## Exercise: Analyzing the NYC Taxi Dataset with PySpark

Dataset: The New York City Taxi and Limousine Commission (TLC) provides a publicly available dataset of taxi trips in New York City, which can be downloaded from the TLC website: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

### Tasks:

    Download the TLC taxi dataset and load it into a PySpark DataFrame.
    Explore the dataset by performing basic operations such as filtering, grouping, and aggregating.
    Calculate the average trip distance and fare amount for each hour of the day, and plot the results using a line chart.
    Identify the top 10 busiest taxi zones in terms of pick-up counts, and plot the results using a bar chart.
    Calculate the correlation between the trip distance and the fare amount, and plot the results using a scatter plot.

### Hints:

    Use the PySpark SQL module to perform SQL-like operations on the DataFrame.
    Use the PySpark MLlib module to perform statistical analysis and data visualization.
    Use the Matplotlib library to create plots and charts.

#### Note: This is just an example exercise, and you can modify it or choose a different dataset depending on your interests and goals. The key is to practice PySpark by working with real-world data and solving meaningful problems.

In [5]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("nyc_tlc") \
    .master("local[*]") \
    .getOrCreate()

sc = spark.sparkContext

In [11]:
df = spark.read.parquet(r'C:\Users\Usuario1\Desktop\Work\ML Learning\Machine Learning Topics Explanation\pySpark\dataset\yellow_tripdata_2022-01.parquet')
df.printSchema()


root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



In [16]:
df.describe('trip_distance').show()

+-------+-----------------+
|summary|    trip_distance|
+-------+-----------------+
|  count|          2463931|
|   mean| 5.37275119311366|
| stddev|547.8714044600813|
|    min|              0.0|
|    max|        306159.28|
+-------+-----------------+



In [17]:
from pyspark.sql.functions import approxCountDistinct
df.agg(*[approxCountDistinct(c).alias(c) for c in df.columns]).show()



+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       4|             1371984|              1435138|             10|         4265|         7|                 2|         260|         264|           6|       6280|   60|     13|      295