<h1>Assignment 3</h1>
<h2>Analyzing NYC Taxi Trip Data with Databricks and Apache Spark</h2>
<h4>By Group 10 (Faiza, Wardah, Amany, Yusra, Sara, Anna), Due 2025-11-26</h4>


<h3>Task 1: Classification</h3>
<h4>Objective: Build a classification pipeline to predict whether a trip resulted in a high fare (e.g., over $20) based on trip characteristics.</h4>

<h4>Step 1: Data Preparation</h4>

a) Load Dataset: Load the CSV file into a Spark DataFrame.</ol>

b) Data Exploration: Explore the dataset to understand the variables and handle missing or invalid values. </ol>

c) Feature Engineering: Create new features or modify existing ones to improve the model. For example:
- Time-based features (e.g., hour, day of the week) from the pickup or drop-off times.
- Distance calculation between pickup and drop-off coordinates.

d) Target Variable Creation: Define a binary target column high_fare where:
- 1 if the fare is above $20.
- 0 if the fare is $20 or below.

e) Data Splitting: Split the data into training (70%) and testing (30%) sets.

In [0]:
from pyspark.sql.functions import to_timestamp, unix_timestamp, col, mean, max
from pyspark.sql.types import DoubleType

In [0]:
# a) Load the data, fix types
df = spark.read.format("csv").option("header", "true").load("/Volumes/workspace/default/assignment/yellow_tripdata_2015-01.csv")
df = df.withColumn("passenger_count", col("passenger_count").cast("int"))
df = df.withColumn("trip_distance", col("trip_distance").cast("float"))
df = df.withColumn("fare_amount", col("fare_amount").cast("float"))

In [0]:
# b) Data Exploration
# Handle missing values by removing them for columns “fare_amount, trip_distance, and passenger_count.”
df1 = df.dropna(subset = ["fare_amount", "trip_distance", "passenger_count"])
# Filter out rows with invalid data (e.g., fare_amount < 0 or trip_distance = 0)
df1 = df1.filter((df1.fare_amount > 0) & (df1.trip_distance != 0))
# Convert the pickup_datetime and dropoff_datetime columns to timestamp data types
df1 = df1.withColumn(
    "tpep_pickup_datetime",
    to_timestamp("tpep_pickup_datetime"))
df1 = df1.withColumn(
    "tpep_dropoff_datetime",
    to_timestamp("tpep_dropoff_datetime"))
df1.select("tpep_pickup_datetime").distinct().show()

+--------------------+
|tpep_pickup_datetime|
+--------------------+
| 2015-01-21 05:45:18|
| 2015-01-06 11:49:54|
| 2015-01-10 21:36:52|
| 2015-01-25 00:13:08|
| 2015-01-30 19:50:04|
| 2015-01-10 21:13:52|
| 2015-01-01 22:47:45|
| 2015-01-10 19:12:25|
| 2015-01-10 22:53:25|
| 2015-01-20 15:51:26|
| 2015-01-15 16:15:15|
| 2015-01-09 20:29:42|
| 2015-01-10 21:36:51|
| 2015-01-23 16:51:38|
| 2015-01-04 13:44:51|
| 2015-01-29 21:33:55|
| 2015-01-21 07:27:00|
| 2015-01-10 21:27:46|
| 2015-01-27 16:22:03|
| 2015-01-25 17:45:13|
+--------------------+
only showing top 20 rows



<h4>Step 2: Decision Tree Classifier Pipeline</h4>

a) Define the pipeline stages:

- Feature Transformers (VectorAssembler, StandardScaler, etc.).

- Model: Decision Tree Classifier.

b) Hyperparameter Tuning: Use CrossValidator with GridSearch to find the best parameters (e.g., max depth, min instances per node).

c) Model Training: Train the pipeline on the training data.

d) Model Evaluation: Evaluate performance on the test data using metrics like F1 Score, Precision, and Recall.

e) Save Pipeline: Save the trained pipeline.

<h4>Step 3: Logistic Regression Pipeline</h4>

a) Define a new pipeline with Logistic Regression as the classifier.

b) Perform hyperparameter tuning on Logistic Regression parameters (e.g., regularization parameter, max iterations).

c) Evaluate model performance on test data and compare with the Decision Tree Classifier pipeline.

d) Save the trained pipeline.

<h4>Step 4: Report Findings</h4>

a) Discuss the performance of each pipeline and which hyperparameters were chosen.

b) State the best-performing pipeline and explain why it performed better.

<h3>Task 2: Regression</h3>
<h4>Objective: Build a regression pipeline to predict the fare amount based on trip characteristics.</h4>

%md

<h4>Step 1: Data Preparation</h4>

a) Load Dataset: Use the same dataset and load it into a new DataFrame.</ol>

b) Feature Engineering: Similar to the classification task, but focus on features that might help predict fare amount.

Time-based features (pickup hour, day of week).

Trip distance and trip duration.

c) Data Splitting: Split the data into training (70%) and testing (30%) sets.</ol>

<h4>Step 2: Linear Regression Pipeline</h4>

a) Define the pipeline stages:

Feature Transformers (VectorAssembler, StandardScaler, etc.).

Model: Linear Regression.

b) Hyperparameter Tuning: Use CrossValidator with GridSearch to tune hyperparameters (e.g., regularization parameter, max iterations).

c) Model Training: Train the pipeline on the training data.

d) Model Evaluation: Evaluate the model using RMSE (Root Mean Squared Error) and R² Score on the test data.

e) Save Pipeline: Save the trained pipeline.

<h4>Step 3: Random Forest Regressor Pipeline</h4>

a) Define a new pipeline with Random Forest Regressor.

b) Perform hyperparameter tuning on Random Forest parameters (e.g., number of trees, max depth).

c) Evaluate the model performance on test data and compare it with the Linear Regression pipeline.

d) Save the trained pipeline.

<h4>Step 4: Report Findings</h4>

a) Discuss the performance of each pipeline and provide a comparison.

b) Justify which pipeline you would choose for future predictions based on the evaluation metrics.