NYC Taxi Data Science Project: Comprehensive ML Benchmarking

This repository presents a complete machine learning project focused on the New York City taxi trip dataset found here https://www.kaggle.com/datasets/yasserh/nyc-taxi-trip-duration/data. The project is divided into first Preparing and cleaning the dataset and later into two distinct machine learning challenges: Regression (predicting trip duration) and Classification (predicting the taxi vendor).

The entire pipeline covers data cleaning, feature engineering, implementation of various classical ML models, and deep learning architectures for benchmarking.

📂 Repository Structure

File/Folder	Purpose	Key Content
Data_prep.ipynb	Data Preparation & EDA	The starting point of the project, covering data cleaning, outlier handling, and feature engineering.
regression/	Part 1: Trip Duration Prediction (Regression Task)	Contains scripts for Linear Regression and Neural Networks.
classification/	Part 2: Vendor ID Classification (Classification Task)	Contains notebooks for classical and deep learning classification models.

1. Data Preparation & Feature Engineering

File: Data_prep.ipynb

To run this notebook make sure to to download the dataset from https://www.kaggle.com/datasets/yasserh/nyc-taxi-trip-duration/data and read it in the right setting. I choose to save the data to my googledrive. You need to change it. This notebook details the crucial first steps of the pipeline:

Data Cleaning: Handling missing values and ensuring correct data types.
Outlier Removal: Filtering unrealistic data (e.g., zero-distance, extremely long/short duration trips, and geographical outliers).
Feature Engineering: Creating essential features like distance (Distance_KM), temporal features (day_of_week, time_of_day_category), and flags for subsequent modeling.

The resulting cleaned and featurized data is saved to be consumed by the model scripts. The cleaned dataset is saved as df_filtered.csv. Use the resulted cleaned data for training and evaluating the models in both regression and classification tasks.

2. Regression Analysis (Predicting Trip Duration)

The goal of this section is to predict the continuous variable trip_duration. To run the regression scripts, ensure that the cleaned data file df_filtered.csv generated from the Data Preparation notebook is accessible to the scripts.

Models Implemented

File Name	Model Type	Description
linear_regre.py	Linear Regression	Implements a simple linear model as a baseline. Generates an HTML report.
all_nn_models.py	Neural Networks (Small, Medium, Large)	Combines the training and evaluation of three distinct Keras deep learning architectures to compare the impact of model complexity.
HTML_report_generator_all_nn.py	Reporting Utility	A helper script to generate detailed, visually appealing HTML reports for the Neural Network runs.

Regression Results

The regression phase compared a baseline Linear Model against three increasingly complex Neural Networks (NNs) on 14 preprocessed features. The goal was to minimize prediction error (MAE/RMSE).

All regression model results are summarized in dedicated HTML reports:

regression/HTML_pages/nn_small_model.html
regression/HTML_pages/nn_medium_model.html
regression/HTML_pages/nn_larger_model.html
regression/HTML_pages/linear_regression_report.html

Model Name	Training Duration (Approx.)	Test R2 Score	Test RMSE (Approx. Seconds)
Linear Regression	Negligible	$0.63$	$\sim 348$
Small NN (small_model)	$\sim 3.76$ seconds	$0.61$	$\sim 355$
Medium NN (medium_model)	$\sim 9.31$ seconds	$0.71$	$\sim 360$
Larger NN (larger_model)	$\sim 35.74$ seconds	$\mathbf{0.74}$	$\mathbf{\sim 293}$

Conclusion: The Small NN model established a weak baseline with an $R^2$ score of only $0.61$ (explaining $61\%$ of the variance) and a high RMSE of $\sim 355$ seconds. This poor performance immediately validated the need for more complex, non-linear models. The Larger Neural Network achieved the highest $R^2$ score ($\mathbf{0.74}$), confirming that deep learning was necessary to accurately model the non-linear relationship between features and trip duration.

3. Classification Analysis (Predicting Vendor ID)

The goal of this section is to predict the categorical variable vendor_id (Vendor 1 or Vendor 2) and benchmarked linear models against powerful tree-based models, documented in model_metrics.csv.

Models Implemented (Jupyter Notebooks)

The analysis compares performance across a wide range of classification algorithms:

Notebook	Model Type
DecisionTree.ipynb	Decision Tree Classifier
GBC.ipynb	Gradient Boosting Classifier
KNN.ipynb	K-Nearest Neighbors (KNN)
logistic-regression.ipynb	Logistic Regression
RandomForest.ipynb	Random Forest Classifier
SVM.ipynb	Support Vector Machine (SVM)

Classification Results & Visualizations

model_metrics.csv: A consolidated file containing the accuracy, F1-score, and log-loss for all implemented classification models on both training and test sets.
Visual Assets: PNG images illustrating key insights, such as feature importance plots for tree-based models and visualization of the decision boundaries.

Model	Test Accuracy	Test F1-Score	Key Observation
RandomForestClassifier	$\mathbf{0.9999965}$	$\mathbf{0.9999967}$	The Champion: Achieved near-perfect classification performance.
DecisionTreeClassifier	$0.9999955$	$0.9999955$	Excellent generalization, but perfect training score suggests potential for complexity/overfitting.
LogisticRegression	$0.5837697$	$0.4944124$	Poor Performer: Barely better than random chance, confirming the relationship is highly non-linear.

Key Findings on Classification:

Tree Dominance: The Random Forest Classifier is the clear top performer, indicating that the relationship between trip features and the vendor is complex and non-linear.
Feature Importance: Visualizations from the Decision Tree and Random Forest show that the most influential features for classifying the vendor are geographical coordinates (pickup/dropoff lat/long) and the calculated distance (Distance_KM), suggesting the vendors operate in distinct geographical patterns or along specific routes.

⚙️ Setup and Execution

Prerequisites

To run this project locally, you will need Python and the following key libraries:

pip install pandas numpy scikit-learn tensorflow matplotlib seaborn jupyter

Running the Pipeline

Start with Data Prep: Open and execute all cells in Data_prep.ipynb to ensure the cleaned data file is generated and saved in the expected location.
Run Regression Models: Execute the Python scripts in the regression/ folder:
regression/linear_regre.py
regression/all_nn_models.py
Run Classification Models: Open and run the individual Jupyter Notebooks in the classification/ folder to train the respective models, generate visualizations, and update the model_metrics.csv file.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
classification		classification
regression		regression
.gitattributes		.gitattributes
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
Data_prep.ipynb		Data_prep.ipynb
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Data Science Project: Comprehensive ML Benchmarking

📂 Repository Structure

1. Data Preparation & Feature Engineering

2. Regression Analysis (Predicting Trip Duration)

Models Implemented

Regression Results

3. Classification Analysis (Predicting Vendor ID)

Models Implemented (Jupyter Notebooks)

Classification Results & Visualizations

⚙️ Setup and Execution

Prerequisites

Running the Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Data Science Project: Comprehensive ML Benchmarking

📂 Repository Structure

1. Data Preparation & Feature Engineering

2. Regression Analysis (Predicting Trip Duration)

Models Implemented

Regression Results

3. Classification Analysis (Predicting Vendor ID)

Models Implemented (Jupyter Notebooks)

Classification Results & Visualizations

⚙️ Setup and Execution

Prerequisites

Running the Pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages