Skip to content

wasimshoman/ai_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Taxi Data Science Project: Comprehensive ML Benchmarking

This repository presents a complete machine learning project focused on the New York City taxi trip dataset found here https://www.kaggle.com/datasets/yasserh/nyc-taxi-trip-duration/data. The project is divided into first Preparing and cleaning the dataset and later into two distinct machine learning challenges: Regression (predicting trip duration) and Classification (predicting the taxi vendor).

The entire pipeline covers data cleaning, feature engineering, implementation of various classical ML models, and deep learning architectures for benchmarking.

📂 Repository Structure

File/Folder Purpose Key Content
Data_prep.ipynb Data Preparation & EDA The starting point of the project, covering data cleaning, outlier handling, and feature engineering.
regression/ Part 1: Trip Duration Prediction (Regression Task) Contains scripts for Linear Regression and Neural Networks.
classification/ Part 2: Vendor ID Classification (Classification Task) Contains notebooks for classical and deep learning classification models.

1. Data Preparation & Feature Engineering

File: Data_prep.ipynb

To run this notebook make sure to to download the dataset from https://www.kaggle.com/datasets/yasserh/nyc-taxi-trip-duration/data and read it in the right setting. I choose to save the data to my googledrive. You need to change it. This notebook details the crucial first steps of the pipeline:

  • Data Cleaning: Handling missing values and ensuring correct data types.
  • Outlier Removal: Filtering unrealistic data (e.g., zero-distance, extremely long/short duration trips, and geographical outliers).
  • Feature Engineering: Creating essential features like distance (Distance_KM), temporal features (day_of_week, time_of_day_category), and flags for subsequent modeling.

The resulting cleaned and featurized data is saved to be consumed by the model scripts. The cleaned dataset is saved as df_filtered.csv. Use the resulted cleaned data for training and evaluating the models in both regression and classification tasks.

2. Regression Analysis (Predicting Trip Duration)

The goal of this section is to predict the continuous variable trip_duration. To run the regression scripts, ensure that the cleaned data file df_filtered.csv generated from the Data Preparation notebook is accessible to the scripts.

Models Implemented

File Name Model Type Description
linear_regre.py Linear Regression Implements a simple linear model as a baseline. Generates an HTML report.
all_nn_models.py Neural Networks (Small, Medium, Large) Combines the training and evaluation of three distinct Keras deep learning architectures to compare the impact of model complexity.
HTML_report_generator_all_nn.py Reporting Utility A helper script to generate detailed, visually appealing HTML reports for the Neural Network runs.

Regression Results

The regression phase compared a baseline Linear Model against three increasingly complex Neural Networks (NNs) on 14 preprocessed features. The goal was to minimize prediction error (MAE/RMSE).

All regression model results are summarized in dedicated HTML reports:

  • regression/HTML_pages/nn_small_model.html
  • regression/HTML_pages/nn_medium_model.html
  • regression/HTML_pages/nn_larger_model.html
  • regression/HTML_pages/linear_regression_report.html
Model Name Training Duration (Approx.) Test R2 Score Test RMSE (Approx. Seconds)
Linear Regression Negligible $0.63$ $\sim 348$
Small NN (small_model) $\sim 3.76$ seconds $0.61$ $\sim 355$
Medium NN (medium_model) $\sim 9.31$ seconds $0.71$ $\sim 360$
Larger NN (larger_model) $\sim 35.74$ seconds $\mathbf{0.74}$ $\mathbf{\sim 293}$

Conclusion: The Small NN model established a weak baseline with an $R^2$ score of only $0.61$ (explaining $61\%$ of the variance) and a high RMSE of $\sim 355$ seconds. This poor performance immediately validated the need for more complex, non-linear models. The Larger Neural Network achieved the highest $R^2$ score ($\mathbf{0.74}$), confirming that deep learning was necessary to accurately model the non-linear relationship between features and trip duration.

3. Classification Analysis (Predicting Vendor ID)

The goal of this section is to predict the categorical variable vendor_id (Vendor 1 or Vendor 2) and benchmarked linear models against powerful tree-based models, documented in model_metrics.csv.

Models Implemented (Jupyter Notebooks)

The analysis compares performance across a wide range of classification algorithms:

Notebook Model Type
DecisionTree.ipynb Decision Tree Classifier
GBC.ipynb Gradient Boosting Classifier
KNN.ipynb K-Nearest Neighbors (KNN)
logistic-regression.ipynb Logistic Regression
RandomForest.ipynb Random Forest Classifier
SVM.ipynb Support Vector Machine (SVM)

Classification Results & Visualizations

  • model_metrics.csv: A consolidated file containing the accuracy, F1-score, and log-loss for all implemented classification models on both training and test sets.
  • Visual Assets: PNG images illustrating key insights, such as feature importance plots for tree-based models and visualization of the decision boundaries.
Model Test Accuracy Test F1-Score Key Observation
RandomForestClassifier $\mathbf{0.9999965}$ $\mathbf{0.9999967}$ The Champion: Achieved near-perfect classification performance.
DecisionTreeClassifier $0.9999955$ $0.9999955$ Excellent generalization, but perfect training score suggests potential for complexity/overfitting.
LogisticRegression $0.5837697$ $0.4944124$ Poor Performer: Barely better than random chance, confirming the relationship is highly non-linear.

Key Findings on Classification:

  • Tree Dominance: The Random Forest Classifier is the clear top performer, indicating that the relationship between trip features and the vendor is complex and non-linear.
  • Feature Importance: Visualizations from the Decision Tree and Random Forest show that the most influential features for classifying the vendor are geographical coordinates (pickup/dropoff lat/long) and the calculated distance (Distance_KM), suggesting the vendors operate in distinct geographical patterns or along specific routes.

⚙️ Setup and Execution

Prerequisites

To run this project locally, you will need Python and the following key libraries:

pip install pandas numpy scikit-learn tensorflow matplotlib seaborn jupyter

Running the Pipeline

  1. Start with Data Prep: Open and execute all cells in Data_prep.ipynb to ensure the cleaned data file is generated and saved in the expected location.

  2. Run Regression Models: Execute the Python scripts in the regression/ folder:
    regression/linear_regre.py
    regression/all_nn_models.py

  3. Run Classification Models: Open and run the individual Jupyter Notebooks in the classification/ folder to train the respective models, generate visualizations, and update the model_metrics.csv file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors