This repository presents a complete machine learning project focused on the New York City taxi trip dataset found here https://www.kaggle.com/datasets/yasserh/nyc-taxi-trip-duration/data. The project is divided into first Preparing and cleaning the dataset and later into two distinct machine learning challenges: Regression (predicting trip duration) and Classification (predicting the taxi vendor).
The entire pipeline covers data cleaning, feature engineering, implementation of various classical ML models, and deep learning architectures for benchmarking.
| File/Folder | Purpose | Key Content |
|---|---|---|
| Data_prep.ipynb | Data Preparation & EDA | The starting point of the project, covering data cleaning, outlier handling, and feature engineering. |
| regression/ | Part 1: Trip Duration Prediction (Regression Task) | Contains scripts for Linear Regression and Neural Networks. |
| classification/ | Part 2: Vendor ID Classification (Classification Task) | Contains notebooks for classical and deep learning classification models. |
File: Data_prep.ipynb
To run this notebook make sure to to download the dataset from https://www.kaggle.com/datasets/yasserh/nyc-taxi-trip-duration/data and read it in the right setting. I choose to save the data to my googledrive. You need to change it. This notebook details the crucial first steps of the pipeline:
- Data Cleaning: Handling missing values and ensuring correct data types.
- Outlier Removal: Filtering unrealistic data (e.g., zero-distance, extremely long/short duration trips, and geographical outliers).
- Feature Engineering: Creating essential features like distance (Distance_KM), temporal features (day_of_week, time_of_day_category), and flags for subsequent modeling.
The resulting cleaned and featurized data is saved to be consumed by the model scripts. The cleaned dataset is saved as df_filtered.csv. Use the resulted cleaned data for training and evaluating the models in both regression and classification tasks.
The goal of this section is to predict the continuous variable trip_duration. To run the regression scripts, ensure that the cleaned data file df_filtered.csv generated from the Data Preparation notebook is accessible to the scripts.
| File Name | Model Type | Description |
|---|---|---|
| linear_regre.py | Linear Regression | Implements a simple linear model as a baseline. Generates an HTML report. |
| all_nn_models.py | Neural Networks (Small, Medium, Large) | Combines the training and evaluation of three distinct Keras deep learning architectures to compare the impact of model complexity. |
| HTML_report_generator_all_nn.py | Reporting Utility | A helper script to generate detailed, visually appealing HTML reports for the Neural Network runs. |
The regression phase compared a baseline Linear Model against three increasingly complex Neural Networks (NNs) on 14 preprocessed features. The goal was to minimize prediction error (MAE/RMSE).
All regression model results are summarized in dedicated HTML reports:
- regression/HTML_pages/nn_small_model.html
- regression/HTML_pages/nn_medium_model.html
- regression/HTML_pages/nn_larger_model.html
- regression/HTML_pages/linear_regression_report.html
| Model Name | Training Duration (Approx.) | Test R2 Score | Test RMSE (Approx. Seconds) |
|---|---|---|---|
| Linear Regression | Negligible | ||
| Small NN (small_model) |
|
||
| Medium NN (medium_model) |
|
||
| Larger NN (larger_model) |
|
Conclusion: The Small NN model established a weak baseline with an
The goal of this section is to predict the categorical variable vendor_id (Vendor 1 or Vendor 2) and benchmarked linear models against powerful tree-based models, documented in model_metrics.csv.
The analysis compares performance across a wide range of classification algorithms:
| Notebook | Model Type |
|---|---|
| DecisionTree.ipynb | Decision Tree Classifier |
| GBC.ipynb | Gradient Boosting Classifier |
| KNN.ipynb | K-Nearest Neighbors (KNN) |
| logistic-regression.ipynb | Logistic Regression |
| RandomForest.ipynb | Random Forest Classifier |
| SVM.ipynb | Support Vector Machine (SVM) |
- model_metrics.csv: A consolidated file containing the accuracy, F1-score, and log-loss for all implemented classification models on both training and test sets.
- Visual Assets: PNG images illustrating key insights, such as feature importance plots for tree-based models and visualization of the decision boundaries.
| Model | Test Accuracy | Test F1-Score | Key Observation |
|---|---|---|---|
| RandomForestClassifier | The Champion: Achieved near-perfect classification performance. | ||
| DecisionTreeClassifier | Excellent generalization, but perfect training score suggests potential for complexity/overfitting. | ||
| LogisticRegression | Poor Performer: Barely better than random chance, confirming the relationship is highly non-linear. |
Key Findings on Classification:
- Tree Dominance: The Random Forest Classifier is the clear top performer, indicating that the relationship between trip features and the vendor is complex and non-linear.
- Feature Importance: Visualizations from the Decision Tree and Random Forest show that the most influential features for classifying the vendor are geographical coordinates (pickup/dropoff lat/long) and the calculated distance (Distance_KM), suggesting the vendors operate in distinct geographical patterns or along specific routes.
To run this project locally, you will need Python and the following key libraries:
pip install pandas numpy scikit-learn tensorflow matplotlib seaborn jupyter
-
Start with Data Prep: Open and execute all cells in Data_prep.ipynb to ensure the cleaned data file is generated and saved in the expected location.
-
Run Regression Models: Execute the Python scripts in the regression/ folder:
regression/linear_regre.py
regression/all_nn_models.py -
Run Classification Models: Open and run the individual Jupyter Notebooks in the classification/ folder to train the respective models, generate visualizations, and update the model_metrics.csv file.