# Project 1 - Large-scale Data Cleaning, Encoding, Exploration, and Predictive Modeling

---

## Project Description

In this project, you will work with the NYC Yellow Taxi Dataset. You will carry data cleaning, encoding, exploration, and predicitve models.

The goal of this assignment include:

1. Carry exploratory data analysis to gather knowledge from data
2. Apply data visualization techniques
3. Build transformation pipelines for data preprocessing and data cleaning
4. Select machine learning algorithms for regression tasks
5. Design pipelines for hyperparameter tuning and model selection
6. Implement performance evaluation metrics and evaluate results
7. Report observations, propose business-centric solutions and propose mitigating strategies

This project is **due Friday, October 3 @ 11:59pm**. Late submissions will not be accepted, so please plan accordingly.

## Deliverables

This is an **individual project**.

You will produce the following deliverables to answer all questions below:

1. [**4-page IEEE-format pape**](https://www.ieee.org/conferences/publishing/templates.html). Write a paper with no more than 4 pages addressing the questions below (use the template provided in [this link](https://www.ieee.org/conferences/publishing/templates.html)). When writing this report, consider a business-oriented person as your reader (e.g. the NYC yellow taxi driver company). Tell *the story* for the dataset' goal and propose solutions by addressing (at least) the questions below. You may organize the report per question.

2. **Python Code**. Create two separate Notebooks: (1) "training.ipynb" used for training and hyperparameter tuning, (2) "test.ipynb" for evaluating the final trained model in the test set. The "test.ipynb" will mimic what the business consumer would use, i.e., they receive a set of trained objects and simply use them to make predictions on the dataset. They should NOT have to train or tune anything. Do not forget to **push the tuned objects** to your repository. We should be able to run your "test.ipynb" without having to run the "training.ipynb" file. Points will be deducted otherwise. All of your code should run without any errors and be well-documented. 

3. **README.md file**. Edit the readme.md file in your repository on how to use your code. If there are user-defined parameters, your readme.md file must clearly indicate so and demonstrate how to use your code.

## Reminders

To save a tuned ````scikit-learn```` object, from your "training.ipynb" Notebook, run:

````python
import joblib
joblib.dump(tuned_model_object, 'name_for_tuned_model_object.pkl');
````

In the "test.ipynb" Notebook, you can load this object with:
````python
import joblib
loaded_model_name = joblib.load('name_for_tuned_model_object.pkl');
````

---

## About the Dataset

### 2023 Yellow Taxi Trip Data

These records are generated from the trip record submissions made by yellow taxi Technology Service Providers (TSPs). Each row represents a single trip in a yellow taxi. The trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off taxi zone locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data is available at [NYC Open Data](https://opendata.cityofnewyork.us/).

Attributes Description

- `vendor_id`: A code indicating the TPEP provider that provided the record.
- `tpep_pickup_datetime`: The date and time when the meter was engaged.
- `tpep_dropoff_datetime`: 	
The date and time when the meter was disengaged.
- `passenger_count`: The number of passengers in the vehicle.
- `trip_distance`: 	
The elapsed trip distance in miles reported by the taximeter.
- `ratecodeid`: 	
The final rate code in effect at the end of the trip.
- `store_and_fwd_flag`: 	
This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.
- `pulocationid`: TLC Taxi Zone in which the taximeter was engaged.
- `dolocationid`: 	
TLC Taxi Zone in which the taximeter was disengaged.
- `payment_type`: A numeric code signifying how the passenger paid for the trip.
- `fare_amount`: 	
The time-and-distance fare calculated by the meter. For additional information on the following columns, see https://www.nyc.gov/site/tlc/passengers/taxi-fare.page
- `extra`: Miscellaneous extras and surcharges.
- `mta_tax`: Tax that is automatically triggered based on the metered rate in use.
- `tip_amount`: 	
Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
- `tolls_amount`: 	
Total amount of all tolls paid in trip.
- `improvement_surcharge`: Improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
- `total_amount`: 	
The total amount charged to passengers. Does not include cash tips.
- `congestion_surcharge`: Total amount collected in trip for NYS congestion surcharge.
- `airport_fee`: For pick up only at LaGuardia and John F. Kennedy Airports.

---

## Exercise 1 - Prepare the Data

Apply the necessary data preprocessing using ```scikit-learn``` pipelines. Justify all choices. Use the prepared data to answer exercises (2) and (3). The only requirements regarding attribute encoding are:

1. Encode the attribute ```Date``` with its respective day of the week (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday).
2. Encode the attribute ```Time``` into 4 categories: Morning (10:00 - 11:59), Afternoon (12:00 - 16:59), Evening (17:00 - 18:59) and Night (19:00 - 20:59).
3. Create a new feature - "pre_tip_total_amount" as the sum of attributes ````fare_amount````, ````extra````, ````mta_tax````, ````tolls_amount````, ````improvement_surcharge````, ````congestion_surcharge````, and ````airport_fee````.

## Exercise 2 - Exploratory Data Analysis

In this exercise carry exploratory data analysis to understand the data, including:

1. Pearson's correlation coefficient. In exercise 2 and 3, you will predict attributes ````tip_amount```` and ````fare_amount````.
2. Which pickup location bring the most tips? (Don't forget to normalize by number of fares.)
3. How are the tip distribution affect as a function of time of the day? Day of the week? Time of the day **AND** day of the week? (e.g. Friday night, Monday morning, etc.)

## Exercise 3 - Predictive Modeling

For this exercise, consider the coefficient of determination, $r^2$, as one of your metrics of success and report its 95% confidence interval or CI (on the validation set). Carry any necessary hyperparameter tuning with pipelines. Choose the best CV strategy and report on the best hyperparameter settings.

Train a multiple linear regression **with and without** **Lasso** regularization to **predict ```tip_amount```**. Using the tuned models for each case, answer the following questions:

1. For each model: how is the ````tip_amount```` attribute affected by trip distance, passenger count, pre-tip total amount, other fees, and other variables like pickup day, time slot & location, and vendor of the TPEP provider? That is, how much do each one of these attributes contribute to predicting ````tip_amount````? (Answer this question from the perspective of a taxi driver. What you would you tell the taxi drive to Where and when should they work in order to maximize profit from tips?)

2.  When using Lasso regularizer, which value for the hyperparameter $\lambda$ best works for this dataset? Based on the CI for each model, which performs best? Justify your answer.

3. Which features were excluded in the model with a Lasso regularizer, if any? 

## Exercise 4 - Predictive Modeling

For this exercise, consider the coefficient of determination, $r^2$, as one of your metrics of success and report its 95% confidence interval (CI). Carry any necessary hyperparameter tuning with pipelines. Choose the best CV strategy and report on the best hyperparameter settings.

Train a multiple linear regression **with and without Lasso** regularization to **predict ```fare_amount```**.

1. For each model: how is the ````fare_amount```` attribute affected by trip distance, passenger count, other fees, and other variables like pickup day, time slot & location, and vendor of the TPEP provider? That is, how much do each one of these attributes contribute to predicting ````tip_amount````? (Answer this question from the perspective of a taxi driver. What you would you tell the taxi drive to Where and when should they work in order to maximize profit from tips?)

2.  When using Lasso regularizer, which value for the hyperparameter $\lambda$ best works for this dataset? Based on the CI for each model, which performs best? Justify your answer.

3. Which features were excluded in the model with a Lasso regularizer, if any? 

---