# Day 1 - ML Workflow

The objective of this exercise is to use the tools and methods you learnt during the previous weeks, in order to solve a **Kaggle challenge**.

#### The problem to solve is the [New York City Taxi Fare Prediction Challenge](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction). 

The goal is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations.

For this, you are going to follow the different steps below:

## Steps
1. [Get the data](#part1)
2. [Explore the data](#part2)
3. [Data cleaning](#part3)
4. [Evaluation metric](#part4)
5. [Model baseline](#part5)
6. [Build your first model](#part6)
7. [Model evaluation](#part7)
8. [Kaggle submission](#part8)
9. [Model iteration](#part9)

## 1. Get the data <a id='part1'></a>

The dataset is available on [Kaggle](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data)

First of all:
- Follow the instructions to download the training and test sets
- Put the datasets in a separate folder on your local disk, that you can name `data` for example.

Now we are going to use Pandas to read and explore the datasets.

In [165]:
import pandas as pd

The training dataset is relatively big (~2GB). 
So let's only open a portion of it.
Go to [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) to see how to open a portion of a csv file and store it into a dataframe. (ex: just read 1 million rows maximum)

In [1]:
# your code

Now let's display the first rows to understand the different fields 

In [2]:
# your code

## 2. Explore the data <a id='part2'></a>

Before trying to solve the prediction problem, we need to get a better understanding of the data. 
For that, libraries like Pandas and Seaborn are your best friends. 
Firt of all, make you sure you have [Seaborn](https://seaborn.pydata.org/) installed and import it into your notebook. Note that this can be also useful to import `matplotlib.pyplot` to customize a few things like default `figsize` or `font.size`

In [3]:
import seaborn as sns
import matplotlib.pyplot as plt

#plt.style.use(style)
#plt.rcParams[key] = value

### There are multiple things we want to do in terms of data exploration.

- You first want to look at the distribution of the variable you are going to predict: `fare_amount`
- Then you want to vizualize other variable distributions
- Then it is helpful to compute and visualize correlations between target variable and other variables.
- Also, look for any missing or wrong data.

### Explore the target variable
- Compute simple statistics of the target variable (min, max, mean, std, ...)
- Plot distributions

In [5]:
# your code

### Explore other variables

- passenger_count (statistics + distribution)
- pickup_datetime (you need to build time features out of pickup datetime)
- Geospatial features (pickup_longitude, pickup_latitude,dropoff_longitude,dropoff_latitude)
- Find other variables you can compute from existing data that might explain the target 

#### Passenger Count

In [6]:
# your code

#### Pickup Datetime 
- Extract time features from pickup_datetime (hour, day of week, month, year)
- Create a method `def extract_time_features(_df)` that you will be able to re-use later
- Be careful of timezone
- Explore the newly created features 

In [10]:
def extract_time_features(_df):
    return _df

# df = extract_time_features(df)

In [11]:
# your code to explore time feature (hour, day of week, etc...)

#### Geospatial Data
- check for absurd (lat, lng) coordinates and inspect test set
- To visuzalize geospatial data, you can use libraries like [Folium](https://python-visualization.github.io/folium/)
- check out this [Great example](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data) on kaggle
- Tip: Look for HeatMap to generate a heatmap
- Bonus: Look for HeatMapWithTime to generate an animated heatmmap 

In [12]:
import folium
from folium.plugins import HeatMap
from folium.plugins import HeatMapWithTime

In [13]:
# your code to explore geospatial data

#### Distance
- Compute distance between pickup and dropoff locations (tip: https://en.wikipedia.org/wiki/Haversine_formula)
- Write a method `def haversine_distance(df, **kwargs)` that you will be able to reuse later
- Compute a few statistics for distance and plot distance distribution

In [38]:
def haversine_distance(df, **kwargs):
    df["distance"] = None
    return df

# df = haversine_distance(df)

In [17]:
# your code to explore distance 

#### Explore how target variable correlate with other variables
- As a first step, you can vizualize the target variable vs another variable. For categorical variables, it is often useful to compute the average target variable for each category (Seaborn as plots that do it for you!). For continuous variables (like distance, you can use scatter plots, or regression plots, or bucket the distance into different bins.
- There many different ways to visualize correlation between features, so be creative.

In [18]:
# your code

## 3. Data cleaning <a id='part3'></a>

As you probably identified during data exploration, there are some values that do not seem valid.
In this section, you will take a few steps to clean the training data.

Remove all trips that look incorrect.
- Write a method `clean_data(df)` that you will be able to re-use in the next steps.

In [20]:
def clean_data(_df):
    return _df

#df_cleaned = clean_data(df)

## 4. Evaluation metric <a id='part4'></a>

The evaluation metric for this competition is the root mean-squared error or RMSE. RMSE measures the difference between the predictions of a model, and the corresponding ground truth. A large RMSE is equivalent to a large average error, so smaller values of RMSE are better.

More details here https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

Write a method `def compute_rmse(y_pred, y_true)` that computes the RMSE given `y_pred` and `y_true` which are two numpy arrays corresponding to model predictions and ground truth values.

This method will be useful to evaluate performance of your model

In [21]:
def compute_rmse(y_pred, y_true):
    pass

## 5. Model baseline <a id='part5'></a>

Before building your model, it is often useful to get a performance benchmark. For this, you will use a baseline model that is a very stupid model and compute the evualation metric on that model.
Then, you will be able to see how much better your model is compared to the baseline. It is very common to see ML teams comming up with very sophisticated approaches without knowing by how much their model beats the very simple model.

- Generate predictions based on a simple heuristic
- Evaluate RMSE for these predictions

In [27]:
df["fare_amount_predicted"] = None # heuristic to make simple predictions
compute_rmse(df.fare_amount_predicted, df.fare_amount)

## 6. Build your first model <a id='part6'></a>

Now it is time to build your model!

Here are the different steps you have to follow:

1. Split the data into two different sets (training and validation). You will be measuring the performance of your model on the validation set.
2. Make sure you apply the data cleaning on your training set
3. Think about the different features you want to add in your model
4. For each of these features, make sure you apply the correct transformation so that the model can correctly learn from them (this is true for categorical variables like `hour of day` or `day of week`)
5. Train your model

##### Training/Validation Split

In [29]:
# your code for raining/validation

##### Apply data cleaning on training set

In [194]:
# your code for data cleaning

##### List features (continuous vs categorical)

In [30]:
# your features
target = ""
features = []
categorical_features = []

##### Features transformation
- Write a method `def transform_features(df, **kwargs)` because you will have to make sure you apply the same transformation on the validation (or test set) before making predictions
- For categorical features transformation, you can use `pandas.get_dummies` method

In [31]:
def transform_features(_df, **kwargs):
    pass

##### Model training

In [33]:
# your code for model training

## 7. Model evaluation <a id='part7'></a>

Now to evaluate your model, you need to use your previously trained model to make predictions on the validation set. 

For this, follow these steps:
1. Apply the same transformations on the validation set
2. Make predictions
3. Evaluate predictions using `compute_rmse` method

In [34]:
# your code for model evaluation of the validation set

## 8. Kaggle submission <a id='part8'></a>

Now that you have a model, you can now make predictions on Kaggle test set and be evaluated by Kaggle directly.

- Download test data from Kaggle
- Follow [instructions](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation) to make sure your predictions are in the right format
- Re-train your model using all the data (do not split between train/validation)
- Apply all features engineering and transformations methods on the test set
- Use the model to make predictions on the test set
- Submit your predictions!

In [35]:
# your code 

## 9. Model iteration <a id='part9'></a>

You can improve your model by trying different things (But dont' worry, some of these things will be covered in the next days).
- Use more data to train
- Build and add more features 
- Try different estimators
- Adjust your data cleaning to remove more or less data
- Tune the hyperparameters of your model

### Ideas for Feature Engineering

###### Another Distance ?
- Think about the distance you used, try and find a more adapted distance to our problem (Ask TA for insights)

###### Distance from the center 
- Compute a new Feature calculating distance of pickup location from the center
- Scatter Plot *distance_from_center* regarding *distance* 
- What do you observe ? What new features could you add ? How are these new features correlated to the target ?

###### Which direction  you heading to ?
- Compute a new Feature calculating the direction your heading to
- What do you observe ? What new features could you add ? How are these new features correlated to the target ?