# Uber Fares 🚙🚙
In this exercise, we'll use Random Forests in order to estimate the price of a Uber ride.

## Importing libraries and dataset
0. Import the usual libraries and read the dataset from this url:
"https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/Decision+trees/uber.csv"

In [88]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

import plotly.express as px

In [89]:
dataset = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/Decision+trees/uber.csv")

## Basic exploring and cleaning
1. Display basic statistics about the dataset. Do you notice some inconsistent values?

In [90]:
print("Number of rows : {}".format(dataset.shape[0]))
print("Number of columns : {}".format(dataset.shape[1]))
print()

print("Display of dataset: ")
display(dataset.head())
print()

print("Basics statistics: ")
data_desc = dataset.describe(include="all")
display(data_desc)
print()

print("Data types: ")
display(dataset.dtypes)

print("Percentage of missing values: ")
display(100 * dataset.isnull().sum() / dataset.shape[0])

Number of rows : 20000
Number of columns : 9

Display of dataset: 


Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,48462598,2015-05-07 10:24:44.0000004,13.0,2015-05-07 10:24:44 UTC,-73.971664,40.797035,-73.958939,40.777649,1
1,6637611,2014-07-09 09:14:04.0000002,5.5,2014-07-09 09:14:04 UTC,-73.991635,40.749855,-73.98825,40.741341,2
2,8357193,2013-11-11 18:51:00.000000240,8.5,2013-11-11 18:51:00 UTC,-73.982352,40.777042,-73.995912,40.759757,1
3,40466112,2014-05-22 01:54:00.00000069,19.0,2014-05-22 01:54:00 UTC,-73.991455,40.7517,-73.936357,40.812327,1
4,35405035,2011-06-21 23:37:33.0000002,7.7,2011-06-21 23:37:33 UTC,-73.974749,40.756255,-73.952276,40.778332,1



Basics statistics: 


Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,20000.0,20000,20000.0,20000,20000.0,20000.0,20000.0,20000.0,20000.0
unique,,20000,,19967,,,,,
top,,2015-05-07 10:24:44.0000004,,2012-08-28 14:03:00 UTC,,,,,
freq,,1,,2,,,,,
mean,27679490.0,,11.358151,,-72.490431,39.918498,-72.459891,39.923345,1.69015
std,16011230.0,,9.89199,,10.461597,6.051561,10.564266,6.90152,1.311384
min,3949.0,,-23.7,,-75.419276,-74.00619,-75.423067,-73.991765,0.0
25%,13834760.0,,6.0,,-73.992075,40.734733,-73.991423,40.734105,1.0
50%,27697240.0,,8.5,,-73.981904,40.752554,-73.980305,40.752997,1.0
75%,41480820.0,,12.5,,-73.967229,40.767075,-73.963509,40.768348,2.0



Data types: 


Unnamed: 0             int64
key                   object
fare_amount          float64
pickup_datetime       object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        int64
dtype: object

Percentage of missing values: 


Unnamed: 0           0.0
key                  0.0
fare_amount          0.0
pickup_datetime      0.0
pickup_longitude     0.0
pickup_latitude      0.0
dropoff_longitude    0.0
dropoff_latitude     0.0
passenger_count      0.0
dtype: float64

2. Drop the useless columns and the rows containing outliers.

In [91]:
dataset = dataset.drop(["Unnamed: 0", "key"], axis = 1)
mask = dataset['fare_amount'] > 0
dataset = dataset.loc[mask, :]

In [92]:
dataset.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,13.0,2015-05-07 10:24:44 UTC,-73.971664,40.797035,-73.958939,40.777649,1
1,5.5,2014-07-09 09:14:04 UTC,-73.991635,40.749855,-73.98825,40.741341,2
2,8.5,2013-11-11 18:51:00 UTC,-73.982352,40.777042,-73.995912,40.759757,1
3,19.0,2014-05-22 01:54:00 UTC,-73.991455,40.7517,-73.936357,40.812327,1
4,7.7,2011-06-21 23:37:33 UTC,-73.974749,40.756255,-73.952276,40.778332,1


## Feature engineering
### Dealing with datetime objects
3. Convert the `pickup_datetime` column into datetime format. Use panda's [dt module](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) to create the following columns:
* Year
* Month
* Day
* Weekday: contains the **name** of the day of week

Then, you can drop the column `pickup_datetime`.

In [93]:
dataset["pickup_datetime"] = pd.to_datetime(dataset["pickup_datetime"])
dataset["Year"] = dataset["pickup_datetime"].dt.year
dataset["Month"] = dataset["pickup_datetime"].dt.month
dataset["Day"] = dataset["pickup_datetime"].dt.day

weekdays_dict = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}
dataset["Weekday"] = dataset["pickup_datetime"].dt.weekday.map(weekdays_dict)

dataset.drop("pickup_datetime", axis=1, inplace=True)

In [94]:
dataset.head()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,Year,Month,Day,Weekday
0,13.0,-73.971664,40.797035,-73.958939,40.777649,1,2015,5,7,Thursday
1,5.5,-73.991635,40.749855,-73.98825,40.741341,2,2014,7,9,Wednesday
2,8.5,-73.982352,40.777042,-73.995912,40.759757,1,2013,11,11,Monday
3,19.0,-73.991455,40.7517,-73.936357,40.812327,1,2014,5,22,Thursday
4,7.7,-73.974749,40.756255,-73.952276,40.778332,1,2011,6,21,Tuesday


### Haversine formula

It would be very interesting to compute the ride distance from the GPS coordinates. [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) allows to do this 🤓:

$$
d = 2r \arcsin \big(\sqrt{\sin^2(\frac{\phi_2 - \phi_1}{2}) + \cos \phi_1 \cos \phi_2 \sin^2(\frac{\lambda_2 - \lambda_1}{2})} \big)
$$

where:
* $d$ is the ride distance in kilometers
* $r$ is the Earth's radius in kilometers
* $\phi_1$ is the pickup latitude in radians
* $\phi_2$ is the dropoff latitude in radians
* $\lambda_1$ is the pickup longitude in radians
* $\lambda_2$ is the dropoff longitude in radians

We've implemented for you a function that computes this formula for one ride with coordinates `lon_1`, `lon_2`, `lat_1` and `lat_2`:

In [95]:
def haversine(lon_1, lon_2, lat_1, lat_2):
    
    lon_1, lon_2, lat_1, lat_2 = map(np.radians, [lon_1, lon_2, lat_1, lat_2])  # Convert degrees to Radians
    
    
    diff_lon = lon_2 - lon_1
    diff_lat = lat_2 - lat_1
    

    distance_km = 2*6371*np.arcsin(np.sqrt(np.sin(diff_lat/2.0)**2 + np.cos(lat_1) * np.cos(lat_2) * np.sin(diff_lon/2.0)**2)) # earth radius: 6371km
    
    return distance_km

4. Apply the `haversine` function to he whole dataset to create a new column `ride_distance`. [This stackoverflow post](https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe?answertab=trending#tab-top) might help you!

In [96]:
dataset["ride_distance"] = dataset.apply(lambda row : haversine(row["pickup_longitude"], row["dropoff_longitude"], row["pickup_latitude"], row["dropoff_latitude"]), axis=1)

In [97]:
dataset.head()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,Year,Month,Day,Weekday,ride_distance
0,13.0,-73.971664,40.797035,-73.958939,40.777649,1,2015,5,7,Thursday,2.407225
1,5.5,-73.991635,40.749855,-73.98825,40.741341,2,2014,7,9,Wednesday,0.988729
2,8.5,-73.982352,40.777042,-73.995912,40.759757,1,2013,11,11,Monday,2.235651
3,19.0,-73.991455,40.7517,-73.936357,40.812327,1,2014,5,22,Thursday,8.183379
4,7.7,-73.974749,40.756255,-73.952276,40.778332,1,2011,6,21,Tuesday,3.099698


## Preprocessing
5. Separate the target from the features

In [98]:
target_variable = "fare_amount"

X = dataset.drop(target_variable, axis = 1)
y = dataset[target_variable]

6. Detect names of numeric/categorical features

In [99]:
#numerical_features = [i for i in X.columns if X[i].dtype in ["int32", "float32", "int64", "float64"]]
#categorical_features = [i for i in X.columns if X[i].dtype in ["object", "str", "category"]]

# 'Year', 'Month', 'Day' are int, but they're discret

numerical_features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'ride_distance']
categorical_features = ['Year', 'Month', 'Day', 'Weekday']


print(f"Numerical features : {numerical_features}")
print(f"Categorial features : {categorical_features}")

Numerical features : ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'ride_distance']
Categorial features : ['Year', 'Month', 'Day', 'Weekday']


7. Make a train/test splitting with test_size = 0.2

In [100]:
X_train_unproc, X_test_unproc, y_train, y_test = train_test_split(X, y, test_size=0.2)

8. Make all the necessary preprocessings.

Hint: in this exercise, we'll first create a baseline model with a multivariate **linear regression**. So don't forget to make all the transformations that are required for this kind of model 😉

In [101]:
scaler = StandardScaler()
encoder = OneHotEncoder(drop="first")

preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, numerical_features),
        ('cat', encoder, categorical_features)
    ]
)

X_train = preprocessor.fit_transform(X_train_unproc)
X_test = preprocessor.transform(X_test_unproc)

## Baseline: Linear Regression
9. Train a linear regression model and evaluate its performances. Is it satisfying?

In [102]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [103]:
print(f"R2 score on train set : {lr.score(X_train, y_train)}")
print(f"R2 score on test set : {lr.score(X_test, y_test)}")

R2 score on train set : 0.027038504512852612
R2 score on test set : 0.025129365038334184


The R2 score on both sets are pretty bad.

## Random Forest
10. Train a Random Forest model with default hyperparameters. Are the performances better?

In [104]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

In [105]:
print(f"R2 score on train set : {rfr.score(X_train, y_train)}")
print(f"R2 score on test set : {rfr.score(X_test, y_test)}")

R2 score on train set : 0.9682589530019909
R2 score on test set : 0.7764562206798922


The model is far better, but now he's overfitting.

### Grid search
11. Use grid search to tune the model's hyperparameters. You can try the following values:

```
params = {
    'max_depth': [10, 12, 14],
    'min_samples_split': [4, 8],
    'n_estimators': [60, 80, 100]
}
```



In [106]:
params = {
    'max_depth': [10, 12, 14],
    'min_samples_split': [4, 8],
    'n_estimators': [60, 80, 100]
}

gridsearch =  GridSearchCV(estimator=rfr, param_grid= params, cv = 3)
gridsearch.fit(X_train, y_train)

### Performances
12. Display the R2-score and the [mean absolute error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?highlight=mean%20absolute%20error#sklearn.metrics.mean_absolute_error) on train set and test set. What do you think of this model? Would it be interesting to use it to estimate the fares on new data?

In [107]:
print(f"R2 score on train set : {gridsearch.score(X_train, y_train)}")
print(f"Mean Absolute Error train set : {mean_absolute_error(y_train, gridsearch.predict(X_train))}")
print()
print(f"R2 score on test set : {gridsearch.score(X_test, y_test)}")
print(f"Mean Absolute Error test set : {mean_absolute_error(y_test, gridsearch.predict(X_test))}")

R2 score on train set : 0.9300667530138934
Mean Absolute Error train set : 1.5027243160926302

R2 score on test set : 0.7820692454605659
Mean Absolute Error test set : 2.278251896885369


## Feature importance
13. Make a bar plot with the importances of each feature. Are you surprised?

In [108]:
column_names = []
for name, step, features_list in preprocessor.transformers_: # loop over steps of ColumnTransformer
    if name == 'num': # if pipeline is for numeric variables
        features = features_list # just get the names of columns to which it has been applied
    else: # if pipeline is for categorical variables
        features = step.get_feature_names_out() # get output columns names from OneHotEncoder
    column_names.extend(features) # concatenate features names
        
print("Names of columns corresponding to each coefficient: ", column_names)


Names of columns corresponding to each coefficient:  ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'ride_distance', 'Year_2010', 'Year_2011', 'Year_2012', 'Year_2013', 'Year_2014', 'Year_2015', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12', 'Day_2', 'Day_3', 'Day_4', 'Day_5', 'Day_6', 'Day_7', 'Day_8', 'Day_9', 'Day_10', 'Day_11', 'Day_12', 'Day_13', 'Day_14', 'Day_15', 'Day_16', 'Day_17', 'Day_18', 'Day_19', 'Day_20', 'Day_21', 'Day_22', 'Day_23', 'Day_24', 'Day_25', 'Day_26', 'Day_27', 'Day_28', 'Day_29', 'Day_30', 'Day_31', 'Weekday_Monday', 'Weekday_Saturday', 'Weekday_Sunday', 'Weekday_Thursday', 'Weekday_Tuesday', 'Weekday_Wednesday']


In [109]:
feature_importance = pd.DataFrame(index = column_names, data = gridsearch.best_estimator_.feature_importances_, columns=["feature_importances"])
feature_importance = feature_importance.sort_values(by = 'feature_importances')

In [110]:
fig = px.bar(feature_importance)
fig.show()

14. Would the model be able to make good predictions if we hadn't included the ride distance by hand? Train a new Random Forest model (with grid search) by dropping the `ride_distance` column from the features, and conclude.

In [111]:
target_variable = "fare_amount"

X = dataset.drop([target_variable, "ride_distance"], axis = 1)
y = dataset[target_variable]

numerical_features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
categorical_features = ['Year', 'Month', 'Day', 'Weekday']

X_train_unproc, X_test_unproc, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
encoder = OneHotEncoder(drop="first")

preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, numerical_features),
        ('cat', encoder, categorical_features)
    ]
)

X_train = preprocessor.fit_transform(X_train_unproc)
X_test = preprocessor.transform(X_test_unproc)

params = {
    'max_depth': [10, 12, 14],
    'min_samples_split': [4, 8],
    'n_estimators': [60, 80, 100]
}

gridsearch =  GridSearchCV(estimator=rfr, param_grid= params, cv = 3)
gridsearch.fit(X_train, y_train)

In [112]:
print(f"R2 score on train set : {gridsearch.score(X_train, y_train)}")
print(f"Mean Absolute Error train set : {mean_absolute_error(y_train, gridsearch.predict(X_train))}")
print()
print(f"R2 score on test set : {gridsearch.score(X_test, y_test)}")
print(f"Mean Absolute Error test set : {mean_absolute_error(y_test, gridsearch.predict(X_test))}")

R2 score on train set : 0.8414339029432591
Mean Absolute Error train set : 2.5666997763020154

R2 score on test set : 0.7386447128963429
Mean Absolute Error test set : 2.9279744489821797


Better than the baseline, even without the "ride_distance" column, which is the most important feature (considering the last RandomForestRegressor).