## Use Case Overview: 

The objective of this notebook is to predict the duration of NYC taxi trips. \
We use data related to yellow or green taxis to train a simple prediction model.

In this notebook, we use:

- Yellow taxi data from January 2021 for model training.
- Data from February 2021 to test the model (make predictions).

**Features for Model Training**: Useful variables related to the trip itself 

**Target to Predict**: The duration of the trip


### Data
Data for the whole course can be downloaded following this [link](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) or using the following code:



In [None]:
# install gdown
!pip install gdown

In [None]:
import gdown
import os

DATA_FOLDER = "../../data"
train_path = f"{DATA_FOLDER}/yellow_tripdata_2021-01.parquet"
test_path = f"{DATA_FOLDER}/yellow_tripdata_2021-02.parquet"
predict_path = f"{DATA_FOLDER}/yellow_tripdata_2021-03.parquet"


if not os.path.exists(DATA_FOLDER):
    os.makedirs(DATA_FOLDER)
    print(f"New directory {DATA_FOLDER} created!")

gdown.download(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    train_path,
    quiet=False,
)
gdown.download(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet",
    test_path,
    quiet=False,
)
gdown.download(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.parquet",
    predict_path,
    quiet=False,
)

In [None]:
import os

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
import warnings

warnings.filterwarnings("ignore")

**Info** : <p style=color:green>Using large amounts of data in jupyter notebook, some cell can take some time to run<p/>

# 1 - Load data

In [None]:
DATA_FOLDER = "../../data"

train_df = pd.read_parquet(os.path.join(DATA_FOLDER, "yellow_tripdata_2021-01.parquet"))

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.isnull().sum()

# 2 - Prepare the data

## 2-1 Compute the target

The dataset contains pickup and dropoff time but not the duration itself. \
We compute the duration of a taxi trip in minutes using these two variables.

In [None]:
def compute_target(df):
    df["duration"] = df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]
    df["duration"] = df["duration"].dt.total_seconds() / 60
    return df


train_df = compute_target(train_df)

We can visualise how the duration is distributed : 

In [None]:
train_df["duration"].describe()

In [None]:
fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
sns.distplot(train_df.duration, ax=ax);

We notice that there are negative durations and trips that last 6 hours. \
We will proceed to remove outliers and narrow the scope to trips lasting between 1 minute and 1 hour.

In [None]:
MIN_DURATION = 1
MAX_DURATION = 60


def filter_outliers(df, min_duration=MIN_DURATION, max_duration=MAX_DURATION):
    df = df[df["duration"].between(min_duration, max_duration)]
    return df


train_df = filter_outliers(train_df)

In [None]:
fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
sns.distplot(train_df.duration, ax=ax);

## 2-2 Prepare features

### 2-2-1 Categorical features

We will encode discrete variables as strings and then proceed to extracting the features and the target in order to train two models:

In [None]:
CATEGORICAL_COLS = ["PULocationID", "DOLocationID", "passenger_count"]


def encode_categorical_cols(df):
    df[CATEGORICAL_COLS] = df[CATEGORICAL_COLS].fillna(-1).astype("int")
    df[CATEGORICAL_COLS] = df[CATEGORICAL_COLS].astype("category")
    return df


train_df = encode_categorical_cols(train_df)

In [None]:
def extract_x_y(df, dv=None):
    dicts = df[CATEGORICAL_COLS].to_dict(orient="records")
    if dv is None:
        dv = DictVectorizer()
        dv.fit(dicts)
    X = dv.transform(dicts)
    y = df["duration"].values
    return X, y, dv


X_train, y_train, dv = extract_x_y(train_df)

# 3 - Train model

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# ~ 2.5 minutes
rf = RandomForestRegressor(n_estimators=100, max_depth=11, max_features="sqrt", random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

# 4 - Evaluate model

In [None]:
def compute_metrics(y, y_pred):
    metrics = {
        "rmse": mean_squared_error(y, y_pred, squared=False),
        "mape": mean_absolute_percentage_error(y, y_pred),
    }
    return metrics

## 4-1 On train data

In [None]:
y_pred_lr = lr.predict(X_train)
y_pred_rf = rf.predict(X_train)
compute_metrics(y_train, y_pred_lr), compute_metrics(y_train, y_pred_rf)

## 4-2 On test data

In [None]:
test_df = pd.read_parquet(os.path.join(DATA_FOLDER, "yellow_tripdata_2021-02.parquet"))

In [None]:
test_df = compute_target(test_df)
test_df = filter_outliers(test_df)
test_df = encode_categorical_cols(test_df)
X_test, y_test, _ = extract_x_y(test_df, dv=dv)

In [None]:
y_pred_test_lr = lr.predict(X_test)
y_pred_test_rf = rf.predict(X_test)
compute_metrics(y_test, y_pred_test_lr), compute_metrics(y_test, y_pred_test_rf)