# DataTalks - MLOps ZoomCamp - Week 1 - HW 1

The goal of the homework is to train a simple model for predicting the duration of a ride.

Link to homework: [https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/01-intro/homework.md](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/01-intro/homework.md)

## 1. Initializing modules

In [1]:
# Importing the set of Python packages needed for the analysis

import pandas as pd
import logging
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from typing import Optional
import pickle
import seaborn as sns

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

## 2. Downloading the data

We can now download the data from the NYC website: [https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

In [2]:
# --- Defining the URL of the data
# NOTE: The current data uses AWS cloudfront for managing the dataset.


def download_dataset(year: str, month: str) -> pd.DataFrame:
    """
    Function to download the dataset from the
    ``NYC Taxi & Limousine Commission`` website.

    Parameters
    ------------
    year : str
        Year, for which to download the dataset.

    month : str
        Month, for which to download the dataset.
        The variable must contain 2 digits, e.g.
        'February' would be represented as '02'.

    Returns
    ----------
    nyc_dataset : pandas.DataFrame
        Dataset of the NYC Taxi & Limousine trip data.
    """
    # Checking input data
    # `year` - Type check
    year_type_arr = (str, int)
    if not isinstance(year, year_type_arr):
        msg = "`year` ({}) is not of the correct data type ({})"
        msg = msg.format(type(year), year_type_arr)
        raise TypeError(msg)
    # `month` - Type check
    month_type_arr = (str,)
    if not isinstance(month, month_type_arr):
        msg = "`month` ({}) is not of the correct data type ({})"
        msg = msg.format(type(month), month_type_arr)
        raise TypeError(msg)
    # `month` - Lenght check
    if len(month) != 2:
        msg = "`month` must have two digits and only has {}"
        msg = msg.format(len(month))
        raise ValueError(msg)
    #
    # Base URL
    dataset_url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_{year}-{month}.parquet"

    logger.info(f">> Reading '{dataset_url}'")

    return pd.read_parquet(dataset_url)


In [3]:
# Dataset for '2023' - 'February
dataset_jan = download_dataset(year="2021", month="01")

# Dataset for '2023' - 'February
dataset_feb = download_dataset(year="2021", month="02")


INFO:__main__:>> Reading 'https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2021-01.parquet'
INFO:__main__:>> Reading 'https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2021-02.parquet'


## 3. Data analysis

### 3.1 Describing the datasets

We can now visualize both datasets:

In [4]:
dataset_jan.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [5]:
dataset_feb.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00013,2021-02-01 00:01:00,2021-02-01 01:33:00,,,,B00014
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173.0,82.0,,B00021
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173.0,56.0,,B00021
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82.0,129.0,,B00021
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,,225.0,,B00037


Similarly, we can now describe both datasets:

In [6]:
logger.info(f">> There are `{len(dataset_jan)}` records for the 'January' dataset")

logger.info(f">> There are `{len(dataset_feb)}` records for the 'February' dataset")

INFO:__main__:>> There are `1154112` records for the 'January' dataset
INFO:__main__:>> There are `1037692` records for the 'February' dataset


### 3.2 Computing the trip duration

We can now compute the `duration` variable in *minutes*.

In [7]:
def calculate_trip_duration(dataset: pd.DataFrame) -> pd.Series:
    """
    Function to calculate the trip duration for a set of pickup and dropoff times.

    Parameters
    ------------
    dataset : pandas.DataFrame
        Dataset containing the trip data for pickup and drop off times.

    Returns
    -----------
    trip_duration : pd.Series
        Series corresponding to the trip duration in ``minutes``.
    """
    # Defining column names
    pickup_colname = "pickup_datetime"
    dropoff_colname = "dropOff_datetime"

    # Calculating trip duration in minutes

    return (
        pd.to_datetime(dataset[dropoff_colname])
        - pd.to_datetime(dataset[pickup_colname])
    ).dt.total_seconds() / 60


We can now run the function for each of the datasets:

In [8]:
# January
dataset_jan["duration"] = calculate_trip_duration(dataset_jan)

# February
dataset_feb["duration"] = calculate_trip_duration(dataset_feb)

In [9]:
logger.info(f">> The average trip duration in 'January' is '{dataset_jan['duration'].mean():0.2f}' minutes")

logger.info(f">> The average trip duration in 'February' is '{dataset_feb['duration'].mean():0.2f}' minutes")

INFO:__main__:>> The average trip duration in 'January' is '19.17' minutes
INFO:__main__:>> The average trip duration in 'February' is '20.71' minutes


## 4. Data preparation

We are only interested in the trips with durations between `1` and `60` minutes (inclusive).

In [10]:
# January
dataset_jan_filtered = dataset_jan.loc[
    dataset_jan["duration"].between(1, 60, inclusive="both")
]

logger.info(
    f"There are `{len(dataset_jan_filtered)}` number of trips between [0,60] minutes in 'January'. (# of records dropped: {len(dataset_jan) - len(dataset_jan_filtered)})"
)

# February

dataset_feb_filtered = dataset_feb.loc[
    dataset_feb["duration"].between(1, 60, inclusive="both")
]

logger.info(
    f"There are `{len(dataset_feb_filtered)}` number of trips between [0,60] minutes in 'February'. (# of records dropped: {len(dataset_feb) - len(dataset_feb_filtered)})"
)


INFO:__main__:There are `1109826` number of trips between [0,60] minutes in 'January'. (# of records dropped: 44286)
INFO:__main__:There are `990113` number of trips between [0,60] minutes in 'February'. (# of records dropped: 47579)


## 5. Missing values

In order to create the necessary features for our model, we first need to do some data cleaning:

1. Replace the missing values of the pickup and dropoff location IDs with `-1`.

In [11]:
# Defining column names
pickup_id_colname = "PUlocationID"
dropoff_id_colname = "DOlocationID"

# Number of records with missing values - January
missing_data_jan_pickup_id_frac = 100 * dataset_jan_filtered[pickup_id_colname].isna().sum()/len(dataset_jan_filtered[pickup_id_colname])
missing_data_jan_dropoff_id_frac = 100 * dataset_jan_filtered[dropoff_id_colname].isna().sum()/len(dataset_jan_filtered[dropoff_id_colname])

logger.info(f">> There is a total of '{missing_data_jan_pickup_id_frac:.2f}%' of missing values - Pickup IDs - January")
logger.info(f">> There is a total of '{missing_data_jan_dropoff_id_frac:.2f}%' of missing values - DropOff IDs - January")

INFO:__main__:>> There is a total of '83.53%' of missing values - Pickup IDs - January
INFO:__main__:>> There is a total of '13.33%' of missing values - DropOff IDs - January


In [12]:
# Replacing missing values with '-1'
for colname in [pickup_id_colname, dropoff_id_colname]:
    dataset_jan_filtered.loc[:, colname] = dataset_jan_filtered[colname].fillna(-1).astype(int)
    dataset_feb_filtered.loc[:, colname] = dataset_feb_filtered[colname].fillna(-1).astype(int)

## 6. One-hot encoding

We need to convert the categorical data into one-hot encoding using a *dictionary  vectorizer*.

In [13]:
# Columns to use for the categorical columns
categorical_colnames = [pickup_id_colname, dropoff_id_colname]

# Turning the categorical variables into strings for the dictionary vectorizer
dataset_jan_filtered.loc[:, categorical_colnames] = dataset_jan_filtered[categorical_colnames].astype(str)

# Creating train and predictor variables for each dataset
train_dict_jan = dataset_jan_filtered[categorical_colnames].to_dict(orient="records")
train_dict_feb = dataset_feb_filtered[categorical_colnames].to_dict(orient="records")

# Initializing vectorizer
dv = DictVectorizer()

# And computing the one-hot encoding
X_train = dv.fit_transform(train_dict_jan)

# And now the variable that we can to predict on
y_train = dataset_jan_filtered["duration"].values

In [14]:
logger.info(f"The shape of the training features: '{X_train.shape}' - January")

INFO:__main__:The shape of the training features: '(1109826, 525)' - January


We now have the necessary ingredients to train a model.

## 7. Training a model

We can now train a linear regression model with the data from **January**. We can then use
that model to validate it with the data from **February**.

In [15]:
# Load the model
lr = LinearRegression()

# Training a linear model
lr.fit(X_train, y_train)

We can now calculate the `RMSE` (root-mean-squared-error) between the actual `duration` of the trip to
the inferred one:

In [16]:
# Predicted 'duration' using the Linear regression model
y_pred = lr.predict(X_train)

# Calculating the RMSE of the data
rmse = mean_squared_error(y_train, y_pred, squared=False)

logger.info(f"The RMSE between the 'real' and 'predicted' trip duration is: {rmse:.2f}")

INFO:__main__:The RMSE between the 'real' and 'predicted' trip duration is: 10.53


## 8. Evaluating the model

The next step is to evaluate the model on a brand new dataset, i.e. a dataset that the model hasn't
been trained on. For example, we'll validate the model using the dataset from *February*.

To do this, we'll need to perform the same set of data preparation tasks that we've done for the dataset
from *January*.

In [17]:
def data_preparation(
    month: str,
    year: str,
    min_duration: Optional[int] = 1,
    max_duration: Optional[int] = 60,
) -> pd.DataFrame:
    """
    Function to prepare a dataset. This function will perform the
    data cleaning steps, as well as any data type casting, etc.

    Parameters
    ------------
    month : str
        Month, for which to download the dataset.
        The variable must contain 2 digits, e.g.
        'February' would be represented as '02'.

    year : str
        Year, for which to download the dataset.

    min_duration : int, optional
        Minimum number of minutes that a trip lasts.
        This variable is set to ``1`` by default.

    max_duration : int, optional
        Maximum number of minutes that a trip lasts.
        This variable is set to ``60`` by default.

    Returns
    -----------
    dataset_processed : pandas.DataFrame
        Dataset after having gone through the data cleaning and
        data processing steps.
    """
    # 1. Reading in the dataset
    raw_dataset = download_dataset(month=month, year=year)

    # 2. Computing trip duration
    duration_colname = "duration"
    raw_dataset[duration_colname] = calculate_trip_duration(
        dataset=raw_dataset
    )

    # 3. Filtering the trips that lasted within 1 minute and 60 mins.
    dataset = raw_dataset.loc[
        raw_dataset[duration_colname].between(
            min_duration, max_duration, inclusive="both"
        )
    ]

    # 4. Handle missing values for the pickup ID and dropoff ID.
    pickup_id_colname = "PUlocationID"
    dropoff_id_colname = "DOlocationID"
    fill_value = -1

    categorical_cols = [pickup_id_colname, dropoff_id_colname]

    dataset.loc[:, categorical_cols] = (
        dataset[categorical_cols].fillna(fill_value).astype(int).astype(str)
    )

    return dataset.reset_index(drop=True)


In order to validate our results, we will read the dataset from `2023-02` and pass the dataset
through the data processing pipeline using the function `data_preparation`:

In [18]:
val_dataset = data_preparation(month="02", year="2021", min_duration=1, max_duration=60)

INFO:__main__:>> Reading 'https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2021-02.parquet'


In [19]:
val_dataset.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173,82,,B00021,10.666667
1,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173,56,,B00021,14.566667
2,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82,129,,B00021,7.95
3,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,-1,225,,B00037,13.8
4,B00037,2021-02-01 00:00:37,2021-02-01 00:09:35,-1,61,,B00037,8.966667


In [20]:
logger.info(f"The average duration of the trip in the validation dataset is : {val_dataset['duration'].mean():.2f} minutes")

INFO:__main__:The average duration of the trip in the validation dataset is : 16.86 minutes


We will now validate the results using the existing linear regression model:

In [21]:
# Defining the columns to use for categorical variables
pickup_id_colname = "PUlocationID"
dropoff_id_colname = "DOlocationID"
categorical_cols = [pickup_id_colname, dropoff_id_colname]

# Vectorizing the data
val_dicts = val_dataset[categorical_cols].to_dict(orient="records")

# Creating the `X` and `y` variables for the validation dataset
X_val = dv.transform(val_dicts)
y_val = val_dataset["duration"].values

# Computing the predictions of the trip duration using the validation dataset
y_pred_val = lr.predict(X_val)

Finally, we can look at the `RMSE` for the validation dataset:

In [22]:
rmse_val = mean_squared_error(y_true=y_val, y_pred=y_pred_val, squared=False)
logger.info(f"The RMSE of the validation dataset is: {rmse_val:.2f}")

INFO:__main__:The RMSE of the validation dataset is: 12.85


## Final words

In the previous sections, we saw the `RMSE` for the training dataset was around `10.53`, while the
`RMSE` using the validation dataset was `12.85`. This shows that the model does perform a little
worse in the validation dataset than the training dataset.

However, this model is good enough for demostration purposes.