# MLOps Zoomcamp - Homework #1

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

*Answer:*

In [1]:
%%time 

import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Wall time: 1.34 s


In [2]:
!ls -l ../data

total 22004
-rw-r--r-- 1 user 197121 11886281 May 18 21:37 fhv_tripdata_2021-01.parquet
-rw-r--r-- 1 user 197121 10645466 May 18 21:37 fhv_tripdata_2021-02.parquet


In [3]:
%%time 

df = pd.read_parquet('../data/fhv_tripdata_2021-01.parquet')
df.head()

Wall time: 220 ms


Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [4]:
df[df['DOlocationID'] == 110.0]

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number


In [None]:
len_df = len(df)
print("Number of records:", len_df)

## Q2. Computing duration

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

*Answer:*

In [None]:
def create_duration_feature(dataset):
    """Create a new feature called duration"""
    dataset['duration'] = dataset.dropOff_datetime - dataset.pickup_datetime
    dataset.duration = dataset.duration.apply(lambda td: td.total_seconds() / 60)
    return dataset

In [None]:
df = create_duration_feature(df)
avg_duration = round(df.duration.mean(), 2)

print('Average trip duration in January was', avg_duration, 'minutes')

## Data preparation

Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

*Answer:*

In [None]:
def filter_dataset(dataset):
    return dataset[(dataset.duration >= 1) & (dataset.duration <= 60)]

df = filter_dataset(df)

print("The number of dropped rows:", (len_df - len(df)))

## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs

*Answer:*

In [None]:
def replace_missing_values(dataset):
    """Replace missing values by -1"""
    dataset['PUlocationID'] = dataset['PUlocationID'].fillna(-1)
    dataset['DOlocationID'] = dataset['DOlocationID'].fillna(-1)
    return dataset

In [None]:
df = replace_missing_values(df)

missing_values_fraction = round(df['PUlocationID'].value_counts()[-1] / len(df), 2)

print('Fraction of missing values in the PUlocationID column is', missing_values_fraction)

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

*Answer:*

In [None]:
def get_x_and_y(dataset):
    """Extract features (X) and target variable (y)"""
    # Convert the selected columns to categories
    categorical = ['PUlocationID', 'DOlocationID']
    dataset[categorical] = dataset[categorical].astype(str)

    # Create a list of dictionaries
    df_dicts = dataset[categorical].to_dict(orient='records')

    # Fit a dictionary vectorizer
    dv = DictVectorizer()
    X = dv.fit_transform(df_dicts)

    # Create a target vector for model training
    target = 'duration'
    y = dataset[target].values

    return X, y

In [None]:
X_train, y_train = get_x_and_y(df)

print(f'The matrix has {X_train.shape[1]} columns')

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model.

* Train a plain linear regression model with default parameters
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

*Answer:*

In [None]:
def train_model(estimator, X, y):
    model = estimator()
    print(f'Training a {model.__class__.__name__} model using default hyperparameters...')
    model.fit(X, y)
    return model


def eval_model(model, X, y, data_info='train'):
    print(f'Evaluating the {model.__class__.__name__} model on the {data_info} dataset ...')
    
    # Predict the values using the training data
    y_pred = model.predict(X)
    
    # Evaluate the model
    rmse = mean_squared_error(y_train, y_pred, squared=False)
    return round(rmse, 2)

In [None]:
lr_model = train_model(LinearRegression, X_train, y_train)

rmse = eval_model(lr_model, X_train, y_train, 'train')
print(f'RMSE on the training data: {rmse}')

## Q6. Evaluating the model

### Prepare the validation dataset

In [None]:
%%time 

# Load the validation dataset
df = pd.read_parquet('../data/fhv_tripdata_2021-02.parquet')

# Feature engineering: create an new column
# df['duration'] = df.dropOff_datetime - df.pickup_datetime
# df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
df = create_duration_feature(df)

# Filter out unused values
# df = df[(df.duration >= 1) & (df.duration <= 60)]
df = filter_dataset(df)

# Replace all NaN values by -1
# df['PUlocationID'] = df['PUlocationID'].fillna(-1)
# df['DOlocationID'] = df['DOlocationID'].fillna(-1)
df = replace_missing_values(df)

# Convert the selected columns to categories
# categorical = ['PUlocationID', 'DOlocationID']
# df[categorical] = df[categorical].astype(str)

# Create a list of dictionaries
# val_dicts = df[categorical].to_dict(orient='records')

# Fit a dictionary vectorizer
# dv = DictVectorizer()
# X_val = dv.fit_transform(val_dicts)

# Create a target vector for the model training
# target = 'duration'
# y_val = df[target].values

X_val, y_val = get_x_and_y(df)

print(f'The validation matrix has {X_val.shape[1]} columns')

In [None]:
# df['PUlocationID'].value_counts()
df['PUlocationID'].isnull().values.any(), df['DOlocationID'].isnull().values.any()

In [None]:
df[df['DOlocationID'] == '110.0']

In [None]:
X_val.shape

In [None]:
%%time 

# Predict the values using the validation data
y_pred = lr_model.predict(X_val)

# Evaluate the model
rmse = mean_squared_error(y_val, y_pred, squared=False)
rmse = round(rmse, 2)
print(f'RMSE on the validation data: {rmse}')

In [None]:
# Create a list of dictionaries

df2 = pd.read_parquet('../data/fhv_tripdata_2021-01.parquet')

df2[['PUlocationID', 'DOlocationID']] = df2[['PUlocationID', 'DOlocationID']].astype(str)
df_dicts2 = df2[['PUlocationID', 'DOlocationID']].to_dict(orient='records')

# Fit a dictionary vectorizer
dv2 = DictVectorizer()
X2 = dv2.fit_transform(df_dicts2)

In [None]:
df3 = pd.read_parquet('../data/fhv_tripdata_2021-02.parquet')
df3[['PUlocationID', 'DOlocationID']] = df3[['PUlocationID', 'DOlocationID']].astype(str)
df_dicts3 = df3[['PUlocationID', 'DOlocationID']].to_dict(orient='records')

# Fit a dictionary vectorizer
dv3 = DictVectorizer()
X3 = dv3.fit_transform(df_dicts3)

In [None]:
# Find the difference in column names
set(dv2.feature_names_) ^ set(dv3.feature_names_)

In [None]:
# Find the difference in column names
set(dv3.feature_names_) ^ set(dv2.feature_names_)

{'DOlocationID=110.0'}