# MLOps Zoomcamp - Homework #1

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

*Answer:*

In [1]:
%%time 

import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Wall time: 5.39 s


In [2]:
!ls -l ../data

total 22004
-rw-r--r-- 1 user 197121 11886281 May 18 21:37 fhv_tripdata_2021-01.parquet
-rw-r--r-- 1 user 197121 10645466 May 18 21:37 fhv_tripdata_2021-02.parquet


In [3]:
%%time 

df = pd.read_parquet('../data/fhv_tripdata_2021-01.parquet')
df.head()

Wall time: 497 ms


Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [4]:
%%time 

len_df = len(df)
print("Number of records:", len_df)

Number of records: 1154112
Wall time: 3.01 ms


## Q2. Computing duration

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

*Answer:*

In [5]:
%%time 

df['duration'] = df.dropOff_datetime - df.pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

Wall time: 23.2 s


In [6]:
avg_duration = round(df.duration.mean(), 2)
print('Average trip duration in January was', avg_duration, 'minutes')

Average trip duration in January was 19.17 minutes


## Data preparation

Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

*Answer:*

In [7]:
df = df[(df.duration >= 1) & (df.duration <= 60)]

print("The number of dropped rows:", (len_df - len(df)))

The number of dropped rows: 44286


## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs

*Answer:*

In [8]:
%%time 

df['PUlocationID'] = df['PUlocationID'].fillna(-1)
df['DOlocationID'] = df['DOlocationID'].fillna(-1)

Wall time: 44.7 ms


In [9]:
missing_values = round(df['PUlocationID'].value_counts()[-1] / len(df), 2)

print('Fraction of missing values in the PUlocationID column is', missing_values)

Fraction of missing values in the PUlocationID column is 0.84


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

*Answer:*

In [10]:
%%time 

# Convert the selected columns to categories
categorical = ['PUlocationID', 'DOlocationID']
df[categorical] = df[categorical].astype(str)

# Create a list of dictionaries
train_dicts = df[categorical].to_dict(orient='records')

# Fit a dictionary vectorizer
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

# Create a target vector for the model training
target = 'duration'
y_train = df[target].values

Wall time: 16.1 s


In [11]:
print(f'The matrix has {X_train.shape[1]} columns')

The matrix has 525 columns


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model.

* Train a plain linear regression model with default parameters
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

*Answer:*

In [12]:
%%time 

# Train a Linear Regressor
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict the values using the training data
y_pred = lr_model.predict(X_train)

# Evaluate the model
rmse = mean_squared_error(y_train, y_pred, squared=False)
rmse = round(rmse, 2)
print(f'RMSE on the training data: {rmse}')

RMSE on the training data: 10.53
Wall time: 28.6 s


## Q6. Evaluating the model

### Prepare the validation dataset

In [13]:
%%time 

# Load the validation dataset
df = pd.read_parquet('../data/fhv_tripdata_2021-02.parquet')

# Feature engineering: create an new column
df['duration'] = df.dropOff_datetime - df.pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

# Filter out unused values
df = df[(df.duration >= 1) & (df.duration <= 60)]

# Replace all NaN values by -1
df['PUlocationID'] = df['PUlocationID'].fillna(-1)
df['DOlocationID'] = df['DOlocationID'].fillna(-1)

# Convert the selected columns to categories
categorical = ['PUlocationID', 'DOlocationID']
df[categorical] = df[categorical].astype(str)

# Create a list of dictionaries
val_dicts = df[categorical].to_dict(orient='records')

# Fit a dictionary vectorizer
dv = DictVectorizer()
X_val = dv.fit_transform(val_dicts)

# Create a target vector for the model training
target = 'duration'
y_val = df[target].values

Wall time: 56.5 s


In [16]:
val_dicts

[{'PUlocationID': '173.0', 'DOlocationID': '82.0'},
 {'PUlocationID': '173.0', 'DOlocationID': '56.0'},
 {'PUlocationID': '82.0', 'DOlocationID': '129.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '225.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '61.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '26.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '72.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '169.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '161.0'},
 {'PUlocationID': '13.0', 'DOlocationID': '182.0'},
 {'PUlocationID': '152.0', 'DOlocationID': '244.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '-1.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '-1.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '-1.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '265.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '237.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '248.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '248.0'},
 {'PUlocationID': '-1.0', 'DOlocationID': '159.0'},
 {'PUlocationID':

In [15]:
X_val.shape

(990113, 526)

In [14]:
%%time 

# Predict the values using the validation data
y_pred = lr_model.predict(X_val)

# Evaluate the model
rmse = mean_squared_error(y_val, y_pred, squared=False)
rmse = round(rmse, 2)
print(f'RMSE on the validation data: {rmse}')

ValueError: dimension mismatch