<a href="https://colab.research.google.com/github/tnaka78/mlops-zoomcamp/blob/main/mlops_zoomcamp_01_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLOps Zoomcamp 01 Homework

In [1]:
import os
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Q1. Download the data

In [2]:
!wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.parquet
!wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet

--2022-05-23 11:28:49--  https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.parquet
Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.196.33
Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.196.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11886281 (11M) [binary/octet-stream]
Saving to: ‘fhv_tripdata_2021-01.parquet’


2022-05-23 11:28:49 (50.2 MB/s) - ‘fhv_tripdata_2021-01.parquet’ saved [11886281/11886281]

--2022-05-23 11:28:49--  https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet
Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.196.33
Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.196.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10645466 (10M) [binary/octet-stream]
Saving to: ‘fhv_tripdata_2021-02.parquet’


2022-05-23 11:28:50 (48.5 MB/s) - ‘fhv_tripdata_2021-02.parquet’ saved [10645466/10645

In [3]:
train_df = pd.read_parquet('./fhv_tripdata_2021-01.parquet')
train_df.head(5)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [4]:
len(train_df)

1154112

## Q2. Computing duration

In [5]:
train_df['duration'] = (train_df['dropOff_datetime'] - train_df['pickup_datetime']).apply(lambda td: td.total_seconds() / 60.0)
train_df.head(5)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009,17.0
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013,110.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037,15.216667


In [6]:
train_df['duration'].mean()

19.1672240937939

## Data preparation

number of records which duration is not between 1 and 60

In [7]:
len(train_df[(train_df['duration'] < 1.0) | (train_df['duration'] > 60.0)])

44286

In [8]:
train_df = train_df[(train_df['duration'] >= 1.0) & (train_df['duration'] <= 60.0)]

## Q3. Missing values

In [9]:
train_df.isnull().sum() / len(train_df)

dispatching_base_num      0.000000
pickup_datetime           0.000000
dropOff_datetime          0.000000
PUlocationID              0.835273
DOlocationID              0.133270
SR_Flag                   1.000000
Affiliated_base_number    0.000697
duration                  0.000000
dtype: float64

Fill NaN with -1

In [10]:
train_df.fillna({'PUlocationID': -1, 'DOlocationID': -1}, inplace=True)
train_df.head(5)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,-1.0,-1.0,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,-1.0,-1.0,,B00009,17.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,-1.0,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,-1.0,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,-1.0,71.0,,B00037,9.05


## Q4. One-hot encoding

In [11]:
train_df[['PUlocationID', 'DOlocationID']] = train_df[['PUlocationID', 'DOlocationID']].astype(str)
train_df.head(5)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,-1.0,-1.0,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,-1.0,-1.0,,B00009,17.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,-1.0,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,-1.0,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,-1.0,71.0,,B00037,9.05


In [12]:
train_dict = train_df[['PUlocationID', 'DOlocationID']].to_dict(orient='records')

In [13]:
dv = DictVectorizer()
x_train = dv.fit_transform(train_dict)

In [14]:
x_train.shape

(1109826, 525)

## Q5. Train a model

In [15]:
y_train = train_df['duration'].values

In [16]:
lr = LinearRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_train)
mean_squared_error(y_train, y_pred, squared=False)

10.528519107210744

## Q6. Evaluate the model

Prepare validation data

In [17]:
val_df = pd.read_parquet('./fhv_tripdata_2021-02.parquet')
val_df['duration'] = (val_df['dropOff_datetime'] - val_df['pickup_datetime']).apply(lambda td: td.total_seconds() / 60.0)
val_df = val_df[(val_df['duration'] >= 1.0) & (val_df['duration'] <= 60.0)]
val_df.fillna({'PUlocationID': -1, 'DOlocationID': -1}, inplace=True)
val_df[['PUlocationID', 'DOlocationID']] = val_df[['PUlocationID', 'DOlocationID']].astype(str)
val_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173.0,82.0,,B00021,10.666667
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173.0,56.0,,B00021,14.566667
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82.0,129.0,,B00021,7.95
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,-1.0,225.0,,B00037,13.8
5,B00037,2021-02-01 00:00:37,2021-02-01 00:09:35,-1.0,61.0,,B00037,8.966667


In [18]:
val_dict = val_df[['PUlocationID', 'DOlocationID']].to_dict(orient='records')
x_val = dv.transform(val_dict)
y_val = val_df['duration'].values

Evaluate

In [19]:
y_pred = lr.predict(x_val)
mean_squared_error(y_val, y_pred, squared=False)

11.014283196111764