# Homework of session 1
----

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## LIBRARY

In [2]:
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# Q1. Downloading the data
----

We'll use the same <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" target="_blank">NYC Taxi dataset</a>, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

- 1054112
- 1154112
- 1254112
- 1354112

In [12]:
%cd ./data
!wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.parquet
!wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet

/home/premsurawut/mlops-zoomcamp/notebook/01-Intro/data
--2022-06-11 15:00:26--  https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.parquet
Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.231.41
Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.231.41|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11886281 (11M) [binary/octet-stream]
Saving to: ‘fhv_tripdata_2021-01.parquet’


2022-06-11 15:00:27 (13.0 MB/s) - ‘fhv_tripdata_2021-01.parquet’ saved [11886281/11886281]

--2022-06-11 15:00:28--  https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet
Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.231.41
Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.231.41|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10645466 (10M) [binary/octet-stream]
Saving to: ‘fhv_tripdata_2021-02.parquet’


2022-06-11 15:00:29 (12.5 MB/s

In [19]:
df = pd.read_parquet('./fhv_tripdata_2021-01.parquet')

In [23]:
print('Q1: How many records are there?')
print(f'A1: {len(df)}')

Q1: How many records are there?
A1: 1154112


# Q2. Computing duration
----

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

- 15.16
- 19.16
- 24.16
- 29.16

In [21]:
df['duration'] = df.dropOff_datetime - df.pickup_datetime
df['duration'] = df.duration.dt.total_seconds() / 60

In [26]:
print("Q2: What's the average trip duration in January?")
print(f"A2: {round(df['duration'].mean(), 3)}")

Q2: What's the average trip duration in January?
A2: 19.167


## Data preparation
Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [29]:
df = df[(df.duration >= 1) & (df.duration <= 60)].copy()
print('Q: How many records did you drop?')
print(f'A: {1154112 - len(df)}')

Q: How many records did you drop?
A: 44286


# Q3. Missing values
----

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

- 53%
- 63%
- 73%
- 83%

In [47]:
categorical = ['PUlocationID', 'DOlocationID']

pulocation_id = len(df['PUlocationID'])
df[categorical] = df[categorical].fillna(-1).astype('int')
pulocation_id_filled = len(df[df['PUlocationID'] == -1 ])
fraction_pu_id = pulocation_id_filled / pulocation_id * 100

print('Q3: What is the fractions of missing values for pickup location ID?')
print(f'A3: {fraction_pu_id}')

Q3: What is the fractions of missing values for pickup location ID?
A3: 83.52732770722618


# Q4. One-hot encoding
----

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

Turn the dataframe into a list of dictionaries
Fit a dictionary vectorizer
Get a feature matrix from it
What's the dimensionality of this matrix? (The number of columns).

- 2
- 152
- 352
- 525
- 725

In [48]:
df[categorical] = df[categorical].astype('str')

In [49]:
train_dicts = df[categorical].to_dict(orient='records')

In [50]:
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts) 

In [54]:
y_train = df.duration.values

In [55]:
print('Q4: What is the dimensionality of this matrix?')
print(f'A4: {len(dv.feature_names_)}')

Q4: What is the dimensionality of this matrix?
A4: 525


# Q5. Training a model
----

Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters
Calculate the RMSE of the model on the training data
What's the RMSE on train?

- 5.52
- 10.52
- 15.52
- 20.52

In [57]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

mse_result = mean_squared_error(y_train, y_pred, squared=False)
print('Q5: What is the RMSE on train')
print(f'A5: {mse_result}')

Q5: What is the RMSE on train
A5: 10.528519107205451


# Q6. Evaluating the model
----

Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

- 6.01
- 11.01
- 16.01
- 21.01

In [58]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [59]:
df_val = read_data('./fhv_tripdata_2021-02.parquet')

In [60]:
val_dicts = df_val[categorical].to_dict(orient='records')

In [61]:
X_val = dv.transform(val_dicts) 
y_pred = lr.predict(X_val)
y_val = df_val.duration.values

val_mse = mean_squared_error(y_val, y_pred, squared=False)

In [62]:
print('Q6: What is the RMSE on validation?')
print(f'A6: {val_mse}')

Q6: What is the RMSE on validation?
A6: 11.014283139629091
