## Homework Week1 - MLOps Zoomcamp

### Author: [Sebastián Ayala Ruano](https://sayalaruano.github.io/)

### Dataset

In this homework, we will use the "For-Hire Vehicle Trip Records" from the [NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

In [64]:
# Import libraries 
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [65]:
# Load data 
df_jan = pd.read_parquet("Data/fhv_tripdata_2021-01.parquet")
df_feb = pd.read_parquet("Data/fhv_tripdata_2021-02.parquet")

### Question 1
Read the data for January. How many records are there?

In [66]:
df_jan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1154112 entries, 0 to 1154111
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1154112 non-null  object        
 1   pickup_datetime         1154112 non-null  datetime64[ns]
 2   dropOff_datetime        1154112 non-null  datetime64[ns]
 3   PUlocationID            195845 non-null   float64       
 4   DOlocationID            991892 non-null   float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1153227 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 61.6+ MB


**Answer:** 1154112

### Question 2 - Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the average trip duration in January?

In [67]:
# Create trip duration columns 
df_jan['duration'] = df_jan.dropOff_datetime - df_jan.pickup_datetime
df_feb['duration'] = df_feb.dropOff_datetime - df_feb.pickup_datetime

# Convert trip duration to minutes
df_jan.duration = df_jan.duration.apply(lambda td: td.total_seconds() / 60)
df_feb.duration = df_feb.duration.apply(lambda td: td.total_seconds() / 60)

In [68]:
# Obtain trip duration mean
df_jan['duration'].mean()

19.167224093791006

**Answer:** 19.16

### Data preparation

Check the distribution of the duration variable. There are some outliers. 

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [69]:
# Apply trip duration filter 
df_jan_ft = df_jan[(df_jan.duration >= 1) & (df_jan.duration <= 60)]
df_feb_ft = df_feb[(df_feb.duration >= 1) & (df_feb.duration <= 60)]

In [70]:
# Number of dropped records
df_jan_dropp_n = len(df_jan) - len(df_jan_ft)
df_feb_dropp_n = len(df_feb) - len(df_feb_ft)

In [71]:
df_jan_dropp_n

44286

In [72]:
df_feb_dropp_n

47579

### Question 3 - Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

In [73]:
# Select features for the model
features = ['PUlocationID', 'DOlocationID']

# Fill NAs with -1
df_jan_train = df_jan_ft[features].fillna(-1)

# Convert columns from float to strings
df_jan_train = df_jan_train.astype(str)


In [80]:
# Fraction of missing values
n_nas = df_jan_train["PUlocationID"].value_counts()["-1.0"]

total = len(df_jan_train)

fraction = n_nas/total

fraction

0.8352732770722617

**Answer:** 83%

### Question 4 - One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

In [82]:
# Convert dataframe into a list of dicts
train_dicts = df_jan_train.to_dict(orient='records')

# Fit dict vect and obtain feat matrix
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [83]:
X_train.shape

(1109826, 525)

**Answer:** 525

### Question 5 - Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

In [84]:
# Obtain target variable column
target = 'duration'
y_train = df_jan_ft[target].values

# Train lr model
lr = LinearRegression()
lr.fit(X_train, y_train)


In [85]:
# Obtain RMSE
y_pred = lr.predict(X_train)
mean_squared_error(y_train, y_pred, squared=False)

10.528519388409808

**Answer:** 10.52

### Question 6 - Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021). 

What's the RMSE on validation?

In [92]:
def read_dataframe(filename):
    # Load data
    df = pd.read_parquet(filename)

    # Create trip duration columns 
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    
    # Convert trip duration to minutes
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    # Apply trip duration filter 
    df = df[(df.duration >= 1) & (df.duration <= 60)]

    # Select features for the model
    features = ['PUlocationID', 'DOlocationID']

    # Fill NAs with -1
    df[features] = df[features].fillna(-1)

    # Convert columns from float to strings
    df[features] = df[features].astype(str)
    
    return df

In [93]:
# Load data
df_train = read_dataframe("Data/fhv_tripdata_2021-01.parquet")
df_val = read_dataframe("Data/fhv_tripdata_2021-02.parquet")

In [94]:
len(df_train), len(df_val)

(1109826, 990113)

In [95]:
# Create and fit dicts
features = ['PUlocationID', 'DOlocationID']


dv = DictVectorizer()

train_dicts = df_train[features].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[features].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [96]:
# Obtain target column
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

In [97]:
# Evaluate the model on validation dataset
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

11.014287519486222

**Answer:** 11.01