Dataset: 
- training set: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
- validation set: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-02.parquet

Homework: https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2024/01-intro/homework.md

In [5]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression, Lasso, Ridge

from sklearn.metrics import mean_squared_error

import pickle

In [10]:
# ! pip install pyarrow

In [9]:
### open parquet data directly from the url
### dataset 2024
# df_train = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet')
# df_val = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-02.parquet')

### dataset 2023
df_train = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
df_val = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet')

### Q1
Download the data for January and February 2023.

Read the data for January. How many columns are there?
#### Answer: 19

#### Solution:

In [10]:
df_train.shape

(3066766, 19)

### 2. Q2
Now let's compute the `duration` variable. 

It should contain the duration of a ride in minutes.

What's the standard deviation of the trips duration in January?
#### Answer: 41.45

#### Solution:

In [11]:
df_train['duration'] = df_train.tpep_dropoff_datetime - df_train.tpep_pickup_datetime
df_train.duration = df_train['duration'].apply(lambda x: x.total_seconds() / 60)

df_train['duration'].describe() 

count    3.066766e+06
mean     1.566900e+01
std      4.259435e+01
min     -2.920000e+01
25%      7.116667e+00
50%      1.151667e+01
75%      1.830000e+01
max      1.002918e+04
Name: duration, dtype: float64

In [12]:
df_train['duration'].std()

42.594351241920904

In [13]:
df_val['duration'] = df_val.tpep_dropoff_datetime - df_val.tpep_pickup_datetime
df_val.duration = df_val['duration'].apply(lambda x: x.total_seconds() / 60)

df_val['duration'].describe() 

count    2.913955e+06
mean     1.601591e+01
std      4.284210e+01
min     -4.361667e+01
25%      7.250000e+00
50%      1.180000e+01
75%      1.876667e+01
max      7.053617e+03
Name: duration, dtype: float64

In [14]:
df_val['duration'].std()

42.84210176105097

### 3. Q3
Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?
#### Answer: 98%

#### Solution: 

In [15]:
len_before = len(df_train)
df_train = df_train[(df_train['duration'] >= 1) & (df_train['duration'] <= 60)]
len_after = len(df_train)
100*(len_after/len_before)

98.1220282212598

In [16]:
df_val = df_val[(df_val['duration'] >= 1) & (df_val['duration'] <= 60)]

### 4. Q4
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.
- Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
- Fit a dictionary vectorizer
- Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?
#### Answer: 515

#### Solution:

In [17]:
categorical = ['PULocationID', 'DOLocationID']
df_train[categorical] = df_train[categorical].astype(str)
df_val[categorical] = df_val[categorical].astype(str)

In [18]:
df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

In [22]:
# categorical = ['PULocationID','DOLocationID']
categorical = ['PU_DO']
numerical = ['trip_distance']

dv = DictVectorizer()

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [23]:
X_train

<3009173x21802 sparse matrix of type '<class 'numpy.float64'>'
	with 6018346 stored elements in Compressed Sparse Row format>

In [24]:
X_val

<2855951x21802 sparse matrix of type '<class 'numpy.float64'>'
	with 5704693 stored elements in Compressed Sparse Row format>

### 5. Q5
Now let's use the feature matrix from the previous step to train a model.
- Train a plain linear regression model with default parameters
- Calculate the RMSE of the model on the training data

What's the RMSE on train?
#### Answer: 3.64

#### Solution:

In [25]:
target = 'duration'
y_train = df_train[target].values 
y_val = df_val[target].values 

In [29]:
## model building (Linear Regression)

lr = LinearRegression()
lr.fit(X_train, y_train)  ### model training

y_pred = lr.predict(X_train)   ### prediction inference

## model evaluation
mean_squared_error(y_train, y_pred, squared=False)  ### root mean square

5.026784311948384

### 6. Q6
Now let's apply this model to the validation dataset (February 2023).

What's the RMSE on validation?
#### Answer: 7.81

#### Solution:

In [30]:
## model building (Linear Regression)

lr = LinearRegression()
lr.fit(X_train, y_train)  ### model training

y_pred = lr.predict(X_val)   ### prediction inference

## model evaluation
mean_squared_error(y_val, y_pred, squared=False)  ### root mean square

5.198299504452104