## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [3]:
! wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O ../data/yellow_tripdata_2022-01.parquet
! wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet -O ../data/yellow_tripdata_2022-02.parquet



--2023-05-22 13:42:36--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.160.201.131, 18.160.201.50, 18.160.201.5, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.160.201.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38139949 (36M) [application/x-www-form-urlencoded]
Saving to: ‘../data/yellow_tripdata_2022-01.parquet’


2023-05-22 13:42:36 (173 MB/s) - ‘../data/yellow_tripdata_2022-01.parquet’ saved [38139949/38139949]

--2023-05-22 13:42:36--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.160.201.126, 18.160.201.5, 18.160.201.131, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.160.201.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456

In [4]:
jan = pd.read_parquet('../data/yellow_tripdata_2022-01.parquet')
jan.shape[1]

19

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45

In [5]:
jan['duration'] = jan.tpep_dropoff_datetime - jan.tpep_pickup_datetime
jan.duration = jan.duration.apply(lambda td: td.total_seconds() / 60)

jan[['tpep_dropoff_datetime', 'tpep_pickup_datetime', 'duration']].head()

Unnamed: 0,tpep_dropoff_datetime,tpep_pickup_datetime,duration
0,2022-01-01 00:53:29,2022-01-01 00:35:40,17.816667
1,2022-01-01 00:42:07,2022-01-01 00:33:43,8.4
2,2022-01-01 01:02:19,2022-01-01 00:53:21,8.966667
3,2022-01-01 00:35:23,2022-01-01 00:25:21,10.033333
4,2022-01-01 01:14:20,2022-01-01 00:36:48,37.533333


In [6]:
np.std(jan.duration)

46.445295712725304

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [7]:
filtered = jan[(jan.duration >= 1) & (jan.duration <= 60)]
filtered.shape[0] / float(jan.shape[0])

0.9827547930522406

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [8]:
categorical = ['PULocationID', 'DOLocationID']
filtered.loc[:, categorical] = filtered[categorical].astype(str)
train_df = filtered[categorical]

In [9]:
dv = DictVectorizer(sparse=False)
train_dict = train_df.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [10]:
X_train.shape[1]

515

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [11]:
target = filtered.duration
y_train = target.values

lr = LinearRegression()
lr.fit(X_train, y_train)

In [12]:
y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

6.986190689964798

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [13]:
feb = pd.read_parquet('../data/yellow_tripdata_2022-02.parquet')
feb['duration'] = feb.tpep_dropoff_datetime - feb.tpep_pickup_datetime
feb.duration = feb.duration.apply(lambda td: td.total_seconds() / 60)

feb[['tpep_dropoff_datetime', 'tpep_pickup_datetime', 'duration']].head()

Unnamed: 0,tpep_dropoff_datetime,tpep_pickup_datetime,duration
0,2022-02-01 00:19:24,2022-02-01 00:06:58,12.433333
1,2022-02-01 00:55:55,2022-02-01 00:38:22,17.55
2,2022-02-01 00:26:59,2022-02-01 00:03:20,23.65
3,2022-02-01 00:28:05,2022-02-01 00:08:00,20.083333
4,2022-02-01 00:33:07,2022-02-01 00:06:48,26.316667


In [14]:
feb_filtered = feb[(feb.duration >= 1) & (feb.duration <= 60)]
feb_filtered.loc[:, categorical] = filtered[categorical].astype(str)
feb_filtered = feb_filtered[
    ~(
        pd.isna(feb_filtered.PULocationID) |
        pd.isna(feb_filtered.DOLocationID)
   )
]
feb_X = feb_filtered[categorical]

In [15]:
feb_train_dict = feb_X.to_dict(orient='records')
X_val = dv.transform(feb_train_dict)

In [16]:
feb_target = feb_filtered.duration
feb_target.fillna(-1)
y_val = feb_target.values

In [17]:
y_pred = lr.predict(X_val)
mean_squared_error(y_val, y_pred, squared=False)

11.152070593951542

## Submit the results

* Submit your results here: https://forms.gle/uYTnWrcsubi2gdGV7
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 23 May 2023 (Tuesday), 23:00 CEST (Berlin time). 

After that, the form will be closed.