## Homework

The goal of this homework is to train a simple model for predicting the duration of a taxi ride.

In [1]:
import pandas as pd

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

Use the "**Yellow** Taxi Trip Records" in the [NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [2]:
df_train = pd.read_parquet('./data/yellow_tripdata_2023-01.parquet')
print(f'No. of columns: {df_train.shape[1]}')

No. of columns: 19


## Q2. Computing duration

Compute the `duration` variable in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59
* 52.59
* 62.59

In [3]:
df_train['duration'] = df_train['tpep_dropoff_datetime'] - df_train['tpep_pickup_datetime']
df_train['duration'] = df_train.duration.apply(lambda td: td.total_seconds() / 60)
df_train['duration']

0           8.433333
1           6.316667
2          12.750000
3           9.616667
4          10.833333
             ...    
3066761    13.983333
3066762    19.450000
3066763    24.516667
3066764    13.000000
3066765    14.400000
Name: duration, Length: 3066766, dtype: float64

In [4]:
df_train['duration'].describe()

count    3.066766e+06
mean     1.566900e+01
std      4.259435e+01
min     -2.920000e+01
25%      7.116667e+00
50%      1.151667e+01
75%      1.830000e+01
max      1.002918e+04
Name: duration, dtype: float64

In [5]:
print(f"Standard deviation of duration: {df_train['duration'].std():.2f}")

Standard deviation of duration: 42.59


## Q3. Dropping outliers

Check the distribution of the `duration` variable. There are some outliers. Remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after dropping the outliers?

* 90%
* 92%
* 95%
* 98%

In [6]:
initial_row_count = df_train.shape[0]
initial_row_count

3066766

In [7]:
df_train = df_train[(df_train['duration'] >= 1) & (df_train['duration'] <= 60)]
reduced_row_count = df_train.shape[0]
reduced_row_count

3009173

In [8]:
fraction_left = reduced_row_count / initial_row_count
print(f'Fraction of records left: {fraction_left:.0%}')

Fraction of records left: 98%


## Q4. One-hot encoding

Apply one-hot encoding to the pickup and dropoff location IDs. Use only these two features for the model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [9]:
categorical = ['PULocationID', 'DOLocationID']

df_train[categorical] = df_train[categorical].astype(str)

In [10]:
train_dicts = df_train[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [11]:
print(f'Dimensionality of feature matrix: {len(dv.feature_names_)}')

Dimensionality of feature matrix: 515


## Q5. Training a model

Use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64
* 11.64
* 16.64

In [12]:
target = 'duration'
y_train = df_train[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

print(f'Train RMSE: {mean_squared_error(y_train, y_pred, squared=False):.2f}')

Train RMSE: 7.65


## Q6. Evaluating the model

Apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81
* 11.81
* 16.81

In [13]:
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)

        df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
        df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)

    df['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
    df['duration'] = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df['duration'] >= 1) & (df['duration'] <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    
    df[categorical] = df[categorical].astype(str)
    
    return df

In [14]:
df_val = read_dataframe('./data/yellow_tripdata_2023-02.parquet')

In [15]:
val_dicts = df_val[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_val = df_val[target].values

In [16]:
y_pred = lr.predict(X_val)

print(f'Validation RMSE: {mean_squared_error(y_val, y_pred, squared=False):.2f}')

Validation RMSE: 7.81
