## Homework 1
* Author: Sebastián Contreras
* Email: sebastiancz@live.cl

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_parquet("yellow_tripdata_2022-01.parquet")

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [3]:
data.shape[1]

19

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45

In [4]:
data["duration"] = (data["tpep_dropoff_datetime"] - data["tpep_pickup_datetime"]).dt.total_seconds()/60

In [5]:
np.round(data["duration"].std(), 2)

46.45

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%


In [6]:
old_size = data.shape[0]

In [7]:
data.query("duration >= 1 & duration <= 60", inplace=True)

In [8]:
np.round((1 - (old_size - data.shape[0])/data.shape[0])*100, 0)

98.0

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [9]:
data[['PULocationID', 'DOLocationID']] = data[['PULocationID', 'DOLocationID']].fillna(-1).astype('int')
data[['PULocationID', 'DOLocationID']] = data[['PULocationID', 'DOLocationID']].astype('str')
train_dicts = data[['PULocationID', 'DOLocationID']].to_dict(orient='records')

In [10]:
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(train_dicts)
y_train = data["duration"].values
X_train.shape[0]

2421440

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [11]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [12]:
np.round(mean_squared_error(y_true=y_train, y_pred=lr.predict(X_train), squared=False), 2)

6.99

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [13]:
feb = pd.read_parquet("yellow_tripdata_2022-02.parquet")
feb["duration"] = (feb["tpep_dropoff_datetime"] - feb["tpep_pickup_datetime"]).dt.total_seconds()/60
feb.query("duration >= 1 & duration <= 60", inplace=True)
feb[['PULocationID', 'DOLocationID']] = feb[['PULocationID', 'DOLocationID']].fillna(-1).astype('int')
feb[['PULocationID', 'DOLocationID']] = feb[['PULocationID', 'DOLocationID']].astype('str')
test_dicts = feb[['PULocationID', 'DOLocationID']].to_dict(orient='records')

X_test = vectorizer.transform(test_dicts)
y_test = feb["duration"].values
y_hat = lr.predict(X_test)
np.round(mean_squared_error(y_true=y_test, y_pred=y_hat, squared=False), 2)

7.79