In [2]:
import pandas as pd
import numpy as np

## Q1. Downloading the data
Read the data for January. How many records are there?

In [3]:
df_jan_21 = pd.read_parquet('../data/fhv_tripdata_2021-01.parquet')

In [4]:
# Number of records - For-Hire Vehicle Trip Records in January 2021
df_jan_21.shape[0]

1154112

## Q2. Computing duration
Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

In [5]:
df_jan_21['trip_duration_min'] = df_jan_21['dropOff_datetime'] - df_jan_21['pickup_datetime']

In [6]:
df_jan_21['trip_duration_min'] = df_jan_21['trip_duration_min'].apply(lambda x : x.total_seconds()/60)

In [7]:
df_jan_21['trip_duration_min'].mean()

19.167224093791006

Data preparation
Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?




In [8]:
df_jan_21_dropped = df_jan_21[(df_jan_21['trip_duration_min'] >= 1) & (df_jan_21['trip_duration_min'] <= 60)]

In [9]:
df_jan_21_dropped.shape[0]

1109826

In [10]:
df_jan_21.shape[0] - df_jan_21_dropped.shape[0]

44286

## Q3. Missing values
The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.


In [11]:
df_jan_21.isnull().mean()*100

dispatching_base_num        0.000000
pickup_datetime             0.000000
dropOff_datetime            0.000000
PUlocationID               83.030676
DOlocationID               14.055828
SR_Flag                   100.000000
Affiliated_base_number      0.076682
trip_duration_min           0.000000
dtype: float64

In [12]:
df_jan_21[['PUlocationID', 'DOlocationID']] = df_jan_21[['PUlocationID', 'DOlocationID']].replace(np.nan, -1)

In [13]:
len(df_jan_21[df_jan_21['PUlocationID']== -1])*100/len(df_jan_21)

83.03067639882438

## Q4. One-hot encoding
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

Turn the dataframe into a list of dictionaries
Fit a dictionary vectorizer
Get a feature matrix from it
What's the dimensionality of this matrix? (The number of columns).

In [14]:
from sklearn.feature_extraction import DictVectorizer

In [15]:
categorical_features = ['PUlocationID', 'DOlocationID']
df_jan_21_dropped[categorical_features] = df_jan_21_dropped[categorical_features].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [16]:
dict_vectorizer = DictVectorizer()

In [17]:
df_jan_dict = df_jan_21_dropped[categorical_features].to_dict(orient='records')

In [18]:
X_train = dict_vectorizer.fit_transform(df_jan_dict)

In [19]:
X_train.shape[1]

525

In [20]:
y_train = df_jan_21_dropped['trip_duration_min'].values

In [21]:
y_train

array([17.        , 17.        ,  8.28333333, ..., 16.2       ,
       19.43333333, 36.        ])

## Q5. Training a model
Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters
Calculate the RMSE of the model on the training data
What's the RMSE on train?

In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [23]:
model_lr = LinearRegression()

In [24]:
model_lr.fit(X_train, y_train)

LinearRegression()

In [25]:
y_pred = model_lr.predict(X_train)
rmse = mean_squared_error(y_train, y_pred, squared=False)

In [26]:
rmse

10.52851910720539

## Q6. Evaluating the model
Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

In [27]:

df_feb_21 = pd.read_parquet('../data/fhv_tripdata_2021-02.parquet')

In [36]:
df_feb_21['trip_duration_min'] = df_feb_21['dropOff_datetime'] - df_feb_21['pickup_datetime']

In [37]:
df_feb_21['trip_duration_min'] = df_feb_21['trip_duration_min'].apply(lambda x : x.total_seconds()/60)

In [43]:
df_feb_21_dropped = df_feb_21[(df_feb_21['trip_duration_min'] >= 1) & (df_feb_21['trip_duration_min'] <= 60)]

In [44]:
df_feb_dict = df_feb_21_dropped[categorical_features].to_dict(orient='records')
X_val = dict_vectorizer.transform(df_feb_dict)

In [45]:
y_val = df_feb_21_dropped['trip_duration_min'].values

In [51]:
y_pred_val = model_lr.predict(X_val)
rmse_val = mean_squared_error(y_val, y_pred_val, squared=False)

In [52]:
rmse_val

12.855087247019194