# Homework 1
The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use the same [NYC taxi dataset][nyc-taxi-dataset], but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

* 1054112
* 1154112
* 1254112
* 1354112

[nyc-taxi-dataset]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [1]:
from pathlib import Path
import requests

def download_nyc_for_hire_vehicle(year_month: str, save_dir: str) -> str:
    fname = f'fhv_tripdata_{year_month}.parquet'
    
    if not Path(f'{save_dir}/{fname}').exists:
        r = requests.get(f'https://nyc-tlc.s3.amazonaws.com/trip+data/{fname}')
    
        with open(f'{save_dir}/{fname}', 'wb') as fout:
            for chunk in r.iter_content(chunk_size=1024):
                fout.write(chunk)
            
    return f'{save_dir}/{fname}'

In [2]:
import pandas as pd

jan_datafile = download_nyc_for_hire_vehicle('2021-01', './')
df = pd.read_parquet(jan_datafile)

df

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037
...,...,...,...,...,...,...,...
1154107,B03266,2021-01-31 23:43:03,2021-01-31 23:51:48,7.0,7.0,,B03266
1154108,B03284,2021-01-31 23:50:27,2021-02-01 00:48:03,44.0,91.0,,
1154109,B03285,2021-01-31 23:13:46,2021-01-31 23:29:58,171.0,171.0,,B03285
1154110,B03285,2021-01-31 23:58:03,2021-02-01 00:17:29,15.0,15.0,,B03285


## Q2. Computing duration

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

* 15.16
* 19.16
* 24.16
* 29.16

In [3]:
df['duration'] = df['dropOff_datetime'] - df['pickup_datetime']
df['duration'].mean()

Timedelta('0 days 00:19:10.033445627')

## Data preparation

Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

* 53%
* 63%
* 73%
* 83%

In [4]:
df['duration'].describe()

count                      1154112
mean     0 days 00:19:10.033445627
std      0 days 06:38:41.529882844
min                0 days 00:00:01
25%                0 days 00:07:46
50%                0 days 00:13:24
75%                0 days 00:22:17
max              294 days 00:11:03
Name: duration, dtype: object

In [5]:
df2 = df.loc[(df.duration >= pd.Timedelta(minutes=1)) & (df.duration <= pd.Timedelta(minutes=60))].copy()
print(f'Dropped {len(df) - len(df2):,} records from original dataframe')

Dropped 44,286 records from original dataframe


In [6]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1109826 entries, 0 to 1154111
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype          
---  ------                  --------------    -----          
 0   dispatching_base_num    1109826 non-null  object         
 1   pickup_datetime         1109826 non-null  datetime64[ns] 
 2   dropOff_datetime        1109826 non-null  datetime64[ns] 
 3   PUlocationID            182818 non-null   float64        
 4   DOlocationID            961919 non-null   float64        
 5   SR_Flag                 0 non-null        object         
 6   Affiliated_base_number  1109053 non-null  object         
 7   duration                1109826 non-null  timedelta64[ns]
dtypes: datetime64[ns](2), float64(2), object(3), timedelta64[ns](1)
memory usage: 76.2+ MB


In [7]:
df2.loc[pd.isna(df.PUlocationID), 'PUlocationID'] = -1
df2.loc[pd.isna(df.DOlocationID), 'DOlocationID'] = -1
df2

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,-1.0,-1.0,,B00009,0 days 00:17:00
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,-1.0,-1.0,,B00009,0 days 00:17:00
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,-1.0,72.0,,B00037,0 days 00:08:17
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,-1.0,61.0,,B00037,0 days 00:15:13
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,-1.0,71.0,,B00037,0 days 00:09:03
...,...,...,...,...,...,...,...,...
1154107,B03266,2021-01-31 23:43:03,2021-01-31 23:51:48,7.0,7.0,,B03266,0 days 00:08:45
1154108,B03284,2021-01-31 23:50:27,2021-02-01 00:48:03,44.0,91.0,,,0 days 00:57:36
1154109,B03285,2021-01-31 23:13:46,2021-01-31 23:29:58,171.0,171.0,,B03285,0 days 00:16:12
1154110,B03285,2021-01-31 23:58:03,2021-02-01 00:17:29,15.0,15.0,,B03285,0 days 00:19:26


In [8]:
pu_percent_replaced = len(df2.loc[df2.PUlocationID < 0]) / len(df2)
print(f'Replace {pu_percent_replaced:.3} of PUlocationID with -1')

Replace 0.835 of PUlocationID with -1


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

* 2
* 152
* 352
* 525
* 725


In [9]:
from sklearn.feature_extraction import DictVectorizer

location_columns = ['PUlocationID', 'DOlocationID']
df2[location_columns] = df2[location_columns].astype('str')
records = df2[location_columns].to_dict(orient='records')
dict_vectorizer = DictVectorizer(sparse=False)
transformed = dict_vectorizer.fit_transform(records)

transformed.shape

(1109826, 525)