# New York City Taxi Ride Duration Prediction

In this case study, we will build a predictive model to predict the duration of taxi ride. We will do the following steps:
  * Install the dependencies
  * Load the data as pandas dataframe
  * Define the outcome variable - the variable we are trying to predict.
  * Build features with Deep Feature Synthesis using the [featuretools](https://featuretools.com) package. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system.
  


Allocate at least 2-3 hours to go through this case study end-to-end

# Install Dependencies 
<p>If you have not done so already, download this repository <a href="https://github.com/Featuretools/DSx/archive/master.zip">from git</a>. Once you have downloaded this archive, unzip it and cd into the directory from the command line. Next run the command ``./install_osx.sh`` if you are on a mac or ``./install_linux.sh`` if you are on linux. This should install all of the dependencies.</p>
<p> If you are on a windows machine, open the requirements.txt folder and make sure to install each of the dependencies listed (featuretools, jupyter, pandas, sklearn, numpy) </p>
<p> Once you have installed all of the dependencies, open this notebook. On Mac and Linux, navigate to the directory that you downloaded from git and run ``jupyter notebook`` to be taken to this notebook in your default web browser. When you open the NewYorkCity_taxi_case_study.ipynb file in the web browser, you can step through the code by clicking the ``Run`` button at the top of the page. If you have any questions for how to use <a href="http://jupyter.org/">Jupyter</a>, refer to google or the discussion forum.</p>

# Running the Code

In [5]:
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features, preview, feature_importances
from sklearn.ensemble import GradientBoostingRegressor
from featuretools.primitives import (Weekend, Minute, Hour, Day, Week, Month,
                                     Weekday, Weekend, Count, Sum, Mean, Median, Std, Min, Max)
import numpy as np
ft.__version__
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Step 1: Download and load the raw data as pandas dataframes
<p>If you have not yet downloaded the data it can be downloaded <a href="https://s3.amazonaws.com/mit-dsx-data/nyc-taxi-data.zip">from S3</a>. Once you have downloaded the archive, unzip it and place the nyc-taxi-data folder in the same directory as this script. 
</p>

In [6]:
trips, pickup_neighborhoods, dropoff_neighborhoods = load_nyc_taxi_data()
preview(trips, 10)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,trip_duration,pickup_neighborhood,dropoff_neighborhood
0,0,2,2016-01-01 00:00:19,2016-01-01 00:06:31,3,1.32,-73.961258,40.7962,-73.95005,40.787312,2,372.0,AH,C
672146,672146,1,2016-04-29 07:01:31,2016-04-29 07:15:46,1,3.3,-73.949951,40.784653,-73.982536,40.75547,1,855.0,C,AA
672147,672147,2,2016-04-29 07:01:43,2016-04-29 07:09:15,1,1.14,-73.967331,40.75737,-73.954277,40.765282,1,452.0,N,K
672148,672148,1,2016-04-29 07:01:46,2016-04-29 07:07:54,1,1.1,-74.003082,40.727509,-73.984703,40.724377,1,368.0,AB,AC
672149,672149,2,2016-04-29 07:01:46,2016-04-29 07:06:48,2,1.4,-73.990158,40.77235,-73.982147,40.7598,1,302.0,AR,AA
672150,672150,1,2016-04-29 07:01:59,2016-04-29 07:07:33,1,1.2,-73.983681,40.746677,-73.971703,40.762463,2,334.0,AO,A
672151,672151,2,2016-04-29 07:02:11,2016-04-29 07:15:24,2,2.13,-73.994209,40.750999,-73.969391,40.761539,1,793.0,D,AK
672152,672152,1,2016-04-29 07:02:11,2016-04-29 07:06:44,1,1.0,-73.983276,40.770985,-73.98011,40.760666,1,273.0,AR,A
672153,672153,2,2016-04-29 07:02:13,2016-04-29 07:08:36,1,1.17,-73.980141,40.743168,-73.983391,40.754665,1,383.0,Y,AA
672154,672154,1,2016-04-29 07:02:16,2016-04-29 07:04:07,1,0.5,-73.965973,40.765381,-73.970558,40.758724,1,111.0,AK,N


The ``trips`` table has the following fields
* ``id`` which uniquely identifies the trip
* ``vendor_id`` is the taxi cab company - in our case study we have data from three different cab companies
* ``pickup_datetime`` the time stamp for pickup
* ``dropoff_datetime`` the time stamp for drop-off
* ``passenger_count`` the number of passengers for the trip
* ``trip_distance`` total distance of the trip in miles 
* ``pickup_longitude`` the longitude for pickup
* ``pickup_latitude`` the latitude for pickup
* ``dropoff_longitude``the longitude of dropoff 
* ``dropoff_latitude`` the latitude of dropoff
* ``payment_type`` a numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided
* ``trip_duration`` this is the duration we would like to predict using other fields 
* ``pickup_neighborhood`` a one or two letter id of the neighboorhood where the trip started
* ``dropoff_neighborhood`` a one or two letter id of the neighboorhood where the trip ended

# Step 2: Prepare the Data
Lets create entities and relationships. The three entities in this data are 
* trips 
* pickup_neighborhoods
* dropoff_neighborhoods

This data has the following relationships
* pickup_neighborhoods --> trips (one neighboorhood can have multiple trips that start in it. This means pickup_neighborhoods is the ``parent_entity`` and trips is the child entity)
* dropoff_neighborhoods --> trips (one neighboorhood can have multiple trips that end in it. This means dropoff_neighborhoods is the ``parent_entity`` and trips is the child entity)

In <a <href="https://www.featuretools.com/"><featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows: 


In [7]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "pickup_neighborhoods": (pickup_neighborhoods, "neighborhood_id"),
        "dropoff_neighborhoods": (dropoff_neighborhoods, "neighborhood_id"),
        }

relationships = [("pickup_neighborhoods", "neighborhood_id", "trips", "pickup_neighborhood"),
                 ("dropoff_neighborhoods", "neighborhood_id", "trips", "dropoff_neighborhood")]

Next, we specify the cutoff time for each instance of the target_entity, in this case ``trips``.This timestamp represents the last time data can be used for calculating features by DFS. In this scenario, that would be the pickup time because we would like to make the duration prediction using data before the trip starts. 

For the purposes of the case study, we choose to only select trips that started after January 12th, 2016. 

In [15]:
cutoff_time = trips[['id', 'pickup_datetime']]
cutoff_time = cutoff_time[cutoff_time['pickup_datetime'] > "2016-01-12"]
cutoff_time.head(5)

Unnamed: 0,id,pickup_datetime
56311,56311,2016-01-12 00:00:25
56312,56312,2016-01-12 00:02:09
56313,56313,2016-01-12 00:02:25
56314,56314,2016-01-12 00:02:41
56315,56315,2016-01-12 00:03:44


# Step 3: Create baseline features using Deep Feature Synthesis

Instead of manually creating features, such as "month of pickup datetime", we can let DFS come up with them automatically. It does this by 
* interpreting the variable types of the columns e.g categorical, numeric and others 
* matching the columns to the primitives that can be applied to their variable types
* creating features based on these matches

# Create transform features using transform primitives

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In featureools, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

In [12]:
trans_primitives = [Weekend]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

*If you're interested about parameters to DFS such as `ignore_variables`, you can learn more about these parameters [here](https://docs.featuretools.com/generated/featuretools.dfs.html#featuretools.dfs)*
<p>Here are the features created.</p>

In [13]:
print ("Number of features: %d" % len(features))
features

Number of features: 13


[<Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: vendor_id>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>]

Now let's compute the features. 

In [14]:
feature_matrix = compute_features(features, cutoff_time)

Progress: 100%|██████████| 5/5 [00:27<00:00,  5.45s/cutoff time]
Finishing computing...


In [16]:
preview(feature_matrix, 5)

Unnamed: 0_level_0,dropoff_neighborhoods.latitude,pickup_neighborhoods.latitude,IS_WEEKEND(pickup_datetime),IS_WEEKEND(dropoff_datetime),passenger_count,trip_distance,dropoff_neighborhood = AD,dropoff_neighborhood = A,dropoff_neighborhood = AA,dropoff_neighborhood = D,...,pickup_neighborhood = AK,pickup_neighborhood = AO,pickup_neighborhood = N,pickup_neighborhood = R,pickup_neighborhood = O,vendor_id,pickup_neighborhoods.longitude,dropoff_neighborhoods.longitude,trip_duration,payment_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,40.721435,40.720245,False,False,1,1.61,0,0,0,0,...,0,0,0,0,0,2,-73.987205,-73.998366,645.0,1
691284,40.721435,40.729652,False,False,2,0.61,0,0,0,0,...,0,0,0,0,0,2,-73.991595,-73.998366,160.0,1
691285,40.785005,40.77627,False,False,2,0.88,0,0,0,0,...,0,0,0,0,0,2,-73.982322,-73.97605,295.0,1
691286,40.757707,40.742531,False,False,1,1.9,0,0,1,0,...,0,0,0,0,0,1,-73.977943,-73.986446,1573.0,1
691288,40.761087,40.747126,False,False,1,1.0,0,0,0,0,...,0,1,0,0,0,1,-73.985336,-73.995736,404.0,1


# Step 4: Build the Model

To build a model, we
* Seperate the data into a porition for ``training`` (75% in this case) and a portion for ``testing`` 
* Get the log of the trip duration so that a more linear relationship can be found.
* Train a model using a ``GradientBoostingRegressor``

In [17]:
# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train+1)
y_test = np.log(y_test+1)

In [18]:
model = GradientBoostingRegressor(verbose=True)
model.fit(X_train, y_train)
model.score(X_test, y_test)

      Iter       Train Loss   Remaining Time 
         1           0.4925            3.81m
         2           0.4333            3.97m
         3           0.3843            3.76m
         4           0.3446            3.67m
         5           0.3119            3.61m
         6           0.2852            3.54m
         7           0.2634            3.50m
         8           0.2454            3.46m
         9           0.2305            3.49m
        10           0.2183            3.45m
        20           0.1666            3.02m
        30           0.1558            2.74m
        40           0.1514            2.24m
        50           0.1488            1.86m
        60           0.1472            1.47m
        70           0.1458            1.08m
        80           0.1448           42.75s
        90           0.1440           21.02s
       100           0.1433            0.00s


0.72201075268017556

In [50]:
from sklearn.metrics import mean_squared_error
from math import sqrt 
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model_predicted = model.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (
    mean_squared_error(y_train, model.predict(X_train)),
    mean_squared_error(y_test, model_predicted)))
print('Root MSE train: %.3f, test: %.3f'%(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), \
                                          np.sqrt(mean_squared_error(y_test, model.predict(X_test)))))

MSE train: 0.118, test: 0.133
Root MSE train: 0.343, test: 0.364


# Step 5: Adding more Transform Primitives

* Add ``Minute``, ``Hour``, ``Week``, ``Month``, ``Weekday`` , etc primitives
* All these transform primitives apply to ``datetime`` columns

In [19]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

In [20]:
print ("Number of features: %d" % len(features))
features

Number of features: 25


[<Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: vendor_id>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>]

Now let's compute the features. 

In [21]:
feature_matrix = compute_features(features, cutoff_time)

Progress: 100%|██████████| 5/5 [00:36<00:00,  7.35s/cutoff time]
Finishing computing...


In [22]:
preview(feature_matrix, 10)

Unnamed: 0_level_0,trip_distance,dropoff_neighborhoods.latitude,MONTH(pickup_datetime),DAY(dropoff_datetime),DAY(pickup_datetime),WEEK(pickup_datetime),MINUTE(dropoff_datetime),HOUR(pickup_datetime),dropoff_neighborhoods.longitude,IS_WEEKEND(dropoff_datetime),...,pickup_neighborhood = AK,pickup_neighborhood = AO,pickup_neighborhood = N,pickup_neighborhood = R,pickup_neighborhood = O,passenger_count,trip_duration,IS_WEEKEND(pickup_datetime),vendor_id,WEEKDAY(dropoff_datetime)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,1.61,40.721435,1,12,12,2,11,0,-73.998366,False,...,0,0,0,0,0,1,645.0,False,2,1
691284,0.61,40.721435,5,2,2,18,24,12,-73.998366,False,...,0,0,0,0,0,2,160.0,False,2,0
691285,0.88,40.785005,5,2,2,18,27,12,-73.97605,False,...,0,0,0,0,0,2,295.0,False,2,0
691286,1.9,40.757707,5,2,2,18,48,12,-73.986446,False,...,0,0,0,0,0,1,1573.0,False,1,0
691288,1.0,40.761087,5,2,2,18,30,12,-73.995736,False,...,0,1,0,0,0,1,404.0,False,1,0
691289,3.24,40.761492,5,2,2,18,55,12,-73.975899,False,...,0,0,0,0,0,1,1906.0,False,2,0
691290,0.1,40.764723,5,2,2,18,26,12,-73.966696,False,...,1,0,0,0,0,1,156.0,False,1,0
691291,1.6,40.77627,5,2,2,18,37,12,-73.982322,False,...,0,0,0,0,0,1,827.0,False,1,0
691292,1.5,40.764723,5,2,2,18,39,12,-73.966696,False,...,0,0,0,0,0,1,883.0,False,1,0
691293,1.89,40.766488,5,2,2,18,34,12,-73.983998,False,...,0,0,0,0,1,2,592.0,False,2,0


# Step 6: Build the new model

In [23]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train+1)
y_test = np.log(y_test+1)

In [24]:
model = GradientBoostingRegressor(verbose=True)
model.fit(X_train,y_train)
model.score(X_test,y_test)

      Iter       Train Loss   Remaining Time 
         1           0.4925            5.47m
         2           0.4333            5.36m
         3           0.3843            5.11m
         4           0.3444            4.96m
         5           0.3117            4.91m
         6           0.2848            4.84m
         7           0.2620            4.76m
         8           0.2435            4.80m
         9           0.2282            4.88m
        10           0.2152            4.93m
        20           0.1588            4.31m
        30           0.1415            3.57m
        40           0.1332            3.06m
        50           0.1283            2.44m
        60           0.1252            1.96m
        70           0.1227            1.47m
        80           0.1207           58.23s
        90           0.1191           28.68s
       100           0.1177            0.00s


0.77555576702088913

In [49]:
from sklearn.metrics import mean_squared_error
from math import sqrt 
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model_predicted = model.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (
    mean_squared_error(y_train, model.predict(X_train)),
    mean_squared_error(y_test, model_predicted)))
print('Root MSE train: %.3f, test: %.3f'%(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), \
                                          np.sqrt(mean_squared_error(y_test, model.predict(X_test)))))

MSE train: 0.118, test: 0.133
Root MSE train: 0.343, test: 0.364


# Step 7: Add Aggregation Primitives

Now let's add aggregation primitives. These primitives will generate features for the parent entities ``pickup_neighborhoods``, and ``dropoff_neighborhood`` and then add them to the trips entity, which is the entity for which we are trying to make prediction.

In [42]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
aggregation_primitives = [Count, Sum, Mean, Median, Std, Max, Min]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=aggregation_primitives,
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

In [43]:
print ("Number of features: %d" % len(features))
features

Number of features: 63


[<Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: vendor_id>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>,
 <Feature: pickup_neighborhoods.COUNT(trips)>,
 <Feature: pickup_neighborhoods.SUM(trips.passenger_count)>

In [44]:
feature_matrix = compute_features(features, cutoff_time)

Progress: 100%|██████████| 5/5 [01:33<00:00, 18.63s/cutoff time]
Finishing computing...


In [51]:
preview(feature_matrix, 10)

Unnamed: 0_level_0,pickup_neighborhoods.MEAN(trips.trip_distance),HOUR(pickup_datetime),dropoff_neighborhood = AD,dropoff_neighborhood = A,dropoff_neighborhood = AA,dropoff_neighborhood = D,dropoff_neighborhood = AR,dropoff_neighborhood = C,dropoff_neighborhood = O,dropoff_neighborhood = N,...,dropoff_neighborhoods.MAX(trips.trip_distance),MONTH(pickup_datetime),MONTH(dropoff_datetime),pickup_neighborhoods.MAX(trips.trip_duration),pickup_neighborhoods.SUM(trips.passenger_count),pickup_neighborhoods.longitude,pickup_neighborhoods.MAX(trips.passenger_count),trip_duration,dropoff_neighborhoods.MIN(trips.trip_distance),vendor_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56311,2.978551,0,0,0,0,0,0,0,0,0,...,20.54,1,1,3292.0,2283.0,-73.987205,6.0,645.0,0.0,2
691284,2.232347,12,0,0,0,0,0,0,0,0,...,24.1,5,5,3606.0,34521.0,-73.991595,6.0,160.0,0.0,2
691285,2.062772,12,0,0,0,0,0,0,0,0,...,86.52,5,5,3580.0,36299.0,-73.982322,6.0,295.0,0.0,2
691286,2.125305,12,0,0,1,0,0,0,0,0,...,502.8,5,5,3587.0,31158.0,-73.977943,6.0,1573.0,0.0,1
691288,2.171776,12,0,0,0,0,0,0,0,0,...,26.62,5,5,3598.0,43543.0,-73.985336,6.0,404.0,0.0,1
691289,2.509054,12,0,1,0,0,0,0,0,0,...,25.6,5,5,3561.0,30913.0,-73.998366,6.0,1906.0,0.0,2
691290,1.830726,12,0,0,0,0,0,0,0,0,...,21.6,5,5,3604.0,43212.0,-73.966696,6.0,156.0,0.0,1
691291,2.266902,12,0,0,0,0,0,0,0,0,...,501.4,5,5,3579.0,32656.0,-73.956886,6.0,827.0,0.0,1
691292,2.274509,12,0,0,0,0,0,0,0,0,...,21.6,5,5,3602.0,57862.0,-73.976515,6.0,883.0,0.0,1
691293,1.872252,12,0,0,0,0,1,0,0,0,...,27.83,5,5,3586.0,39612.0,-73.960551,6.0,592.0,0.0,2


# Step 8: Build the new model

In [52]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train+1)
y_test = np.log(y_test+1)

In [53]:
# note: this may take up to 30 minutes to run
model = GradientBoostingRegressor(verbose=True)
model.fit(X_train, y_train)

      Iter       Train Loss   Remaining Time 
         1           0.4925           13.34m
         2           0.4333           12.82m
         3           0.3843           12.78m
         4           0.3444           12.57m
         5           0.3117           12.42m
         6           0.2848           12.16m
         7           0.2620           11.89m
         8           0.2435           11.79m
         9           0.2282           11.75m
        10           0.2152           11.75m
        20           0.1585           10.86m
        30           0.1420            9.33m
        40           0.1332            7.86m
        50           0.1271            6.49m
        60           0.1238            5.12m
        70           0.1211            3.79m
        80           0.1191            2.49m
        90           0.1176            1.24m
       100           0.1163            0.00s


GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=True, warm_start=False)

# Step 9: Evalute on test data

In [54]:
model.score(X_test,y_test)

0.7780888563898487

we can also make predictions using our model

In [55]:
y_pred = model.predict(X_test)
y_pred = np.exp(y_pred) - 1 # undo the log we took earlier
y_pred[5:]

array([  557.67992664,   590.2602792 ,  1497.39684679, ...,  1063.48382696,
        1800.89361932,   739.60249439])

# Additional Analysis
<p>Let's look at how important each feature was for the model.</p>

In [56]:
feature_importances(model, feature_matrix.columns, n=15)

1: Feature: trip_distance, 0.314
2: Feature: HOUR(pickup_datetime), 0.126
3: Feature: HOUR(dropoff_datetime), 0.089
4: Feature: WEEKDAY(pickup_datetime), 0.051
5: Feature: dropoff_neighborhoods.latitude, 0.046
6: Feature: dropoff_neighborhoods.longitude, 0.036
7: Feature: dropoff_neighborhoods.MEDIAN(trips.trip_distance), 0.027
8: Feature: WEEKDAY(dropoff_datetime), 0.026
9: Feature: pickup_neighborhoods.longitude, 0.022
10: Feature: dropoff_neighborhoods.MEDIAN(trips.trip_duration), 0.022
11: Feature: IS_WEEKEND(pickup_datetime), 0.021
12: Feature: dropoff_neighborhoods.MEAN(trips.trip_duration), 0.019
13: Feature: pickup_neighborhoods.MEDIAN(trips.trip_distance), 0.018
14: Feature: dropoff_neighborhoods.MEAN(trips.trip_distance), 0.018
15: Feature: payment_type, 0.018
