# Question 3 
### Can we accurately predict how much tips a driver will get based on distance traveled, time spent, hour of the day and day of the week?

#### Due to the limit of computation power and time, we'll be looking at it one day of the week at a time.

### Applying KNN and Regression

#### Some useful links

https://www.dataquest.io/blog/k-nearest-neighbors-in-python/ - KNN

http://www.statsoft.com/textbook/k-nearest-neighbors - Cross Validation with KNN

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html - Selecting Classifier for Cross Validation

This notebook will handle one type of customers, tippers only

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import matplotlib.path as mplPath
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error
%matplotlib inline

### Reading in the cleaned data

In [2]:
#df=pd.read_csv("datasets/clean-january-2013.csv") # tippers and non-tippers together
df=pd.read_csv("datasets/cleaner-january-2013.csv") # only those that paid tips

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,weekday,hour,pickup,trip_distance,tip_amount,total_amount,time_spent
0,0,Friday,0,"40.613763,-73.972592",0.0,7.7,38.5,0.0
1,1,Friday,0,"40.61622,-73.97454999999998",2.9,3.25,16.25,11.0
2,2,Friday,0,"40.620231,-73.96423799999998",5.1,5.0,25.0,19.0
3,3,Friday,0,"40.629405,-74.017868",13.36,7.5,45.5,22.0
4,4,Friday,0,"40.63121,-74.017517",2.64,2.2,13.7,11.0


In [4]:
df.tail()

Unnamed: 0.1,Unnamed: 0,weekday,hour,pickup,trip_distance,tip_amount,total_amount,time_spent
7375724,7375724,Wednesday,23,"40.873275,-73.886922",2.8,1.5,13.5,10.0
7375725,7375725,Wednesday,23,"40.890857,-73.908495",0.0,4.41,7.91,1.0
7375726,7375726,Wednesday,23,"40.900912,-74.00320499999998",0.0,23.0,75.5,0.0
7375727,7375727,Wednesday,23,"40.902505,-74.00228799999998",0.0,3.0,86.0,6.0
7375728,7375728,Wednesday,23,"40.907027,-73.909115",0.0,9.5,50.5,0.0


In [5]:
df = df.loc[(df['weekday'] == 'Wednesday')] 

### Generate the training and testing set

In [6]:
# Generate the training set.  Set random_state to be able to replicate results.
train = df.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = df.loc[~df.index.isin(train.index)]
# Print the shapes of both sets.
print(train.shape)
print(test.shape)

(961593, 8)
(240398, 8)


In [7]:
train.head()

Unnamed: 0.1,Unnamed: 0,weekday,hour,pickup,trip_distance,tip_amount,total_amount,time_spent
6449753,6449753,Wednesday,9,"40.750237,-73.98330099999998",1.2,1.8,10.8,11.0
6585129,6585129,Wednesday,11,"40.763397,-73.996442",2.6,2.2,13.7,12.0
7284201,7284201,Wednesday,22,"40.744727,-73.991235",0.46,0.9,5.9,3.0
6440440,6440440,Wednesday,9,"40.743532,-73.98443799999998",0.95,0.5,8.0,8.0
6608456,6608456,Wednesday,12,"40.722912,-73.99893",2.6,4.05,17.55,18.0


In [8]:
Xtrain = train[['weekday','hour','pickup','trip_distance','total_amount','time_spent']]
ytrain = train[['tip_amount']]
Xtest = test[['weekday','hour','pickup','trip_distance','total_amount','time_spent']]
ytest = test[['tip_amount']]
Xtrain.shape, ytrain.shape, Xtest.shape, ytest.shape

((961593, 6), (961593, 1), (240398, 6), (240398, 1))

In [9]:
Xtrain.head() # training set data we're interested in

Unnamed: 0,weekday,hour,pickup,trip_distance,total_amount,time_spent
6449753,Wednesday,9,"40.750237,-73.98330099999998",1.2,10.8,11.0
6585129,Wednesday,11,"40.763397,-73.996442",2.6,13.7,12.0
7284201,Wednesday,22,"40.744727,-73.991235",0.46,5.9,3.0
6440440,Wednesday,9,"40.743532,-73.98443799999998",0.95,8.0,8.0
6608456,Wednesday,12,"40.722912,-73.99893",2.6,17.55,18.0


In [10]:
ytrain.head() # corresponding tips

Unnamed: 0,tip_amount
6449753,1.8
6585129,2.2
7284201,0.9
6440440,0.5
6608456,4.05


### Create the matrix

In [11]:
## might be useful
pickup_train = Xtrain[['pickup']] # remember the pickup coordinates

Xtrain = Xtrain.join(pd.get_dummies(Xtrain['hour']))
Xtrain = Xtrain.drop(['hour','weekday','pickup'], axis=1)
Xtrain.head()

Unnamed: 0,trip_distance,total_amount,time_spent,0,1,2,3,4,5,6,...,14,15,16,17,18,19,20,21,22,23
6449753,1.2,10.8,11.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6585129,2.6,13.7,12.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7284201,0.46,5.9,3.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6440440,0.95,8.0,8.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6608456,2.6,17.55,18.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
## might be useful
pickup_test = Xtest[['pickup']] # remember the pickup coordinates

Xtest = Xtest.join(pd.get_dummies(Xtest['hour']))
Xtest = Xtest.drop(['hour','weekday','pickup'], axis=1)
Xtest.head()

Unnamed: 0,trip_distance,total_amount,time_spent,0,1,2,3,4,5,6,...,14,15,16,17,18,19,20,21,22,23
6173760,16.9,67.3,25.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173766,13.8,49.2,29.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173768,16.8,63.3,24.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173778,18.8,60.0,25.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173779,16.68,63.3,25.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
Xtrain.shape

(961593, 27)

In [14]:
print(Xtest.shape)
Xtest.head()

(240398, 27)


Unnamed: 0,trip_distance,total_amount,time_spent,0,1,2,3,4,5,6,...,14,15,16,17,18,19,20,21,22,23
6173760,16.9,67.3,25.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173766,13.8,49.2,29.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173768,16.8,63.3,24.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173778,18.8,60.0,25.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6173779,16.68,63.3,25.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Cross Validation with KNeighborsRegressor

http://scikit-learn.org/stable/modules/cross_validation.html - 3.1.1. Computing cross-validated metrics

In [21]:
%%time
cross_validation = cross_val_score(KNeighborsRegressor(),Xtrain, ytrain,cv=5)
cross_validation

array([ 0.83574942,  0.85191105,  0.85379918,  0.85866424,  0.8598996 ])

In [24]:
print("Accuracy: %0.2f (+/- %0.2f)" % (cross_validation.mean(), cross_validation.std() * 2))

Accuracy: 0.85 (+/- 0.02)


In [22]:
%%time 
clf = KNeighborsRegressor().fit(Xtrain, ytrain)
score = clf.score(Xtest, ytest)
print("Score for fold: %.3f" % (score))

Score for fold: 0.866
Wall time: 2min 13s


### Computing error


Now that we know our point predictions, we can compute the error involved with our predictions. We can compute mean squared error.

In [27]:
%%time
mse = mean_squared_error(clf.predict(Xtest),ytest)

Wall time: 1min 21s


In [28]:
print("MSE = ",mse)
print("RMSE = ",np.sqrt(mse))

MSE =  0.645572972828
RMSE =  0.803475558326


In [29]:
%%time
np.column_stack((clf.predict(Xtest),ytest))

Wall time: 1min 20s


array([[ 10.14,  10.  ],
       [  7.43,   8.2 ],
       [  7.39,   6.  ],
       ..., 
       [  2.7 ,   2.7 ],
       [ 17.72,  23.  ],
       [ 25.15,   3.  ]])