# What will be the best location to pick-up customers for each day of the week of a month?


As a taxi driver, you always want to know where is the best location to pick up customers. The more customers you get the more profit you earn. In this project, we will use Kmeans Clustering and Polynomial Regression to predict best location for a taxi driver.

## Preamble

In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd

## Load the data
you can get data from this [link](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-01.csv)

In [None]:
df=pd.read_csv("datasets/yellow_tripdata_2013-01.csv")

## Preprocess data
Since our data is large and exist some ourliers, we have to do preprocess for our data. We need to reduce size of our data and get rid off some ourliers for our data.

In [None]:
#create new column call weekday
timestamp = pd.to_datetime(pd.Series(df['pickup_datetime']))
df['weekday'] = timestamp.dt.weekday_name
df.head()

In [None]:
#drop unnecessary column
df = df.drop(['vendor_id','passenger_count','trip_distance','rate_code',
              'store_and_fwd_flag','payment_type','fare_amount','surcharge','mta_tax',
             'tip_amount','tolls_amount','total_amount','dropoff_datetime',
              'dropoff_longitude','dropoff_latitude'], axis=1)

#get rid off some garbage data
df=df[(df['pickup_latitude'] > 40.492083) & (df['pickup_latitude']<40.944536) &
     (df['pickup_longitude']> -74.267880)& (df['pickup_longitude']< -73.662022)]

df.head()

In [None]:
#get all the selected weekdays in selected month
my_weekday="Thursday"
my_montn=1
df_select=df[(df['weekday']==my_weekday) & 
                 (pd.to_datetime(df['pickup_datetime']) < pd.datetime(2013,my_montn+1,1))&
                (pd.to_datetime(df['pickup_datetime']) > pd.datetime(2013,my_montn,1))]
df_select=df_select[:70000]
df_select.head()

## Kmean Clustering
We use Kmean to divide our data into 100 clusters. We use these 100 clusters to represent 100 blocks in New York City. We look into size of clusters and center of clusters

In [None]:
%%time
#use Kmean to group data by longitude and latitude
my_cluster=100
from sklearn.cluster import KMeans
lon=df_select['pickup_longitude'].values
lat=df_select['pickup_latitude'].values
coodinate_array=np.array([[lon[i],lat[i]] for i in range(len(lon))])

kmeans_n = KMeans(n_clusters=my_cluster,  n_init=1, random_state=1000)
kmeans_n.fit(coodinate_array)
labels = kmeans_n.labels_
print(labels)

In [None]:
# add new column call cluster
df_select['Cluster']=labels
df_select.head()

In [None]:
#prepare for regression
Cluster_size=df_select.groupby('Cluster').size()
Cluster_size=np.array([[Cluster_size[i]] for i in range(len(Cluster_size))])
Cluster_center=kmeans_n.cluster_centers_

## Analysis

In [None]:
plt.hist(Cluster_size)
plt.title("cluster size VS count")
plt.show()

In [None]:
analysis_lon=[]
analysis_lat=[]
for i in range(len(Cluster_center)):
    analysis_lon.append(Cluster_center[i][0])
    analysis_lat.append(Cluster_center[i][1])
plt.scatter(analysis_lon,analysis_lat)
plt.title("Distribution of center point")
plt.show()

In [None]:
plt.scatter(analysis_lon,Cluster_size)
plt.title("longitude VS Cluster Size")
plt.show()

In [None]:
plt.scatter(analysis_lat,Cluster_size)
plt.title("latitude VS Cluster Size")
plt.show()

## Training data and testing data

In [None]:
#use 80% of data for train, use 20% of data for test
train_size=int(len(Cluster_size)*0.8)
test_size=int(len(Cluster_size)*0.2)
train_feature=Cluster_size[:train_size]
train_response=Cluster_center[:train_size]
test_feature=Cluster_size[test_size:]
test_response=Cluster_center[test_size:]

## Validation - coefficient of determination (R^2)
in order to check how well our model perform for prediction, we use coefficient of determination (R^2) and Mean Squared Error(MSE) to evaluate our model.

In [None]:
#coefficient of determination (R^2)
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline


def fit_model(X, y):
    model = Pipeline([('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression(fit_intercept=False))])
    model.fit(X, y)
    return model

def score_model(model, X, y, Xv, yv):
    return tuple([model.score(X, y), model.score(Xv, yv)])

def fit_model_and_score(data, response, validation, val_response):
    model = fit_model(data, response)
    return score_model(model, data, response, validation, val_response)

print (fit_model_and_score(train_feature, train_response,
                           test_feature, test_response))


## MSE

In [None]:
#use mean squared error to evaluation model
from sklearn.metrics import mean_squared_error

MSE_model=Pipeline([('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression(fit_intercept=False))])
MSE_model.fit(train_feature, train_response)
X_MSE=(test_feature)
y_MSE = MSE_model.predict(X_MSE)
mean_squared_error(test_response, y_MSE)


## Prediction
we fit sizes from Clusters with centers from Clusters. Next, we use the cluster that has max size as input to predict best location for pick up

In [None]:
#predict best location

X=Cluster_size
y=Cluster_center

prediction_model=Pipeline([('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression(fit_intercept=False))])
prediction_model.fit(X, y)
X_predict=([max(Cluster_size)])
y_predict = prediction_model.predict(X_predict)
y_predict


## Visualization
For Visualization, we use randint function to generate values for 100 different colors, we use scatter plot with 100 different colors to see clusters that Kmean created. Also, we use scatter plot to create visualization for prediction, we show predicted point in red and actual point in yellow for our visualization.

In [None]:
#prepare for visualization
max_size_cluster=0


for data in y_predict:
    visual_x=data[[0]]
    visual_y=data[[1]]
    
for i in range(len(Cluster_size)):
    if (Cluster_size[i]==Cluster_size.max()):
        max_size_cluster=i
        
actual_value=kmeans_n.cluster_centers_[max_size_cluster]
actual_x=actual_value[0]
actual_y=actual_value[1]

In [None]:
#visualization for kmean cluster
from random import randint
colors = []

for i in range(my_cluster):
    colors.append('#%06X' % randint(0, 0xFFFFFF))

plt.figure(figsize=(18,9))
for i in range(my_cluster):
    my_cluster_df=df_select[df_select['Cluster']==i]
    lon_x=my_cluster_df.pickup_longitude.values
    lat_y=my_cluster_df.pickup_latitude.values
    plt.scatter(lon_x,lat_y,alpha=0.2,s=100,c=colors[i])

plt.axis([visual_x-0.1,visual_x+0.1,visual_y-0.1,visual_y+0.1])
plt.title("visualization for kmean")
plt.show()


In [None]:
#scatter plot all the data for selected weekday and prediction(best location in red)
x_points=lon
y_points=lat
plt.figure(figsize=(18,9))
plt.scatter(lon,lat,alpha=0.2,s=100)
plt.scatter(visual_x,visual_y ,c='r',s=100)
plt.scatter(actual_x,actual_y ,c='y',s=100)
plt.axis([visual_x-0.05,visual_x+0.05,visual_y-0.05,visual_y+0.05])
plt.show()
