<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/1024px-Uber_logo_2018.svg.png" alt="UBER LOGO" width="50%" />

# UBER Pickups 

## Company's Description 📇

<a href="http://uber.com/" target="_blank">Uber</a> is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with <a href="https://www.ubereats.com/fr-en" target="_blank">Uber Eats</a>, package delivery, freight transportation and even urban transportation with <a href="https://www.uber.com/fr/en/ride/uber-bike/" target="_blank"> Jump Bike</a> and <a href="https://www.li.me/" target="_blank"> Lime </a> that the company funded. 


The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮


## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

(If you are not familiar with the bay area, check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 

Therefore, Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

## Goals 🎯

Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:

* Create an algorithm to find hot zones 
* Visualize results on a nice dashboard 

## Scope of this project 🖼️

To start off, Uber wants to try this feature in New York city. Therefore you will only focus on this city. Data can be found here: 

👉👉<a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/Projects/uber-trip-data.zip" target="_blank"> Uber Trip Data</a> 👈👈

**You only need to focus on New York City for this project**

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Clustering is your friend 

Clustering technics are a perfect fit for the job. Think about it, all the pickup locations can be gathered into different clusters. You can then use **cluster coordinates to pin hot zones** 😉
    

### Create maps with `plotly` 

Check out <a href="https://plotly.com/" target="_blank">Plotly</a> documentation, you can create maps and populate them easily. Obviously, there are other libraries but this one should do the job pretty well. 


### Start small grow big 

Eventhough Uber wants to have hot-zones per hour and per day of week, you should first **start small**. Pick one day at a given hour and **then start to generalize** your approach. 

## Deliverable 📬

To complete this project, your team should: 

* Have a map with hot-zones using any python library (`plotly` or anything else). 
* You should **at least** describe hot-zones per day of week. 
* Compare results with **at least** two unsupervised algorithms like KMeans and DBScan. 

Your maps should look something like this: 

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Clusters_uber_pickups.png" alt="Uber Cluster Map" />

In [12]:
import pandas as pd 
import numpy as np
import calendar

from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

pio.renderers.default = "browser"

from IPython.display import display

In [13]:
####
# Load & explore
########
df = pd.read_csv('uber-raw-data-apr14.csv')
display(df.head())

print("Basics statistics: ")
display(df.describe(include='all'))

print("Percentage of missing values: ")
display(100*df.isnull().sum()/df.shape[0])

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


Basics statistics: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,564516,564516.0,564516.0,564516
unique,41999,,,5
top,4/7/2014 20:21:00,,,B02682
freq,97,,,227808
mean,,40.740005,-73.976817,
std,,0.036083,0.050426,
min,,40.0729,-74.7733,
25%,,40.7225,-73.9977,
50%,,40.7425,-73.9848,
75%,,40.7607,-73.97,


Percentage of missing values: 


Date/Time    0.0
Lat          0.0
Lon          0.0
Base         0.0
dtype: float64

In [None]:
# no missing values

In [14]:
# transform the index into datetime
df.index = pd.to_datetime(df['Date/Time'])
df.head()

Unnamed: 0_level_0,Date/Time,Lat,Lon,Base
Date/Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-04-01 00:11:00,4/1/2014 0:11:00,40.769,-73.9549,B02512
2014-04-01 00:17:00,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2014-04-01 00:21:00,4/1/2014 0:21:00,40.7316,-73.9873,B02512
2014-04-01 00:28:00,4/1/2014 0:28:00,40.7588,-73.9776,B02512
2014-04-01 00:33:00,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [15]:
timeslot_24 = [('00:00', '00:59'),('01:00', '01:59'), ('02:00', '02:59'),
            ('03:00', '03:59'), ('04:00', '04:59'), ('05:00', '05:59'),
            ('06:00', '06:59'), ('07:00', '07:59'), ('08:00', '08:59'),
            ('09:00', '09:59'), ('10:00', '10:59'), ('11:00', '11:59'),            
            ('12:00', '12:59'), ('13:00', '13:59'), ('14:00', '14:59'),   
            ('15:00', '15:59'), ('16:00', '16:59'), ('17:00', '17:59'),   
            ('18:00', '18:59'), ('19:00', '19:59'), ('20:00', '20:59'),   
            ('21:00', '21:59'), ('22:00', '22:59'), ('23:00', '23:59')]

In [19]:
###
# Let's take sample, monday/tuesday for one month between 18h - 19h 
###########
df_sample = df.loc[df.index.weekday.isin([0])].between_time('16:00', '19:00')

# Let's create a loop that will collect the Within-sum-of-square (wcss) for each value K 
# Let's use .inertia_ parameter to get the within sum of square value for each value K 
# Identify the significant clusters and this process is iterative. 
# If the distance between the observation and its closest cluster center is greater than 
# the distance between the others closest cluster centers(Cluster 1, Cluster 2 …),
# then the observation will replace the cluster center depending on which one is closer to the observation
wcss =  []
k_elbow = []
X = df_sample.iloc[:,[1,2]]
for i in range (1,11): 
    kmeans = MiniBatchKMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    k_elbow.append(i)
    print("WCSS for K={} --> {}".format(i, wcss[-1]))

# Computer mean silhouette score
# Pour chaque point, son coefficient de silhouette est la différence entre la distance moyenne avec 
# les points du même groupe que lui (cohésion) et la distance moyenne avec les points des autres 
# groupes voisins (séparation). Si cette différence est négative, le point est en moyenne plus proche d
# u groupe voisin que du sien : il est donc mal classé. 
# À l'inverse, si cette différence est positive, le point est en moyenne plus proche de son groupe que du groupe voisin : il est donc bien classé.
sil = []
k_sil = []
## Careful, you need to start at i=2 as silhouette score cannot accept less than 2 labels 
for i in range (2,11): 
    kmeans = MiniBatchKMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    sil.append(silhouette_score(X, kmeans.labels_))
    k_sil.append(i)
    print("Silhouette score for K={} is {}".format(i, sil[-1]))


WCSS for K=1 --> 51.1366118272159
WCSS for K=2 --> 40.23584818244221
WCSS for K=3 --> 22.730704340161246
WCSS for K=4 --> 17.981831844260036
WCSS for K=5 --> 15.877859665669508
WCSS for K=6 --> 10.226382557726241
WCSS for K=7 --> 9.104290425769683
WCSS for K=8 --> 8.335205622077147
WCSS for K=9 --> 6.91196626003633
WCSS for K=10 --> 6.957086806792694
Silhouette score for K=2 is 0.35163373121297087
Silhouette score for K=3 is 0.4369799145043831
Silhouette score for K=4 is 0.4708849291922912
Silhouette score for K=5 is 0.4878956282110826
Silhouette score for K=6 is 0.4991258180233731
Silhouette score for K=7 is 0.4027942036493982
Silhouette score for K=8 is 0.4165866437855045
Silhouette score for K=9 is 0.42662634475948513
Silhouette score for K=10 is 0.34875801578135485


In [20]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Inertia per cluster", "Silhouette Score per cluster"))

# Create DataFrame for within the Within-sum-of-square
wcss_frame = pd.DataFrame(wcss)
k_frame = pd.Series(k_elbow)

# Create figure
fig.add_trace(
    go.Scatter(x=k_frame, y=wcss_frame.iloc[:,-1]),
    row=1, col=1
)
fig.update_xaxes(title_text="# Clusters", row=1, col=1)
fig.update_yaxes(title_text="Inertia", row=1, col=1)

# Create a data frame computer mean silhouette score
cluster_scores=pd.DataFrame(sil)
k_frame = pd.Series(k_sil)

# Create figure
fig.add_trace (
    go.Bar(x=k_frame, y=cluster_scores.iloc[:, -1]),
    row=1, col=2
)
fig.update_xaxes(title_text="# Clusters", row=1, col=2)
fig.update_yaxes(title_text="Silhouette score", row=1, col=2)

# Render
fig.update_layout(showlegend=False)
fig.show(renderer="vscode") # if using workspace

In [21]:
####
## let's look at it on a map !
#############
kmeans = MiniBatchKMeans(n_clusters=4, random_state=42)
kmeans.fit(X)

# assign each sample to a cluster
df_sample['cluster'] = kmeans.predict(X)

fig = px.scatter_mapbox(df_sample, lat="Lat", lon="Lon", 
    zoom=10, mapbox_style="carto-positron",
    color='cluster', 
    color_continuous_scale=px.colors.sequential.Rainbow,
    width=600,height=600)

fig.show(renderer="vscode")

In [28]:
####
## Let's try a DBSCAN on the sample
###########
X = df_sample.iloc[:,[1,2]].values
db = DBSCAN(eps=0.0025, metric='manhattan', min_samples=100, n_jobs=-1)
db.fit(X)

# Assign each sample to a cluster
df_sample['cluster'] = db.labels_
fig = px.scatter_mapbox(
    df_sample[(df_sample.cluster>=0)], lat="Lat", lon="Lon", 
    zoom=10, mapbox_style="carto-positron",
    color='cluster', 
    color_continuous_scale=px.colors.sequential.Rainbow, 
    width=600,height=600)
    
fig.show(renderer="vscode")
df_sample.cluster.value_counts()

-1    6278
 1    6138
 2    1744
 0     261
 4     211
 6     190
 5     190
 3     141
 7     122
 8      58
Name: cluster, dtype: int64

In [None]:
# a lot of sample are classifier as outliers

In [23]:
##########
# We build a data frame with a full month of data
# for each day of the week, for each hour slot
#######
hotspots = pd.DataFrame(columns=['Lat', 'Lon', 'day', 'time_slot'])
for day in range(7):
    dfday = df[(df.index.dayofweek == day)]
    for slot in timeslot_24:
        dfslot = dfday.between_time(slot[0], slot[1]).copy()
        X = dfslot.iloc[:,[1,2]].values
        kmeans = MiniBatchKMeans(n_clusters=4, random_state=42)
        kmeans.fit(X)
        dfslot['day'] = calendar.day_name[day]
        dfslot['time_slot'] = slot[0]
        dfslot['cluster'] = kmeans.predict(X)
        hotspots = pd.concat([hotspots, dfslot], ignore_index=True)

hotspots.head(100), hotspots.shape

(        Lat      Lon     day time_slot          Date/Time    Base  cluster
 0   40.7205 -73.9939  Monday     00:00   4/7/2014 0:31:00  B02512      2.0
 1   40.7407 -74.0077  Monday     00:00   4/7/2014 0:37:00  B02512      2.0
 2   40.7591 -73.9892  Monday     00:00   4/7/2014 0:50:00  B02512      0.0
 3   40.7419 -74.0034  Monday     00:00   4/7/2014 0:58:00  B02512      2.0
 4   40.7456 -73.9773  Monday     00:00  4/14/2014 0:02:00  B02512      0.0
 ..      ...      ...     ...       ...                ...     ...      ...
 95  40.7365 -73.9908  Monday     00:00  4/21/2014 0:05:00  B02598      2.0
 96  40.7731 -73.9797  Monday     00:00  4/21/2014 0:06:00  B02598      0.0
 97  40.7242 -74.0102  Monday     00:00  4/21/2014 0:06:00  B02598      2.0
 98  40.7470 -73.9732  Monday     00:00  4/21/2014 0:07:00  B02598      0.0
 99  40.6339 -73.9285  Monday     00:00  4/21/2014 0:07:00  B02598      2.0
 
 [100 rows x 7 columns],
 (564516, 7))

In [24]:
###
# Let's display the hotspot during a whole monday
#########
dbspots_monday = hotspots[hotspots["day"] == 'Monday']
fig = px.scatter_mapbox(dbspots_monday, lat="Lat", lon="Lon", 
    zoom=10, mapbox_style="carto-positron",
    color='cluster', 
    color_continuous_scale=px.colors.sequential.Rainbow,
    animation_frame='time_slot',
    animation_group='day', 
)
fig.show()

In [23]:
fig = px.scatter_mapbox(hotspots, lat="Lat", lon="Lon", 
    zoom=10, mapbox_style="carto-positron",
    color='day', #size='hits',
    color_continuous_scale=px.colors.sequential.Rainbow,
    animation_frame='time_slot',
    animation_group='day')
fig.show()

In [29]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

dbspots = pd.DataFrame(columns=['Lat', 'Lon', 'day', 'time_slot'])

for day in range(7):
    dfday = df[(df.index.dayofweek == day)]
    for slot in timeslot_24:
        dfslot = dfday.between_time(slot[0], slot[1]).copy()
        X = dfslot.iloc[:,[1,2]].values
        db = DBSCAN(eps=0.0025, metric='manhattan', min_samples=100, n_jobs=-1)
        db.fit(X)
        
        dfslot['day'] = calendar.day_name[day]
        dfslot['time_slot'] = slot[0]
        dfslot['cluster'] = db.labels_
        dfslot = dfslot[(dfslot.cluster>=0)]
        dbspots = pd.concat([dbspots, dfslot], ignore_index=True)
        
dbspots.head(100), dbspots.shape

(        Lat      Lon     day time_slot          Date/Time    Base  cluster
 0   40.6448 -73.7825  Monday     06:00  4/21/2014 6:12:00  B02512      0.0
 1   40.6449 -73.7822  Monday     06:00  4/28/2014 6:16:00  B02512      0.0
 2   40.6447 -73.7826  Monday     06:00  4/28/2014 6:17:00  B02512      0.0
 3   40.6449 -73.7821  Monday     06:00   4/7/2014 6:05:00  B02598      0.0
 4   40.6450 -73.7821  Monday     06:00   4/7/2014 6:10:00  B02598      0.0
 ..      ...      ...     ...       ...                ...     ...      ...
 95  40.6449 -73.7821  Monday     06:00  4/28/2014 6:07:00  B02682      0.0
 96  40.6450 -73.7815  Monday     06:00  4/28/2014 6:13:00  B02682      0.0
 97  40.6447 -73.7829  Monday     06:00  4/28/2014 6:13:00  B02682      0.0
 98  40.6448 -73.7827  Monday     06:00  4/28/2014 6:15:00  B02682      0.0
 99  40.6449 -73.7821  Monday     06:00  4/28/2014 6:17:00  B02682      0.0
 
 [100 rows x 7 columns],
 (51823, 7))

In [30]:
dbspots['time_slot'].value_counts().count()

14

In [27]:
#dbspots['hits'] = dbspots['hits'].astype(str).astype(int)

fig = px.scatter_mapbox(dbspots, lat="Lat", lon="Lon", 
    zoom=10, mapbox_style="carto-positron",
    color='cluster', #size='hits',
    color_continuous_scale=px.colors.sequential.Rainbow,
    animation_frame='time_slot',
    animation_group='day')
fig.show()