<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/1024px-Uber_logo_2018.svg.png" alt="UBER LOGO" width="50%" />

# UBER Pickups 

## Company"s Description 📇

<a href="http://uber.com/" target="_blank">Uber</a> is one of the most famous startup in the world. It started as a ride-sharing application for people who couldndft afford a taxi. Now, Uber expanded its activities to Food Delivery with <a href="https://www.ubereats.com/fr-en" target="_blank">Uber Eats</a>, package delivery, freight transportation and even urban transportation with <a href="https://www.uber.com/fr/en/ride/uber-bike/" target="_blank"> Jump Bike</a> and <a href="https://www.li.me/" target="_blank"> Lime </a> that the company funded. 


The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮

## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

(If you are not familiar with the bay area, check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 

Therefore, Uber's df team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**

In [17]:
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

In [2]:
july_dataset = r"C:\Users\giand\OneDrive\Documents\__packages__\jedha\src\uber-trip-data\uber-raw-data-jul14.csv"
df = pd.read_csv(july_dataset)

1. Grouper dataset par jour de la semaine
2. Grouper pour chaque heure de la journée

## EDA

In [3]:
df.shape

(796121, 4)

In [4]:
df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,7/1/2014 0:03:00,40.7586,-73.9706,B02512
1,7/1/2014 0:05:00,40.7605,-73.9994,B02512
2,7/1/2014 0:06:00,40.732,-73.9999,B02512
3,7/1/2014 0:09:00,40.7635,-73.9793,B02512
4,7/1/2014 0:20:00,40.7204,-74.0047,B02512


In [5]:
(df.isna().sum() / df.shape[0]).apply(lambda x: f"{round(x * 100)} %")

Date/Time    0 %
Lat          0 %
Lon          0 %
Base         0 %
dtype: object

In [6]:
df.describe(include="all")

Unnamed: 0,Date/Time,Lat,Lon,Base
count,796121,796121.0,796121.0,796121
unique,44286,,,5
top,7/15/2014 19:30:00,,,B02617
freq,79,,,310160
mean,,40.739141,-73.972353,
std,,0.040551,0.05866,
min,,39.7214,-74.826,
25%,,40.7209,-73.9961,
50%,,40.7425,-73.9832,
75%,,40.7608,-73.9651,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 796121 entries, 0 to 796120
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date/Time  796121 non-null  object 
 1   Lat        796121 non-null  float64
 2   Lon        796121 non-null  float64
 3   Base       796121 non-null  object 
dtypes: float64(2), object(2)
memory usage: 24.3+ MB


In [8]:
df_sample = df.sample(10000)

In [9]:
for column in df_sample.columns:
    fig = px.histogram(df_sample[column])
    fig.show()

## Pipeline & Pre-processing

In [10]:
df_sample = df_sample.drop("Base", axis=1)

In [11]:
df_sample["Date/Time"] = pd.to_datetime(df_sample["Date/Time"])
df_sample["dayofweek"] = df_sample["Date/Time"].dt.dayofweek
df_sample["hour"] = df_sample["Date/Time"].dt.hour
df_sample = df_sample.drop("Date/Time", axis=1)
df_sample.head()

Unnamed: 0,Lat,Lon,dayofweek,hour
281476,40.7692,-73.9509,1,7
87884,40.7222,-74.0053,2,13
730546,40.7164,-73.996,2,15
594294,40.725,-74.0077,1,16
28991,40.7221,-73.9584,6,2


In [33]:
dfs = {}
for i in range(7):
    dfs[i] = df_sample[df_sample["dayofweek"] == i]
    sc = StandardScaler()
    X = sc.fit_transform(dfs[i])
    dfs[i].to_csv(f"data/dataset_day_{i}.csv", index=False)
    np.save(f"data/preprocessing_day_{i}.npy", X)