# Uber Data Analysis using Python

This notebook is a continuation of the " Uber Basic Data Analysis " notebook, https://www.kaggle.com/brahimmebrek/uber-basic-data-analysis.

It contains more in depth visualizations ( Heatmaps, spatial visualizations and animation ) of the Uber Pickups in New York City data set.

The analysis is broken up into 3 sections:
- Data Loading and Preparation ( same as the " Uber Basic Data Analysis " notebook ).
- Cross Analysis through heatmaps.
- Spatial visualization and animation.

Github repo : https://github.com/BrahimMebrek/Uber_Data_Analysis/

## 1. Data Loading and Preparation

### 1.1 Loading Modules

In [None]:
import pandas as pd
import numpy as np

#Visualization modules
import matplotlib.pyplot as plt
import seaborn as sns

#The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python
from mpl_toolkits.basemap import Basemap
from matplotlib import cm #Colormap

#Animation Modules
from matplotlib.animation import FuncAnimation
import matplotlib.animation as animation

%matplotlib inline

### 1.2 Loading Data

In [None]:
#Load the datasets

df_apr14=pd.read_csv("/kaggle/input/uber-pickups-in-new-york-city/uber-raw-data-apr14.csv")
df_may14=pd.read_csv("/kaggle/input/uber-pickups-in-new-york-city/uber-raw-data-may14.csv")
df_jun14=pd.read_csv("/kaggle/input/uber-pickups-in-new-york-city/uber-raw-data-jun14.csv")
df_jul14=pd.read_csv("/kaggle/input/uber-pickups-in-new-york-city/uber-raw-data-jul14.csv")
df_aug14=pd.read_csv("/kaggle/input/uber-pickups-in-new-york-city/uber-raw-data-aug14.csv")
df_sep14=pd.read_csv("/kaggle/input/uber-pickups-in-new-york-city/uber-raw-data-sep14.csv")

#Merge the dataframes into one

df = df_apr14.append([df_may14,df_jun14,df_jul14,df_aug14,df_sep14], ignore_index=True)

### 1.3 Data Preparation

In [None]:
df.head()

In [None]:
df.info()

In [None]:
#Renaming the Date/Time Colomn
df = df.rename(columns={'Date/Time': 'Date_time'})

#Converting the Date_time type into Datetime
df['Date_time'] = pd.to_datetime(df['Date_time'])

#Adding usufull colomns
df['Month'] = df['Date_time'].dt.month_name()
df['Weekday'] = df['Date_time'].dt.day_name()
df['Day'] = df['Date_time'].dt.day
df['Hour'] = df['Date_time'].dt.hour
df['Minute'] = df['Date_time'].dt.minute

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe(include = 'all')

## 2 Cross Analysis
Through our exploration we are going to visualize:
- Heatmap by Hour and Day.
- Heatmap by Hour and Weekday.
- Heatmap by Month and Day.
- Heatmap by Month and Weekday.

In [None]:
#Defining a function that counts the number of rows
def count_rows(rows):
    return len(rows)

### 2.1 Heatmap by Hour and Day

In [None]:
#Creating the hour and day dataframe
df_hour_day = df.groupby('Hour Day'.split()).apply(count_rows).unstack()
df_hour_day.head()

In [None]:
plt.figure(figsize = (12,8))

#Using the seaborn heatmap function 
ax = sns.heatmap(df_hour_day, cmap=cm.YlGnBu, linewidth = .5)
ax.set(title="Trips by Hour and Day");

#### Analysing the results
We see that the number of trips in increasing throughout the day, with a peak demand in the evening between 16:00 and 18:00.

It corresponds to the time where employees finish their work and go home.

### 2.2 Heatmap by Hour and Weekday

In [None]:
df_hour_weekday = df.groupby('Hour Weekday'.split(), sort = False).apply(count_rows).unstack()
df_hour_weekday.head()

In [None]:
plt.figure(figsize = (12,8))

ax = sns.heatmap(df_hour_weekday, cmap=cm.YlGnBu, linewidth = .5)
ax.set(title="Trips by Hour and Weekday");

#### Analysing the results
We can see that on working days (From Monday to Friday) the number of trips is higher from 16:00 to 21:00. It shows even better what we said from the first heatmap.

On Friday the number of trips remains high until 23:00 and continues on early Saturday. It corresponds to the time where people come out from work, then go out for dinner or drink before the weekend.

We can notice the same pattern on Saturday, people tend to go out at night, the number of trips remains on high until early Sunday.

### 2.3 Heatmap by Day and Month

In [None]:
df_day_month = df.groupby('Day Month'.split(), sort = False).apply(count_rows).unstack()
df_day_month.head()

In [None]:
plt.figure(figsize = (12,8))

ax = sns.heatmap(df_day_month, cmap = cm.YlGnBu, linewidth = .5)
ax.set(title="Trips by Day and Month");

#### Analysing the results
We observe that the number of trips increases each month, we can say that from April to September 2014, Uber was in a continuous improvement process.

We can notice from the visualization a dark spot, it corresponds to the 30 April. The number of trips that day was extreme compared to the rest of the month.

Unfortunatly we have not been able to find any factual information to explain the pulse. A successful marketing strategy can be assumed to be in place that days. So as the analysis go on we consider that day an outliner.

In [None]:
#The number of trips the 30th of April
max_april = max(df_day_month['April'])

#The mean number of trips the rest of April
mean_rest_april = df_day_month['April'][0:29].sum() / 29

ratio_april = round(max_april / mean_rest_april)
print('The number of trips the 30th of April is {} times higher than the mean number of trips during the rest of the month'.format(ratio_april))

### 2.4 Heatmap by Month and Weekday

In [None]:
df_month_weekday = df.groupby('Month Weekday'.split(), sort = False).apply(count_rows).unstack()
df_month_weekday.head()

In [None]:
plt.figure(figsize = (12,8))

ax = sns.heatmap(df_month_weekday, cmap= cm.YlGnBu, linewidth = .5)
ax.set(title="Trips by Month and Weekday");

## 3 Spatial Visualization

In [None]:
#Setting up the limits
top, bottom, left, right = 41, 40.55, -74.3, -73.6

#Extracting the Longitude and Latitude of each pickup in our dataset
Longitudes = df['Lon'].values
Latitudes  = df['Lat'].values

### 3.1 Scatter visualization
For our first visualization we can reduce the need in computational power by dropping the duplicates in Latitude and Longitude.

In [None]:
df_reduced = df.drop_duplicates(['Lat','Lon'])

In [None]:
ratio_reduction = round((count_rows(df) - count_rows(df_reduced))/count_rows(df) * 100)
print('The dataset has been reduced by {}%'.format(ratio_reduction))

In [None]:
#Extracting the Longitude and Latitude of each pickup in our reduced dataset
Longitudes_reduced = df_reduced['Lon']
Latitudes_reduced  = df_reduced['Lat']

In [None]:
%matplotlib inline

plt.figure(figsize=(16, 12))

plt.plot(Longitudes_reduced, Latitudes_reduced, '.', ms=.8, alpha=.5)

plt.ylim(top=top, bottom=bottom)
plt.xlim(left=left, right=right)


plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('New York Uber Pickups from April to September 2014')

plt.show()

### 3.2 Heatmap visualization
This visualization is more demanding in computational power, since we can't use the reduce dataset if we want to get the number of pickups in the heatmap.
We will use Basemap to create the spacial heatmap.

In [None]:
plt.figure(figsize=(18, 14))
plt.title('New York Uber Pickups from April to September 2014')

#https://matplotlib.org/basemap/api/basemap_api.html
map = Basemap(projection='merc', urcrnrlat=top, llcrnrlat=bottom, llcrnrlon=left, urcrnrlon=right)
x, y = map(Longitudes, Latitudes)
map.hexbin(x, y, gridsize=1000, bins='log', cmap=cm.inferno)
map.colorbar(location='right', format='%.1f', label='Number of Pickups');

### Analysing the results
From our spacial visualization we observe that:
- Most of Uber's trips in New York are made from Midtown to Lower Manhattan.
- Followed by Upper Manhattan and the Heights of Brooklyn.
- Lastly Jersey City and the rest of Brooklyn.

We see some brighter spots in our heatmap, corresponding to :
- LaGuardia Airport in East Elmhurst.
- John F. Kennedy International Airport.
- Newark Liberty International Airport.

We know that many airports have specific requirements about where customers can be picked up by vehicles on the Uber platform. We can assume that these three airports have them, since they represent a big part of uber's business in new york


### Created by MEBREK Brahim