# Exploring Uber New York data



Questions I want to answer with this data:

## Using Uber data only

What areas show the most pickups?
Is there an increase in pickups in time? Best to look at the month to month variation for this.
What day in the week is busiest?
What time of the day is busiest?

## Using Uber data and other Taxi data

Do we see a decrease in pickups for other Taxi companies?


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
april_uber = pd.read_csv('../input/uber-raw-data-apr14.csv')
april_uber.info()

In [None]:
april_uber['Date/Time'] = pd.to_datetime(april_uber['Date/Time'], format="%m/%d/%Y %H:%M:%S")

In [None]:
april_uber['Dayofweek'] = april_uber['Date/Time'].dt.dayofweek
april_uber['Daynumber'] = april_uber['Date/Time'].dt.day
april_uber['Hour'] = april_uber['Date/Time'].dt.hour

In [None]:
print(april_uber['Dayofweek'].unique())
print(april_uber['Daynumber'].unique())
print(april_uber['Hour'].unique())

In [None]:
april_uber['Hour'].hist(bins=24,color='k', alpha=0.5)
plt.xlim(0,23)
#plt.title('')
plt.ylabel('Total Journeys')
plt.xlabel('Time in Hours');
plt.axvline(8, color='r', linestyle='solid')
plt.axvline(17, color='r', linestyle='solid')

Obviously we see the peak pickup time coinciding with rush hour with the red lines showing 8am and 5pm.

In [None]:
april_uber['Daynumber'].hist(bins=30,color='k', alpha=0.5)
plt.xlim(1,30)
plt.title('Journeys for April 2014')
plt.ylabel('Total Journeys')
plt.xlabel('Date in April');

Here we are seeing that weekends are generally quieter. Let's have a quick look at Saturday 12th of April to see what the pickup distribution during the day looks like.

In [None]:
april_uber[april_uber['Daynumber']==12]['Hour'].hist(bins=24,color='k', alpha=0.5)
plt.xlim(0,24)
plt.title('Journeys for 12th of April 2014')
plt.ylabel('Total Journeys')
plt.xlabel('Time in Hours');
plt.axvline(8, color='r', linestyle='solid')
plt.axvline(17, color='r', linestyle='solid')

We see that the distribution is very different for a Saturday with no maximum at rush hour times.

We see from two plots before that the final date (April 30th 2014) is an outlier with 10000 more uber customers compared to previous days. Let's compare this day to the same day a week before to compare and check there are no mistakes.

In [None]:
test_30 = april_uber[april_uber['Daynumber']== 30]
test_23 = april_uber[april_uber['Daynumber']== 23]
print(len(test_30))
print(len(test_23))

Indeed, there are far more Uber customers on the 30th of April. Let's see if there is an event spike (maybe a concert was on or something) by comparing the two days in hours.

In [None]:
test_30['Hour'].hist(bins=24,color='k', alpha=0.5)
test_23['Hour'].hist(bins=24,color='b', alpha=0.5)
plt.xlim(0,23)
#plt.title('')
plt.ylabel('Total Journeys')
plt.xlabel('Time in Hours');
plt.axvline(8, color='r', linestyle='solid')
plt.axvline(17, color='r', linestyle='solid')

Both days have a very similar distribution.
Checking online, it turns out there was a major rain storm on this day. Many more people who normally walk to work opted for a taxi in order to avoid the rain.

http://www.nycareaweather.com/2014/04/april-30-2014-heavy-rain-continues-over-2-3-inches/

Sadly, we can't check to see if we see a similar peak for other non Taxis companies as there is no data available for this date.

Let's now investigate the month by month Uber pickups.

# Combining all Uber datasets

In [None]:
#should write a simple function for this but maybe later when I have finished the full analysis. 
may_uber = pd.read_csv('../input/uber-raw-data-may14.csv')
may_uber['Date/Time'] = pd.to_datetime(may_uber['Date/Time'], format="%m/%d/%Y %H:%M:%S")
may_uber['Dayofweek'] = may_uber['Date/Time'].dt.dayofweek
may_uber['Daynumber'] = may_uber['Date/Time'].dt.day
may_uber['Hour'] = may_uber['Date/Time'].dt.hour
jun_uber = pd.read_csv('../input/uber-raw-data-jun14.csv')
jun_uber['Date/Time'] = pd.to_datetime(jun_uber['Date/Time'], format="%m/%d/%Y %H:%M:%S")
jun_uber['Dayofweek'] = jun_uber['Date/Time'].dt.dayofweek
jun_uber['Daynumber'] = jun_uber['Date/Time'].dt.day
jun_uber['Hour'] = jun_uber['Date/Time'].dt.hour
jul_uber = pd.read_csv('../input/uber-raw-data-jul14.csv')
jul_uber['Date/Time'] = pd.to_datetime(jul_uber['Date/Time'], format="%m/%d/%Y %H:%M:%S")
jul_uber['Dayofweek'] = jul_uber['Date/Time'].dt.dayofweek
jul_uber['Daynumber'] = jul_uber['Date/Time'].dt.day
jul_uber['Hour'] = jul_uber['Date/Time'].dt.hour
aug_uber = pd.read_csv('../input/uber-raw-data-aug14.csv')
aug_uber['Date/Time'] = pd.to_datetime(aug_uber['Date/Time'], format="%m/%d/%Y %H:%M:%S")
aug_uber['Dayofweek'] = aug_uber['Date/Time'].dt.dayofweek
aug_uber['Daynumber'] = aug_uber['Date/Time'].dt.day
aug_uber['Hour'] = aug_uber['Date/Time'].dt.hour
sep_uber = pd.read_csv('../input/uber-raw-data-sep14.csv')
sep_uber['Date/Time'] = pd.to_datetime(sep_uber['Date/Time'], format="%m/%d/%Y %H:%M:%S")
sep_uber['Dayofweek'] = sep_uber['Date/Time'].dt.dayofweek
sep_uber['Daynumber'] = sep_uber['Date/Time'].dt.day
sep_uber['Hour'] = sep_uber['Date/Time'].dt.hour

In [None]:
full_uber = pd.concat([april_uber,may_uber,jun_uber,jul_uber,aug_uber,sep_uber])

In [None]:
full_uber['Month'] = full_uber['Date/Time'].dt.month

In [None]:
full_uber['Hour'].hist(bins=24,color='k', alpha=0.5)
plt.xlim(0,23)
plt.ylabel('Total Journeys')
plt.xlabel('Time in Hours');
plt.axvline(8, color='r', linestyle='solid')
plt.axvline(17, color='r', linestyle='solid')

In [None]:
full_uber['Month'].hist(bins=6,color='k', alpha=0.5)
plt.xlim(4,9)
#plt.title('')
plt.ylabel('Total Journeys')
plt.xlabel('Month');

In [None]:
((len(full_uber[full_uber['Month']==9])-len(full_uber[full_uber['Month']==4]))/len(full_uber[full_uber['Month']==4])) * 100

Uber pickup have increased by 82% from April to September 2014 and it is quite a steady linear increase. Let's take a look at another Taxi company to see if there is any pickup decrease due to Uber.

In [None]:
skyline = pd.read_csv('../input/other-Skyline_B00111.csv')
skyline['Date'] = pd.to_datetime(skyline['Date'], format="%m/%d/%Y")
skyline['Month'] = skyline['Date'].dt.month
full_uber['Month'].hist(bins=6,color='k', alpha=0.5)
skyline['Month'].hist(bins=3,color='k', alpha=0.5)
plt.xlim(4,9)
#plt.title('')
plt.ylabel('Total Journeys')
plt.xlabel('Time in Hours');

In [None]:
print(len(skyline[skyline['Month']==7]))
print(len(skyline[skyline['Month']==8]))
print(len(skyline[skyline['Month']==9]))

No decrease can be seen from this Taxi company alone but one would need to combine all other Taxi companies.