# Uber New York Data Analysis

# 1. Import the library

Very first, we should to import some basic modules that will definetely help us to create a beautiful of our analysis. 
* **pandas**       : For the data frame needed
* **seaborn**      : Data Visualization
* **numpy**        : Use for numerical computation
* **matplotlib**   : Data Visualization

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import os

import plotly.express as px
import plotly.graph_objects as go

# 2. Data Loading and Preparation

For analysis purpose we need uber trip data from 2014. There are six files of raw data on Uber pickups in New York City from April to September 2014. We have to concatinate the dataset.

In [None]:
path = r'../input/uber-pickups-in-new-york-city'
files = ['uber-raw-data-aug14.csv',
         'uber-raw-data-apr14.csv',
         'uber-raw-data-jul14.csv',
         'uber-raw-data-jun14.csv',
         'uber-raw-data-may14.csv',
         'uber-raw-data-sep14.csv']
final = pd.DataFrame()

for file in files:
    df = pd.read_csv(path+"/"+file,encoding='utf-8')
    final = pd.concat([final,df])

In [None]:
final.shape

The files are separated by month and each has the following columns:

* Date/Time : The date and time of the Uber pickup
* Lat       : The latitude of the Uber pickup
* Lon       : The longitude of the Uber pickup
* Base      : The TLC base company code affiliated with the Uber pickup

The globe is split into an imaginary 360 sections from both top to bottom (north to south) and 180 sections from side to side (west to east). The sections running from top to bottom on a globe are called longitude, and the sections running from side to side on a globe are called latitude.
Latitude is the measurement of distance north or south of the Equator.
Every location on earth has a global address. Because the address is in numbers, people can communicate about location no matter what language they might speak. A global address is given as two numbers called coordinates. The two numbers are a location's latitude number and its longitude number ("Lat/Long").

In [None]:
df = final.copy()
df.head(10)

Let's check out data type

In [None]:
df.dtypes

For the **Date/Time** collum the data type is object, so we should change the format into datetime using **pd.to_datetime()**

In [None]:
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format='%m/%d/%Y %H:%M:%S')
df.dtypes

Now, we were going to add a new collumn to define weekday, day, minute, month, and hour

In [None]:
df['weekday']=df['Date/Time'].dt.day_name()
df['day']=df['Date/Time'].dt.day
df['minute']=df['Date/Time'].dt.minute
df['month']=df['Date/Time'].dt.month
df['hour']=df['Date/Time'].dt.hour
df.head()

In [None]:
df.dtypes

In [None]:
colors = ['lightslategray',] * 5
colors[0] = 'crimson'

fig = go.Figure(data=[go.Bar(
    x=df['weekday'].value_counts().index,
    y=df['weekday'].value_counts().values,
    marker_color=colors # marker color can be a single color value or an iterable
)])
fig.update_layout(title_text='Rush Day of Uber Trip')

From the bar chart above we can see that have so much rush in Thursday.

# 3. Exploratory Data and Analysis

## Analysing Trip of Uber

### Seems to have highest sales on Thursday

In [None]:
colors = ['lightslategray',] * 5
colors[0] = 'crimson'

fig = go.Figure(data=[go.Bar(
    x = df['weekday'].value_counts().index,
    y = df['weekday'].value_counts(),
    marker_color=colors # marker color can be a single color value or an iterable
)])
fig.update_layout(title_text='High Sales of Uber Trip')


### Analysis by Hour

In [None]:
plt.hist(df['hour'])
plt.ylabel('frequency')
plt.xlabel('work hour')

If we can see from the histogram above, we can say that its actually peaks during evening time when people are logging from the work. 

### Peaks on evening time

In [None]:
plt.figure(figsize=(40,20))
for i,month in enumerate(df['month'].unique()):
  plt.subplot(3,2,i+1)
  df[df['month']==month]['hour'].hist()

From the all visual above we can conclude that the rush time is happen on evening time. 

## Analysing Monthly Rides

In [None]:
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
df.groupby('month')['hour'].count()

In [None]:
fig = go.Figure(data=[go.Bar(
    x = df.groupby('month')['hour'].count().index,
    y = df.groupby('month')['hour'].count(),
    #marker_color=colors # marker color can be a single color value or an iterable
)])
fig.update_layout(title_text='The Highest Monthly Ride')

The month has maximum rides is September

Next, the analysis of journey of each day

In [None]:
plt.figure(figsize=(10,8))
plt.hist(df['day'],bins=30,rwidth=0.8,range=(0.5,30.5))
plt.xlabel('date of the month')
plt.ylabel('Total Journeys')
plt.title('Journeys by Month Day')

From the histogram above, we can see at the end of the month the rush of the ride happened.

## Analysing Demand of Ubers

To analysis of total rides per month we should iterate in every month. 

In [None]:
plt.figure(figsize=(20,12))
for i, month in enumerate(df['month'].unique(),1):
  plt.subplot(3,2,i)
  df_out=df[df['month']==month]
  plt.hist(df_out['day'])
  plt.xlabel('day in month {}'.format(month))
  plt.ylabel('total rides')

From the visualization above, almost in the couple of last day in each every month we have maximum ride.

**Analysing rush in hour**

In [None]:
ax=sns.pointplot(x='hour',y='Lat', data=df, hue='weekday')
ax.set_title('hoursoffday vs latitude of passenger')

From the figure above, we can see the pointplot with respect to each in every weekday.

## Performing Cross Analysis

Analyse which base number gets popular by month

In [None]:
base=df.groupby(['Base','month'])['Date/Time'].count().reset_index()
base

In [None]:
plt.figure(figsize=(10,6))
ax = sns.lineplot(x='month',y='Date/Time', hue='Base',data=base)
ax.set_title('Popular Base Number by Month')

B02617 is the base popular by month

Perform Cross Analysis
Through our exploration we are going to vizualize:
1. Heatmap by hour and weekday
2. Heatmap by hour and day
3. Heatmap by month and day
4. Heatmap by month and weekday

In [None]:
#Heatmap by hour and weekday
def count_rows(rows):
  return len(rows)
by_cross = df.groupby(['weekday','hour']).apply(count_rows)
by_cross

Let's create a pivot table

In [None]:
pivot=by_cross.unstack()
pivot

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(pivot)

To make it easy, we can define a function called heatmap.

In [None]:
#Heatmap by hour and day, month and day, month and weekday
def heatmap(col1, col2):
  by_cross = df.groupby([col1,col2]).apply(count_rows)
  pivot=by_cross.unstack()
  plt.figure(figsize=(15,8))
  return sns.heatmap(pivot)  

In [None]:
#Heatmap by hour and day
heatmap('day','hour')

In [None]:
#Heatmap by month and day
heatmap('day','month')

In [None]:
#Heatmap by month and weekday
heatmap('weekday', 'month')

## Performs Spatial Anaysis on Demand of Uber

Analysis of location data point

In [None]:
plt.figure(figsize=(12,6))
plt.plot(df['Lon'],df['Lat'],'r+',ms=0.5)
plt.xlim(-74.2,-73.7)
plt.ylim(40.6,41)

Perform spatial analysis using heatmap to get clear cut of Rush

In [None]:
df_out=df[df['weekday']=='Sunday']
df_out

In [None]:
rush=df_out.groupby(['Lat','Lon'])['weekday'].count().reset_index()
rush.columns=['Lat','Lon','no of trips']
rush

## Analysing Uber Pickup on Each Month

In [None]:
uber_15=pd.read_csv(r'../input/uber-pickups-in-new-york-city/uber-raw-data-janjune-15.csv')

In [None]:
uber_15.head()

In [None]:
uber_15.dtypes

In [None]:
uber_15['Pickup_date']=pd.to_datetime(uber_15['Pickup_date'], format='%Y-%m-%d %H:%M:%S')

In [None]:
uber_15.dtypes

In [None]:
uber_15['weekday']=uber_15['Pickup_date'].dt.day_name()
uber_15['day']=uber_15['Pickup_date'].dt.day
uber_15['minute']=uber_15['Pickup_date'].dt.minute
uber_15['month']=uber_15['Pickup_date'].dt.month
uber_15['hour']=uber_15['Pickup_date'].dt.hour

In [None]:
uber_15.head()

Uber pickups by the month in New York City

In [None]:
px.bar(x=uber_15['month'].value_counts().index,
       y=uber_15['month'].value_counts())

## Analysing Rush in New York City

Analysisng rush in New York City in every hour

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(uber_15['hour'])
plt.title("Rush in New York City")

Analysisng in-depth analysis of rush in NYC day and hour wise

In [None]:
summary=uber_15.groupby(['weekday','hour'])['Pickup_date'].count().reset_index()
summary

In [None]:
summary.columns=['weekday','hour','count']
summary.head()

In [None]:
plt.figure(figsize=(12,8))
sns.pointplot(x='hour',y='count',hue='weekday',data=summary)

From the graph above, we can see that every weekday there is always an increase at 5-9, while on weekends it is low.

## Perform in Depth Analysis of Uber Base Number

**Analysing which base number has most active vehicles**

In [None]:
uber_foil=pd.read_csv(r'../input/uber-pickups-in-new-york-city/Uber-Jan-Feb-FOIL.csv')

In [None]:
uber_foil.head()

In [None]:
uber_foil.dtypes

find how many unique base number available in our data

In [None]:
uber_foil['dispatching_base_number'].unique()

So, we just need a distribution of activities with respect to each and every base number. We guys can simply use our boxplot because here we have a multiple base number. Whenever we have multiple was always try to go ahead with box plot. 

In [None]:
plt.figure(figsize=(12,10))
sns.boxplot(x='dispatching_base_number',y='active_vehicles', data=uber_foil)

We can see with respect to each and every base number. So from this, we can definitely come up with conclusion that the green color, which is exactly BO2764 has a maximum number of activists.

**Analysing which base umber has most trips**

In [None]:
plt.figure(figsize=(12,10))
sns.boxplot(x='dispatching_base_number', y='trips', data=uber_foil)

Again, this BO2764 is still lead over here. It means it has a maximum number of activities as well as it has a maximum number of trips as well

**How average trips/vehicle inc/decreases with dates with each of base number**

In [None]:
#define trips.vehicle
uber_foil['trips/vehicle']=uber_foil['trips']/uber_foil['active_vehicles']

In [None]:
uber_foil.head()

In [None]:
plt.figure(figsize=(12,6))
uber_foil.set_index('date').groupby(['dispatching_base_number'])['trips/vehicle'].plot()
plt.ylabel('Average trips/vehicle')
plt.title('Demand vs Supply Chart')
plt.legend()

We can see with respect to each other based number, we have our own plot for each of the base number. We can definitely see over the orange plot, which is exactly BO2764 and BO2598 have that much demand and supply. It means both these base number they definitely performed better. Whereas in case of blue chart BO2512 it doesn't have that much good attention comparing to all other base number.