# 0.Introduction

We are going to explore the Uber Request Data to identify & suggest solutions for Incomplete requests.
Requests that are not completed doesnt generate revenue to the company.
We are provided with the data of requests made between the Airport and the City.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1. Loading and visual analysis of the dataset

In [None]:
UberData = pd.read_csv("../input/uber-request-data/Uber Request Data.csv")

In [None]:
UberData.head()

In [None]:
UberData.tail()

In [None]:
UberData.shape

In [None]:
UberData.info()

In [None]:
UberData.Status.value_counts()

In [None]:
UberData.dtypes.value_counts()

In [None]:
UberData.select_dtypes(include=['object']).columns

Looking at s subset of the data and the definition of the features we notice:
1. Status contains categorical values
2. There are 6,745 rows with missing data in DriverID and Drop timestamp
3. Both Request timestamp and Drop timestamp are of type object which needs to be casted
4. DriverID is fload which needs to be integer
5. This is a small dataset so we do not require further analysis on performance

# 2. Checking for Duplicates and NULL values

In [None]:
UberData.duplicated().sum()

There are no duplicated rows in our dataset

In [None]:
UberData.isna().sum()

In [None]:
NA = UberData.isna().sum()/len(UberData)
NA[NA > 0].sort_values()

We see that Driver id and Drop timestamp are missing data

**Check these 2 columns further to find any patteren and determine the best way to fill these NA**

In [None]:
UberData[UberData['Driver id'].isna()]['Status'].unique()

In [None]:
UberData[UberData['Drop timestamp'].isna()]['Status'].unique()

From the above we notice that Driver ID is not available for rides that were not initiated
Drop timestamp has missing values if the ride was not initiated or if the ride was cancelled for some reason.
We do not have to worry about the NAs in these two fields as they are intended to be that way.

# 3. Feature selection

Purpose of our analysis to identify the possible reason for rides not being completed due to unavailability of cars.
Looking at the columns available we know that DriverID and Drop timestamp is not going to be useful for this analysis.

In [None]:
UberData.drop(columns=['Driver id', 'Drop timestamp'],inplace = True)

# 4. Feature Engineering

**Renaming the columns to remove blank space and make the casing uniform**

In [None]:
UberData.columns = [i.replace(' ', '_').lower() for i in UberData.columns]

**Fixing the datatype**

In [None]:
UberData['request_timestamp'] = pd.to_datetime(UberData['request_timestamp'])

**Lets add some new features which will perform detailed analysis**

In [None]:
UberData['request_weekday'] = UberData[['request_timestamp']].apply(lambda x: dt.datetime.strftime(x['request_timestamp'], '%A'), axis=1)

In [None]:
UberData['request_weekday'].unique()

Notice that our data set doesnt have data for Tuesday and the weekends.
We can infer logically that ourdata set may be a subset of the actual day as it is not possible to have 0 rides on any day
THough this might no affect our analysis it is best to communicate this to the stakeholders for clarity

In [None]:
UberData['request_month'] = UberData[['request_timestamp']].apply(lambda x: dt.datetime.strftime(x['request_timestamp'], '%B'), axis=1)

In [None]:
UberData['request_month'].unique()

Notice that our data set has data only for July, November and December.
It is best to communicate this to the stakeholders for clarity

In [None]:
UberData['request_year'] = UberData[['request_timestamp']].apply(lambda x: dt.datetime.strftime(x['request_timestamp'], '%Y'), axis=1)

In [None]:
UberData['request_year'].unique()

Our dataset is the requests made in the year 2016. Since it is just one year lets drop this feature

In [None]:
UberData.drop(columns=['request_year'],inplace = True)

In [None]:
UberData['Request_Hour'] = UberData['request_timestamp'].dt.round('H').dt.hour

In [None]:
UberData.rename(columns={"Request_Hour": "request_hour"}, inplace = True)

In [None]:
UberData['request_hour'].unique()

In [None]:
def get_hr(hr):
    if(hr >= 0 and hr <= 6):
        return 'Early Morning'
    elif(hr >= 7 and hr < 12):
        return 'Morning'
    elif(hr >= 12 and hr < 16):
        return 'Afternoon'
    elif(hr >= 16 and hr < 19):
        return 'Evening'
    elif(hr >= 19 and hr < 22):
        return 'Night'
    elif(hr >= 22):
        return 'Late Night'

In [None]:
UberData['time_of_day'] = UberData['request_hour'].apply(lambda x : get_hr(x))

**Our dataframe with new features**

In [None]:
UberData.head()

# 5. Observation

**# of rides being requested in each PickUp Point**

In [None]:
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%  ({v:d})'.format(p=pct,v=val)
    return my_autopct

UberData['pickup_point'].value_counts().plot.pie(autopct=make_autopct(UberData['pickup_point'].value_counts()))
plt.show()

Since our dataset concentrates on rides from and to the Airport we see that the requests made from City to Airport is slightly higher

In [None]:
UberData[UberData.pickup_point == "City"]['status'].value_counts().plot.pie(autopct=make_autopct(UberData[UberData.pickup_point == "City"]['status'].value_counts()))
plt.show()

In [None]:
UberData[UberData.pickup_point == "Airport"]['status'].value_counts().plot.pie(autopct=make_autopct(UberData[UberData.pickup_point == "Airport"]['status'].value_counts()))
plt.show()

Most ofter cars are not available when requested from the Airport
Most rides that gets canceled are from City where as the rate is much lower for the requests made from Airport

**Requests made on various days and time**

In [None]:
plt.hist(UberData[UberData.pickup_point == "Airport"]['request_hour'], bins=len(UberData['request_hour'].unique()))
plt.title("Airport")
plt.xlabel("Request hour")
plt.ylabel("No. of Requests")
plt.show()

In [None]:
plt.hist(UberData[(UberData.pickup_point == "Airport") & (UberData.status == "No Cars Available")]['request_hour'], bins=len(UberData['request_hour'].unique()))
plt.title("Cars unavailabe at the Airport")
plt.xlabel("Request hour")
plt.ylabel("No. of Requests")
plt.show()

Most requests are made in the late evening till mid-night. And cars are unavailable during this peak time.

In [None]:
plt.hist(UberData[UberData.pickup_point == "City"]['request_hour'], bins=len(UberData['request_hour'].unique()))
plt.title("City")
plt.xlabel("Request hour")
plt.ylabel("No. of Requests")
plt.show()

In [None]:
plt.hist(UberData[(UberData.pickup_point == "City") & (UberData.status == "No Cars Available")]['request_hour'], bins=len(UberData['request_hour'].unique()))
plt.title("Cars unavailable in City")
plt.xlabel("Request hour")
plt.ylabel("No. of Requests")
plt.show()

In contrary to Airports rides most people prefer heading to the airport early in the morning till 10 am and this is the time where cars are unavailable

In [None]:
plt.hist(UberData[(UberData.pickup_point == "City")]['request_weekday'], bins=len(UberData['request_weekday'].unique()))
plt.title("City")
plt.xlabel("Request Weekday")
plt.ylabel("No. of Requests")
plt.show()

In [None]:
plt.hist(UberData[(UberData.pickup_point == "City") & (UberData.status == "No Cars Available")]['request_weekday'], bins=len(UberData['request_weekday'].unique()))
plt.title("No Cars Available in City")
plt.xlabel("Request Weekday")
plt.ylabel("No. of Requests")
plt.show()

In [None]:
plt.hist(UberData[(UberData.pickup_point == "Airport")]['request_weekday'], bins=len(UberData['request_weekday'].unique()))
plt.title("Airport")
plt.xlabel("Request Weekday")
plt.ylabel("No. of Requests")
plt.show()

In [None]:
plt.hist(UberData[(UberData.pickup_point == "Airport") & (UberData.status == "No Cars Available")]['request_weekday'], bins=len(UberData['request_weekday'].unique()))
plt.title("No Cars Available in Airport")
plt.xlabel("Request Weekday")
plt.ylabel("No. of Requests")
plt.show()

From the above charts we see that Wednesday is the busiest day for inbound and outbound travels and that is when the demand is high

In [None]:
tdgrp = UberData[(UberData.pickup_point == "City")].groupby('time_of_day')['request_id'].count().reset_index(name='# of request').sort_values(by=["# of request"], ascending = False)
sns.barplot(x="time_of_day", y="# of request", data=tdgrp)
plt.show()

tdgrp = UberData[(UberData.pickup_point == "City") & (UberData.status == "No Cars Available")].groupby('time_of_day')['request_id'].count().reset_index(name='# of request').sort_values(by=["# of request"], ascending = False)
sns.barplot(x="time_of_day", y="# of request", data=tdgrp)
plt.show()

When compared to Airport, City has lesser supplu demand gap

In [None]:
tdgrp = UberData[(UberData.pickup_point == "Airport")].groupby('time_of_day')['request_id'].count().reset_index(name='# of request').sort_values(by=["# of request"], ascending = False)
sns.barplot(x="time_of_day", y="# of request", data=tdgrp)
plt.show()
tdgrp = UberData[(UberData.pickup_point == "Airport") & (UberData.status == "No Cars Available")].groupby('time_of_day')['request_id'].count().reset_index(name='# of request').sort_values(by=["# of request"], ascending = False)
sns.barplot(x="time_of_day", y="# of request", data=tdgrp)
plt.show()

Only during Morning - Noon the supply demand gap is less

**Quick Summary on Cancelled Rides**

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(15, 11))
UberData[(UberData.status == "Cancelled") & (UberData.pickup_point == "City")].groupby(['time_of_day','status'])['request_id'].count().unstack().plot.bar(ax = axes[0],legend=True, rot=0)
UberData[(UberData.status == "Cancelled") & (UberData.pickup_point == "City")].groupby(['request_weekday','status'])['request_id'].count().unstack().plot.bar(ax = axes[1],legend=True, rot=0)
UberData[(UberData.status == "Cancelled") & (UberData.pickup_point == "City")].groupby(['request_hour','status'])['request_id'].count().unstack().plot.bar(ax = axes[2],legend=True, rot=0)
fig.tight_layout() 


**Summarizing the Stats by Status**

In [None]:
UberData[UberData.pickup_point == "Airport"].groupby(['request_weekday','time_of_day','status'])['request_id'].count().unstack().plot.bar(legend=True, figsize=(30,10))
plt.title("Airport")
plt.show()

UberData[UberData.pickup_point == "City"].groupby(['request_weekday','time_of_day','status'])['request_id'].count().unstack().plot.bar(legend=True, figsize=(30,10))
plt.title("City")
plt.show()

It is to be noted on some days-slot most of the rides are cancelled. This needs to be further investigated by looking at the reason and who initiated the cancellation

# 6. Summary

**Problems obsorver:**
1. Requests made from Airport to City and vice versa are almost the same with a slightly more request at the City
2. About half the rides are not completed due to cancellation or unavailability
3. Most unavailability are at the Airport comparing to the city and most cancellation happen at the city
4. Peak time & unavailability is at Airport in the evening vs in city it is in the morning
5. Peak time & unavailability is on Wednesdays

**Suggested solutions**
1. Special insentives can be given to drivers on Wednesdays and who pickup rides to Airpot at night & who pickup passengers from City to Airport in the Morning.
2. A dedicated fleet service that handels only Airport rides can be set up to meet the demand.
3. Van service that can accomodate more than 1 passenger can be introdiced to meet the supply-demand gap will lesser vehicles.
4. Drivers making a Airport pickup at night can be encouraged to wait for requests in the early-mid mornings at City for Airport pickup. Having less Airport-City requests in the morning and vice versa can be a reason for this gap. Business can come up with bonus, extra commission etc to encourage drivers to bridge this gap

# 7. A Quick Look on to EDA Package