# Uber Analysis

## Dataset

The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.) 

Geography: USA, Sri Lanka and Pakistan

Time period: January - December 2016

Unit of analysis: Drives

Total Drives: 1,155

Total Miles: 12,204

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime, time

import warnings
warnings.filterwarnings('ignore')



## Exploratory Data Analysis

In [None]:
df = pd.read_csv("../input/uberdrives/My Uber Drives - 2016.csv", encoding="latin1")
df.head()

In [None]:
#removing * from columns
df.columns = df.columns.str.replace("*","")
df.head(1)

In [None]:
df.info()

In [None]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset")

In [None]:
df.describe()

### handling missing data

In [None]:
df.isnull().sum()

In [None]:
# Visualizing the missing data
plt.figure(figsize=(10,5))
sns.heatmap(df.isnull(),cmap="magma",yticklabels=False,cbar = False)
plt.show()

In [None]:
import missingno as msno

msno.bar(df)
plt.show()

In [None]:
df_copy = df.copy()

In [None]:
null_columns = df_copy.columns[df.isnull().any()]
df_copy[null_columns].isnull().sum()

In [None]:
df_copy.drop(index = 1155, axis = 0, inplace = True)


In [None]:
df_copy.isnull().sum()

In [None]:
#percentage of null value present in purpose column
int((df_copy['PURPOSE'].isnull().sum()/len(df_copy))*100)

In [None]:
# filling the values in purpose with forward fill
df_copy['PURPOSE'].fillna(method = 'ffill',inplace = True)

In [None]:
df_copy.isna().sum()

### Relation between duration and purpose of cab ride

In [None]:
df_copy['START_DATE'] = pd.to_datetime(df_copy['START_DATE'], errors='coerce')
df_copy['END_DATE'] = pd.to_datetime(df_copy['END_DATE'], errors='coerce')
df_copy.info()

In [None]:
category = pd.crosstab(index = df_copy['CATEGORY'],columns = 'Count of travels as per category')
category.plot(kind = 'bar',color = 'r',alpha = 0.7)
plt.legend()
plt.tight_layout()
category

In [None]:
start_point = df_copy.START.value_counts()
start_point_value=start_point[start_point>10]
pie=plt.pie(start_point_value,labels = start_point_value.index, shadow = True, startangle = 190)
plt.tight_layout()
plt.title("Start location")
plt.show()

> According to above pie chart, Start location are more in cary, Morrisville and Whitebridge, While there are some unknown locations which are grouped together.

### Which are the places having lowest start point

In [None]:
start_point = df_copy.START.value_counts()
start_point_value_low=start_point[start_point <= 10]

> Above are start point with lowest number of starts

### Which are the places having highest stop point

In [None]:
Stop_point = df_copy.STOP.value_counts()
Stop_point[Stop_point > 10]

> The places where Cary , Unknown Loccation, Morrisville , Whitebridge and next to Islamabad are highest stop points. Highest stop points are not same as Highest end points , there is a bit difference

### Which are the places are having LOWEST STOP point

In [None]:
Stop_point = df_copy.STOP.value_counts()
Stop_point[Stop_point <= 10]

> these are the lowest stopping point

> Seeing the highest starting and stoping point, we can see that that cary, Morrisville and Whitebridge are famous destination


### Miles most travelled

In [None]:
Miles = df_copy.MILES.value_counts()
print(Miles[Miles > 10])

Miles[Miles > 10].plot(kind = 'bar')
plt.tight_layout()
plt.title("Miles travelled")
plt.show()


In [None]:
Miles = pd.crosstab(index = df_copy['MILES']>10, columns = 'Count of Miles')
Miles.plot(kind = 'bar', color = 'r',alpha = 0.7)
plt.legend()
Miles

In [None]:
miles  = df_copy.MILES.value_counts()
miles_high = len(miles[miles > 10])
miles_low = len(miles[miles < 10])
pie_values = np.array([miles_high, miles_low])
plt.pie(pie_values,labels=['miles higher than 10','miles lower than 10'], shadow=True, startangle = 155)
plt.title("Miles travelled")
ax = plt.gca()
plt.legend(bbox_to_anchor=(1, 1), bbox_transform=ax.transAxes)
plt.tight_layout()
plt.show()

> From the above plots, we can understand that mostly miles travelled are lower than 10 miles

### Purpose of most cabs

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(df_copy['PURPOSE'],order =df_copy['PURPOSE'].value_counts().index, palette='viridis')
plt.show()

> cabs were mostly used for meetings and entertainment

In [None]:
# calculating minutes of trip
df_copy['MINUTES']=df_copy.END_DATE - df_copy.START_DATE
df_copy['MINUTES'] = df_copy['MINUTES'].dt.total_seconds()/60
df_copy.head()

Grouping Purpose by Miles to see if we can get any pattern.

In [None]:
pd.DataFrame({
    'MEAN': df_copy.groupby(['PURPOSE'])['MILES'].mean().round(1),
    'MIN' : df_copy.groupby(['PURPOSE'])['MILES'].min(),
    'MAX' : df_copy.groupby(['PURPOSE'])['MILES'].max()}).reset_index()


### BOX plot of MILES and MINUTES split by PURPOSE 

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.boxplot(data = df_copy,x = df_copy.PURPOSE, y = df_copy.MILES)
plt.xticks(rotation = 90)
plt.subplot(1,2,2)
sns.boxplot(data = df_copy,x = df_copy.PURPOSE, y = df_copy.MINUTES )
plt.xticks(rotation = 90)
plt.show()

Box plot without outliears

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.boxplot(data = df_copy, x = df_copy.PURPOSE, y = df_copy.MILES,showfliers = False)
plt.xticks(rotation = 90)
plt.subplot(1,2,2)
sns.boxplot(data = df_copy, x = df_copy.PURPOSE, y = df_copy.MINUTES, showfliers = False)
plt.xticks(rotation=90)
plt.show()


Checking for round about trips

In [None]:
plt.figure(figsize=(8,5))
def round(x):
  if x['START'] == x['STOP']:
    return 'YES'
  else:
    return 'NO'  

df_copy['ROUND_TRIP'] = df_copy.apply(round, axis = 1)
sns.countplot(df_copy['ROUND_TRIP'],order = df_copy['ROUND_TRIP'].value_counts().index, palette = 'rocket_r')
plt.show()

Frequency of trip each month

In [None]:
df_copy['MONTH'] = pd.DatetimeIndex(df_copy['START_DATE']).month

In [None]:
dic = {1:'Jan', 2: 'Feb', 3: 'Mar', 4: 'April', 5: 'May', 6: 'June', 7: 'July', 8: 'Aug', 9: 'Sep',
      10: 'Oct', 11: 'Nov', 12: 'Dec' }

df_copy['MONTH'] = df_copy['MONTH'].map(dic)

In [None]:
df_copy

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(df_copy['MONTH'], order = df_copy['MONTH'].value_counts().index, palette="deep")
plt.axhline(df_copy['MONTH'].value_counts().mean(),linestyle='--', color = 'darkred', label='Mean Trips across Months')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(df_copy['ROUND_TRIP'],hue = df_copy['MONTH'])
plt.legend(bbox_to_anchor=(1.05, 0.95),loc=2)
plt.show()

### How does the PURPOSE of Cab ride vary with time and distance?

In [None]:
plt.figure(figsize = (16,7))
plt.subplot(1,2,1)
sns.boxplot(data = df_copy,x = df_copy.PURPOSE,y = df_copy.MILES, showfliers = False)
plt.xticks(rotation = 90)
plt.subplot(1,2,2)
sns.boxplot(data = df_copy, x = df_copy.PURPOSE, y = df_copy.MILES, showfliers = False)
plt.xticks(rotation = 90)
plt.show()

### Is the distance proportional to the duration?

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.lineplot(data = df_copy, x =df_copy.MINUTES, y = df_copy.MILES)
plt.grid(True, linestyle = '--')

plt.subplot(1,2,2)
sns.scatterplot(data = df_copy,x = df_copy.MINUTES, y = df_copy.MILES)
plt.grid(True, linestyle = '--')
plt.show()

we see that our conventional logic, that distance is proportional to time, is challenged as some cab rides took more time for less distance.

In [None]:
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
n, bins, patches = plt.hist(df_copy.MINUTES)
plt.xticks(bins.round())
plt.grid(True, linestyle="dotted")
plt.title("Count of Cab ride MINUTES")

plt.subplot(1,2,2)
n, bins, patches = plt.hist(df_copy.MILES)
plt.xticks(bins.round())
plt.grid(True, linestyle="dotted")
plt.title("COunt of Cab ride Miles")

plt.show()

### Distribution of Cab rides based on Category

In [None]:
plt.figure(figsize=(9,5))
sns.countplot(data = df_copy,x = "PURPOSE", hue="CATEGORY",dodge=False)
plt.xticks(rotation=45)
plt.show()

The above plot makes clear distinction on the Purpose of Business rides and Personal rides. This may give insight to cab aggregators to decide which sector to introduce new cabs in.

### Where do customers most frequently take cabs?

In [None]:
plt.figure(figsize=(15,4))
pd.Series(df['START']).value_counts()[:25].plot(kind="bar")
plt.title("Cab Rides START Location frequency")
plt.xticks(rotation = 45);

### Frequency of Cab Rides STOP

In [None]:
plt.figure(figsize=(15,4))
pd.Series(df['STOP']).value_counts()[:25].plot(kind = "bar")
plt.title("Cab Rides STOP Location frequency")
plt.xticks(rotation = 45);

In [None]:
g = sns.factorplot(x="PURPOSE", y="MILES", hue="CATEGORY", data=df,
                   size=15, kind="bar", palette="muted")
g.fig.suptitle('Miles was earned per category and purpose', fontsize= 25)
g.fig.set_size_inches(15, 5)
g.set_xlabels('PURPOSE', fontsize= 20)
g.set_ylabels('MILES', fontsize= 20)
plt.show()

### **CONCLUSION**



1.   Most of the cab rides are within a distance of 31 miles taking about 34 minutes.
2.   Business Cab rides are not only more in volume, but also in distance travelled.
3. Main uses of cab rides are Meal/Entertainment, Customer visit, Meeting, Errand/Supplies.
4. Cab traffic is mostly concentrated in 5 cities or localities.
5. Ac seasonal pattern of cab ride volume exists, which is highest on December.



