# Uber Ride Data Analysis

This dataset contains details of uber rides of a customer.<br>
**Dataset:** The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals etc) [dataset](https://www.kaggle.com/zusmani/uberdrives).<br>


# Objective

To fetch insights from the behavior of an common Uber customer.

# Importing libraries

In [82]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Loading dataset

In [2]:
df = pd.read_csv('/kaggle/input/uberdrives/My Uber Drives - 2016.csv')
df.head()

In [3]:
df.tail()

In [4]:
print(df.shape)
df.dtypes

**There are 6 catagorical vars and 1 numeric type variable**<br>
***Here STATR_DATE* and END_DATE* are in object type. We need to convert them back into datetime variable***

# Checking for null values

In [5]:
df.isna().sum()

In [6]:
df[df['END_DATE*'].isna()]

**As we can see this row contains wrong data for most of the columns. We will delete it**

In [7]:
# dropping row containing null vals
df.drop(df[df['END_DATE*'].isna()].index,axis=0,inplace=True)

In [8]:
df.isna().sum()

In [105]:
df.info()

**Now we have null data only in Purpose column. <br>
As we have more than 55% data missing. So I am dropping this columns and excluding this from this analysis.**
<br> You may also delete the null value rows and include this column in the analysis.<br>
```sns.countplot(df['PURPOSE*'], order=df['PURPOSE*'].value_counts().index)```


In [106]:
# droppig Purpose
df.drop(['PURPOSE*'],axis=1,inplace=True)
df.head(2)

# Checking for duplicate rows

In [9]:
df[df.duplicated()]

We will remove this duplicate row

In [10]:
df.drop(df[df.duplicated()].index, axis=0, inplace=True)
df[df.duplicated()]

**Converting start_date & end_date cols into datetime** 

In [11]:
df['START_DATE*'] = pd.to_datetime(df['START_DATE*'], format='%m/%d/%Y %H:%M')
df['END_DATE*'] = pd.to_datetime(df['END_DATE*'], format='%m/%d/%Y %H:%M')
df.dtypes

# EDA

# Univariate

## 1. Category

In [12]:
df['CATEGORY*'].unique()

There are 2 ride-categories...  Business: For work related & Personal: For personal travel 

In [13]:
df[['CATEGORY*','MILES*']].groupby(['CATEGORY*']).agg(tot_miles=('MILES*','sum'))

In [14]:
plt.figure()
df[['CATEGORY*','MILES*']].groupby(['CATEGORY*']).agg(tot_miles=('MILES*','sum')).plot(kind='bar')
plt.xlabel('Category')
plt.ylabel('Total Miles')
plt.title('Total Miles per Category')

**User mainly uses Uber cabs for its Business purposes**<br>
* Around 94% miles was consumed during Business trips.
* Only 6% miles were consumed during personal trips.

## START*

In [25]:
len(df['START*'].unique())

**There are 177 unique starting points**

In [15]:
# Top 10 Start places
df['START*'].value_counts(ascending=False)[:10]

In [16]:

df['START*'].value_counts(ascending=False)[:10].plot(kind='barh',ylabel='Places',xlabel='Pickup Count',title='Top 10 Pickup places')

**Cary is the most popular Starting point for this user**

## STOP*

In [26]:
len(df['STOP*'].unique())

**There are 188 unique Drop points (destination)**

In [17]:

df['STOP*'].value_counts(ascending=False)[:10].plot(kind='barh',ylabel='Places',xlabel='Pickup Count',title='Top 10 Drop places')

**Cary is the most popular Stop place for this user.**<br> 
***Maybe his home is in Cary (as mostly start & stop are from here)***

In [31]:
df[df['START*']=='Unknown Location']['START*'].value_counts()

In [28]:
df[df['STOP*']=='Unknown Location']['STOP*'].value_counts()

## MILES*

In [18]:
sns.histplot(df['MILES*'],kde=True)

**Miles data is Rightly Skewed **

In [20]:
df.describe().T

## Multivariate analysis

In [21]:
df.head()

In [37]:
df.groupby(['START*','STOP*'])['MILES*'].apply(print)

In [50]:
df.groupby(['START*','STOP*'])['MILES*'].sum().sort_values(ascending=False)[1:11]

**Cary-Durham & Cary-Morrisville and vice versa are the farthest distance ride.**

**Checking for Round Trip**

In [54]:
def is_roundtrip(df):
    if df['START*'] == df['STOP*']:
        return 'YES'
    else:
        return 'NO'
    
df['ROUND_TRIP*'] = df.apply(is_roundtrip, axis=1)

sns.countplot(x='ROUND_TRIP*',data=df, order=df['ROUND_TRIP*'].value_counts().index)

In [55]:
df['ROUND_TRIP*'].value_counts()

**User mostly take single-trip Uber rides.**<br>
* Around 75% trip is single-trip and 25% are ROund-Trip

## Calculating Ride duration

In [58]:
df.dtypes

In [59]:
df['Ride_duration'] = df['END_DATE*']-df['START_DATE*']
df.head()

**Converting Ride_duration into Minutes**

In [60]:
# using datetime.Timedelta  => https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html
df.loc[:, 'Ride_duration'] = df['Ride_duration'].apply(lambda x: pd.Timedelta.to_pytimedelta(x).days/(24*60) + pd.Timedelta.to_pytimedelta(x).seconds/60)
df.head()

In [80]:
#Capture Hour, Day, Month and Year of Ride in a separate column
df['month'] = pd.to_datetime(df['START_DATE*']).dt.month
df['Year'] = pd.to_datetime(df['START_DATE*']).dt.year
df['Day'] = pd.to_datetime(df['START_DATE*']).dt.day
df['Hour'] = pd.to_datetime(df['START_DATE*']).dt.hour

df['day_of_week'] = pd.to_datetime(df['START_DATE*']).dt.dayofweek
days = {0:'Mon',1:'Tue',2:'Wed',3:'Thur',4:'Fri',5:'Sat',6:'Sun'}

df['day_of_week'] = df['day_of_week'].apply(lambda x: days[x])

df.head()

**Addding month name instead of month number**

In [83]:
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])
df.head()

**Total rides/month**

In [85]:
print(df['month'].value_counts())

In [88]:
sns.countplot(x='month',data=df,order=pd.value_counts(df['month']).index,hue='CATEGORY*')

**Most number of rides were in month of December (all of them were Business trips)**<br>
Top 5 months having most trips were:    December,August,November,February & March.<br>
**Uber Ride was used at Feb,Mar,Jul,Jun & Apr for personal trips.**

In [91]:
sns.countplot(x='day_of_week',data=df,order=pd.value_counts(df['day_of_week']).index,hue='CATEGORY*')

**FRIDAY was the day at which uber rides were mostly used**

**Average distance covered/month**

In [96]:
df.groupby('month').mean()['MILES*'].sort_values(ascending = False).plot(kind='bar')
plt.axhline(df['MILES*'].mean(), linestyle='--', color='red', label='Mean distance')
plt.legend()
plt.show()

**User's Longest ride were on April & shortest were on November**

In [100]:
sns.countplot(x='Hour',data=df,order=pd.value_counts(df['Hour']).index,hue='CATEGORY*')

**Maximim number of trips were on Evening & at noon.**

### Calculating Trip speed

In [102]:
df.head()

In [111]:
df['Duration_hours'] = df['Ride_duration']/60
df['Speed_KM'] = df['MILES*']/df['Duration_hours']
df.head(2)

In [121]:
fig, ax = plt.subplots()
sns.histplot(x='Speed_KM',data=df,kde=True,ax=ax)
ax.set_xlim(1,31)
ax.set_xticks([x*50 for x in range(0,5)])


**Speed is right skewed**

# Conclusion

* **User mainly uses Uber cabs for its Business purposes** <br>
    * Around 94% miles was consumed during Business trips.
    * Only 6% miles were consumed during personal trips.
* There are 177 unique starting points<br>
    * **Cary is most poplular starting point for this driver.**
* There are 188 unique Stop points.
    * **Cary is most poplular drop point for this driver.**
* **Cary-Durham & Cary-Morrisville and vice versa are the User's longest distance Uber ride.**
* **User usually takes single-trip Uber rides.**
    * Around 75% trip is single-trip and 25% are Round-Trip.
* **User's Most number of rides were in month of December & Least were in September.**
* **Friday has maximum number of trips.**
* **Afternoons and evenings seem to have the maximum number of trips.**
* **User's Longest ride were on April & shortest were on November**