##### 16 Feb 2021

## Cab Rides EDA

In this exercise we will perform Exploratory Data Analysis on the Cab Rides data to better understand its nuances.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time, datetime
%matplotlib inline

In [None]:
df=pd.read_csv('../input/uberdrives/My Uber Drives - 2016.csv')
df.head()

### Converting Dates to Datetime Type

We clearly see two columns that should be datetime category.  
We need to convert them into such.

When making changes to columns using column names, we must remove special characters like '*'.  
We do that using the below function.

In [None]:
df.columns=df.columns.str.replace('*','')
df.drop(index = 1155, axis = 0, inplace = True)
df.head()

Now we convert both the columns to datetime using the below function.

In [None]:
df['START_DATE']= pd.to_datetime(df['START_DATE'])
df['END_DATE']= pd.to_datetime(df['END_DATE'])
df.info();

### Checking null values

Next, we immediately notice there are some missing values in PURPOSE column.  
Let's look into it further.

In [None]:
df.isnull().sum()

There are 502 missing values. This is almost 45% of the total observation.  
Hence, this cannot be imputed using regular means such as average, median or mode.  

Our current order of priority is to predict the missing values.  
The next few steps comprise the trial and error method of figuring out the best alternative to do so.

In [None]:
df.nunique()

In [None]:
df.PURPOSE.value_counts()

In [None]:
# Lets calculate the duration, as there might be some relation of it to PURPOSE of Cab ride.
df['MINUTES'] = df.END_DATE - df.START_DATE
df.head()

In [None]:
# As Duration is in datetime format, we need to convert it to float type in order to use it for analysis.
df['MINUTES'] = df['MINUTES'].dt.total_seconds() /60
df.head()

Grouping Purpose by Miles to see if we can get any pattern.

In [None]:
pd.DataFrame({'MEAN': df.groupby(['PURPOSE'])['MILES'].mean().round(1), 
              'MIN' : df.groupby(['PURPOSE'])['MILES'].min(), 
              'MAX' : df.groupby(["PURPOSE"])['MILES'].max()}).reset_index()

Lets use boxplot to better visualize the spread.

## Boxplots of MILES and MINUTES split by PURPOSE

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MILES)
plt.xticks(rotation=45)
plt.subplot(1,2,2)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MINUTES)
plt.xticks(rotation=45);

## Boxplots of MILES and MINUTES based on PURPOSE without outliers

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MILES,showfliers=False)
plt.xticks(rotation=45)
plt.subplot(1,2,2)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MINUTES,showfliers=False)
plt.xticks(rotation=45);

In [None]:
df_new = df[df['PURPOSE'].isin(['Commute','Charity ($)','Moving','Airport/Travel']) == False]

In [None]:
df_new.shape

In [None]:
df_new.PURPOSE.value_counts()

In [None]:
df_new.groupby(df_new.PURPOSE)['MILES'].mean().round(3)

After trying multiple approaches, I've decided to predict the missing values using Decision Tree algorithm.

In [None]:
df_na = df[df.PURPOSE.isna()]
df_na.head()

In [None]:
# Initiating Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(criterion='entropy')

In [None]:
# Dropping NA values for train dataset
df_a = df.dropna()

# Splitting dataset into independent and dependent variables
X = df_a[['CATEGORY','MILES','MINUTES']]
y = df_a.PURPOSE

# As X has categorical variables, converting all to numeric type using one hot encoding
X = pd.get_dummies(X, drop_first = True)

# Training dtree model 
dtree.fit(X,y)


In [None]:
# Storing rows with null values in X_na
X_na = df_na[['CATEGORY','MILES','MINUTES']]

# Performing one-hot encoding
X_na = pd.get_dummies(X_na, drop_first = True)

# Making the predictions using dtree model
preds = dtree.predict(X_na)
preds.shape

In [None]:
X_na['PURPOSE'] = preds
df['PURPOSE'].fillna(X_na['PURPOSE'], inplace = True)

In [None]:
# Lets check if all missing values have been filled
df.isnull().sum()

In [None]:
# Lets check the difference in the PURPOSE variable
df.PURPOSE.value_counts()

In [None]:
X_na.PURPOSE.value_counts()

Now that we have all the date, let's see what all inferences can we draw from this data.

# 1. How does the PURPOSE of Cab ride vary with time and distance?

### Boxplots of MILES and MINUTES based on PURPOSE (without outliers)

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MILES, showfliers = False)
plt.xticks(rotation = 45)
plt.subplot(1,2,2)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MINUTES, showfliers = False)
plt.xticks(rotation = 45);

## Boxplots of MILES and MINUTES based on PURPOSE (with outliers)

In [None]:
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MILES)
plt.xticks(rotation = 45)
plt.subplot(1,2,2)
sns.boxplot(data=df, x=df.PURPOSE, y=df.MINUTES)
plt.xticks(rotation = 45);

We have successfully filled the missing values.  
Now that our dataset is complete, lets proceed to visualizing the data using meaningful plots.

In [None]:
df.nunique()

First, we will look at the relationship between the distance (MILES) and time taken (MINUTES).

# 2. Is the distance proportional to the duration?

### Plots of MILES with respect to MINUTES

In [None]:
plt.figure(figsize = (14,5))
plt.subplot(1,2,1)
sns.lineplot(data=df, x=df.MINUTES, y=df.MILES)
plt.grid(True, linestyle = "--")
plt.subplot(1,2,2)
sns.scatterplot(data=df, x=df.MINUTES, y=df.MILES)
plt.grid(True, linestyle = "--")

Clearly, the lineplot doesn't give us a clear representation of the spread.  
However, by plotting multiple plots, we can decide which plot to opt.  
Also, we see that our conventional logic, that distance is proportional to time, is challenged as some cab rides took more time for less distance.

In [None]:
plt.figure(figsize = (16,5))

plt.subplot(1,2,1)
n, bins, patches = plt.hist(df.MINUTES)
plt.xticks(bins.round())
plt.grid(True, linestyle = "dotted")
plt.title("Count of Cab ride MINUTES")

plt.subplot(1,2,2)
n, bins, patches = plt.hist(df.MILES)
plt.xticks(bins.round())
plt.grid(True, linestyle = "dotted")
plt.title("Count of Cab ride MILES");

# 3. Is the distance time relation same for both Business and Personal Category?

### Plot of MILES and MINUTES w.r.t CATEGORY of Cab Ride

In [None]:
sns.countplot(data=df, x="CATEGORY")

In [None]:
plt.figure(figsize = (14,5))
plt.subplot(1,2,1)
sns.regplot(data=df[df['CATEGORY'] == 'Business'],x="MILES", y="MINUTES")
plt.title("BUSINESS CAB RIDES")
plt.grid(True, linestyle = ":")

plt.subplot(1,2,2)
sns.regplot(data=df[df['CATEGORY'] == 'Personal'],x="MILES", y="MINUTES")
plt.title("PERSONAL CAB RIDES")
plt.grid(True, linestyle = ":")

  The above charts show not only the trend of the scatter, but also the standard deviation of the same.

# 4. What is purpose (destination) of most cab rides?

### Split of rides based on PURPOSE

In [None]:
pd.Series(df['PURPOSE']).value_counts().plot(kind="bar")
plt.xticks(rotation = 45);

The major chunk of the cab rides are used for Meals/ Entertainment, Meetings, Errand/Supplies and Customer Visit

### Distribution of Cab rides based on Category

In [None]:
plt.figure(figsize = (9,5))
sns.countplot(data=df,x="PURPOSE", hue = 'CATEGORY', dodge = False)
plt.xticks(rotation = 45);

The above plot makes clear distinction on the Purpose of Business rides and Personal rides.  
This may give insight to cab aggregators to decide which sector to introduce new cabs in.

Now, lets see cabs usage based on location.  
We will see where most cab rides start and where they stop.

# 5. Where do customers most frequently take cabs?

### Frequency of Cab Rides START

In [None]:
plt.figure(figsize = (15,4))
pd.Series(df['START']).value_counts()[:25].plot(kind="bar")
plt.title("Cab Rides START Location frequency")
plt.xticks(rotation = 45);

### Frequency of Cab Rides STOP

In [None]:
plt.figure(figsize=(15,4))
pd.Series(df['STOP']).value_counts()[:25].plot(kind = "bar")
plt.title("Cab Rides STOP Location frequency")
plt.xticks(rotation = 45);

The above graphs give us a visual understanding of frequency of rides.

# 6. When are cab rides more popular (frequently used)?

In [None]:
df['MONTH'] = pd.DatetimeIndex(df['END_DATE']).month_name()
df.head(5)

In [None]:
pd.Series(df['MONTH'].value_counts()).plot(kind="bar")

Lets split the above further for better analysis.

In [None]:
plt.figure(figsize = (8,5))
sns.histplot(data = df, x='MONTH', hue='CATEGORY', multiple = 'stack',kde = True, binwidth = 30)
plt.xticks(rotation = 45);

As we can see, clearly there is a seasonal trend.  
The Cab company could make use of this to increase their rides.

# Conclusion

From the above Exploratory Data Analysis, we have inferred the information followed below:
1. The mean of the data is deviated due to the outlier Commute cab ride. Apart from it, the rest conform to similar ranges.
2. Most of the cab rides are within a distance of 31 miles taking about 34 minutes.
3. Business Cab rides are not only more in volume, but also in distance travelled.
4. Main uses of cab rides are Meal/Entertainment, Customer visit, Meeting, Errand/Supplies.  
5. Cab traffic is mostly concentrated in 5 cities or localities.
6. Ac seasonal pattern of cab ride volume exists, which is highest on December.