 <h1>UBER Rides Analysis</h1>


This notebook will be analyzing My Uber Drives 2016 dataset.  Within this dataset we will find hidden patterns of the rides that were performed during the specific time frame depicted on the data.  The Data set contains 7 variables and 1156 observations.  

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import datetime as dt
%matplotlib inline 

In [None]:
df = pd.read_csv('../input/My Uber Drives - 2016.csv')

After downloading the csv file and placing it into a pandas data frame lets look at the columns and their associated data types.  Noticed from the below there are variables that contains null values.  

In [None]:
df.info()

Here is a snapshot of the first 5 rows of the data. This gives us an idea of the data elements within the variables.  We can see there is a Category, Purpose, Miles and other.  We will be utlizing primarly the purpose and miles to draw conclusion on the purpose of trips and miles travelled.  

In [None]:
df.head()

In [None]:
df.tail()

<h1>Data Cleansing and Preparation</h1>

In [None]:
df1 = df.drop(df.index[1155])

In [None]:
df1['PICK_DATE'] = df['START_DATE*'].str.split(' ').str[0]

In [None]:
df1['DROP_DATE'] = df['END_DATE*'].str.split(' ').str[0]

In [None]:
test = [df1]

for dataset in test: 
    dataset['START_DATE*'] = pd.to_datetime(dataset['START_DATE*']).astype('datetime64[ns]')
    dataset['END_DATE*'] = pd.to_datetime(dataset['END_DATE*']).astype('datetime64[ns]')

In [None]:
df1['CITY_PAIR'] = df1['START*']+'-'+ df1['STOP*']


In [None]:
df1['TOTAL_TIME'] = df1['END_DATE*']-df1['START_DATE*']

In [None]:
df1.info()


In [None]:
df1.isnull().sum()

In [None]:
df1['PURPOSE*'] = df1['PURPOSE*'].fillna('OTHER')

In [None]:
df1.groupby('PURPOSE*', as_index=False).sum()

In [None]:
df1.isnull().sum()

In [None]:
data = [df1]

for dataset in data:
    dataset['CATEGORY*'][df1['PURPOSE*']=='Meal/Entertain'] = 'Meals'

In [None]:
df1.describe()

In [None]:
df1.describe(include=['O'])

In [None]:
df1.head()

<h1>Data Visualization</h1>

The below boxplot shows by purpose and miles driven.  As you can see from the boxplot the outliers in this case there are several.  The one that catches my attention is the customer visit with a total miles driven of 300 miles.   This represet a trip picked up from city Latta to city Jacksonville with aproximate travel time of 5 hours and 30 minutes.   

In [None]:
oth = ['OTHER']

g = sns.FacetGrid(data=df1[~df1['PURPOSE*'].isin(oth)], aspect=2, size=6)
g.map(sns.boxplot, 'PURPOSE*', 'MILES*', palette="Set1")
plt.show()

The below Distriubtion plot shows the miles disributed by trips.  It shows that between 0-25 there are a total of 1100 trips.

In [None]:
plt.figure(figsize=(18,8))
plt.hist(df1['MILES*'])
plt.show()

This pie chart represents the percentage of trips made using the PURPOSE* variables.  I have exploded the piece of the pie chart with the highest percentage.  In this case are those trips that did not have a value assigned.  Hence earlier I used a fillna() function to this missing values and replace those values with NA. 

In [None]:
plt.figure(figsize=(10,10))
df1['PURPOSE*'].value_counts()[:11].plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0,0,0,0,0,0,0,0,0,0])
plt.show()


In [None]:
g = sns.FacetGrid(data=df1, aspect=2, size=8)
g.map(sns.countplot, 'PURPOSE*', palette="Set1")
plt.show()


In [None]:
x = np.arange(0, 1155)
y = df1['MILES*']

plt.figure(figsize=(18,8))

plt.scatter(x, y, s=15)
plt.xticks([0, 400, 800, 1200])
plt.show()


In [None]:
g = sns.FacetGrid(data=df1, aspect=2, size=8, hue='PURPOSE*')
g.map(plt.plot, 'START_DATE*')
plt.legend()
plt.xlabel('# of Trips')
plt.show()

In [None]:
plt.figure(figsize=(18,8))
df1['CITY_PAIR'].value_counts()[:50].plot(kind='bar')
plt.show()

In [None]:
g = sns.FacetGrid(data=df1, aspect=2, size=8, hue='CATEGORY*')
g.map(plt.plot, 'TOTAL_TIME')
plt.show()

In conclusion, the data shows that there are trips that are outside of the normal average miles travel by an UBER drive. For example out of the total 1150 observations in the data which equal to trips made, what the data does not shows is how many drivers are in the total observation.  THerefore I could not draw a concrete conclusion per driver.  THe data shows cities that are overseas. 

In [None]:
totals = df1.groupby('CATEGORY*', as_index=False).agg({'MILES*': 'sum'})

In [None]:
totals['PERCENTAGE'] = (totals['MILES*']/df1['MILES*'].sum())*100

In [None]:
totals

In [None]:
sizes = np.array(totals['PERCENTAGE'])
labels = np.array(totals['CATEGORY*'])


fig1, ax1 = plt.subplots(figsize=(9,9))
ax1.pie(sizes, explode=[0.2,0,0], labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('PERCENTAGE OF MILES BY CATEGORY')

plt.show()

In [None]:
cat = df1.groupby('CATEGORY*', as_index=False).mean()

plt.figure(figsize=(18,8))

sns.barplot('CATEGORY*', 'MILES*', data=cat)
plt.title('AVERAGE MILES DRIVEN PER PURPOSE')
plt.show()