# Data Analysis Using Python

For this notebook we are going to analyse the uber dataset which contains details of uber drivers. This dataset helps us to understand the behaviour of an ordinary Uber customer. 

I will try to do some exploratory data analysis and I will also try to find answers for some questions to uncover insights.

Importing Necessary Packages

In [None]:
import pandas as pd
import numpy as np
from pylab import *
import matplotlib.pyplot as plt

In [None]:
df=pd.read_csv('../input/uberdrives/My Uber Drives - 2016.csv')

In [None]:
df.columns #Displaying Column Names

In [None]:
## lets explore the shape of dataframe.
print(" Shape of the  dataframe is: " , df.shape)

In [None]:
# Lets exlore the first 5 rows of dataframe to see sample records.
df.head()

In [None]:
#exploring last 5 rows
df.tail(5)  

I can see some missing Values in the datset :) 

In [None]:
df.isnull().sum()

Only Purpose column has many missing values. In this notebook we are not going deeper in msiing value analysis so dropping the missing data.

In [None]:
df=df.dropna()

In [None]:
df.isnull().sum()

Some Basic methods to explore data :)

In [None]:
df.describe()## to get the summary stats 

In [None]:
df['MILES*'].max() #to get maximum value in miles column

In [None]:
df.sample(7)  #to display random sample records 

In [None]:
df.dtypes #to get datatype of each columns

In [None]:
df.info()    # to get complete info about dataframe

In [None]:
df.sort_values(by=['MILES*'],ascending=False).head(10) #Sorting based on miles

In [None]:
df[df['PURPOSE*'].isnull()] ## displays the rows where purpose is null 

We have dropped all null values

Let's analyse each column separately 

**1. Start Date**

In [None]:
df1=df.copy() #taking copy of dataframe

In [None]:
df1["START_DATE*"]=pd.to_datetime(df["START_DATE*"],format="%m/%d/%Y %H:%M") #changing the datatype and format of start date

In [None]:
df1.info()

In [None]:
df1.head()

We can see datatype and format of start date column has been changed

**Explore start date by month**

In [None]:
sd_m_dis=df1["START_DATE*"].dt.month.value_counts()
sd_m_dis=sd_m_dis.sort_index()
sd_m_mean=sd_m_dis.mean()
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., 1.03*height, '%s' % int(height))
figure(0)
rects=plt.bar(sd_m_dis.index,sd_m_dis.values)
plt.plot([0,len(sd_m_dis.index)+1],[sd_m_mean,sd_m_mean],"r--")
plt.title("START_DATE DISTRIBUTE")
plt.xlabel("Month")
plt.ylabel("Trips")
plt.grid()
autolabel(rects)

Driver took more number of rides in the  month of december.Interestingly,there is no ride recorded in september. In April,May,August and october trips are less than the average value.

**Explore start date by hour**

In [None]:
sd_h_dis=df1["START_DATE*"].dt.hour.value_counts()
sd_h_dis=sd_h_dis.sort_index()
sd_h_mean=sd_h_dis.mean()
figure(1)
rects=plt.bar(sd_h_dis.index,sd_h_dis.values)
plt.plot([0,len(sd_h_dis.index)+1],[sd_h_mean,sd_h_mean],"r--")
plt.title("START_DATE(Hours) DISTRIBUTE")
plt.xlabel("Hours")
plt.ylabel("Trips")
plt.grid()
autolabel(rects)

Most of the orders are between 10am and 6pm

**2.End Date**

End date will be similar to start date so we are skipping it :)

**3.Category**

In [None]:
ct_dis=df1["CATEGORY*"].value_counts()
figure(2)
rects=plt.bar(range(1,len(ct_dis.index)+1),ct_dis.values)
plt.title("Category DISTRIBUTE")
plt.xlabel("Category")
plt.ylabel("Quantity")
plt.xticks(range(1,len(ct_dis.index)+1),ct_dis.index)
#plt.grid()
autolabel(rects)

We can see most of trips are business trip. So driver spending most of the time for business work rather than personal work.

**4.Start**

In [None]:
st_dis=df1["START*"].value_counts()
st_dis.sort_values(inplace=True,ascending=False)
st_dis=st_dis.iloc[:10]
print("Start place:\n",st_dis)

Most of the trips are started at CARY

**5.Stop**

In [None]:
stp_dis=df["STOP*"].value_counts()
stp_dis.sort_values(ascending=False)
stp_dis=stp_dis.iloc[:10]
print("STOP_PLACE:\n",stp_dis)

We can conclude drivers place is CARY

In [None]:
ml_dis=df1["MILES*"]
ml_range_lst=["<=5","5-10","10-15","15-20",">20"]
ml_dic=dict()
for item in ml_range_lst:
    ml_dic[item]=0
for mile in ml_dis.values:
    if mile<=5:
        ml_dic["<=5"]+=1
    elif mile<=10:
        ml_dic["5-10"]+=1
    elif mile<=15:
        ml_dic["10-15"]+=1
    elif mile<=20:
        ml_dic["15-20"]+=1
    else:
        ml_dic[">20"]+=1
ml_dis=pd.Series(ml_dic)
ml_dis.sort_values(inplace=True,ascending=False)
print("Miles:\n",ml_dis)

Driving is picking more shorter distance trips than a longer one. I think he can get more money on shorter trips.

**6.Purpose**

In [None]:
pp_dis=df["PURPOSE*"].value_counts()
pp_dis.sort_values(ascending=False)
pp_dis=pp_dis.iloc[:10]
print("PURPOSE:\n",pp_dis)

Meeting and Meal/Entertain purpose has more trips. Surprisingly,Between Offices and airport travel has very less trips.

**Creating new columns**

In [None]:
df1['triptype']=np.where(df1['MILES*']<=df1['MILES*'].mean(),'short','long')

In [None]:
trip_type=df1['triptype'].value_counts()
trip_type.sort_values(ascending=False)
trip_type=trip_type.iloc[:10]
print("PURPOSE:\n",trip_type)

We have alresdy identified that there is more number of short trips. Creation of new column can be used for further analysis.

In [None]:
df1['KMS']=df1['MILES*'].apply(lambda Y :Y*1.6)

In [None]:
df1.head()

Created KM column based on miles.

In [None]:
df1["END_DATE*"]=pd.to_datetime(df1["END_DATE*"],format="%m/%d/%Y %H:%M") #changing the datatype and format of END date
#Calculate the duration for the rides
df1['Duration'] = df1['END_DATE*'] - df1['START_DATE*']
#convert duration to numbers(minutes)
df1.loc[:, 'Duration'] = df1['Duration'].apply(lambda x: pd.Timedelta.to_pytimedelta(x).days/(24*60) + pd.Timedelta.to_pytimedelta(x).seconds/60)
df1['SpeedMph']=df1['MILES*']/df1['Duration'] #creating speed column

We have created speed column from miles and duration columns. This can be used for further analysis.

**Findings**

Based on above analysis we can answer many questions on the dataset. Few are shown below.

**What is the average length of the trip?**

In [None]:
print('Average length of trip in minutes:\n',df1['Duration'].mean())

**Whats is average number of rides per month?**

In [None]:
print('Trips per month', sd_m_dis)

**Whats is average number of rides per hour?**

In [None]:
print('Trips per hour',sd_h_dis)

**Category vs Purpose Vs Miles**

In [None]:
df1.groupby(['CATEGORY*','PURPOSE*'])['MILES*'].agg(['mean','count','max','min'])

We can see most of the trips are used for business/Meetings trips.

**Category vs Triptype vs Purpose Vs Speed**

In [None]:
df1.groupby(['CATEGORY*','triptype','PURPOSE*'])['SpeedMph'].agg(['mean','median','max','min','count'])