### Case study  - Uber Data Analysis

The data of a driver’s uber trips are available for year 2016.
Your manager wants you to explore this data to give him some useful insights about the trip behaviour of a Uber driver.

#### Dataset - 
The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)


#### Steps - 

1.Import the libraries

2.Get the data and observe it

3.Check missing values, either remove it or fill it.

4.Get summary of data using python function.

5.Explore the data parameter wise

Here we have information of destination(start and stop), time(start and stop), category and purpose of trip, miles covered.


In [1]:
# ----------------------
# Concepts To cover 
# ----------------------
# 1. Data profiling
# 2. group by function
# 3. Apply function 
# 4. DateTime operations 

In [1]:
# Import the libraries 
import numpy as np
import pandas as pd

In [2]:
# Read the Data 

df = pd.read_csv('My_Uber_Drives _2016.csv')

In [None]:
 # View first n rows of data 

In [6]:
df.head(20)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit
5,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain
6,1/6/2016 17:30,1/6/2016 17:35,Business,West Palm Beach,Palm Beach,7.1,Meeting
7,1/7/2016 13:27,1/7/2016 13:33,Business,Cary,Cary,0.8,Meeting
8,1/10/2016 8:05,1/10/2016 8:25,Business,Cary,Morrisville,8.3,Meeting
9,1/10/2016 12:17,1/10/2016 12:44,Business,Jamaica,New York,16.5,Customer Visit


In [None]:
#View the last 5 rows of data


In [8]:
df.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


In [None]:
# understand shape and size of data 


In [9]:
df.shape

(1156, 7)

In [None]:
# Same as above, gives non-null number of records


1. PURPOSE column has lots of missing values  
2. 1155 or 1156 records ??

Show the records with missing values for column= PURPOSE 

In [10]:
df.isnull()


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
1151,False,False,False,False,False,False,False
1152,False,False,False,False,False,False,False
1153,False,False,False,False,False,False,False
1154,False,False,False,False,False,False,False


How many records are non-null  / have values ( in a particular column )

In [11]:
df.isnull().sum()

START_DATE*      0
END_DATE*        1
CATEGORY*        1
START*           1
STOP*            1
MILES*           0
PURPOSE*       503
dtype: int64

### Renaming columns

In [None]:
# Rename the columns to remove the * from the names




#or 
# Replace the * character from all the  columns .


# You can also rename the specific column names 


### Filtering dataframes -1 

In [12]:
# shows the entries where PURPOSE is null
df[df['PURPOSE*'].isnull()]
# inverting the selection ( not null ) ( works for booleans cases)


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
32,1/19/2016 9:09,1/19/2016 9:23,Business,Whitebridge,Lake Wellingborough,7.2,
85,2/9/2016 10:54,2/9/2016 11:07,Personal,Whitebridge,Northwoods,5.3,
86,2/9/2016 11:43,2/9/2016 11:50,Personal,Northwoods,Tanglewood,3.0,
87,2/9/2016 13:36,2/9/2016 13:52,Personal,Tanglewood,Preston,5.1,
...,...,...,...,...,...,...,...
1066,12/19/2016 14:37,12/19/2016 14:50,Business,Unknown Location,Unknown Location,5.4,
1069,12/19/2016 19:05,12/19/2016 19:17,Business,Islamabad,Unknown Location,2.2,
1071,12/20/2016 8:49,12/20/2016 9:24,Business,Unknown Location,Rawalpindi,12.0,
1143,12/29/2016 20:53,12/29/2016 21:42,Business,Kar?chi,Unknown Location,6.4,


### Filtering dataframe - 2 

In [None]:
#1. Conditions within dataframe 

dfd1 = [~df[]]
# inverting the selection 


Explore the details from the MILES column


In [17]:
# Show the top 10 rides (*in terms of distance driven)

df['MILES*'].sort_values(ascending= False).head(10)
# Show the row that has the max miles 

df 
# Shows the top 10 rows of MILES ( decreasing value )




1155    12204.7
269       310.3
270       201.0
881       195.9
776       195.6
546       195.3
559       180.2
297       174.2
299       159.3
727       156.9
Name: MILES*, dtype: float64

#### Dropping rows  which have null values

In [None]:
# Get the initial data with dropping the NA values

#Get the shape of the dataframe after removing the null values


The filtered dataset with no nulls ( in PURPOSE column )  contains 653 rows of non-null values

### Lets explore the data parameter wise - 

1.Destination - (starting and stopping)

2.Time - (hour of the day, day of week, month of year)

3.Categories

4.Purpose 

5.Grouping two parameters to get more insights


## 1. Understanding  the start and stop points 

In [None]:
# Get the unique starting point, unique destination
# names of unique start points


In [None]:
#count of unique start points using  len()

In [None]:
 # or use can use the nunique function

In [None]:
# Get the starting destination, unique destination
#names of unique start points

In [None]:
#count of unique start points

Stations which are appeared in both start and stop locations 

In [18]:
3*1**3

3

In [None]:
#Identify popular start points - top 10


In [None]:
#Identify popular stop destinations - top 10


In [None]:
# Are there cases where the start and the stop location are the same  ? 
