# **Assignment on Uber dataset**

**Load the dataset**

In [46]:
import pandas as pd
uber_data = pd.read_csv("Uber.csv") 

**Display basic info about dataset**

In [48]:
uber_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   START_DATE*  1156 non-null   object 
 1   END_DATE*    1155 non-null   object 
 2   CATEGORY*    1155 non-null   object 
 3   START*       1155 non-null   object 
 4   STOP*        1155 non-null   object 
 5   MILES*       1156 non-null   float64
 6   PURPOSE*     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB


**Check for missing values**

In [50]:
uber_data.isnull().sum()

START_DATE*      0
END_DATE*        1
CATEGORY*        1
START*           1
STOP*            1
MILES*           0
PURPOSE*       503
dtype: int64

**Drop rows with missing values**

In [52]:
uber_data = uber_data.dropna()

**fill missing values (propose column with unknown value)**

In [56]:
uber_data['PURPOSE*'] = uber_data['PURPOSE*'].fillna('Unknown')

uber_data['MILES*'] = uber_data['MILES*'].fillna(uber_data['MILES*'].mean())


**Check and remove duplicates**

In [58]:
uber_data.duplicated().sum()

1

In [60]:
uber_data = uber_data.drop_duplicates()

**Convert START_DATE and END_DATE to datetime**

In [62]:
uber_data['START_DATE*'] = pd.to_datetime(uber_data['START_DATE*'], errors='coerce')

In [64]:
uber_data['END_DATE*'] = pd.to_datetime(uber_data['END_DATE*'], errors='coerce')

**Total number of rides per category:**

In [67]:
rides_per_category = uber_data['CATEGORY*'].value_counts()
print(rides_per_category)

CATEGORY*
Business    646
Personal      6
Name: count, dtype: int64


**Total miles traveled for each purpose:**

In [70]:
miles_per_purpose = uber_data.groupby('PURPOSE*')['MILES*'].sum()
print(miles_per_purpose)

PURPOSE*
Airport/Travel       16.5
Between Offices     197.0
Charity ($)          15.1
Commute             180.2
Customer Visit     2089.5
Errand/Supplies     508.0
Meal/Entertain      911.7
Meeting            2841.4
Moving               18.2
Temporary Site      523.7
Name: MILES*, dtype: float64


**Average distance for business vs. personal rides:**

In [90]:
avg_dist = uber_data.groupby('CATEGORY*')['MILES*'].mean()
print(avg_dist)

CATEGORY*
Business    10.971827
Personal    35.583333
Name: MILES*, dtype: float64


**Add a column for cost estimation (assuming $2 per mile):**

In [76]:
uber_data['COST_ESTIMATION'] = uber_data['MILES*'] * 2

**Filter rides longer than 50 miles:**

In [79]:
long_rides = uber_data[uber_data['MILES*'] > 50]
long_rides

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,COST_ESTIMATION
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,127.4
232,2016-03-17 12:52:00,2016-03-17 15:11:00,Business,Austin,Katy,136.0,Customer Visit,272.0
251,2016-03-19 19:33:00,2016-03-19 20:39:00,Business,Galveston,Houston,57.0,Customer Visit,114.0
268,2016-03-25 13:24:00,2016-03-25 16:22:00,Business,Cary,Latta,144.0,Customer Visit,288.0
269,2016-03-25 16:52:00,2016-03-25 22:22:00,Business,Latta,Jacksonville,310.3,Customer Visit,620.6
270,2016-03-25 22:54:00,2016-03-26 01:39:00,Business,Jacksonville,Kissimmee,201.0,Meeting,402.0
295,2016-04-02 12:21:00,2016-04-02 14:47:00,Business,Kissimmee,Daytona Beach,77.3,Customer Visit,154.6
296,2016-04-02 16:57:00,2016-04-02 18:09:00,Business,Daytona Beach,Jacksonville,80.5,Customer Visit,161.0
297,2016-04-02 19:38:00,2016-04-02 22:36:00,Business,Jacksonville,Ridgeland,174.2,Customer Visit,348.4
298,2016-04-02 23:11:00,2016-04-03 01:34:00,Business,Ridgeland,Florence,144.0,Meeting,288.0


**Filter by specific purpose (e.g., meetings):**

In [82]:
meetings_rides = uber_data[uber_data['PURPOSE*'] == 'Meeting']
meetings_rides

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,COST_ESTIMATION
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,9.4
6,2016-01-06 17:30:00,2016-01-06 17:35:00,Business,West Palm Beach,Palm Beach,7.1,Meeting,14.2
7,2016-01-07 13:27:00,2016-01-07 13:33:00,Business,Cary,Cary,0.8,Meeting,1.6
8,2016-01-10 08:05:00,2016-01-10 08:25:00,Business,Cary,Morrisville,8.3,Meeting,16.6
10,2016-01-10 15:08:00,2016-01-10 15:51:00,Business,New York,Queens,10.8,Meeting,21.6
...,...,...,...,...,...,...,...,...
1142,2016-12-29 20:15:00,2016-12-29 20:45:00,Business,Kar?chi,Kar?chi,7.2,Meeting,14.4
1144,2016-12-29 23:14:00,2016-12-29 23:47:00,Business,Unknown Location,Kar?chi,12.9,Meeting,25.8
1148,2016-12-30 16:45:00,2016-12-30 17:08:00,Business,Kar?chi,Kar?chi,4.6,Meeting,9.2
1150,2016-12-31 01:07:00,2016-12-31 01:14:00,Business,Kar?chi,Kar?chi,0.7,Meeting,1.4


**What is the total number of business trips versus personal trips?**

In [85]:
business_vs_personal = uber_data['CATEGORY*'].value_counts()
business_vs_personal

CATEGORY*
Business    646
Personal      6
Name: count, dtype: int64

**What percentage of trips are business versus personal?**

In [92]:
percent= uber_data['CATEGORY*'].value_counts(normalize=True) * 100
percent

CATEGORY*
Business    99.079755
Personal     0.920245
Name: proportion, dtype: float64