## Uber data Analysis

### 1. Understand the Problem and Data
Describtion

Business Problem:

Data Description:

Tools;

EDA


## üöó **Real-World Business Scenario**

You are a **Data Analyst working for Uber** in the **Operations and Strategy Department**.
Your goal is to analyze **driver trip data** to understand:

* **Usage patterns** (when, where, and why people are using Uber)
* **Trip performance** (trip length, time, miles, frequency)
* **Customer and business insights** (business vs personal use)
* **Operational optimization** (reduce idle time, improve driver efficiency, increase revenue)

You are tasked with conducting an **Exploratory Data Analysis (EDA)** on Uber ride data to help management make **data-driven decisions**.

---

## üéØ **EDA Objectives**

1. Understand the **overall trip behavior**.
2. Identify **patterns and trends** in ride usage.
3. Explore **seasonal and time-based variations**.
4. Compare **trip purposes and categories**.
5. Detect **outliers or anomalies** (e.g., very long trips).
6. Provide **actionable insights** for Uber‚Äôs business strategy.

---

## üí° **Key Business Questions for EDA**

### üïí 1. Time-based Analysis

* What is the **total number of trips per day/week/month**?
* Which **days of the week** have the most Uber trips?
* What are the **peak hours** for Uber rides (morning, evening, late night)?
* How does the **ride frequency vary over time** (seasonal trends)?
* Are there any **particular months or dates** with unusual activity?

---

### üìç 2. Location-based Analysis

* Which **pickup locations (START)** are most common?
* Which **drop locations (STOP)** are most frequent?
* What are the **top origin-destination pairs**?
* Do riders tend to **travel within the same city** or **between cities**?

---

### üß≠ 3. Distance and Duration Insights

* What is the **average distance (miles)** per trip?
* Which trips are **shortest and longest**?
* Is there a **relationship between trip distance and time**?
* How do **business trips** differ from **personal trips** in terms of distance?

---

### üè¢ 4. Category and Purpose Analysis

* What proportion of rides are **Business** vs **Personal**?
* What are the **top purposes** for business trips (e.g., Customer Visit, Meeting, Errand)?
* Are **Business rides longer or shorter** on average compared to Personal rides?
* Do certain **purposes** happen at specific times of day (e.g., ‚ÄúMeeting‚Äù rides during office hours)?

---

### üìà 5. Driver and Operational Efficiency

* How many **total miles** were driven overall?
* What is the **average miles per day**?
* Are there **idle days** (no rides)?
* Are there **patterns of frequent short rides** vs **few long rides**?

---

### ‚ö†Ô∏è 6. Data Quality and Anomalies

* Are there any **missing values** in PURPOSE or CATEGORY?
* Any **negative or zero miles** that need cleaning?
* Are there **duplicate trips** or incorrect timestamps?

---

### üí∞ 7. Potential Business Recommendations

After performing EDA, you could generate insights like:

* Suggest **incentives** for high-demand times or routes.
* Identify **popular travel corridors** to place surge pricing.
* Recommend **driver shifts** based on demand peaks.
* Understand how **business vs personal rides** contribute to revenue.

---

## üßÆ **Next Steps After EDA**

Once the EDA is complete, you can:

1. Build a **Power BI or Tableau dashboard** showing interactive trip insights.
2. Use **Machine Learning** to predict:

   * Trip demand by hour/day.
   * Average distance or ride duration.
   * Customer segmentation by travel purpose.

---

## 2. Data Collection and Preparation
- Gather Data: Collect the relevant data from various sources.
- Clean and Preprocess: Handle missing values, remove duplicates, datatype , correct inconsistent or invalid data, and transform the data as needed.

In [1]:
import pandas as pd

In [2]:
# 2.1 Data Collection /Data load
df = pd.read_csv('/content/UberDataset.csv')
df.head()

Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE
0,01-01-2016 21:11,01-01-2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,5.0,
2,01-02-2016 20:25,01-02-2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,01-05-2016 17:31,01-05-2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,01-06-2016 14:42,01-06-2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


https://www.kaggle.com/datasets/bhanupratapbiswas/uber-data-analysis

## Steps
1. basic info
2. missing/null value check
3. missing value delete/update
4. duplicate value check
5. duplicate value delete/drop
6. invalid data check and correct
7. data types corrections
8. save the clean data

====
EDA


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   START_DATE  1156 non-null   object 
 1   END_DATE    1155 non-null   object 
 2   CATEGORY    1155 non-null   object 
 3   START       1155 non-null   object 
 4   STOP        1155 non-null   object 
 5   MILES       1156 non-null   float64
 6   PURPOSE     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB


In [4]:
# 1.1 statistical summary -->numbrical columns
df.describe()

Unnamed: 0,MILES
count,1156.0
mean,21.115398
std,359.299007
min,0.5
25%,2.9
50%,6.0
75%,10.4
max,12204.7


In [5]:
# 1.1 statistical summary -->categorical columns
df.describe(include='object')

Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,PURPOSE
count,1156,1155,1155,1155,1155,653
unique,1155,1154,2,177,188,10
top,6/28/2016 23:34,6/28/2016 23:59,Business,Cary,Cary,Meeting
freq,2,2,1078,201,203,187


In [6]:
# 2. missing/null value check
df.isnull().sum()

Unnamed: 0,0
START_DATE,0
END_DATE,1
CATEGORY,1
START,1
STOP,1
MILES,0
PURPOSE,503


In [7]:
df["PURPOSE"].unique()

array(['Meal/Entertain', nan, 'Errand/Supplies', 'Meeting',
       'Customer Visit', 'Temporary Site', 'Between Offices',
       'Charity ($)', 'Commute', 'Moving', 'Airport/Travel'], dtype=object)

In [8]:
df["PURPOSE"].value_counts()

Unnamed: 0_level_0,count
PURPOSE,Unnamed: 1_level_1
Meeting,187
Meal/Entertain,160
Errand/Supplies,128
Customer Visit,101
Temporary Site,50
Between Offices,18
Moving,4
Airport/Travel,3
Commute,1
Charity ($),1


In [9]:
df["PURPOSE"]

Unnamed: 0,PURPOSE
0,Meal/Entertain
1,
2,Errand/Supplies
3,Meeting
4,Customer Visit
...,...
1151,Temporary Site
1152,Meeting
1153,Temporary Site
1154,Temporary Site


In [10]:
df["PURPOSE"].mode()[0]

'Meeting'

In [11]:
# replace null values
df["PURPOSE"].fillna("Meeting")

Unnamed: 0,PURPOSE
0,Meal/Entertain
1,Meeting
2,Errand/Supplies
3,Meeting
4,Customer Visit
...,...
1151,Temporary Site
1152,Meeting
1153,Temporary Site
1154,Temporary Site


In [12]:
df['xyz']= df["PURPOSE"].fillna(df["PURPOSE"].mode()[0])

In [13]:
df['PURPOSE']= df["PURPOSE"].fillna(df["PURPOSE"].mode()[0])
# replace the null values and reassign in same columns

# 3. missing value delete/update

In [14]:
df.isnull().sum()

Unnamed: 0,0
START_DATE,0
END_DATE,1
CATEGORY,1
START,1
STOP,1
MILES,0
PURPOSE,0
xyz,0


In [15]:
# drop the column
df.drop(columns=['xyz'],inplace=True)

In [16]:
# df = df.drop(columns=['xyz'])

In [17]:
df.dropna(axis=0,inplace=True)

In [18]:
df.isnull().sum()

Unnamed: 0,0
START_DATE,0
END_DATE,0
CATEGORY,0
START,0
STOP,0
MILES,0
PURPOSE,0


In [19]:
# 4. duplicate value check
df.duplicated().sum()

np.int64(1)

In [20]:
len(df)

1155

In [21]:
# check the actual duplicate value
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
1150,False
1151,False
1152,False
1153,False


In [22]:
df[df.duplicated()]

Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE
492,6/28/2016 23:34,6/28/2016 23:59,Business,Durham,Cary,9.9,Meeting


In [23]:
df['START_DATE']== "6/28/2016 23:34"

Unnamed: 0,START_DATE
0,False
1,False
2,False
3,False
4,False
...,...
1150,False
1151,False
1152,False
1153,False


In [24]:
df['START_DATE']== "6/28/2016 23:34"
# boolean indexing

# show value
df[df['START_DATE']== "6/28/2016 23:34"]

Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE
491,6/28/2016 23:34,6/28/2016 23:59,Business,Durham,Cary,9.9,Meeting
492,6/28/2016 23:34,6/28/2016 23:59,Business,Durham,Cary,9.9,Meeting


In [25]:
df.nunique()

Unnamed: 0,0
START_DATE,1154
END_DATE,1154
CATEGORY,2
START,177
STOP,188
MILES,256
PURPOSE,10


In [26]:
# delete the duplicate
df.drop_duplicates(inplace=True)

In [27]:
# 6. invalid data check and correct
df3 = pd.read_csv('/content/UberDataset.csv',encoding="utf-8",encoding_errors="replace")
df3.iloc[23]

Unnamed: 0,23
START_DATE,1/13/2016 13:54
END_DATE,1/13/2016 14:07
CATEGORY,Business
START,Downtown
STOP,Gulfton
MILES,11.2
PURPOSE,Meeting


In [28]:
# Error Value resovle Examples
import pandas as pd
import numpy as np

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 'thirty', np.nan],
        'City': ['New York', 'London', 'Paris', 'New York']}
df2 = pd.DataFrame(data)

# 1. Check Data Types
print("Original dtypes:")
print(df2.dtypes)

# 2. Correct 'Age' column (handle non-numeric and missing values)
df2['Age'] = pd.to_numeric(df2['Age'], errors='coerce') # Convert 'thirty' to NaN

df2['Age'].fillna(df2['Age'].mean(), inplace=True) # Fill NaN with mean

# 3. Check for consistent 'City' values (not applicable in this simple example, but shown for illustration)
# print(df2['City'].unique()) # Would show ['New York', 'London', 'Paris']

print("\nCorrected DataFrame:")
print(df)
print("\nCorrected dtypes:")
print(df.dtypes)

Original dtypes:
Name    object
Age     object
City    object
dtype: object

Corrected DataFrame:
            START_DATE          END_DATE  CATEGORY             START  \
0     01-01-2016 21:11  01-01-2016 21:17  Business       Fort Pierce   
1     01-02-2016 01:25  01-02-2016 01:37  Business       Fort Pierce   
2     01-02-2016 20:25  01-02-2016 20:38  Business       Fort Pierce   
3     01-05-2016 17:31  01-05-2016 17:45  Business       Fort Pierce   
4     01-06-2016 14:42  01-06-2016 15:49  Business       Fort Pierce   
...                ...               ...       ...               ...   
1150   12/31/2016 1:07   12/31/2016 1:14  Business           Kar?chi   
1151  12/31/2016 13:24  12/31/2016 13:42  Business           Kar?chi   
1152  12/31/2016 15:03  12/31/2016 15:38  Business  Unknown Location   
1153  12/31/2016 21:32  12/31/2016 21:50  Business        Katunayake   
1154  12/31/2016 22:08  12/31/2016 23:51  Business           Gampaha   

                  STOP  MILES        

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Age'].fillna(df2['Age'].mean(), inplace=True) # Fill NaN with mean


In [29]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with mixed date formats
df5 = pd.DataFrame({
    'date_time': ['01-01-2016 21:11', '1/13/2016 13:54', '05-20-2016 08:30'],
    'value': [10, 20, 30]
})

print("Original DataFrame:")
print(df5)
print("\nData types before conversion:")
print(df5.info())

# Convert the 'date_time' column to datetime objects
# `format='mixed'` will attempt to infer the format for each element.
# `dayfirst=True` helps to correctly parse the European-style date (01-01)
df5['date_time'] = pd.to_datetime(df5['date_time'], format='mixed', dayfirst=True)

print("\nDataFrame after conversion:")
print(df5)
print("\nData types after conversion:")
print(df5.info())


Original DataFrame:
          date_time  value
0  01-01-2016 21:11     10
1   1/13/2016 13:54     20
2  05-20-2016 08:30     30

Data types before conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date_time  3 non-null      object
 1   value      3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes
None

DataFrame after conversion:
            date_time  value
0 2016-01-01 21:11:00     10
1 2016-01-13 13:54:00     20
2 2016-05-20 08:30:00     30

Data types after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date_time  3 non-null      datetime64[ns]
 1   value      3 non-null      int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 180.0 bytes


In [30]:
df.dtypes

Unnamed: 0,0
START_DATE,object
END_DATE,object
CATEGORY,object
START,object
STOP,object
MILES,float64
PURPOSE,object


In [31]:
df['START_DATE']

Unnamed: 0,START_DATE
0,01-01-2016 21:11
1,01-02-2016 01:25
2,01-02-2016 20:25
3,01-05-2016 17:31
4,01-06-2016 14:42
...,...
1150,12/31/2016 1:07
1151,12/31/2016 13:24
1152,12/31/2016 15:03
1153,12/31/2016 21:32


In [32]:
# 7. data types corrections
pd.to_datetime(df['START_DATE'],format="mixed",dayfirst=True)

Unnamed: 0,START_DATE
0,2016-01-01 21:11:00
1,2016-02-01 01:25:00
2,2016-02-01 20:25:00
3,2016-05-01 17:31:00
4,2016-06-01 14:42:00
...,...
1150,2016-12-31 01:07:00
1151,2016-12-31 13:24:00
1152,2016-12-31 15:03:00
1153,2016-12-31 21:32:00


In [33]:
df['START_DATE'] = pd.to_datetime(df['START_DATE'],format="mixed",dayfirst=True)

df['END_DATE'] = pd.to_datetime(df['END_DATE'],format="mixed",dayfirst=True)

# ====
# EDA

In [34]:
df.dtypes
# change the date and time datatypes

Unnamed: 0,0
START_DATE,datetime64[ns]
END_DATE,datetime64[ns]
CATEGORY,object
START,object
STOP,object
MILES,float64
PURPOSE,object


In [35]:
# 8. save the clean data
df.to_csv("uber_clean.csv",index=False)

In [48]:
import pandas as pd
# load the clean data
df = pd.read_csv("/content/uber_clean.csv")

df['START_DATE'] = pd.to_datetime(df['START_DATE'],format="mixed",dayfirst=True)

df['END_DATE'] = pd.to_datetime(df['END_DATE'],format="mixed",dayfirst=True)

df.head()

Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,2016-02-01 01:25:00,2016-02-01 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,Meeting
2,2016-02-01 20:25:00,2016-02-01 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,2016-05-01 17:31:00,2016-05-01 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,2016-06-01 14:42:00,2016-06-01 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


In [None]:
# create a new column and store month_name and
# also create weeknum column and year column


In [None]:
df["month"] = df['START_DATE'].dt.month_name()
df["weeknum"] =df['START_DATE'].dt.isocalendar().week
df["year"] = df['START_DATE'].dt.year
df.head()



## üöó **Real-World Business Scenario**

You are a **Data Analyst working for Uber** in the **Operations and Strategy Department**.
Your goal is to analyze **driver trip data** to understand:

* **Usage patterns** (when, where, and why people are using Uber)
* **Trip performance** (trip length, time, miles, frequency)
* **Customer and business insights** (business vs personal use)
* **Operational optimization** (reduce idle time, improve driver efficiency, increase revenue)

You are tasked with conducting an **Exploratory Data Analysis (EDA)** on Uber ride data to help management make **data-driven decisions**.

---

## üéØ **EDA Objectives**

1. Understand the **overall trip behavior**.
2. Identify **patterns and trends** in ride usage.
3. Explore **seasonal and time-based variations**.
4. Compare **trip purposes and categories**.
5. Detect **outliers or anomalies** (e.g., very long trips).
6. Provide **actionable insights** for Uber‚Äôs business strategy.

---

## üí° **Key Business Questions for EDA**

### üïí 1. Time-based Analysis

* What is the **total number of trips per day/week/month**?
* Which **days of the week** have the most Uber trips?
* What are the **peak hours** for Uber rides (morning, evening, late night)?
* How does the **ride frequency vary over time** (seasonal trends)?
* Are there any **particular months or dates** with unusual activity?

---

### üìç 2. Location-based Analysis

* Which **pickup locations (START)** are most common?
* Which **drop locations (STOP)** are most frequent?
* What are the **top origin-destination pairs**?
* Do riders tend to **travel within the same city** or **between cities**?

---

### üß≠ 3. Distance and Duration Insights

* What is the **average distance (miles)** per trip?
* Which trips are **shortest and longest**?
* Is there a **relationship between trip distance and time**?
* How do **business trips** differ from **personal trips** in terms of distance?

---

### üè¢ 4. Category and Purpose Analysis

* What proportion of rides are **Business** vs **Personal**?
* What are the **top purposes** for business trips (e.g., Customer Visit, Meeting, Errand)?
* Are **Business rides longer or shorter** on average compared to Personal rides?
* Do certain **purposes** happen at specific times of day (e.g., ‚ÄúMeeting‚Äù rides during office hours)?

---

### üìà 5. Driver and Operational Efficiency

* How many **total miles** were driven overall?
* What is the **average miles per day**?
* Are there **idle days** (no rides)?
* Are there **patterns of frequent short rides** vs **few long rides**?

---

### ‚ö†Ô∏è 6. Data Quality and Anomalies

* Are there any **missing values** in PURPOSE or CATEGORY?
* Any **negative or zero miles** that need cleaning?
* Are there **duplicate trips** or incorrect timestamps?

---

### üí∞ 7. Potential Business Recommendations

After performing EDA, you could generate insights like:

* Suggest **incentives** for high-demand times or routes.
* Identify **popular travel corridors** to place surge pricing.
* Recommend **driver shifts** based on demand peaks.
* Understand how **business vs personal rides** contribute to revenue.

---

### üïí 1. Time-based Analysis

* What is the **total number of trips per day/week/month**?
* Which **days of the week** have the most Uber trips?
* What are the **peak hours** for Uber rides (morning, evening, late night)?
* How does the **ride frequency vary over time** (seasonal trends)?
* Are there any **particular months or dates** with unusual activity?


In [88]:
# month
df['month'].value_counts()

Unnamed: 0_level_0,count
month,Unnamed: 1_level_1
December,151
August,127
October,117
February,105
July,105
March,104
June,98
November,96
January,81
April,62


In [57]:
df['month'].value_counts().mean()

np.float64(96.16666666666667)

In [None]:
# week
df['weeknum'].value_counts()

In [61]:
# perday booking
df['START_DATE'].dt.day.value_counts().mean()

np.float64(37.225806451612904)

In [62]:
len(df)

1154

In [None]:
df['START_DATE'].dt.weekday.value_counts()
# Which days of the week have the most Uber trips?
df['START_DATE'].dt.weekday.value_counts()
#  0 represents Monday, and the numbers increase to 6 for Sunday.


In [78]:
df['START_DATE'].dt.day_name()

Unnamed: 0,START_DATE
0,Friday
1,Monday
2,Monday
3,Sunday
4,Wednesday
...,...
1149,Saturday
1150,Saturday
1151,Saturday
1152,Saturday


In [76]:
df['START_DATE'].dt.weekday.value_counts().reset_index()

Unnamed: 0,START_DATE,count
0,4,185
1,6,173
2,0,172
3,1,172
4,3,163
5,5,156
6,2,133


In [87]:
df['START_DATE'].dt.day_name().value_counts()

Unnamed: 0_level_0,count
START_DATE,Unnamed: 1_level_1
Friday,185
Sunday,173
Monday,172
Tuesday,172
Thursday,163
Saturday,156
Wednesday,133


In [86]:
df['START_DATE'].dt.day_name().value_counts().reset_index().set_index('START_DATE').loc["Friday"]

Unnamed: 0,Friday
count,185


In [104]:
# show the month name less then average booking
monthly_booking = df['month'].value_counts().reset_index()
monthly_booking["count"].mean()

np.float64(96.16666666666667)

In [97]:
# df[logic ]
monthly_booking["count"]< monthly_booking["count"].mean()

monthly_booking[monthly_booking["count"]< monthly_booking["count"].mean()]

Unnamed: 0,month,count
7,November,96
8,January,81
9,April,62
10,May,56
11,September,52


In [None]:
# * Which **drop locations (Start)** are most frequent?
df['START'].value_counts()
# * Which **drop locations (STOP)** are most frequent?
df['STOP'].value_counts()

## Insights
- cary loaction is most prefible station for start point and stop point.


In [None]:
https://docs.google.com/document/d/19E_ETfW-HLniredgT18HVQV4mgBOT-lLZ7hmuxToRsE/edit?tab=t.0


## Conclusion