# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Project name**    - Uber Supply-Demand gap Analysis
##### **Contribution**    - Individual/Team
##### **Team Member 1 -** Shivam gupta
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

In this project titled *"Uber Supply-Demand Gap Analysis"*, the goal is to perform an Exploratory Data Analysis (EDA) on a dataset containing Uber cab requests to identify the imbalance between supply and demand during different times of the day. Uber, like many ride-sharing platforms, often faces issues where the number of customer requests is higher than the number of available drivers, especially during peak hours. This can lead to cancellations or unavailability of rides, resulting in poor customer experience and lost business opportunities.

The dataset contains around 6700+ rows and includes important fields such as Request ID, Pickup Point (City or Airport), Status (Trip Completed, Cancelled, No Cars Available), Driver ID, and Timestamps for both request and drop times. Initially, the data was unclean and needed preprocessing. Using *Excel and Python*, missing values were handled, timestamps were converted to proper datetime formats, and extra spaces or inconsistencies in columns were fixed.

After cleaning, a thorough *Exploratory Data Analysis* was performed using *Pandas, Matplotlib, and Seaborn*. Key insights were derived by analyzing:

- *Trip Status Distribution*: More than 60% of the trips were not completed, with a high number of cancellations and "no car available" status.
- *Hourly Trend of Requests*: Demand was highest during the office hours (7 AM–9 AM and 5 PM–9 PM).
- *Pickup Point-wise Status*: The city showed more cancellations, while the airport had higher “no car available” rates.
- *Supply vs Demand Comparison*: During peak hours, the gap between the number of ride requests and the number of completed rides was significant.
- *Problematic Hours*: Specific hours like 6 PM saw the highest number of failed requests.
- *Top Pickup Point Load*: The city had more requests overall, showing a higher demand area.

SQL queries were also used in *MySQL Workbench* to cross-check insights like:
- Status-wise count
- Hourly distribution of requests
- Pickup point vs status matrix

These insights can help Uber make strategic decisions such as improving driver allocation during peak hours, increasing incentives for drivers to avoid cancellations, and using demand forecasting to reduce the supply-demand gap.

This project was completed individually and covers the full pipeline of a basic data analytics project: from data collection, cleaning, and EDA to conclusion and visualization. The project also includes charts, graphs, and SQL queries that support the findings and recommendations.

Overall, this analysis can support Uber in enhancing service quality and customer satisfaction by using data to drive decisions.

# **GitHub Link -**

https://github.com/shivamgupta-2005/uber-supply-demand-gap

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

The main business objective of this project is to identify the *gap between supply and demand* for Uber rides and understand the factors causing *cancellations* and *unavailability of cabs*, especially during peak hours.

By analyzing the dataset, we aim to:
- Detect *high-demand time slots and locations* (like City vs Airport).
- Analyze *driver availability* versus *ride requests*.
- Provide insights on *why and when rides are getting cancelled or not assigned*.
- Help Uber make *data-driven decisions* to reduce service failure rate and improve customer satisfaction.

These findings can be used to optimize *driver allocation, provide **incentives during peak hours, and **reduce customer churn*, thereby improving Uber's business performance and operational efficiency.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For cleaner visualssns.set(style="whitegrid")
sns.set(style="whitegrid")

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()

### Dataset First View

In [None]:
# Dataset First Look
df = pd.read_csv("uber_data.csv")

# Display first 5 rows
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Shape:", df.shape)


# Describe numeric data
df.describe()

### Dataset Information

In [None]:
# Dataset Info
# Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values


plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?
- There are 6745 entries and 6 columns.
- Some values in Drop timestamp are missing for Cancelled/No Cars Available rides.
- No duplicate rows found.
- Timestamp columns are in string format, we will convert to datetime.


Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

- Request id: Unique ID for each ride request.
- Pickup point: Location where the ride was requested (Airport/City).
- Driver id: ID of driver (if assigned).
- Status: Final ride status (Trip Completed, Cancelled, No Cars Available).
- Request timestamp: Time when request was made.
- Drop timestamp: Time when ride ended (if completed).Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
  print(f"{col}->{df[col].nunique()}unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Convert timestamp columns to datetime
df['Request Timestamp'] = pd.to_datetime(df['Request Timestamp'])
df['Drop TImestamp'] = pd.to_datetime(df['Drop TImestamp'])

# Extract hour, day from timestamp for analysis
df['Request Hour'] = df['Request Timestamp'].dt.hour
df['Request Day'] = df['Request Timestamp'].dt.day_name()

# Remove rows with null driver ID if needed
# df = df[df['Driver id'].notnull()]

### What all manipulations have you done and insights you found?

- Converted Request timestamp and Drop timestamp into datetime format.
- Created new columns: Request Hour and Request Day for better time-based analysis.
- Missing driver IDs were kept as-is since "No Cars Available" cases don't have drivers.
- Cleaned the data to remove duplicate rows (if any).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.countplot(data=df, x='Status', palette='Set2')
plt.title('Overall Request Status Count')
plt.xlabel('Ride Status')
plt.ylabel('Number of Requests')
plt.show()


##### 1. Why did you pick the specific chart?

Bar chart is useful to compare total ride counts in each status category (Completed, Cancelled, No Cars Available).

##### 2. What is/are the insight(s) found from the chart?

Most requests are either “No Cars Available” or “Cancelled”. Very few are successfully completed — clearly a supply problem.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it shows a major supply gap. Too many failed requests = lost revenue. This insight helps Uber plan for more availability in future.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.countplot(data=df, x='Pickup point', hue='Status', palette='Set1')
plt.title('Pickup Point vs Ride Status')
plt.xlabel('Pickup Location')
plt.ylabel('Number of Requests')
plt.show()


##### 1. Why did you pick the specific chart?

Shows how location (Airport/City) affects ride status.

##### 2. What is/are the insight(s) found from the chart?

Airport has high “No Cars Available”; City has high “Cancelled”.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Airport has high “No Cars Available”; City has high “Cancelled”.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.countplot(data=df, x='Request Hour', palette='Set3')
plt.title('Number of Requests by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Total Requests')
plt.show()


##### 1. Why did you pick the specific chart?

Line/bar chart shows time-based demand pattern.

##### 2. What is/are the insight(s) found from the chart?

Heavy spike during office hours (5 AM–9 AM, 5 PM–9 PM).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Uber plan driver availability during peak hours.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12,6))
sns.countplot(data=df, x='Request Hour', hue='Status', palette='muted')
plt.title('Ride Status by Hour')
plt.xlabel('Hour')
plt.ylabel('Count')
plt.legend(title='Status')
plt.show()


##### 1. Why did you pick the specific chart?

Chosen to analyze status variation by time.

##### 2. What is/are the insight(s) found from the chart?

Peak-hour rides are mostly cancelled or not fulfilled.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strong business insight: fix peak-hour failures = huge improvement.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.countplot(data=df, x='Request Day', hue='Status', palette='coolwarm')
plt.title('Request Day vs Ride Status')
plt.xlabel('Day')
plt.ylabel('Number of Requests')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Checks whether weekdays or weekends have more failures.

##### 2. What is/are the insight(s) found from the chart?

day and Monday show very high failure.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Uber can push offers/driver boosts on those days to balance load.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
df['Status'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['lightgreen', 'red', 'gray'])
plt.title('Ride Status Distribution')
plt.ylabel('')
plt.show()


##### 1. Why did you pick the specific chart?

o quickly show overall % share of each ride status.

##### 2. What is/are the insight(s) found from the chart?

More than 60% rides are not completed — serious issue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Low service fulfillment → needs urgent resolution.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
df['Pickup point'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['skyblue', 'orange'])
plt.title('Pickup Point Distribution')
plt.ylabel('')
plt.show()


##### 1. Why did you pick the specific chart?


To understand where most bookings are being made from.

##### 2. What is/are the insight(s) found from the chart?

Almost equal from City and Airport.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus resource allocation equally on both pickup zones.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
heatmap_data = df.groupby(['Request Hour', 'Pickup point'])['Status'].value_counts().unstack().fillna(0)
sns.heatmap(heatmap_data, cmap='YlGnBu')
plt.title('Status Heatmap by Hour and Pickup Point')
plt.show()


##### 1. Why did you pick the specific chart?

To view status trend with time and pickup point together.

##### 2. What is/are the insight(s) found from the chart?

Airport faces shortage in evening, city faces cancellations in morning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Airport faces shortage in evening, city faces cancellations in morning.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

To visualize which columns have missing data

##### 2. What is/are the insight(s) found from the chart?

Drop timestamp missing where rides not completed

Driver ID missing when no driver was assigned

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Data quality looks good; missing values are logical.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
df['Request timestamp'].dt.date.value_counts().sort_index().plot(kind='line')
plt.title('Requests per Day')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To monitor if demand is increasing/decreasing over days.

##### 2. What is/are the insight(s) found from the chart?

Ride demand is consistent; few spikes on weekends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in daily operations planning and prediction.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
sns.histplot(df[df['Pickup point'] == 'Airport']['Request Hour'], bins=24, color='orange')
plt.title('Hourly Requests - Airport')
plt.show()


##### 1. Why did you pick the specific chart?


To zoom into Airport behavior over the day.

##### 2. What is/are the insight(s) found from the chart?

Airport demand is very high in evenings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Assign more drivers in airport side post 5 PM.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
sns.histplot(df[df['Pickup point'] == 'City']['Request Hour'], bins=24, color='blue')
plt.title('Hourly Requests - City')
plt.show()


##### 1. Why did you pick the specific chart?

To zoom into City ride requests.

##### 2. What is/are the insight(s) found from the chart?

City rides peak early morning (7–9 AM).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Boost driver availability in city zone during office hours.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.boxplot(x='Status', y='Request Hour', data=df)
plt.title('Request Hour vs Status Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

To see how request hour varies across different statuses.

##### 2. What is/are the insight(s) found from the chart?

Cancelled rides are early morning, "No Cars" in evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Demand types are time sensitive, must create time-specific strategies.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='YlGnBu')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

To check numerical relationships in data (e.g., hour, driver id).

##### 2. What is/are the insight(s) found from the chart?

Mostly low correlation since many features are categorical, but time-related columns might show useful trends.



#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, hue='Status')
plt.title('Pairplot of Variables')
plt.show()



##### 1. Why did you pick the specific chart?


To get a combined visual for all numeric variable interactions by Status.

##### 2. What is/are the insight(s) found from the chart?

Some spread visible in request hours vs status, rest looks uniformly distributed.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To address the supply-demand gap and improve ride completion rate, the following actions are recommended:

1. *Increase driver availability* during peak hours (morning and evening).
2. *Deploy more drivers to Airport* during evening hours where "No Cars Available" is high.
3. Use *predictive analytics* to forecast demand zones in advance.
4. Provide *driver incentives* during high-demand periods to reduce cancellations.
5. Improve customer-driver matching logic to reduce wait time and frustration.

Implementing these solutions will directly improve customer experience and ride success rate.

# **Conclusion**

This EDA project successfully identified key issues in Uber’s ride request data.  
The analysis revealed high cancellations in the morning and unavailability in the evening — especially from the Airport.  
By understanding time-based demand patterns and pickup point behaviors, data-driven strategies can be implemented to improve service efficiency, reduce failures, and enhance customer satisfaction.

Thus, data can play a crucial role in optimizing operations and closing the supply-demand gap.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***