<a href="https://colab.research.google.com/github/tejalhinge23/Uber-Request-Data-analysis/blob/main/Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



Uber Request Data Analysis

# **Project Summary -**

The primary goal of this project was to perform a comprehensive analysis of Uber’s ride request data to identify operational bottlenecks, with a focus on detecting the demand–supply gap, understanding cancellation trends, and evaluating request patterns across different times of the day and pickup points. By leveraging exploratory data analysis (EDA) techniques, the project aimed to generate actionable insights to improve service availability and customer satisfaction.

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/tejalhinge23

Provide your GitHub Link here.

# **Problem Statement**


Uber, a leading ride-hailing service, aims to provide reliable and timely transportation to its users. However, frequent complaints from customers—especially regarding cancellations, long wait times, and "No Cars Available" messages—highlight a possible mismatch between rider demand and driver supply at specific times and locations.

The company seeks to understand the underlying causes of these service disruptions by analyzing historical ride request data. This analysis should reveal:

When and where demand exceeds supply

The factors leading to ride cancellations

Time slots with low driver availability

Patterns in pickup behavior at City vs Airport

**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Answer Here.

The core business objective of this project is to help Uber optimize its ride-hailing operations by identifying and addressing service inefficiencies that lead to poor user experiences—particularly caused by ride cancellations, cab unavailability, and mismatched supply and demand.

Specifically, the goal is to:

Analyze historical ride request data to uncover patterns in rider demand and driver availability

Pinpoint critical time slots and locations where supply-demand imbalances are most severe

Provide data-driven recommendations to:

Minimize ride cancellations

Increase driver participation during high-demand hours

Improve customer satisfaction and retention

Support future resource allocation and pricing strategy decisions

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Status", order=df["Status"].value_counts().index)
plt.title("Request Status Count")
plt.show()

1)The countplot for the Status column was chosen because:

It summarizes the final outcome of all Uber ride requests in a clear and visual way.

It allows us to compare the number of trips that were Completed, Cancelled, or marked as No Cars Available.

Since Status is a categorical variable, a bar chart is the most appropriate visualization to show frequency distributions.

2)The chart typically shows that:

A substantial number of rides are "Trip Completed", which is expected.

A significant portion of rides are "Cancelled" by drivers or customers.

A large and concerning number of requests result in "No Cars Available".

This reveals that:

Service demand is not always met by supply, especially during certain times or at specific pickup points.

Uber is potentially losing a large number of customers due to unavailability and cancellations.

3)By identifying the magnitude of trip failures (cancelled and unserved rides), Uber can:

Improve driver allocation during peak times.

Introduce dynamic pricing or incentives to reduce cancellations.

Implement predictive scheduling models to reduce “No Cars Available” cases.

This leads to:

Higher ride completion rates

Better user experience

Increased revenue and brand trust.

plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Pickup point", hue="Status")
plt.title("Pickup Point vs Status")
plt.show()

1)This chart visualizes how request outcomes (Status) differ between City and Airport pickup locations.It gives a comparative view of demand satisfaction by location.Grouped bar charts are ideal here because we’re comparing multiple categorical variables (Pickup Point × Status).

2)City Pickup Point:
Higher number of cancellations by drivers, especially in the early morning.
Moderate levels of completed trips and some “No Cars Available”.

Airport Pickup Point:
High "No Cars Available" rate, especially during nighttime hours.
Slightly more successful completions than the city during the day.

3)Helps Uber allocate more drivers based on location-specific trends.
Promotes dynamic driver incentives:
Offer morning bonuses in the city to reduce cancellations.
Encourage night shifts near the airport to increase availability.
Supports data-driven planning for surge pricing or predictive deployment.

df["Request Hour"] = df["Request timestamp"].dt.hour
plt.figure(figsize=(10, 4))
sns.countplot(x="Request Hour", data=df, order=sorted(df["Request Hour"].unique()))
plt.title("Requests by Hour of Day")
plt.show()

1)This chart provides a time-based view of rider demand across the day.

It helps identify peak demand hours and off-peak periods.

This kind of hourly distribution is essential for operational planning:

When to allocate more drivers

When surge pricing should apply

When downtime can be expected

Bar charts are ideal for this type of frequency comparison.

2)Peak demand occurs between 5 AM and 9 AM (early morning) and again in the evening (5 PM to 9 PM).

Lowest activity occurs late at night (1 AM to 4 AM).

Morning peak likely corresponds to airport-bound business or work trips.

Evening peak aligns with commute home or social travel.

These patterns align with typical urban ride-hailing behavior.

3)Optimize driver shift scheduling: Ensure more drivers are active during high-demand periods.

Boost revenue: Use surge pricing intelligently during peak hours.

Improve service reliability: Avoid “No Cars Available” messages by predicting driver need in advance.

Customer satisfaction improves when requests are met promptly during busy hours.

# ***Let's Begin !***

## ***1. Know Your Data***

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:

from google.colab import files
uploaded = files.upload()
import pandas as pd
df = pd.read_csv("Uber_Request_Data_Cleaned.csv", parse_dates=["Request timestamp", "Drop timestamp"])



### Dataset First View

In [None]:
display(df.head())

### Dataset Rows & Columns count

In [None]:
df = pd.read_csv("Uber_Request_Data_Cleaned.csv")
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])


### Dataset Information

In [None]:
df.info()


#### Duplicate Values

In [None]:
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
df.isnull().sum()


In [None]:
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Value Heatmap")
plt.show()

### What did you know about your dataset?

Answer HereThe dataset consists of 6,745 Uber ride requests collected over 5 days (from July 11 to July 15, 2016).
It contains 6 columns:Request id, Pickup point, Driver id, Status, Request timestamp, Drop timestamp.
Timestamps were properly parsed as datetime types.
Driver id is often missing for rides that were not completed.
Status is a categorical variable with 3 values:
"Trip Completed", "Cancelled", and "No Cars Available"





## ***2. Understanding Your Variables***

In [None]:
df.columns


In [None]:
df.describe()

### Variables Description


Request id =	Unique identifier for each ride request.
Pickup point =	Location from where the ride was requested. Values: 'City' or 'Airport'.
Driver id	= ID of the driver assigned to the request. NaN if no driver was assigned.
Status =	Final status of the ride request:

### Check Unique Values for each variable.

In [None]:
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Step 1: Import Required Libraries
import pandas as pd
import numpy as np

# Step 2: Load Dataset
df = pd.read_csv("Uber_Request_Data_Cleaned.csv", parse_dates=["Request timestamp", "Drop timestamp"])

# Step 3: Rename Columns for Simplicity (Optional but recommended)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# Step 4: Check for Duplicates and Remove Them
print("Duplicate rows:", df.duplicated().sum())
df.drop_duplicates(inplace=True)

# Step 5: Check for Missing Values
print("Missing values:\n", df.isnull().sum())

# Step 6: Feature Engineering – Extract Hour & Day
df["request_hour"] = df["request_timestamp"].dt.hour
df["request_day"] = df["request_timestamp"].dt.date

# Step 7: Create a Categorical Flag for Peak vs Non-Peak Hours (Optional)
df["time_slot"] = pd.cut(df["request_hour"],
                         bins=[-1, 4, 9, 16, 21, 24],
                         labels=["Late Night", "Morning", "Afternoon", "Evening", "Night"])

# Step 8: Handle Missing Data (Optional)
# Drop rows with missing 'request_timestamp' (shouldn’t exist)
df.dropna(subset=["request_timestamp"], inplace=True)

# Keep missing driver_id and drop_timestamp for later analysis (don't fill yet)

# Step 9: Validate Data Types
print(df.dtypes)

# Step 10: Final Dataset Preview
print("Final shape:", df.shape)
df.head()


### What all manipulations have you done and insights you found?

1. Demand-Supply Gap
High number of ride cancellations from the City during early morning hours.

Many “No Cars Available” requests from the Airport during late night and early morning, indicating driver shortage.

2. Peak Request Hours
Most ride requests occurred between 5 AM – 9 AM and 5 PM – 9 PM, indicating peak traffic/demand periods.

3. Cancellation & Unavailability Patterns
Cancellations peak during Morning (5–9 AM) and Evening (5–9 PM), likely due to traffic or ride refusal.

Unavailability is highest during Night (10 PM – 3 AM), especially at the Airport.

4. Trip Completion Analysis
Majority of completed trips occur between daytime hours (8 AM – 8 PM).

Average trip duration ranges from 15–40 minutes, based on completed trips only.

5. Pickup Point Trends
City has more ride requests overall.

Airport requests are more likely to face unavailability, especially during Night.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(8, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Value Heatmap")
plt.show()

Visual Simplicity

A heatmap of missing values provides a quick visual scan of where data is incomplete.

You can instantly see which columns (e.g., driver_id, drop timestamp) have missing values and how frequently.

Better Than Just .isnull().sum()

While df.isnull().sum() gives raw numbers, a heatmap shows row-by-row distribution of missing data.

It helps detect patterns (e.g., entire rows or blocks of rows missing drop_timestamp when status ≠ Trip Completed).



Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Missing Values Are Not Random
Missing values are concentrated in two specific columns:

driver_id

drop_timestamp

2. Driver ID is Missing When No Trip Was Assigned
Rows missing driver_id usually correspond to:

"No Cars Available" — no driver ever accepted the request.

"Cancelled" — either by user or driver before a match was made.

3. Drop Timestamp is Missing for Incomplete Trips
drop_timestamp is only present when the trip was successfully completed.

It's missing for:

Cancelled rides

No Cars Available status

This is a valid business behavior, not a data error.

4. No Missing Values in Request Data
Columns like request_id, pickup_point, status, and request_timestamp are fully populated — which is good for analysis.

Improved Trip Completion Monitoring
Insight: drop_timestamp is missing only when the trip is not completed.

Impact: Uber can track trip completion rates more accurately and investigate high drop-off zones/times.

2.Better Driver Allocation Strategies
Insight: driver_id is missing when no driver was assigned (e.g., “No Cars Available”).

Impact: This helps identify supply-demand gaps by location and time, so Uber can:

Send driver availability alerts during peak demand

Offer surge incentives or shift scheduling at high-demand times

3. Clean and Trustworthy Data for Analysis
Insight: Missing values are logically linked to ride status — not due to data entry errors.

Impact: Ensures data integrity for analysts and builds trust in machine learning models using this dataset.



#### Chart - 6

In [None]:
print("Pickup Point:\n", df['pickup_point'].value_counts())
print("\nStatus:\n", df['status'].value_counts())

plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="status", order=df["status"].value_counts().index)
plt.title("Request Status Count")
plt.show()


Visualizes Distribution of Request Outcomes:

This countplot is ideal for showing how Uber ride requests ended:

Trip Completed

Cancelled

No Cars Available

Categorical Variable → Bar Chart Is Best:

The column status is categorical with limited distinct values.

A bar chart is the most appropriate tool to compare the frequency of categories.

Quickly Identifies Operational Issues:

By visualizing how many requests were cancelled or unserved, we can instantly spot potential problems in Uber’s service — like demand–supply gaps or driver unreliability.

Establishes Baseline for Further Analysis:

This is a foundational EDA chart.

It sets the stage for deeper exploration like:

Time-based cancellations

Location-specific shortages

Unmet demand patterns



Answer Here.

##### 2. What is/are the insight(s) found from the chart?

1.Trip Completed is the most common status
This indicates that the Uber system is mostly functioning well — many requests are successfully fulfilled.

2. Cancelled rides are high (often second in count)
A large number of trips are being cancelled, most likely by drivers, especially during peak hours or long distances (e.g., airport trips from the city).

Suggests dissatisfaction, inconvenience, or poor driver allocation.

3."No Cars Available" is alarmingly frequent
Many requests are never matched with a driver.

This usually occurs late at night or in low-supply zones like the airport.

Indicates a demand-supply imbalance.




Yes, the insights from the "Request Status Count" chart are critical for driving operational and strategic improvements at Uber.

 Positive Business Impact:
Insight	Positive Impact
High ‘Trip Completed’ count	Confirms service reliability in certain time slots/locations. Uber can study and replicate these patterns elsewhere.
High ‘Cancelled’ count	Identifies driver behavior issues (e.g., early morning cancellations). Uber can offer incentives or enforce penalties to reduce cancellations.

Answer Here

#### Chart - 7

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="pickup_point", hue="status")
plt.title("Pickup Point vs Status")
plt.show()


##### 1. Why did you pick the specific chart?

This grouped bar chart is chosen to analyze how request statuses vary by pickup location — specifically between City and Airport. Here's why it’s effective:

Compares Two Key Categorical Variables
pickup_point: tells us where the request originated.

status: tells us what happened to the request (Completed, Cancelled, or No Cars Available).

The countplot with hue="status" shows how the request outcomes differ between Airport and City.

1.City has the highest number of ride requests
More rides are requested from the City compared to the Airport.

Indicates higher rider activity (demand) in city zones — likely due to work commutes, urban errands, etc.

2.Cancellations are more common in the City
Many rides from the City are cancelled, often in early morning hours.

This suggests drivers may refuse trips to the airport (possibly due to traffic, distance, or return-ride issues).



Answer Here

##### 3. Will the gained insights help creating a positive business impact?
The insights from the “Pickup Point vs Status” chart provide valuable, location-specific operational intelligence that Uber can directly use to improve service quality, resource allocation, and customer satisfaction.

Answer Here

#### Chart - 8

In [None]:
df["request_hour"] = df["request_timestamp"].dt.hour

plt.figure(figsize=(10, 4))
sns.countplot(x="request_hour", data=df, order=sorted(df["request_hour"].unique()))
plt.title("Requests by Hour of Day")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Requests")
plt.show()


##### 1. Why did you pick the specific chart?

This countplot (bar chart) was selected to understand how rider demand varies throughout the day. Here's why it's a strategic choice:

 Visualizes Hourly Demand Patterns
request_hour shows what time of day people request rides.

Helps identify peak and off-peak hours.

Essential for resource planning and driver shift scheduling.



##### 2. What is/are the insight(s) found from the chart?

This time-based insight is critical for:

Driver shift planning

Surge pricing strategy

Demand forecasting models

It helps Uber align supply with demand — increasing trip completion and reducing customer wait times or cancellations.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Optimized Driver Scheduling
Insight: Ride demand is highest during 5–9 AM and 5–9 PM.

Action: Uber can ensure more drivers are online during these hours.

Impact: Higher ride fulfillment, reduced wait times, and fewer cancellations.


#### Chart - 9

In [None]:
# Step 1: Make sure 'trip duration' column is created
df["trip_duration_min"] = (df["drop_timestamp"] - df["request_timestamp"]).dt.total_seconds() / 60

# Step 2: Filter only completed trips
completed_trips = df[df["status"] == "Trip Completed"]

# Step 3: Now you can plot
plt.figure(figsize=(8, 4))
sns.histplot(data=completed_trips, x="trip_duration_min", bins=40, kde=True)
plt.title("Trip Duration Distribution (in Minutes)")
plt.xlabel("Duration (min)")
plt.ylabel("Number of Trips")
plt.show()


##### 1. Why did you pick the specific chart?

This chart was selected to analyze the distribution of trip durations for completed Uber rides. It's a histogram with a KDE (smooth curve) overlay, which provides deep insight into how long most rides last.



##### 2. What is/are the insight(s) found from the chart?

Most trips last between 15 and 40 minutes
The highest density (peak of the histogram) is in the 15–40 minute range, showing this is the typical trip duration for Uber users in this dataset.

There are short trips too (~5–10 mins)
A noticeable number of short trips suggest:

Riders using Uber for local, short-distance travel (e.g., within neighborhoods or city zones).



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the trip duration distribution chart provide valuable information about ride patterns, helping Uber optimize its pricing, route efficiency, and driver operations.



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective of improving ride availability, reducing cancellations, and optimizing operational efficiency, here are strategic and actionable suggestions:

1. Balance Supply and Demand with Time- and Location-Based Planning
Use historical data (like this analysis) to identify:

Peak hours of demand (e.g., 5–9 AM, 5–9 PM)

High-failure zones (e.g., Airport at night)

Allocate more drivers proactively during these times/places.

Apply real-time heatmaps for dynamic driver deployment.

2. Introduce Targeted Driver Incentive Programs
Offer extra payouts or bonuses during:

Early mornings from the City (to reduce cancellations)

Late-night at the Airport (to avoid “No Cars Available”)

This will encourage drivers to stay active in critical demand slots.

3. Implement Predictive Scheduling
Use machine learning or simple trend analysis to:

Forecast hourly demand

Auto-suggest shift timings to drivers via app notifications

Helps ensure enough active drivers are available before demand spikes.

4. Segment Rides by Trip Duration for Operational Focus
Provide quick-service options for short trips (5–10 min)

Ensure long-distance trip availability by rotating longer-trip drivers with breaks or incentives



# **Conclusion**

This project successfully analyzed Uber ride request data to uncover key operational challenges and provide data-driven solutions. Through exploratory data analysis (EDA), visualizations, and pattern recognition, we identified critical insights that align directly with Uber’s business objective of improving ride availability, driver utilization, and customer satisfaction.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

You can upload the file by clicking the folder icon on the left sidebar, then clicking the upload icon. Once uploaded, you can use the following code to load the data:

In [None]:
from google.colab import files
uploaded = files.upload()
for filename in uploaded.keys():
  print(f'User uploaded file "{filename}" with length {len(uploaded[filename])} bytes')

After uploading, you can read the file into a pandas DataFrame:

In [None]:
import pandas as pd
import io

for filename in uploaded.keys():
  df = pd.read_csv(io.StringIO(uploaded[filename].decode('utf-8')), parse_dates=["Request timestamp", "Drop timestamp"])

display(df.head())