# Predicting Uber Ride Cancellations using Exploratory Data Analysis and Machine Learning

Source Dataset: https://www.kaggle.com/datasets/yashdevladdha/uber-ride-analytics-dashboard/data

**Domain**: Urban Mobility & Ride-Hailing Analytics

**Objective**: Build a predictive model to determine whether a customer will cancel a ride before it begins, using only the booking metadata available at the time of booking. The goal is to help the platform proactively identify high-risk cancellations and optimize driver dispatch efficiency.

## Data Extraction

Import Kaggle Dataset

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yashdevladdha/uber-ride-analytics-dashboard")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/yashdevladdha/uber-ride-analytics-dashboard?dataset_version_number=2...


100%|██████████| 16.5M/16.5M [00:01<00:00, 10.1MB/s]

Extracting files...





Path to dataset files: C:\Users\Tirthankar Raha\.cache\kagglehub\datasets\yashdevladdha\uber-ride-analytics-dashboard\versions\2


In [None]:
# Load dataset
df = pd.read_csv(r"C:\Users\Tirthankar Raha\.cache\kagglehub\datasets\yashdevladdha\uber-ride-analytics-dashboard\versions\2\ncr_ride_bookings.csv")
df.head()

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,Cancelled Rides by Customer,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,,,,,,,737.0,48.21,4.1,4.3,UPI


## Initiation

Import Libraries

In [10]:
# Basic imports for data analysis
import pandas as pd
import numpy as np

# Importing libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for data modeling
import sklearn.metrics as metrics 
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, classification_report, roc_auc_score, RocCurveDisplay
from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from xgboost import plot_importance

# Set pandas display options for full visibility
pd.set_option('display.max_columns', None)

# Hide future warnings
import warnings; warnings.filterwarnings('ignore')

Review Data

In [12]:
# Variable overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  object 
 3   Booking Status                     150000 non-null  object 
 4   Customer ID                        150000 non-null  object 
 5   Vehicle Type                       150000 non-null  object 
 6   Pickup Location                    150000 non-null  object 
 7   Drop Location                      150000 non-null  object 
 8   Avg VTAT                           139500 non-null  float64
 9   Avg CTAT                           102000 non-null  float64
 10  Cancelled Rides by Customer        10500 non-null   float64
 11  Reason for cancelling by Customer  1050

- Refer to the column definitions below:


| Column Name                     | Description                                                                  |
|---------------------------------|------------------------------------------------------------------------------|
| Date                            | Date of the booking                                                          |
| Time                            | Time of the booking                                                          |
| Booking ID                      | Unique identifier for each ride booking                                      |
| Booking Status                  | Status of booking (Completed, Cancelled by Customer, Cancelled by Driver, etc.) |
| Customer ID                     | Unique identifier for customers                                              |
| Vehicle Type                    | Type of vehicle (Go Mini, Go Sedan, Auto, eBike/Bike, UberXL, Premier Sedan) |
| Pickup Location                 | Starting location of the ride                                                |
| Drop Location                   | Destination location of the ride                                             |
| Avg VTAT                        | Average time for driver to reach pickup location (in minutes)                |
| Avg CTAT                        | Average trip duration from pickup to destination (in minutes)                |
| Cancelled Rides by Customer     | Customer-initiated cancellation flag                                         |
| Reason for cancelling by Customer | Reason for customer cancellation                                           |
| Cancelled Rides by Driver       | Driver-initiated cancellation flag                                           |
| Driver Cancellation Reason      | Reason for driver cancellation                                               |
| Incomplete Rides                | Incomplete ride flag                                                         |
| Incomplete Rides Reason         | Reason for incomplete rides                                                  |
| Booking Value                   | Total fare amount for the ride                                               |
| Ride Distance                   | Distance covered during the ride (in km)                                     |
| Driver Ratings                  | Rating given to driver (1-5 scale)                                           |
| Customer Rating                 | Rating given by customer (1-5 scale)                                         |
| Payment Method                  | Method used for payment (UPI, Cash, Credit Card, Uber Wallet, Debit Card)    |


**Observations**:

- Multiple columns have missing values.
- Column names have inconsistent naming conventions (e.g., Uppercase and spaces).
- Columns 'Date' and 'Time' are date/time variables, which would need to be converted to the correct format.
- Columns 'Booking ID' and 'Customer ID' have values with " " quotes, which needs to be stripped.

In [15]:
# Summary statistics
df.describe(include= 'all')

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,Cancelled Rides by Customer,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
count,150000,150000,150000,150000,150000,150000,150000,150000,139500.0,102000.0,10500.0,10500,27000.0,27000,9000.0,9000,102000.0,102000.0,93000.0,93000.0,102000
unique,365,62910,148767,5,148788,7,176,176,,,,5,,4,,3,,,,,5
top,2024-11-16,17:44:57,"""CNR3648267""",Completed,"""CID6715450""",Auto,Khandsa,Ashram,,,,Wrong Address,,Customer related issue,,Customer Demand,,,,,UPI
freq,462,16,3,93000,3,37419,949,936,,,,2362,,6837,,3040,,,,,45909
mean,,,,,,,,,8.456352,29.149636,1.0,,1.0,,1.0,,508.295912,24.637012,4.230992,4.404584,
std,,,,,,,,,3.773564,8.902577,0.0,,0.0,,0.0,,395.805774,14.002138,0.436871,0.437819,
min,,,,,,,,,2.0,10.0,1.0,,1.0,,1.0,,50.0,1.0,3.0,3.0,
25%,,,,,,,,,5.3,21.6,1.0,,1.0,,1.0,,234.0,12.46,4.1,4.2,
50%,,,,,,,,,8.3,28.8,1.0,,1.0,,1.0,,414.0,23.72,4.3,4.5,
75%,,,,,,,,,11.3,36.8,1.0,,1.0,,1.0,,689.0,36.82,4.6,4.8,


**Observations**:

- All data columns seem to be well distributed, though max. booking value of 4277 seems to be an outlier.
Based on our observations from above, we can plan our initial approach as below:

- Data Pre-Procesing:
    - Modify columns/values as needed
    - Check/eliminate/deal with duplicates
    - Check/eliminate/deal with Nulls
- Exploratory Data Analysis

## Data Pre-processing

Clean and standardize column names

In [16]:
# Replace spaces with underscores for easier referencing in code
# Convert all letters to lowercase for consistency
df.columns = df.columns.str.replace(' ', '_').str.lower()
df.head()

Unnamed: 0,date,time,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,,,,,,,737.0,48.21,4.1,4.3,UPI


Date & Time Features Conversion and Extraction

In [17]:
# Combine 'Date' and 'Time' into a single datetime column
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Drop original columns if not needed
df.drop(['date', 'time'], axis=1, inplace=True)

In [18]:
# Verify the change
print(df.info())
print(df[['datetime']].head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
 #   Column                             Non-Null Count   Dtype         
---  ------                             --------------   -----         
 0   booking_id                         150000 non-null  object        
 1   booking_status                     150000 non-null  object        
 2   customer_id                        150000 non-null  object        
 3   vehicle_type                       150000 non-null  object        
 4   pickup_location                    150000 non-null  object        
 5   drop_location                      150000 non-null  object        
 6   avg_vtat                           139500 non-null  float64       
 7   avg_ctat                           102000 non-null  float64       
 8   cancelled_rides_by_customer        10500 non-null   float64       
 9   reason_for_cancelling_by_customer  10500 non-null   object        
 10  cancelled_rides_by_d

In [19]:
# Extract datetime features for further analysis and modeling
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['weekday'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday'])

df.head()

Unnamed: 0,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method,datetime,hour,day_of_week,weekday,month,is_weekend
0,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,,,,,,,,,,,,2024-03-23 12:29:38,12,Saturday,5,3,True
1,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI,2024-11-29 18:01:39,18,Friday,4,11,False
2,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,,,,,,,627.0,13.58,4.9,4.9,Debit Card,2024-08-23 08:56:10,8,Friday,4,8,False
3,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,,,,,,,416.0,34.02,4.6,5.0,UPI,2024-10-21 17:17:25,17,Monday,0,10,False
4,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,,,,,,,737.0,48.21,4.1,4.3,UPI,2024-09-16 22:08:00,22,Monday,0,9,False


Cleaning data values

In [20]:
# Remove double quotes from column values of 'booking_id' and 'customer_id' for consistency
df['booking_id'] = df['booking_id'].str.strip('"')
df['customer_id'] = df['customer_id'].str.strip('"')
df.head()

Unnamed: 0,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method,datetime,hour,day_of_week,weekday,month,is_weekend
0,CNR5884300,No Driver Found,CID1982111,eBike,Palam Vihar,Jhilmil,,,,,,,,,,,,,,2024-03-23 12:29:38,12,Saturday,5,3,True
1,CNR1326809,Incomplete,CID4604802,Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI,2024-11-29 18:01:39,18,Friday,4,11,False
2,CNR8494506,Completed,CID9202816,Auto,Khandsa,Malviya Nagar,13.4,25.8,,,,,,,627.0,13.58,4.9,4.9,Debit Card,2024-08-23 08:56:10,8,Friday,4,8,False
3,CNR8906825,Completed,CID2610914,Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,,,,,,,416.0,34.02,4.6,5.0,UPI,2024-10-21 17:17:25,17,Monday,0,10,False
4,CNR1950162,Completed,CID9933542,Bike,Ghitorni Village,Khan Market,5.3,19.6,,,,,,,737.0,48.21,4.1,4.3,UPI,2024-09-16 22:08:00,22,Monday,0,9,False


In [21]:
# Check unique values in categorical columns
print("Vehicle Types:", df['vehicle_type'].unique())
print("Booking statuses:", df['booking_status'].unique())

Vehicle Types: ['eBike' 'Go Sedan' 'Auto' 'Premier Sedan' 'Bike' 'Go Mini' 'Uber XL']
Booking statuses: ['No Driver Found' 'Incomplete' 'Completed' 'Cancelled by Driver'
 'Cancelled by Customer']


Check/eliminate/deal with duplicates

In [22]:
# check for duplicate rows 
df.duplicated().sum()

np.int64(0)

In [23]:
# Check for missing Booking IDs and Customer IDs 
print(df['booking_id'].isnull().sum())
print(df['customer_id'].isnull().sum())

0
0


In [24]:
# Check for duplicate Booking IDs
dup_mask = df['booking_id'].duplicated(keep='first')
print("duplicate rows :", dup_mask.sum())
df[dup_mask].head()

duplicate rows : 1233


Unnamed: 0,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method,datetime,hour,day_of_week,weekday,month,is_weekend
5522,CNR5071968,No Driver Found,CID6309096,Auto,Kanhaiya Nagar,India Gate,,,,,,,,,,,,,,2024-03-10 19:55:06,19,Sunday,6,3,True
7762,CNR8512595,Completed,CID9741888,Go Mini,Narsinghpur,Huda City Centre,3.5,15.4,,,,,,,187.0,41.45,4.4,4.2,Credit Card,2024-03-01 11:55:56,11,Friday,4,3,False
9587,CNR1029172,Completed,CID6382731,Auto,Inderlok,Laxmi Nagar,6.9,34.4,,,,,,,332.0,36.38,4.3,4.3,UPI,2024-12-17 19:19:02,19,Tuesday,1,12,False
9726,CNR7132372,Completed,CID6950827,Go Sedan,Kalkaji,Sushant Lok,3.8,18.3,,,,,,,389.0,44.95,3.7,4.2,UPI,2024-05-23 20:44:27,20,Thursday,3,5,False
10186,CNR7768664,Completed,CID4473762,eBike,Anand Vihar ISBT,Netaji Subhash Place,5.1,42.1,,,,,,,357.0,17.03,4.2,4.2,UPI,2024-12-14 21:15:59,21,Saturday,5,12,True


In [25]:
# Inspect duplicate booking_ids so as to know if the duplicate booking_ids are created due to 'No driver found' booking_status
df[df['booking_id']=='CNR5071968']

Unnamed: 0,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method,datetime,hour,day_of_week,weekday,month,is_weekend
317,CNR5071968,Completed,CID7384045,Go Sedan,Panchsheel Park,Yamuna Bank,4.7,42.5,,,,,,,473.0,48.35,4.7,3.8,Cash,2024-10-10 03:56:19,3,Thursday,3,10,False
5522,CNR5071968,No Driver Found,CID6309096,Auto,Kanhaiya Nagar,India Gate,,,,,,,,,,,,,,2024-03-10 19:55:06,19,Sunday,6,3,True


In [26]:
df[df['booking_id']=='CNR8512595']

Unnamed: 0,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method,datetime,hour,day_of_week,weekday,month,is_weekend
1893,CNR8512595,Completed,CID8017027,Auto,Ashok Vihar,Mehrauli,14.6,29.6,,,,,,,294.0,44.22,3.6,5.0,Cash,2024-11-02 10:45:25,10,Saturday,5,11,True
7762,CNR8512595,Completed,CID9741888,Go Mini,Narsinghpur,Huda City Centre,3.5,15.4,,,,,,,187.0,41.45,4.4,4.2,Credit Card,2024-03-01 11:55:56,11,Friday,4,3,False


There are duplicate booking_ids for multiple completed rides. Hence, we will keep all rows with duplicate booking_ids in our dataset but will drop 'booking_id' column before our modeling so that we don't lose the information from these rows.

In [33]:
pd.crosstab(df['cancelled_rides_by_customer'],df['driver_ratings'],dropna=False)

driver_ratings,3.0,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9,4.0,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8,4.9,5.0,NaN
cancelled_rides_by_customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10500
,745,1459,1538,1461,1491,748,2026,3790,3848,3915,1995,6966,13841,14081,7018,4634,9368,4678,2328,4705,2365,46500


In [32]:
pd.crosstab(df['cancelled_rides_by_customer'],df['customer_rating'],dropna=False)

customer_rating,3.0,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9,4.0,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8,4.9,5.0,NaN
cancelled_rides_by_customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10500
,468,1008,881,900,928,443,1194,2354,2357,2370,1185,5396,10697,10995,5279,5890,11533,5763,5880,11642,5837,46500


Driver and Customer Ratings are not available for any bookings wherein the customer cancelled the ride.

In [34]:
# Check for Nulls
df.isna().sum()

booking_id                                0
booking_status                            0
customer_id                               0
vehicle_type                              0
pickup_location                           0
drop_location                             0
avg_vtat                              10500
avg_ctat                              48000
cancelled_rides_by_customer          139500
reason_for_cancelling_by_customer    139500
cancelled_rides_by_driver            123000
driver_cancellation_reason           123000
incomplete_rides                     141000
incomplete_rides_reason              141000
booking_value                         48000
ride_distance                         48000
driver_ratings                        57000
customer_rating                       57000
payment_method                        48000
datetime                                  0
hour                                      0
day_of_week                               0
weekday                         

**Treating NaNs**:

- avg_vtat, avg_ctat
- cancelled_rides_by_customer, cancelled_rides_by_driver, incomplete_rides
- reason_for_cancelling_by_customer, driver_cancellation_reason, incomplete_rides_reason ---*this will be dropped before modelling as it is not required*.
- booking_value, ride_distance, driver_ratings, customer_rating, payment_method

In [35]:
# Check for missing values in 'avg_vtat' 
df[df['avg_vtat'].isnull()].head()

Unnamed: 0,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,reason_for_cancelling_by_customer,cancelled_rides_by_driver,driver_cancellation_reason,incomplete_rides,incomplete_rides_reason,booking_value,ride_distance,driver_ratings,customer_rating,payment_method,datetime,hour,day_of_week,weekday,month,is_weekend
0,CNR5884300,No Driver Found,CID1982111,eBike,Palam Vihar,Jhilmil,,,,,,,,,,,,,,2024-03-23 12:29:38,12,Saturday,5,3,True
8,CNR4510807,No Driver Found,CID7873618,Go Sedan,Noida Sector 62,Noida Sector 18,,,,,,,,,,,,,,2024-09-14 12:49:09,12,Saturday,5,9,True
11,CNR9551927,No Driver Found,CID7568143,Auto,Vidhan Sabha,AIIMS,,,,,,,,,,,,,,2024-09-18 08:09:38,8,Wednesday,2,9,False
27,CNR4499383,No Driver Found,CID5717521,Premier Sedan,Sadar Bazar Gurgaon,Mehrauli,,,,,,,,,,,,,,2024-04-12 19:42:35,19,Friday,4,4,False
57,CNR9773309,No Driver Found,CID9965847,Uber XL,Anand Vihar ISBT,Dwarka Sector 21,,,,,,,,,,,,,,2024-04-11 15:43:34,15,Thursday,3,4,False


In [37]:
# Check which booking statuses have missing 'avg_vtat' and 'avg_ctat'
print('Missing avg_vtat:',df[df['avg_vtat'].isnull()]['booking_status'].unique())
print('Missing avg_ctat:',df[df['avg_ctat'].isnull()]['booking_status'].unique())

Missing avg_vtat: ['No Driver Found']
Missing avg_ctat: ['No Driver Found' 'Cancelled by Driver' 'Cancelled by Customer']
