# Feature Engineering for Ride Cancellation Prediction

**Author:** Zehra Buse Tüfekçi  
**Date:** 26 February 2026

## Purpose of This Notebook
This notebook focuses on feature engineering for the ride-hailing dataset.  
We create new features, encode categorical variables, and prepare the dataset for machine learning models.

## Data Loading and Initial Inspection
We load the cleaned dataset from the previous notebook and inspect its structure to understand the columns and types.

In [33]:
from google.colab import files
files.upload()

{}

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('ride_cancellation_cleaned.csv')
df

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,...,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,8.3,28.8,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,414.0,23.72,4.3,4.5,UPI
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,1.0,Vehicle Breakdown,237.0,5.73,4.3,4.5,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,737.0,48.21,4.1,4.3,UPI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149995,2024-11-11,19:34:01,"""CNR6500631""",Completed,"""CID4337371""",Go Mini,MG Road,Ghitorni,10.2,44.4,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,475.0,40.08,3.7,4.1,Uber Wallet
149996,2024-11-24,15:55:09,"""CNR2468611""",Completed,"""CID2325623""",Go Mini,Golf Course Road,Akshardham,5.1,30.8,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,1093.0,21.31,4.8,5.0,UPI
149997,2024-09-18,10:55:15,"""CNR6358306""",Completed,"""CID9925486""",Go Sedan,Satguru Ram Singh Marg,Jor Bagh,2.7,23.4,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,852.0,15.93,3.9,4.4,Cash
149998,2024-10-05,07:53:34,"""CNR3030099""",Completed,"""CID9415487""",Auto,Ghaziabad,Saidulajab,6.9,39.6,...,No_Customer_Cancellation,0.0,No_Driver_Cancellation,0.0,No_Incomplete_Rides,333.0,45.54,4.1,3.7,UPI


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  object 
 3   Booking Status                     150000 non-null  object 
 4   Customer ID                        150000 non-null  object 
 5   Vehicle Type                       150000 non-null  object 
 6   Pickup Location                    150000 non-null  object 
 7   Drop Location                      150000 non-null  object 
 8   Avg VTAT                           150000 non-null  float64
 9   Avg CTAT                           150000 non-null  float64
 10  Cancelled Rides by Customer        150000 non-null  float64
 11  Reason for cancelling by Customer  1500

## Dropping Unnecessary Columns
We remove columns that are not needed for modeling, such as identifiers and previously handled features.

In [35]:
df = df.drop(['Booking ID','Customer ID'],axis=1)

In [36]:
df['Pickup Location'].unique()

array(['Palam Vihar', 'Shastri Nagar', 'Khandsa', 'Central Secretariat',
       'Ghitorni Village', 'AIIMS', 'Vaishali', 'Mayur Vihar',
       'Noida Sector 62', 'Rohini', 'Udyog Bhawan', 'Vidhan Sabha',
       'Patel Chowk', 'Malviya Nagar', 'Madipur', 'Jama Masjid',
       'IGI Airport', 'Vinobapuri', 'Kashmere Gate', 'Pitampura',
       'Punjabi Bagh', 'Greater Noida', 'Tis Hazari', 'Noida Sector 18',
       'Kanhaiya Nagar', 'Okhla', 'Cyber Hub', 'Sadar Bazar Gurgaon',
       'Shastri Park', 'Faridabad Sector 15', 'Qutub Minar', 'Mundka',
       'DLF City Court', 'New Colony', 'Nirman Vihar',
       'New Delhi Railway Station', 'Civil Lines Gurgaon', 'Seelampur',
       'Noida Extension', 'Adarsh Nagar', 'Panipat', 'Karol Bagh',
       'Sultanpur', 'Moti Nagar', 'Dilshad Garden', 'Aya Nagar',
       'Rajiv Chowk', 'MG Road', 'Jasola', 'Ardee City', 'Meerut',
       'Anand Vihar ISBT', 'Lajpat Nagar', 'Tughlakabad', 'Karkarduma',
       'Dwarka Mor', 'Anand Vihar', 'Uttam Nagar', 'M

In [37]:
df['Drop Location'].unique()

array(['Jhilmil', 'Gurgaon Sector 56', 'Malviya Nagar', 'Inderlok',
       'Khan Market', 'Narsinghpur', 'Punjabi Bagh', 'Cyber Hub',
       'Noida Sector 18', 'Adarsh Nagar', 'Dwarka Sector 21', 'AIIMS',
       'Kherki Daula Toll', 'Ghitorni Village', 'GTB Nagar', 'Madipur',
       'Anand Vihar', 'Rajiv Nagar', 'Mansarovar Park',
       'Botanical Garden', 'IMT Manesar', 'Old Gurgaon',
       'Barakhamba Road', 'Saket', 'Mehrauli', 'Vishwavidyalaya',
       'Preet Vihar', 'Nehru Place', 'Shahdara', 'Noida Film City',
       'Mandi House', 'Janakpuri', 'Udyog Vihar Phase 4',
       'Civil Lines Gurgaon', 'Karkarduma', 'Tagore Garden',
       'Noida Extension', 'Anand Vihar ISBT', 'Central Secretariat',
       'Hauz Rani', 'Palam Vihar', 'RK Puram', 'Basai Dhankot',
       'Badarpur', 'Ramesh Nagar', 'Akshardham', 'Yamuna Bank',
       'IGI Airport', 'New Colony', 'Green Park', 'ITO',
       'New Delhi Railway Station', 'Mundka', 'India Gate', 'Pitampura',
       'Netaji Subhash Place',

In [38]:
df = df.drop(['Pickup Location','Drop Location'],axis=1)

In [39]:
df['Reason for cancelling by Customer'].unique()

array(['No_Customer_Cancellation',
       'Driver is not moving towards pickup location',
       'Driver asked to cancel', 'AC is not working', 'Change of plans',
       'Wrong Address'], dtype=object)

In [40]:
df = df.drop(['Reason for cancelling by Customer'],axis=1)

In [41]:
df['Driver Cancellation Reason'].unique()

array(['No_Driver_Cancellation', 'Personal & Car related issues',
       'Customer related issue', 'More than permitted people in there',
       'The customer was coughing/sick'], dtype=object)

In [42]:
df = df.drop(['Driver Cancellation Reason'],axis=1)

In [43]:
df = df.drop(['Incomplete Rides','Incomplete Rides Reason'],axis=1)

In [44]:
df.tail(20)

Unnamed: 0,Date,Time,Booking Status,Vehicle Type,Avg VTAT,Avg CTAT,Cancelled Rides by Customer,Cancelled Rides by Driver,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
149980,2024-12-10,15:31:45,No Driver Found,Premier Sedan,8.3,28.8,0.0,0.0,414.0,23.72,4.3,4.5,UPI
149981,2024-12-10,09:46:40,Completed,Go Sedan,5.8,18.5,0.0,0.0,91.0,43.43,4.9,4.1,UPI
149982,2024-11-13,06:51:19,No Driver Found,Auto,8.3,28.8,0.0,0.0,414.0,23.72,4.3,4.5,UPI
149983,2024-03-23,20:33:36,Completed,Go Mini,3.0,27.5,0.0,0.0,581.0,25.1,4.3,5.0,Credit Card
149984,2024-01-14,15:42:15,Incomplete,Go Sedan,5.9,12.8,0.0,0.0,1146.0,13.96,4.3,4.5,Cash
149985,2024-02-26,16:45:21,Completed,Bike,13.4,27.0,0.0,0.0,193.0,13.29,4.2,4.6,Debit Card
149986,2024-03-15,17:49:46,Completed,Premier Sedan,12.9,43.9,0.0,0.0,289.0,44.0,4.3,5.0,Cash
149987,2024-02-06,19:51:00,Completed,Go Mini,2.3,19.9,0.0,0.0,101.0,10.69,4.2,4.5,UPI
149988,2024-08-17,10:43:13,Completed,Bike,13.4,26.6,0.0,0.0,96.0,27.72,4.3,4.2,UPI
149989,2024-07-22,10:04:18,Completed,Auto,14.2,22.8,0.0,0.0,75.0,8.33,4.2,4.9,Credit Card


## Encoding Categorical Variables
We convert categorical variables into numerical representations using one-hot encoding.

In [45]:
df['Vehicle Type'].unique()

array(['eBike', 'Go Sedan', 'Auto', 'Premier Sedan', 'Bike', 'Go Mini',
       'Uber XL'], dtype=object)

In [46]:
df = pd.get_dummies ( df, columns = ['Vehicle Type'] )

In [47]:
df['Payment Method'].unique()

array(['UPI', 'Debit Card', 'Cash', 'Uber Wallet', 'Credit Card'],
      dtype=object)

In [48]:
df["Payment Method"].value_counts()

Unnamed: 0_level_0,count
Payment Method,Unnamed: 1_level_1
UPI,93909
Cash,25367
Uber Wallet,12276
Credit Card,10209
Debit Card,8239


In [49]:
df = pd.get_dummies(df, columns=["Payment Method"])

## Converting Boolean Columns
Boolean columns are converted to integers for compatibility with machine learning models.

In [50]:
bool_cols = df.select_dtypes(bool).columns
df[bool_cols] = df[bool_cols].astype(int)

In [None]:
# dummy_cols = df.filter(regex="Vehicle Type|Payment Method").columns
# df[dummy_cols] = df[dummy_cols].astype(int)

## Extracting Date and Time Features
We extract features such as year, month, day, day of week, weekend flag, hour, and minute from date and time columns.

In [51]:
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

In [52]:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Day_of_week'] = df['Date'].dt.dayofweek  # 0=Monday
df['Is_weekend'] = df['Day_of_week'].isin([5,6]).astype(int)

In [53]:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
df['Hour'] = df['Time'].dt.hour
df['Minute'] = df['Time'].dt.minute

In [54]:
df = df.drop(['Date','Time'],axis=1)

## Creating Target Variable
We remove columns that are now redundant and create the target variable for ride cancellations.

In [55]:
df = df[~df["Booking Status"].isin(["Incomplete","No Driver Found"])]

df["target"] = df["Booking Status"].isin(
    ["Cancelled by Driver","Cancelled by Customer"]
).astype(int)

## Saving and Downloading the Processed Dataset

After completing feature engineering, we save the processed dataset to a CSV file.  
This dataset now includes all newly created features and is ready for machine learning model training.

The dataset is saved as `ride_cancellation_processed_v1.csv` and can be downloaded directly from Colab.

In [56]:
df.to_csv("ride_cancellation_processed_v1.csv", index=False)

In [57]:
from google.colab import files
files.download("ride_cancellation_processed_v1.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>