# Task 2

---

## Predictive modeling of customer bookings

This Jupyter notebook includes some code to get you started with this predictive modeling task. We will use various packages for data manipulation, feature engineering and machine learning.

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("customer_booking.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


The `.head()` method allows us to view the first 5 rows in the dataset, this is useful for visual inspection of our columns

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   num_passengers         50000 non-null  int64  
 1   sales_channel          50000 non-null  object 
 2   trip_type              50000 non-null  object 
 3   purchase_lead          50000 non-null  int64  
 4   length_of_stay         50000 non-null  int64  
 5   flight_hour            50000 non-null  int64  
 6   flight_day             50000 non-null  object 
 7   route                  50000 non-null  object 
 8   booking_origin         50000 non-null  object 
 9   wants_extra_baggage    50000 non-null  int64  
 10  wants_preferred_seat   50000 non-null  int64  
 11  wants_in_flight_meals  50000 non-null  int64  
 12  flight_duration        50000 non-null  float64
 13  booking_complete       50000 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.3+ 

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [4]:
df["flight_day"].unique()

array(['Sat', 'Wed', 'Thu', 'Mon', 'Sun', 'Tue', 'Fri'], dtype=object)

In [5]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)

In [6]:
df["flight_day"].unique()

array([6, 3, 4, 1, 7, 2, 5])

In [7]:
df.describe()

Unnamed: 0,num_passengers,purchase_lead,length_of_stay,flight_hour,flight_day,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,1.59124,84.94048,23.04456,9.06634,3.81442,0.66878,0.29696,0.42714,7.277561,0.14956
std,1.020165,90.451378,33.88767,5.41266,1.992792,0.470657,0.456923,0.494668,1.496863,0.356643
min,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.67,0.0
25%,1.0,21.0,5.0,5.0,2.0,0.0,0.0,0.0,5.62,0.0
50%,1.0,51.0,17.0,9.0,4.0,1.0,0.0,0.0,7.57,0.0
75%,2.0,115.0,28.0,13.0,5.0,1.0,1.0,1.0,8.83,0.0
max,9.0,867.0,778.0,23.0,7.0,1.0,1.0,1.0,9.5,1.0


The `.describe()` method gives us a summary of descriptive statistics over the entire dataset (only works for numeric columns). This gives us a quick overview of a few things such as the mean, min, max and overall distribution of each column.

From this point, you should continue exploring the dataset with some visualisations and other metrics that you think may be useful. Then, you should prepare your dataset for predictive modelling. Finally, you should train your machine learning model, evaluate it with performance metrics and output visualisations for the contributing variables. All of this analysis should be summarised in your single slide.

In [8]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from pptx import Presentation
from pptx.util import Inches
import json

In [10]:
# ---------- CONFIG ----------
DATA_PATH = "customer_booking.csv"   # put file in same folder or change path
OUT_DIR = Path(".")
SAMPLE_ROWS = 20000   # set to None to use full dataset (if you have enough memory)
RANDOM_STATE = 42
N_ESTIMATORS = 100
CV_FOLDS = 3

In [11]:
# ---------- LOAD ----------
print("Loading CSV...")
df = pd.read_csv(DATA_PATH, encoding='latin-1')
print("Shape:", df.shape)

Loading CSV...
Shape: (50000, 14)


In [12]:
# ---------- QUICK EDA ----------
print("\nColumns and types:\n", df.dtypes)
print("\nMissing values (top):\n", df.isnull().sum().sort_values(ascending=False).head(15))


Columns and types:
 num_passengers             int64
sales_channel             object
trip_type                 object
purchase_lead              int64
length_of_stay             int64
flight_hour                int64
flight_day                object
route                     object
booking_origin            object
wants_extra_baggage        int64
wants_preferred_seat       int64
wants_in_flight_meals      int64
flight_duration          float64
booking_complete           int64
dtype: object

Missing values (top):
 num_passengers           0
sales_channel            0
trip_type                0
purchase_lead            0
length_of_stay           0
flight_hour              0
flight_day               0
route                    0
booking_origin           0
wants_extra_baggage      0
wants_preferred_seat     0
wants_in_flight_meals    0
flight_duration          0
booking_complete         0
dtype: int64


In [13]:
# ---------- TARGET ----------
TARGET = "booking_complete"
if TARGET not in df.columns:
    raise ValueError("Target column 'booking_complete' not found.")

In [14]:
# ---------- OPTIONAL SAMPLING ----------
if SAMPLE_ROWS is not None and SAMPLE_ROWS < len(df):
    print(f"Sampling {SAMPLE_ROWS} rows for faster iteration ...")
    df = df.sample(n=SAMPLE_ROWS, random_state=RANDOM_STATE).reset_index(drop=True)

Sampling 20000 rows for faster iteration ...


In [15]:
#---------- FEATURE ENGINEERING ----------
# extras_requested
df['extras_requested'] = (
    df.get('wants_extra_baggage', 0).fillna(0).astype(int) +
    df.get('wants_preferred_seat', 0).fillna(0).astype(int) +
    df.get('wants_in_flight_meals', 0).fillna(0).astype(int)
)

In [16]:
# lead_time_bin
if 'purchase_lead' in df.columns:
    df['lead_time_bin'] = pd.cut(df['purchase_lead'].fillna(-1),
                                 bins=[-1,7,30,90,3650],
                                 labels=["missing_or_0-7","8-30","31-90","90+"]).astype(str)

In [17]:
# time_of_day
if 'flight_hour' in df.columns:
    def tod(h):
        try:
            h = int(h)
        except:
            return "unknown"
        if h < 6: return "early_morning"
        if h < 10: return "morning"
        if h < 14: return "midday"
        if h < 18: return "afternoon"
        if h < 22: return "evening"
        return "night"
    df['time_of_day'] = df['flight_hour'].apply(tod)

In [18]:
# stay_bin
if 'length_of_stay' in df.columns:
    df['stay_bin'] = pd.cut(df['length_of_stay'].fillna(0), bins=[-1,0,2,7,3650], labels=["missing/0","0-2","3-7","7+"]).astype(str)

In [19]:
# ---------- PREPARE FEATURES ----------
drop_cols = [TARGET]
X = df[[c for c in df.columns if c not in drop_cols]].copy()
y = df[TARGET].astype(int).copy()

numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = [c for c in X.columns if c not in numeric_cols]

# Encode categoricals as category codes (memory efficient)
for c in categorical_cols:
    X[c] = X[c].astype('category').cat.codes.replace({-1: np.nan})

# Impute
for c in numeric_cols:
    X[c] = X[c].fillna(X[c].median())
for c in categorical_cols:
    X[c] = X[c].fillna(X[c].mode().iloc[0] if not X[c].mode().empty else 0)

In [20]:
# ---------- MODELING (CV predictions) ----------
clf = RandomForestClassifier(n_estimators=N_ESTIMATORS, random_state=RANDOM_STATE, n_jobs=-1, class_weight='balanced')
cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

print("Running cross-validated predictions (this may take some minutes)...")
y_pred = cross_val_predict(clf, X, y, cv=cv, method='predict', n_jobs=1)
y_proba = cross_val_predict(clf, X, y, cv=cv, method='predict_proba', n_jobs=1)[:,1]

Running cross-validated predictions (this may take some minutes)...


In [21]:
# ---------- METRICS ----------
acc = accuracy_score(y, y_pred)
prec = precision_score(y, y_pred, zero_division=0)
rec = recall_score(y, y_pred, zero_division=0)
f1 = f1_score(y, y_pred, zero_division=0)
auc = roc_auc_score(y, y_proba)

metrics = dict(accuracy=acc, precision=prec, recall=rec, f1=f1, roc_auc=auc)
print("\nMetrics:", metrics)

# Fit final model for importances
clf.fit(X, y)
importances = clf.feature_importances_
fi = pd.DataFrame({'feature': X.columns, 'importance': importances}).sort_values('importance', ascending=False)


Metrics: {'accuracy': 0.8491, 'precision': 0.4449648711943794, 'recall': 0.06395153147088523, 'f1': 0.11183048852266039, 'roc_auc': 0.7376193686581224}


In [22]:
# ---------- SAVE OUTPUTS ----------
OUT_DIR = Path(OUT_DIR)
OUT_DIR.mkdir(exist_ok=True)

predictions_fp = OUT_DIR / "booking_predictions_cv.csv"
df_copy = df.copy()
df_copy['predicted_booking'] = y_pred
df_copy['predicted_proba'] = y_proba
df_copy.to_csv(predictions_fp, index=False)

fi_fp = OUT_DIR / "feature_importances.csv"
fi.to_csv(fi_fp, index=False)

metrics_fp = OUT_DIR / "model_metrics.json"
with open(metrics_fp, "w") as f:
    json.dump(metrics, f)


In [26]:
# Plot top features
top_n = 12
plt.figure(figsize=(10,6))
plt.barh(fi['feature'].head(top_n)[::-1], fi['importance'].head(top_n)[::-1])
plt.xlabel("Feature importance")
plt.title("Top features (RandomForest)")
plt.tight_layout()
plot_fp = OUT_DIR / "feature_importances.png"
plt.savefig(plot_fp)
plt.close()



Saved outputs:
 - predictions -> booking_predictions_cv.csv
 - feature importances -> feature_importances.csv
 - plot -> feature_importances.png
 - pptx -> booking_model_summary.pptx
