# CPSC 330 - Applied Machine Learning 

## Homework 6: Putting it all together 
### Associated lectures: All material till lecture 13 

**Due date: Wednesday, March 15, 2023 at 11:59pm**

## Table of contents

- [Submission instructions](#si)
- [Understanding the problem](#1)
- [Data splitting](#2)
- [EDA](#3)
- (Optional) [Feature engineering](#4)
- [Preprocessing and transformations](#5)
- [Baseline model](#6)
- [Linear models](#7)
- [Different classifiers](#8)
- (Optional) [Feature selection](#9)
- [Hyperparameter optimization](#10)
- [Interpretation and feature importances](#11)
- [Results on the test set](#12)
- (Optional) [Explaining predictions](#13)
- [Summary of the results](#14)

## Imports 

In [24]:
import os

%matplotlib inline
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    ConfusionMatrixDisplay,
    accuracy_score, 
    precision_score, 
    recall_score
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
import eli5
from scipy.stats import randint

<br><br>

## Instructions 
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2022W2/blob/main/docs/homework_instructions.md). 

**You may work on this homework in a group and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 3. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).

<br><br>

## Introduction <a name="in"></a>
<hr>

At this point we are at the end of supervised machine learning part of the course. So in this homework, you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips

1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 
4. If you are having trouble running models on your laptop because of the size of the dataset, you can create your train/test split in such a way that you have less data in the train split. If you end up doing this, please write a note to the grader in the submission explaining why you are doing it.  

#### Assessment

We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.

#### A final note

Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (2-8 hours???) is a good guideline for a typical submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

## 1. Understanding the problem <a name="1"></a>
<hr>
rubric={points:4}

In this mini project, you will be working on a classification problem of predicting whether a customer will cancel the reservation they have made at a hotel. 
For this problem, you will use [Reservation Cancellation Prediction Dataset](https://www.kaggle.com/datasets/gauravduttakiit/reservation-cancellation-prediction?select=train__dataset.csv). In this data set, there are about 18.000 examples and 18 features (including the target), and the goal is to estimate whether a person will cancel their booking; this column is labeled "booking_status" in the data (1 = canceled). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/datasets/gauravduttakiit/reservation-cancellation-prediction?select=train__dataset.csv). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

1. The goal of this experiment is to create a model to help hotels predict whether a not a reservation will be honored based on information about the customer. There are 18 total features in the dataset. All features are numerical. 10 of these features are not continuous values and instead have values that represent discrete classes in each feature. Three of these discrete features are binary representations and the remaining seven are ordinally encoded. The target column is a binary representation, therefore this prediction is a classification problem and not a regression problem. 

In [2]:
df = pd.read_csv("train__dataset.csv")

<br><br>

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train and test portions. 

In [3]:
train_df, test_df = train_test_split(df, test_size = 0.2, random_state = 123)

<br><br>

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

In [4]:
#use df.head() to assess scales of features
train_df.head()
# Scales of features are different. arrival_year requires ordinal encoding or OHE?

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
800,2,0,1,3,0,0,0,23,2018,1,7,1,0,0,0,87.0,2,0
13544,1,0,0,1,0,0,0,15,2018,2,19,2,0,0,0,81.0,0,0
14555,1,0,1,1,0,0,0,3,2017,11,23,2,1,0,1,65.0,0,0
11224,2,0,0,2,0,0,1,148,2018,7,8,1,0,0,0,136.8,1,1
10890,2,1,2,4,0,0,0,61,2018,7,23,1,0,0,0,121.5,0,1


In [5]:
#use df.info() to assess feature value types and missing values
train_df.info()
# No missing values found. avg_price_per_room has float dtype instead of integer.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14509 entries, 800 to 15725
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          14509 non-null  int64  
 1   no_of_children                        14509 non-null  int64  
 2   no_of_weekend_nights                  14509 non-null  int64  
 3   no_of_week_nights                     14509 non-null  int64  
 4   type_of_meal_plan                     14509 non-null  int64  
 5   required_car_parking_space            14509 non-null  int64  
 6   room_type_reserved                    14509 non-null  int64  
 7   lead_time                             14509 non-null  int64  
 8   arrival_year                          14509 non-null  int64  
 9   arrival_month                         14509 non-null  int64  
 10  arrival_date                          14509 non-null  int64  
 11  market_segmen

In [6]:
#use df.describe() to assess summary statistics on dataset.
train_df.describe()

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
count,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0,14509.0
mean,1.849059,0.107106,0.809566,2.196223,0.319526,0.032532,0.337859,85.211524,2017.821628,7.422979,15.711765,0.803708,0.025364,0.022331,0.153629,103.558628,0.621476,0.325867
std,0.51497,0.398989,0.869672,1.414027,0.630462,0.177413,0.77562,86.901659,0.382839,3.079938,8.759235,0.644105,0.157232,0.359667,1.720322,35.546998,0.78942,0.468714
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2017.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,1.0,0.0,0.0,0.0,16.0,2018.0,5.0,8.0,0.0,0.0,0.0,0.0,80.3,0.0,0.0
50%,2.0,0.0,1.0,2.0,0.0,0.0,0.0,57.0,2018.0,8.0,16.0,1.0,0.0,0.0,0.0,99.45,0.0,0.0
75%,2.0,0.0,2.0,3.0,0.0,0.0,0.0,126.0,2018.0,10.0,23.0,1.0,0.0,0.0,0.0,120.6,1.0,1.0
max,4.0,3.0,7.0,17.0,3.0,1.0,6.0,443.0,2018.0,12.0,31.0,4.0,1.0,13.0,58.0,540.0,5.0,1.0


In [7]:
#creating X_train and y_train
X_train = train_df.drop(columns=["booking_status"])
y_train = train_df["booking_status"]

X_test = test_df.drop(columns=["booking_status"])
y_test = test_df["booking_status"]

3. Data Observations
- 18 features including target (booking_status)
- all numerical (17 integer dtypes, 1 float dtype)
- type_of_meal_plan, required_car_parking_space, room_type reserved, market_segment_type, repeated_guest need Ordinal Encoding/OHE ? 


4. Metrics for Evaluation
- Recall as we want to maximize predicting which customers are cancelling their booking. Using Recall will allow use to approximate how many positive examples we are correctly/incorrectly predicting. 
<br><br>

<br><br>

## (Optional) 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

Maybe replace no_of_previous_cancellations and no_of_previous_bookings_not_canceled with a past_cancellation_rate percentage calculated as the count of previous cancellations divided by the total number of bookings (both canceled and not canceled). 


<br><br>

<br><br>

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

1. Feature types 
 
 - All features are numeric
 - There are no missing or null values
 - All numeric categories will be scaled with StandardScaler()
 - Although all features are numeric, some are binary/distinct representations and require OHE. These will be labeled as "categorical" features
 
 - arrival_year and arrival_month will be dropped as this information is already contained in the arrival_date feature. 
 

<br><br>

In [8]:
#classifying features by transformations
numerical_feats = ["no_of_adults", "no_of_children", "no_of_weekend_nights", "no_of_week_nights", "lead_time", "arrival_date", "no_of_previous_cancellations", "no_of_previous_bookings_not_canceled", "avg_price_per_room", "no_of_special_requests"]

categorical_feats = ["type_of_meal_plan", "required_car_parking_space", "room_type_reserved", "market_segment_type", "repeated_guest"]

drop_features = ["arrival_year", "arrival_month"]

In [9]:
#defining feature transformers
numeric_transformer = make_pipeline(StandardScaler())

categorical_transformer = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"),
)

#creating a column_transformer preprocessing object for the dataset
preprocessor = make_column_transformer(
    (numeric_transformer, numerical_feats),
    (categorical_transformer, categorical_feats),
    ("drop", drop_features),
)

In [10]:
#transforming our X data
X_transformed = preprocessor.fit_transform(X_train)

<br><br>

## 6. Baseline model <a name="6"></a>
<hr>

rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

In [11]:
sk_dc = DummyClassifier(strategy="prior")

sk_dc.fit(X_transformed, y_train)

print(sk_dc.predict(X_test))
print(sk_dc.predict_proba(X_test))
print(sk_dc.score(X_test, y_test))

[0 0 0 ... 0 0 0]
[[0.6741333 0.3258667]
 [0.6741333 0.3258667]
 [0.6741333 0.3258667]
 ...
 [0.6741333 0.3258667]
 [0.6741333 0.3258667]
 [0.6741333 0.3258667]]
0.665380374862183


<br><br>

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:12}

**Your tasks:**

1. Try logistic regression as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter `C`. 
3. Report validation scores along with standard deviation. 
4. Summarize your results.

In [12]:
param_grid = {
    'logisticregression__C': 10.0 ** np.arange(-2, 2, 0.5), #try with -2,2,0.5
}

pipe_lr = make_pipeline(preprocessor, LogisticRegression(max_iter = 1000))

In [13]:
grid_search = GridSearchCV(
    pipe_lr, param_grid, cv = 5, n_jobs = -1, return_train_score = True
)

grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)

0.8023984219895179
{'logisticregression__C': 0.1}


In [14]:
scores_grid = cross_validate(
    grid_search, X_train, y_train, return_train_score=True
)


pd.DataFrame(scores_grid)
results = pd.DataFrame(grid_search.cv_results_)
results.T

Unnamed: 0,0,1,2,3,4,5,6,7
mean_fit_time,0.160558,0.188002,0.294496,0.447148,0.682286,1.05253,1.734358,1.225265
std_fit_time,0.011651,0.007188,0.0179,0.03909,0.035951,0.187779,0.229477,0.141805
mean_score_time,0.017618,0.017972,0.016111,0.01836,0.020209,0.022769,0.017809,0.01421
std_score_time,0.001629,0.003233,0.005314,0.002322,0.006176,0.007731,0.003684,0.007048
param_logisticregression__C,0.01,0.031623,0.1,0.316228,1.0,3.162278,10.0,31.622777
params,{'logisticregression__C': 0.01},{'logisticregression__C': 0.03162277660168379},{'logisticregression__C': 0.1},{'logisticregression__C': 0.31622776601683794},{'logisticregression__C': 1.0},{'logisticregression__C': 3.1622776601683795},{'logisticregression__C': 10.0},{'logisticregression__C': 31.622776601683793}
split0_test_score,0.800138,0.80255,0.801861,0.801172,0.800138,0.801516,0.801861,0.801172
split1_test_score,0.804618,0.807374,0.811165,0.810131,0.809786,0.810476,0.810476,0.81082
split2_test_score,0.79807,0.796692,0.797381,0.796692,0.797381,0.797037,0.797037,0.797037
split3_test_score,0.793935,0.799449,0.800482,0.800482,0.800482,0.800827,0.801172,0.800827


4.  <br><br>

<br><br>

## 8. Different classifiers <a name="8"></a>
<hr>
rubric={points:15}

**Your tasks:**
1. Try at least 3 other models aside from logistic regression. At least one of these models should be a tree-based ensemble model (e.g., lgbm, random forest, xgboost). 
2. Summarize your results. Can you beat logistic regression? 

In [15]:
results2 = pd.DataFrame()

In [16]:
pipe_lgbm = make_pipeline(
    preprocessor, LGBMClassifier(random_state=123, n_jobs=-1)
)

score_lgbm = cross_validate(
    pipe_lgbm, X_train, y_train, return_train_score=True
)

lgbm_df = pd.DataFrame(score_lgbm)

print(lgbm_df.head())

   fit_time  score_time  test_score  train_score
0  0.177237    0.019094    0.873880     0.898337
1  0.157623    0.020832    0.888008     0.894460
2  0.156438    0.017781    0.865610     0.897562
3  0.155025    0.017795    0.864232     0.895580
4  0.170316    0.017595    0.873492     0.896795


In [17]:
pipe_rf = make_pipeline(
    preprocessor, RandomForestClassifier(random_state=123, n_jobs=-1)
)

score_rf = cross_validate (
    pipe_rf, X_train, y_train, return_train_score=True
)

rf_df = pd.DataFrame(score_rf)

print(rf_df.head())

   fit_time  score_time  test_score  train_score
0  0.888175    0.053411    0.876292     0.996812
1  0.799578    0.052953    0.888353     0.996037
2  0.775968    0.050255    0.874914     0.996381
3  0.757759    0.052705    0.872846     0.996468
4  0.738213    0.051479    0.875215     0.996813


In [18]:
pipe_xg = make_pipeline(
    preprocessor, xgb.XGBClassifier(n_jobs=-1)
)

score_xg = cross_validate(
    pipe_xg, X_train, y_train, return_train_score=True
)

xg_df = pd.DataFrame(score_xg)

print(xg_df.head())

   fit_time  score_time  test_score  train_score
0  1.174755    0.016551    0.870434     0.932885
1  1.093472    0.015378    0.883529     0.926510
2  1.159798    0.016822    0.862164     0.930042
3  1.100173    0.018193    0.862853     0.931679
4  1.127384    0.015990    0.870390     0.933236


2. All three models were able to reach a higher valdiation score than Logistical Regression. Rf, LGBMC, XGBC all had scores >= 0.85 while Logistical Regression scores are < 0.81. LightGBM seemed to be less overfit on the training data, as it was able to maintain a similar test score to the other models without an excessively high score in training. This could indicate it is the more trustworthy model in this case. <br><br>

<br><br>

## (Optional) 9. Feature selection <a name="9"></a>
<hr>
rubric={points:1}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

In [19]:
pipe_lgbm = make_pipeline(
    preprocessor, RFECV(estimator=LGBMClassifier()), LGBMClassifier(random_state=123, n_jobs=-1)
)

score_lgbm = cross_validate(
    pipe_lgbm, X_train, y_train, return_train_score=True
)

lgbm_df = pd.DataFrame(score_lgbm)

print(lgbm_df.head())

    fit_time  score_time  test_score  train_score
0  19.327819    0.024214    0.874914     0.898251
1  22.214469    0.021617    0.885252     0.891789
2  19.507927    0.018233    0.865610     0.898854
3  19.914379    0.017965    0.866644     0.893340
4  19.247902    0.017849    0.871768     0.896968


In [20]:
pipe_rf = make_pipeline(
    preprocessor, RFECV(estimator=RandomForestClassifier()), RandomForestClassifier(random_state=123, n_jobs=-1)
)

score_rf = cross_validate (
    pipe_rf, X_train, y_train, return_train_score=True
)

rf_df = pd.DataFrame(score_rf)

print(rf_df.head())

     fit_time  score_time  test_score  train_score
0  139.741779    0.064121    0.878015     0.996898
1  150.198344    0.047526    0.887664     0.996123
2  146.693794    0.047705    0.870434     0.996209
3  134.986772    0.051656    0.872157     0.996468
4  143.351186    0.047997    0.875215     0.996813


In [21]:
pipe_xg = make_pipeline(
    preprocessor, RFECV(estimator=xgb.XGBClassifier()), xgb.XGBClassifier(n_jobs=-1)
)

score_xg = cross_validate(
    pipe_xg, X_train, y_train, return_train_score=True
)

xg_df = pd.DataFrame(score_xg)

print(xg_df.head())

     fit_time  score_time  test_score  train_score
0  105.564357    0.016455    0.872846     0.916861
1   93.967879    0.016358    0.890765     0.926251
2   87.493266    0.016283    0.862164     0.930042
3   91.667238    0.017118    0.866644     0.930559
4   91.371901    0.016172    0.870390     0.932633


Using <br><br>

<br><br>

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:15}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. You may pick one of the best performing models from the previous exercise and tune hyperparameters only for that model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize)

I will be using the LightGBM for hyperparameter optimization.

In [25]:


param_dist = {
    'lgbmclassifier__n_estimators': randint(100, 500),
    'lgbmclassifier__learning_rate': [0.01, 0.1, 1]
}


opt = RandomizedSearchCV(
    pipe_lgbm,
    param_distributions=param_dist,
    cv=5,
    n_jobs=-1,
    n_iter=10,
    return_train_score=True
)

score_lgbm = cross_validate(
   opt, X_train, y_train, return_train_score=True
)

lgbm_df = pd.DataFrame(score_lgbm)

print(lgbm_df.head())

KeyboardInterrupt: 

<br><br>

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:15}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to explain feature importances of one of the best performing models. Summarize your observations. 

In [26]:
print(pipe_lgbm.named_steps)

{'columntransformer': ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('standardscaler',
                                                  StandardScaler())]),
                                 ['no_of_adults', 'no_of_children',
                                  'no_of_weekend_nights', 'no_of_week_nights',
                                  'lead_time', 'arrival_date',
                                  'no_of_previous_cancellations',
                                  'no_of_previous_bookings_not_canceled',
                                  'avg_price_per_room',
                                  'no_of_special_requests']),
                                ('pipeline-2',
                                 Pipeline(steps=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['type_of_meal_plan',
                                  'required_car_parking_space'

In [27]:
pipe_lgbm = make_pipeline(
    preprocessor, LGBMClassifier(random_state=123, n_jobs=-1)
)

ohe_feature_names = (
    pipe_lgbm.named_steps["columntransformer"].named_transformers_["pipeline-2"].named_steps["onehotencoder"].get_feature_names_out().tolist()
)

feature_names = numerical_feats + ohe_feature_names


pipe_lgbm.fit(X_train, y_train)
eli5.explain_weights(pipe_lgbm.named_steps["lgbmclassifier"], feature_names=feature_names)

Weight,Feature
0.4073,lead_time
0.1698,avg_price_per_room
0.1284,no_of_special_requests
0.1143,market_segment_type_1
0.0401,arrival_date
0.0287,no_of_adults
0.0284,no_of_week_nights
0.0267,no_of_weekend_nights
0.0149,required_car_parking_space_0
0.0119,market_segment_type_0


1. From the eli5 summary chart, we can see that the features of lead_time, avg_price_per_room, no_of_special_requests, and market_segment_type_1 are the features with the heightest weight in feature importance for the LGBMClassifier model.  <br><br>

<br><br>

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:5}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 

In [29]:
y_pred = pipe_lgbm.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Report the evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 score: {f1:.4f}")

Accuracy: 0.8705
Precision: 0.8419
Recall: 0.7545
F1 score: 0.7958


2. The test scores are similar to the mean cross validation scores generated from before. The individual scores indicate that the model is performing decently well while there is still some room for improvement, particularly for recall. For instance, a score of 0.7545 suggests that there are some positive instances that the model missed, and thus there are some false negatives. Furthermore, since the performance on the training data and test data are similar, there seems to be little overfitting to the training data for the model. Therefore, we would not expect issues concerning optimization bias. 

<br><br>

## (Optional) 13. Explaining predictions 
rubric={points:1}

**Your tasks**

1. Take one or two test predictions and explain them with SHAP force plots.  

<br><br>

## 14. Summary of results <a name="13"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Report your final test score along with the metric you used. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 

1. The final test score is approxmiately 0.8705 on the test split, with a recall score of 0.7545.
2. These two scores suggest that our LightGBM Classifier model is performing well in this classification problem, correctly predicting 87.05% of cases in the test split. We observed that ensemble models generally performed more accurately than simple Logistical Regression (with scores around 80%) and the DummyClassifier (with scores around 66%). Hyperparameter optimization with GridSearchCV did not significantly increase the test scores, however, it did raise the LGBM model's training scores from ~85% to over 95% in some cases. Perhaps this signifies some overfitting. Furthermore, feature importance analysis showed that lead_time, avg_price_per_room, no_of_special_requests, and the OHE feature market_segment_type_1 were the most predictive features of whether or not someone would cancel their booking. 
3. Hyperparameter optimization using Bayesian optimization would potentially result in improved performance due to its guided search of the parameter space, while the current method using grid search is limited to a slow, systematic search limited by the parameter grid.

<br><br><br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 