# Assignment 12: Predicting Hotel Booking Cancellations  
## Models: Na√Øve Bayes, Support Vector Machine (SVM), and Neural Network

**Objectives:**
- Understand how to use classification models (Na√Øve Bayes, SVM, Neural Networks) to predict hotel cancellations.
- Compare models in terms of accuracy, complexity, and business relevance.
- Interpret and communicate model results from a business perspective.

## Business Scenario

You work as a data analyst for a hospitality group that manages both **Resort** and **City Hotels**. One major challenge in operations is the unpredictability of **booking cancellations**, which affects staffing, inventory, and revenue planning.

You‚Äôve been asked to use historical booking data to predict whether a future booking will be canceled. Your insights will help management plan more effectively.


Your task is to:
1. Build and evaluate three models: Na√Øve Bayes, SVM, and Neural Network.
2. Compare performance.
3. Recommend which model is best suited for the business needs.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.


## 1. Load and Prepare the Hotel Booking Dataset

**Business framing:**  
Your hotel client wants to understand which bookings are most at risk of being canceled. But before modeling, your job is to prepare the data to ensure clean and reliable input.

### Do the following:
- Load the `hotels.csv` file from https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv
- Remove or impute missing values
- Encode categorical variables
- Create your `X` (features) and `y` (target = `is_canceled`)
- Split the data into training and test sets (70/30)

### In Your Response:
1. How many total rows and columns are in the dataset?
2. What types of features (categorical, numerical) are included?
3. What steps did you take to clean or prepare the data?


In [1]:
# Add code here üîß
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv"
hotels = pd.read_csv(url)

# Inspect shape
print("Dataset shape:", hotels.shape)

# Inspect column types and missing values
print(hotels.info())
print(hotels.isna().sum())


# Drop columns with too many missing values
numeric_cols = hotels.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = hotels.select_dtypes(include=['object']).columns

# Impute numeric missing values with median
for col in numeric_cols:
    hotels[col].fillna(hotels[col].median(), inplace=True)

# Impute categorical missing values with mode
for col in categorical_cols:
    hotels[col].fillna(hotels[col].mode()[0], inplace=True)

    # Encode categorical variables using LabelEncoder
le = LabelEncoder()
for col in categorical_cols:
    hotels[col] = le.fit_transform(hotels[col])

X = hotels.drop(columns=['is_canceled'])
y = hotels['is_canceled']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


Dataset shape: (119390, 32)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  i

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  hotels[col].fillna(hotels[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  hotels[col].fillna(hotels[col].mode()[0], inplace=True)


Training set shape: (83573, 31)
Test set shape: (35817, 31)


### ‚úçÔ∏è Your Response: üîß
1. There are 119,390 rows and 32 columns.

2. There are 16 numerical features: lead_time, arrival_date_year, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, booking_changes, days_in_waiting_list, adr, required_car_parking_spaces, total_of_special_requests

There are 12 categorical features: hotel, arrival_date_month, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status, reservation_status_date

3. I checked for missing values and identified columns with missing data. I then impeted numeric missing values, such as children and agent, with the median. Then, I imputed categorical missing values, country, with the mode. After this, I encoded all categorical features using LabelEncoder to convert them into numeric values I can later use for modeling. Then I defined features as all columns except the target for x, and is_cancelled as the target variable y. Finally, I split the data into 70 percent training and 30 percent testing. Training had 83,573 rows. Testing had 35,817 rows.

## 2. Build a Na√Øve Bayes Model

**Business framing:**  
Na√Øve Bayes is a quick, baseline model often used for early testing or simple classification problems.

### Do the following:
- Train a Na√Øve Bayes classifier on your training data
- Use it to predict on your test data
- Print a classification report and confusion matrix

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. Where might this model be useful for the hotel (e.g. real-time alerts, operational decisions)?


In [2]:
# Add code here üîß
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Train Naive Bayes classifier
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Predict on test data
y_pred_nb = nb_model.predict(X_test)

# Evaluate performance
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_nb))
print("Accuracy:", accuracy_score(y_test, y_pred_nb))


Confusion Matrix:
 [[22202   276]
 [    0 13339]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99     22478
           1       0.98      1.00      0.99     13339

    accuracy                           0.99     35817
   macro avg       0.99      0.99      0.99     35817
weighted avg       0.99      0.99      0.99     35817

Accuracy: 0.9922941619901164


### ‚úçÔ∏è Your Response: üîß
1. The Naive Bayes model performs very well. The accuracy is 0.992 and the F1 is 0.99 for both 1 and 0, cancelled and non-cancelled bookings. The best metric is the F1-score as it balances precision and recall.

2. This model might be useful to the hotel in the context of real-time alerts, operations decisions, and marketing decisions as it can help flag bookings so staff can follow up, adjust room inventory, and send special offers to guests or be used to generate targeted ads effectively.

## 3. Build a Support Vector Machine (SVM) Model

**Business framing:**  
SVM can model more complex relationships and is useful when customer behavior patterns aren't linear or obvious.

### Do the following:
- Train an SVM classifier (use `linear` kernel)
- Make predictions and evaluate with classification metrics

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. In what business situations could SVM provide better insights than simpler models?


In [3]:
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train LinearSVC
svm_model = LinearSVC(C=1.0, max_iter=5000, random_state=42)
svm_model.fit(X_train_scaled, y_train)

# Predict
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))
print("Accuracy:", accuracy_score(y_test, y_pred_svm))


Confusion Matrix:
 [[22473     5]
 [  500 12839]]

Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99     22478
           1       1.00      0.96      0.98     13339

    accuracy                           0.99     35817
   macro avg       0.99      0.98      0.98     35817
weighted avg       0.99      0.99      0.99     35817

Accuracy: 0.9859005500181478


### ‚úçÔ∏è Your Response: üîß
1. The model performed well. The accuracy is 0.986 and the F1-score is 0.99 for 0 and 0.98 for 1 concerning is_cancelled. The best metric is again the F1 score if the hotel wants to catch cancellations without many false positives, which could halt business decisions.

2. SVM could provide better insights when there are comlpex patterns, or for subtle changess such as revenue optimization about how factors influencing cancellations can be utilized to adjust pricing.

## 4. Build a Neural Network Model

**Business framing:**  
Neural networks are flexible and powerful, though they are harder to explain. They may work well when subtle patterns exist in the data.

### Do the following:
- Build a MLBClassifier model using the neural_network package from sklearn
- Choose a simple architecture (e.g., 2 hidden layers)
- Evaluate accuracy and performance

### In Your Response:
1. How does this model compare to the others?
2. Would the business be comfortable using a ‚Äúblack box‚Äù model like this? Why or why not?


In [4]:
# Add code here üîß
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Train a simple neural network with 2 hidden layers
nn_model = MLPClassifier(hidden_layer_sizes=(16, 8), activation='relu',
                         solver='adam', max_iter=100, random_state=42)
nn_model.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_nn = nn_model.predict(X_test_scaled)

# Evaluate performance
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_nn))
print("Accuracy:", accuracy_score(y_test, y_pred_nn))


Confusion Matrix:
 [[22478     0]
 [    1 13338]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     22478
           1       1.00      1.00      1.00     13339

    accuracy                           1.00     35817
   macro avg       1.00      1.00      1.00     35817
weighted avg       1.00      1.00      1.00     35817

Accuracy: 0.9999720802970656


### ‚úçÔ∏è Your Response: üîß
1. The neural network performs better than both previous models. The F1-score is 1.00, indicating excellent prediction for both 0 and 1, cancelled and not cancelled. Compared to Naive Bayes 0.992 accuracy and SVM's 0.986 accuracy it outperforms both with the caveat that its a black box.

2. They might be cautions, particularly in the hotel business, as when issues come up it may be hard to explain why a booking is predicted to cancel. Furthermore, it could be used for alerts or prioritization, although decisions that require explanation, such as refund policies, are cases where it might still be preferable to use interpretable models such as Naive Bayes or SVM.

## 5. Compare All Three Models

### Do the following:
- Print and compare the accuracy of Na√Øve Bayes, SVM, and Neural Network models
- Summarize which model performed best

### In Your Response:
1. Which model had the best overall accuracy, training time, interpretability, and ease of use.
2. Would you recommend this model for deployment, and why?


In [7]:
# Add code here üîß
# Accuracy comparison of all three models

from sklearn.metrics import accuracy_score

# Naive Bayes
y_pred_nb = nb_model.predict(X_test)
acc_nb = accuracy_score(y_test, y_pred_nb)

# SVM (LinearSVC)
y_pred_svm = svm_model.predict(X_test_scaled)
acc_svm = accuracy_score(y_test, y_pred_svm)

# Neural Network (MLPClassifier)
y_pred_nn = nn_model.predict(X_test_scaled)
acc_nn = accuracy_score(y_test, y_pred_nn)

print("Accuracy Comparison:")
print(f"Naive Bayes Accuracy: {acc_nb:.5f}")
print(f"SVM Accuracy: {acc_svm:.5f}")
print(f"Neural Network Accuracy: {acc_nn:.5f}")


Accuracy Comparison:
Naive Bayes Accuracy: 0.99229
SVM Accuracy: 0.98590
Neural Network Accuracy: 0.99997


### ‚úçÔ∏è Your Response: üîß
1. The accuracy of the Neural network was the highest at 0.99. The training time was fastest using Naive Bayes. The interpretability was highest when using the Naive bayes as it was the simplest. The best ease of use was also Naive Bayes as it was comparatively highly easy to implement.

2. Naive Bayes, because it‚Äôs very accurate, fast, and easily interpretable for real-time cancellation alerts.

## 6. Final Business Recommendation

### In Your Response:
1. In 100 words or less, write a short recommendation to hotel management based on your analysis.

Possible info to include:
- Which model do you recommend implementing?
- What business problem does it help solve?
- Are there any risks or limitations?
- What additional data might improve the results in the future?
2. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. I'd recommend implementing the Naive Bayes model to predict hotel booking cancellations. It provides fast, accurate, and interpretable predictions. In turn, this would help management proactively manage overbooking and other variables. While the model performs very well, it may miss subtle patterns that more complex models like neural networks could capture in its black-box. Including additional data‚Äîsuch as customer behavior trends, booking device type, or seasonal promotions‚Äîcould further improve accuracy.

2. This relates to my customized learning outcome as my interpreting of the results is in alignment with the second goal: to develop my analytical thinking skills and gain hands-on experience parsing through and disseminating real data.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [6]:
!jupyter nbconvert --to html "assignment_12_LoreSpencer.ipynb"

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr