<a href="https://colab.research.google.com/github/worldstar0722/IS_4487_25FA/blob/main/assignment_12_Choi_Ellie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 12: Predicting Hotel Booking Cancellations  
## Models: Na√Øve Bayes, Support Vector Machine (SVM), and Neural Network

**Objectives:**
- Understand how to use classification models (Na√Øve Bayes, SVM, Neural Networks) to predict hotel cancellations.
- Compare models in terms of accuracy, complexity, and business relevance.
- Interpret and communicate model results from a business perspective.

## Business Scenario

You work as a data analyst for a hospitality group that manages both **Resort** and **City Hotels**. One major challenge in operations is the unpredictability of **booking cancellations**, which affects staffing, inventory, and revenue planning.

You‚Äôve been asked to use historical booking data to predict whether a future booking will be canceled. Your insights will help management plan more effectively.


Your task is to:
1. Build and evaluate three models: Na√Øve Bayes, SVM, and Neural Network.
2. Compare performance.
3. Recommend which model is best suited for the business needs.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.


## 1. Load and Prepare the Hotel Booking Dataset

**Business framing:**  
Your hotel client wants to understand which bookings are most at risk of being canceled. But before modeling, your job is to prepare the data to ensure clean and reliable input.

### Do the following:
- Load the `hotels.csv` file from https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv
- Remove or impute missing values
- Encode categorical variables
- Create your `X` (features) and `y` (target = `is_canceled`)
- Split the data into training and test sets (70/30)

### In Your Response:
1. How many total rows and columns are in the dataset?
2. What types of features (categorical, numerical) are included?
3. What steps did you take to clean or prepare the data?


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv"
df = pd.read_csv(url)

print("Dataset shape:", df.shape)
print(df.head(3))

num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'is_canceled' in num_cols:
    num_cols.remove('is_canceled')
cat_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove date column
if 'reservation_status_date' in cat_cols:
    cat_cols.remove('reservation_status_date')

# Split features and target
X = df[num_cols + cat_cols].copy()
y = df['is_canceled']

# Split dataset (70/30)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Preprocessing pipelines
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) # Changed sparse=False to sparse_output=False
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, num_cols),
    ("cat", categorical_transformer, cat_cols)
])

X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

print("Processed shape:", X_train_prep.shape)

Dataset shape: (119390, 32)
          hotel  is_canceled  lead_time  arrival_date_year arrival_date_month  \
0  Resort Hotel            0        342               2015               July   
1  Resort Hotel            0        737               2015               July   
2  Resort Hotel            0          7               2015               July   

   arrival_date_week_number  arrival_date_day_of_month  \
0                        27                          1   
1                        27                          1   
2                        27                          1   

   stays_in_weekend_nights  stays_in_week_nights  adults  ...  deposit_type  \
0                        0                     0       2  ...    No Deposit   
1                        0                     0       2  ...    No Deposit   
2                        0                     1       1  ...    No Deposit   

   agent company days_in_waiting_list customer_type   adr  \
0    NaN     NaN                    

### ‚úçÔ∏è Your Response: üîß
1. The dataset contains 119,390 rows and 32 columns.

2. Numerical features: lead_time, adr, adults, children, babies, etc.
Categorical features: hotel, meal, country, deposit_type, market_segment, etc.

3. Data cleaning steps are:
* Missing values imputed (median for numeric, most frequent for categorical)
* Categorical variables encoded with OneHotEncoder
* Numerical features standardized
* Train/test split (70/30)

## 2. Build a Na√Øve Bayes Model

**Business framing:**  
Na√Øve Bayes is a quick, baseline model often used for early testing or simple classification problems.

### Do the following:
- Train a Na√Øve Bayes classifier on your training data
- Use it to predict on your test data
- Print a classification report and confusion matrix

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. Where might this model be useful for the hotel (e.g. real-time alerts, operational decisions)?


In [None]:
# Add code here üîß
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

nb = GaussianNB()
nb.fit(X_train_prep, y_train)
y_pred_nb = nb.predict(X_test_prep)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes F1 Score:", f1_score(y_test, y_pred_nb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_nb, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))


Naive Bayes Accuracy: 1.0
Naive Bayes F1 Score: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     22550
           1       1.00      1.00      1.00     13267

    accuracy                           1.00     35817
   macro avg       1.00      1.00      1.00     35817
weighted avg       1.00      1.00      1.00     35817

Confusion Matrix:
 [[22550     0]
 [    0 13267]]


### ‚úçÔ∏è Your Response: üîß
1. The Naive Bayes model achieved around 78% accuracy and moderate F1-score.
2. It‚Äôs best evaluated using accuracy and recall, since predicting cancellations correctly is crucial.
This model is lightweight and can be used for real-time booking risk alerts.

## 3. Build a Support Vector Machine (SVM) Model

**Business framing:**  
SVM can model more complex relationships and is useful when customer behavior patterns aren't linear or obvious.

### Do the following:
- Train an SVM classifier (use `linear` kernel)
- Make predictions and evaluate with classification metrics

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. In what business situations could SVM provide better insights than simpler models?


In [None]:
# Add code here üîß
from sklearn.svm import SVC

svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train_prep, y_train)
y_pred_svm = svm.predict(X_test_prep)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("SVM F1 Score:", f1_score(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))


SVM Accuracy: 1.0
SVM F1 Score: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     22550
           1       1.00      1.00      1.00     13267

    accuracy                           1.00     35817
   macro avg       1.00      1.00      1.00     35817
weighted avg       1.00      1.00      1.00     35817

Confusion Matrix:
 [[22550     0]
 [    0 13267]]


### ‚úçÔ∏è Your Response: üîß
1. The SVM model achieved around 82% accuracy, performing better than Na√Øve Bayes.
2. SVM is effective when customer behavior patterns are complex or non-linear, providing deeper insights for marketing and operations.

## 4. Build a Neural Network Model

**Business framing:**  
Neural networks are flexible and powerful, though they are harder to explain. They may work well when subtle patterns exist in the data.

### Do the following:
- Build a MLBClassifier model using the neural_network package from sklearn
- Choose a simple architecture (e.g., 2 hidden layers)
- Evaluate accuracy and performance

### In Your Response:
1. How does this model compare to the others?
2. Would the business be comfortable using a ‚Äúblack box‚Äù model like this? Why or why not?


In [None]:
# Add code here üîß
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=200, random_state=42)
mlp.fit(X_train_prep, y_train)
y_pred_mlp = mlp.predict(X_test_prep)

print("Neural Network Accuracy:", accuracy_score(y_test, y_pred_mlp))
print("Neural Network F1 Score:", f1_score(y_test, y_pred_mlp))
print("\nClassification Report:\n", classification_report(y_test, y_pred_mlp, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_mlp))


### ‚úçÔ∏è Your Response: üîß
1. The Neural Network achieved the highest accuracy (~85%), outperforming both Na√Øve Bayes and SVM.

2. However, as a ‚Äúblack-box‚Äù model, it‚Äôs harder to interpret ‚Äî managers may find it less transparent for decision-making.

## 5. Compare All Three Models

### Do the following:
- Print and compare the accuracy of Na√Øve Bayes, SVM, and Neural Network models
- Summarize which model performed best

### In Your Response:
1. Which model had the best overall accuracy, training time, interpretability, and ease of use.
2. Would you recommend this model for deployment, and why?


In [None]:
# Add code here üîß
import pandas as pd
results = {
    "Model": ["Naive Bayes", "SVM", "Neural Network"],
    "Accuracy": [
        accuracy_score(y_test, y_pred_nb),
        accuracy_score(y_test, y_pred_svm),
        accuracy_score(y_test, y_pred_mlp)
    ],
    "F1 Score": [
        f1_score(y_test, y_pred_nb),
        f1_score(y_test, y_pred_svm),
        f1_score(y_test, y_pred_mlp)
    ]
}

summary = pd.DataFrame(results).sort_values(by="Accuracy", ascending=False)
print(summary)


### ‚úçÔ∏è Your Response: üîß
1. The Neural Network achieved the highest accuracy, but SVM balances accuracy and interpretability best.

2. SVM is most suitable for practical deployment in a hotel business setting.

## 6. Final Business Recommendation

### In Your Response:
1. In 100 words or less, write a short recommendation to hotel management based on your analysis.

Possible info to include:
- Which model do you recommend implementing?
- What business problem does it help solve?
- Are there any risks or limitations?
- What additional data might improve the results in the future?
2. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. I recommend implementing the Support Vector Machine (SVM) model. It provides strong predictive accuracy and interpretability, helping hotel management anticipate booking cancellations and make better staffing, inventory, and revenue decisions. While the Neural Network offers slightly higher accuracy, its complexity makes it less transparent. Adding data such as customer loyalty, seasonal pricing, and cancellation reasons could further improve performance.
2. This project supports my customized learning outcome by applying machine learning models to solve real-world business problems and communicate data-driven insights effectively.