<a href="https://colab.research.google.com/github/worldstar0722/IS_4487_25FA/blob/main/assignment_09_ChoiEllie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS 4487 Assignment 9: Customer Segmentation with Clustering

In this assignment, you will:
- Apply unsupervised learning to explore patterns in hotel booking behavior
- Use K-Means and Gaussian Mixture Models (GMM) for customer segmentation
- Evaluate model quality with metrics like Silhouette Score and Davies-Bouldin Index
- Connect clustering to actionable business insights

## Why This Matters

Businesses like hotels and travel platforms (e.g., Airbnb or Expedia) rely on customer segmentation to tailor promotions, pricing strategies, and service levels. Unlike supervised models, clustering helps uncover patterns when no labels exist—an ideal tool when entering new markets or analyzing unstructured customer behavior.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_09_clustering.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.

## 1. Setup and Load Data

Business framing:  

Before we can cluster or segment anything, we need clean, accessible data in a usable format.

- Import the necessary Python libraries
- Load the hotel bookings dataset by [downloading the file](https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-02-11/readme.md#get-the-data-here) or using this link: https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-11/hotels.csv
- Display the first few rows

### In Your Response:
1. What stands out in the initial preview? Any columns or rows that seem unusual?

In [None]:
# Add code here 🔧


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.metrics import pairwise_distances_argmin_min
import seaborn as sns

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-11/hotels.csv"
df = pd.read_csv(url)

display(df.head())
display(df.info())
display(df.describe(include='all').T)


### ✍️ Your Response: 🔧
1. There are a lot of categorical columns (hotel, meal, market_segment, distribution_channel, etc.) and many numeric booking-related columns (lead_time, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, previous_cancellations, adr, total_of_special_requests, etc.). </br>
Some columns that often contain missing or unusual values: children often has NaNs or zeros, agent and company can be strings like "NULL" or NaN, and country has many unique values (country codes). </br>
Look for 0s in adr (average daily rate) which might indicate missing or erroneous prices. </br>
 Date column reservation_status_date will be parsed as string if not converted; treat carefully if needed.

## 2. Select and Prepare Features

Business framing:  

A hotel might want to group guests based on how long they stay, how far in advance they book, or how likely they are to make special requests. You need to pick variables that represent meaningful guest behavior.

- Choose 3–5 numeric features related to customer behavior
- Drop missing values if needed
- Standardize using `StandardScaler`

### In Your Response:
1. What features did you select and why?
2. What kinds of patterns or segments do you expect to find?


In [None]:
# Add code here 🔧
df['total_nights'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

features = ['lead_time', 'total_nights', 'adr', 'total_of_special_requests']
data = df[features].copy()

data = data.replace([np.inf, -np.inf], np.nan).dropna()

print("Rows after dropping NA:", data.shape[0])
print(data.describe())

scaler = StandardScaler()
X = scaler.fit_transform(data)

X_df = pd.DataFrame(X, columns=features, index=data.index)


### ✍️ Your Response: 🔧
1. In this analysis, four variables were selected: lead_time, total_nights, adr, and total_of_special_requests. This is because these variables represent the customer's reservation timing, length of stay, level of expenditure, and service demand, and are key indicators that can well explain hotel use patterns.</br>
2. Based on these variables, different customer segments were expected to appear, including short-term stay-at-home business customers, long-term vacation family travelers, and early reservations.

## 3. Apply K-Means Clustering

Business framing:  

Let’s say you’re working with the hotel’s marketing manager. She wants to group guests into a few clear types to target email campaigns. K-Means is a fast, simple way to try this.

- Fit a `KMeans` model with your selected features
- Choose a value of `k` (e.g. 3, 4, or 5)
- Predict clusters and assign to each guest
- Visualize using a scatterplot of 2 features

Much of this assignment has already been covered in the lab. Please be sure to complete the lab before the assignment.

### In Your Response:
1. What `k` value did you choose, and how did you decide?
2. What types of customers seem to show up in the clusters?



In [None]:
# Add code here 🔧
inertias = {}
sil_scores = {}
for k in [2,3,4,5,6]:
    km = KMeans(n_clusters=k, random_state=42, n_init=20)
    labels = km.fit_predict(X)
    inertias[k] = km.inertia_
    sil_scores[k] = silhouette_score(X, labels)

print("Inertia by k:", inertias)
print("Silhouette by k:", sil_scores)

k = 4
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
k_labels = kmeans.fit_predict(X)

data_k = data.copy()
data_k['kmeans_cluster'] = k_labels

plt.figure(figsize=(8,6))
sns.scatterplot(x=data_k['lead_time'], y=data_k['adr'], hue=data_k['kmeans_cluster'], palette='tab10', s=30, alpha=0.7)
plt.title(f'KMeans (k={k}) clusters — lead_time vs adr')
plt.xlabel('lead_time')
plt.ylabel('adr')
plt.legend(title='cluster')
plt.show()

cluster_means = data_k.groupby('kmeans_cluster')[features].mean().round(2)
display(cluster_means)


### ✍️ Your Response: 🔧
1. As a result of comparing several k values, k = 4 showed the most balanced result through silhouette score and visualization. At this value, the distinction between clusters was clear and easy to interpret. </br>
2. The analysis revealed four main types: short-stay low-cost business customers, long-stay high-priced recreational customers, early bookers, and family customers with many special requests.

## 4. Apply Gaussian Mixture Model (GMM)

Business framing:  

Not all guests fit neatly into one cluster. GMM lets us capture uncertainty — useful if customers behave similarly across groups.

- Fit a GMM with the same number of clusters you chose in Part 3
- Predict soft clusters (remember that soft clustering deals with probabilities, not labels)
- Visualize the GMM model so that you may compare it to the KMeans scatterplot

### In Your Response:
1. How did the GMM results compare to KMeans?
2. What business questions might GMM help answer better?


In [None]:
# Add your code here
gmm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
gmm.fit(X)
gmm_probs = gmm.predict_proba(X)
gmm_labels = gmm.predict(X)
# Attach to a copy
data_g = data.copy()
data_g['gmm_cluster'] = gmm_labels

fig, axes = plt.subplots(1,2, figsize=(14,5))
sns.scatterplot(x=data_k['lead_time'], y=data_k['adr'], hue=data_k['kmeans_cluster'], ax=axes[0], palette='tab10', s=30, alpha=0.7)
axes[0].set_title(f'KMeans (k={k})')
axes[0].set_xlabel('lead_time'); axes[0].set_ylabel('adr')

sns.scatterplot(x=data_g['lead_time'], y=data_g['adr'], hue=data_g['gmm_cluster'], ax=axes[1], palette='tab10', s=30, alpha=0.7)
axes[1].set_title(f'GMM (k={k})')
axes[1].set_xlabel('lead_time'); axes[1].set_ylabel('adr')

plt.tight_layout()
plt.show()

prob_df = pd.DataFrame(gmm_probs[:10], columns=[f'gmm_prob_{i}' for i in range(k)])
display(prob_df.round(3))


### ✍️ Your Response: 🔧
1. GMM showed smoother and more flexible cluster boundaries than K-Means, better representing the probability that customers belong to multiple segments</br>
2. The model is useful for identifying mixed tendencies, such as customers with both business and recreational purposes, and is suitable for predicting multiple promotional responses.

## 5. Evaluate Your Models

Business framing:  

In business, models should be both useful and reliable. You’ll compare model quality using standard evaluation metrics.

- Calculate:
  - WCSS
  - Silhouette Score
  - Davies-Bouldin Index
- Compare both models

**Remember**:
- Lower WCSS = tighter, better-defined clusters
- Silhouette score ranges from -1 to 1.  Higher values = better clustering
- Lower Davies-Boulding Index = better clustering

### In Your Response:
1. Which model performed better on the metrics?
2. Would you recommend KMeans or GMM for a business analyst? Why?


In [None]:
# Add code here 🔧
wcss_k = kmeans.inertia_
sil_k = silhouette_score(X, k_labels)
db_k = davies_bouldin_score(X, k_labels)

sil_g = silhouette_score(X, gmm_labels)
db_g = davies_bouldin_score(X, gmm_labels)
bic = gmm.bic(X)
aic = gmm.aic(X)

print("KMeans (k={}): WCSS(inertia)={:.2f}, Silhouette={:.4f}, Davies-Bouldin={:.4f}".format(k, wcss_k, sil_k, db_k))
print("GMM   (k={}): Silhouette={:.4f}, Davies-Bouldin={:.4f}, BIC={:.1f}, AIC={:.1f}".format(k, sil_g, db_g, bic, aic))

eval_df = pd.DataFrame({
    'model': ['KMeans', 'GMM (hard labels)'],
    'wcss': [wcss_k, np.nan],
    'silhouette': [sil_k, sil_g],
    'davies_bouldin': [db_k, db_g],
    'gmm_bic': [np.nan, bic],
    'gmm_aic': [np.nan, aic]
})
display(eval_df.T)


### ✍️ Your Response: 🔧
1. Comparing the two models, K-Means showed a higher silhouette score, making the cluster clear, and GMM had slightly better goodness-of-fit indicators (AIC, BIC).</br>
2. K-Means, which is simple to interpret and easy to apply, is more suitable for practical analysts, and GMM is recommended for auxiliary exploration.

## 6. Business Interpretation

Business framing:  

What do these clusters mean in the real world? Could they represent solo travelers, families, or bargain shoppers?

- Review characteristics of each cluster (e.g. average `lead_time`, `special_requests`)
- Think from a marketing or hotel operations perspective

### In Your Response:
1. What do the segments represent in terms of guest behavior?
2. How could the hotel tailor services or promotions to each group?


In [None]:
# Add code here 🔧
summary = data_k.groupby('kmeans_cluster')[features].agg(['mean','std','count']).round(2)
display(summary)

cluster_profile = data_k.groupby('kmeans_cluster').agg({
    'lead_time': 'mean',
    'total_nights': 'mean',
    'adr': 'mean',
    'total_of_special_requests': 'mean',
    'kmeans_cluster': 'count'
}).rename(columns={'kmeans_cluster':'count'}).round(2)

display(cluster_profile)

for idx, row in cluster_profile.iterrows():
    print(f"Cluster {idx}: count={row['count']}, lead_time={row['lead_time']}, nights={row['total_nights']}, adr={row['adr']}, special_req={row['total_of_special_requests']}")


### ✍️ Your Response: 🔧
1. The four segments were divided into business customers, recreational families, early reservations, and custom service preference customers, respectively, showing different behavior patterns.</br>
2. otels can optimize marketing by offering short-term discounts to business customers and package products to family customers.

## 7. Final Reflection

Business framing:  

Many teams ask for "segmentation" without knowing how it works. You now have hands-on experience with two clustering techniques and how to present the results.

### In Your Response:
1. What was most challenging about unsupervised learning?
2. When would you use clustering instead of supervised models?
3. How would you explain the value of clustering to a non-technical manager?
4. How does this relate to your customized learning outcome you created in canvas?


### ✍️ Your Response: 🔧
1. Most challenging about unsupervised learning: there is no single ground truth; choosing the right features, selecting cluster counts (k) and interpreting clusters require domain knowledge and iteration. Evaluation metrics (silhouette, Davies-Bouldin) are useful but imperfect — they don't replace business sense.</br>
2. When to use clustering instead of supervised models: use clustering when you lack labeled outcomes (e.g., you don't have "customer-type" labels) and want to discover natural groupings for personalization, product development, or exploration. Use supervised learning when you have a labeled target to predict (churn, cancellation)</br>
3. Explaining value to a non-technical manager: clustering segments guests into a few actionable groups — for each group we can design targeted promotions, pricing, and operations (leading to higher revenue and better guest satisfaction). It's a way to convert a huge, messy dataset into a small set of customer archetypes.</br>
4. Relation to learning outcomes: this assignment demonstrates end-to-end unsupervised workflow (feature selection, standardization, multiple algorithms, metrics, and business interpretation), which builds practical skills for customer segmentation projects.

## Submission Instructions

✅ **Before submitting:**
- Make sure all code cells are run and outputs are visible  
- All markdown questions are answered thoughtfully  
- Submit the assignment as an **HTML file** on Canvas


In [None]:
!jupyter nbconvert --to html "assignment_09_LastnameFirstname.ipynb"