# 🎯 K-Means Clustering - Customer Segmentation

Welcome to your hands-on K-Means clustering exercise! In this notebook, you'll discover hidden patterns in customer data to segment customers into meaningful groups.

## 🔍 Key Questions to Answer
1. **What is the optimal number of clusters?** (using the elbow method)
2. **How many customers are in each cluster?** (using K-Means clustering)

## 📊 Dataset Info
- **Rows**: 200 customers from a retail store
- **Features**: Annual Income (k$) and Spending Score (1-100)
- **Goal**: Segment customers for targeted marketing campaigns

Let's discover customer patterns! 🔍✨

In [None]:
# Install required packages
%pip install pandas numpy matplotlib seaborn scikit-learn

print("✅ All packages installed successfully!")

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print("🎨 Plot styling configured!")

## 📊 Step 1: Load and Explore the Data

Let's load the customer dataset and understand what we're working with.

In [None]:
# Load the customer dataset
df = pd.read_csv('customer_data.csv')

print(f"✅ Dataset loaded successfully!")
print(f"📊 Shape: {df.shape}")
print(f"📋 Columns: {list(df.columns)}")

# Display first few rows
print("\n🔍 First 5 rows:")
print(df.head())

In [None]:
# Dataset overview
print("=== DATASET OVERVIEW ===")
print(df.info())

print("\n=== STATISTICAL SUMMARY ===")
print(df.describe())

In [None]:
# Visualize the data distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Age distribution
axes[0].hist(df['Age'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Age Distribution')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')

# Annual Income distribution
axes[1].hist(df['Annual_Income'], bins=20, alpha=0.7, color='lightgreen', edgecolor='black')
axes[1].set_title('Annual Income Distribution')
axes[1].set_xlabel('Annual Income (k$)')
axes[1].set_ylabel('Frequency')

# Spending Score distribution
axes[2].hist(df['Spending_Score'], bins=20, alpha=0.7, color='salmon', edgecolor='black')
axes[2].set_title('Spending Score Distribution')
axes[2].set_xlabel('Spending Score (1-100)')
axes[2].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

print("📊 Data distributions visualized!")

## 🎯 Step 2: Prepare Data for Clustering

For customer segmentation, we'll focus on Annual Income and Spending Score as our clustering features.

In [None]:
# Select features for clustering
features = ['Annual_Income', 'Spending_Score']
X = df[features].values

print(f"✅ Selected features: {features}")
print(f"📊 Clustering data shape: {X.shape}")

# Display the clustering data
print("\n🔍 First 5 rows of clustering data:")
print(X[:5])

In [None]:
# Visualize the raw data
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], alpha=0.6, s=50)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Data: Annual Income vs Spending Score')
plt.grid(True, alpha=0.3)
plt.show()

print("🎨 Raw data visualized! Can you spot any potential clusters?")

## 🔍 Question 1: What is the optimal number of clusters?

Before applying K-Means, we need to determine the optimal number of clusters using the elbow method. The elbow method helps us find the "sweet spot" where adding more clusters doesn't significantly improve the clustering quality.

**Your task**: Use the elbow method to identify the optimal number of clusters by looking for the "elbow" point in the WCSS curve.

In [None]:
# Calculate Within-Cluster Sum of Squares (WCSS) for different k values
wcss = []
k_range = range(1, 11)

print("🔍 Calculating WCSS for different k values...")

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    print(f"k={k}: WCSS = {kmeans.inertia_:.0f}")

print("\n✅ WCSS calculation completed!")

In [None]:
# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, wcss, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.title('Elbow Method for Optimal k')
plt.grid(True, alpha=0.3)

plt.show()

print("📊 Elbow curve plotted!")
print("🤔 Based on the elbow curve, what do you think is the optimal number of clusters?")

## 🎯 Question 2: How many customers are in each cluster?

Now let's apply K-Means with the optimal number of clusters you identified and find out how many customers belong to each cluster.

**Your task**: Apply K-Means clustering and report the size of each cluster (both count and percentage).

In [None]:
# Apply K-Means with optimal k Based on elbow method
# optimal_k = 

print(f"🎯 Applying K-Means with k={optimal_k}...")

# TODO: Create KMeans model with optimal_k clusters with random_state=42 and n_init=10
# kmeans = 
# cluster_labels = 

# Add cluster labels to original dataframe
df['Cluster'] = cluster_labels

# Display results
print(f"✅ K-Means clustering completed!")
print(f"📊 Cluster centers:")
print(kmeans.cluster_centers_)
print(f"\n🎯 Final WCSS: {kmeans.inertia_:.0f}")

In [None]:
# Analyze cluster sizes
print("=== CLUSTER ANALYSIS ===")
cluster_counts = df['Cluster'].value_counts().sort_index()
print("Cluster sizes:")
for cluster, count in cluster_counts.items():
    print(f"  Cluster {cluster}: {count} customers ({count/len(df)*100:.1f}%)")

# Display cluster statistics
print("\n📊 Cluster Statistics:")
for cluster in range(optimal_k):
    cluster_data = df[df['Cluster'] == cluster]
    print(f"\nCluster {cluster}:")
    print(f"  Average Income: ${cluster_data['Annual_Income'].mean():.0f}k")
    print(f"  Average Spending Score: {cluster_data['Spending_Score'].mean():.1f}")
    print(f"  Age Range: {cluster_data['Age'].min()}-{cluster_data['Age'].max()}")

print(f"\n🎉 Analysis complete! You now have the answers to both questions!")

## 📋 Summary

Congratulations! You've successfully completed the K-Means clustering analysis. You should now be able to answer:

1. **What is the optimal number of clusters?** _(Look at your elbow curve)_
2. **How many customers are in each cluster?** _(Check the cluster sizes above)_

🎯 **Ready to submit your answers?** Make sure you have:
- ✅ Identified the optimal number of clusters from the elbow method
- ✅ Determined the size of each cluster (count and percentage)
- ✅ Understood the customer distribution across clusters

Great work on mastering K-Means clustering! 🎉