# Suggested Solution: Customer Segmentation with Bank Marketing Data Set

This notebook presents a suggested solution for the customer segmentation exercise using the Bank Marketing Data Set. In the following sections, we:

- **Load and preprocess the data** by handling missing values, scaling numerical features, and encoding categorical features.
- **Visualize the data** using plots to explore distributions and relationships.
- **Perform clustering** using the $K$-Means algorithm, including the use of the elbow method and silhouette scores to choose the optimal number of clusters.
- **Visualize clusters** with Principal Component Analysis (PCA) to reduce the data to 2D for easier interpretation.
- **Interpret the clusters** by summarizing key features and discussing insights.

Let's get started!

## 1. Data Loading and Preprocessing

In this section, we load the Bank Marketing Data Set (assumed to be in the file `bank-additional-full.csv`), handle missing values, normalize numerical features, and encode categorical features. Note that for clustering (an unsupervised method), the target variable (`y`) is dropped, although we keep it aside for later interpretation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Load the dataset (ensure the file is in your working directory)
# The file uses semicolon (;) as the separator
df = pd.read_csv('../../data/bank_marketing_data_set/bank-additional/bank-additional-full.csv', sep=';')
print('Dataset shape:', df.shape)
df.head()

### Handling Missing Values

The dataset uses the string `'unknown'` to represent missing values. Here, we replace `'unknown'` with `NaN` and then drop any rows containing missing values. In a production system, you might opt to impute missing values instead.

In [None]:
# Replace 'unknown' with NaN and drop rows with missing values
df.replace('unknown', np.nan, inplace=True)
df.dropna(inplace=True)
print('Dataset shape after dropping missing values:', df.shape)

### Feature Separation and Preprocessing

Since our goal is to perform unsupervised clustering, we drop the target variable (`y`) from our feature set. We then identify numerical and categorical features, scale the numerical features using `StandardScaler`, and one-hot encode the categorical features.

In [None]:
# Separate target variable from features (retain it for later interpretation)
target = df['y']
df_features = df.drop('y', axis=1)

# Define the list of numeric columns based on dataset description
numeric_cols = [
    'age', 'duration', 'campaign', 'pdays', 'previous', 
    'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'
]

# All other columns are considered categorical
categorical_cols = [col for col in df_features.columns if col not in numeric_cols]

print('Numeric columns:', numeric_cols)
print('Categorical columns:', categorical_cols)

from sklearn.preprocessing import StandardScaler

# Scale numeric features
scaler = StandardScaler()
df_features[numeric_cols] = scaler.fit_transform(df_features[numeric_cols])

# One-hot encode categorical features
df_features_encoded = pd.get_dummies(df_features, columns=categorical_cols, drop_first=True)

# Final preprocessed dataset for clustering
X = df_features_encoded
print('Preprocessed feature set shape:', X.shape)
X.head()

## 2. Data Visualization

Before clustering, it is helpful to visualize the data. Here, we plot the distribution of the numerical features and inspect the correlations among them. Such visualizations can reveal potential outliers and guide our understanding of the feature space.

In [None]:
# Plot histograms for numeric features
X_numeric = pd.DataFrame(X[numeric_cols], columns=numeric_cols)
X_numeric.hist(bins=30, figsize=(15, 10));
plt.tight_layout()
plt.show()

# Plot a correlation heatmap for numeric features
plt.figure(figsize=(12, 10))
sns.heatmap(X_numeric.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numeric Features')
plt.show()

## 3. Clustering with $K$-Means

We now apply the $K$-Means clustering algorithm to the preprocessed data. In order to determine a good number of clusters, we use the elbow method (plotting the Within-Cluster Sum of Squares, WCSS) and compute the silhouette score for each candidate number of clusters.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Use the elbow method and silhouette scores to determine the optimal number of clusters
wcss = []
silhouette_scores = []
K_range = range(2, 11)  # testing k from 2 to 10

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    labels = kmeans.labels_
    sil_score = silhouette_score(X, labels)
    silhouette_scores.append(sil_score)
    print(f"Clusters: {k}, WCSS: {kmeans.inertia_:.2f}, Silhouette Score: {sil_score:.3f}")

# Plot the elbow curve and silhouette scores
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(list(K_range), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('WCSS')

plt.subplot(1, 2, 2)
plt.plot(list(K_range), silhouette_scores, marker='o')
plt.title('Silhouette Score')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')

plt.tight_layout()
plt.show()

# Based on the plots and scores, choose an optimal number of clusters (for example, k=3)
optimal_k = 3
kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42)
cluster_labels = kmeans_optimal.fit_predict(X)
print('Optimal number of clusters chosen:', optimal_k)

## 4. Cluster Visualization using PCA

Since the dataset is high-dimensional, we use Principal Component Analysis (PCA) to reduce the data to 2 dimensions for visualization purposes. The following scatter plot shows the clusters in the reduced space.

In [None]:
from sklearn.decomposition import PCA

# Reduce dimensionality using PCA for visualization
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

# Plot the clusters using the first two principal components
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=cluster_labels, palette='Set1', s=50)
plt.title('Clusters Visualization using PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster')
plt.show()

## 5. Cluster Interpretation and Insights

To interpret the clusters, we attach the cluster labels back to the original dataset and compute summary statistics for each cluster. We also examine the distribution of the target variable (`y`) within each cluster to see if certain segments are more likely to subscribe to a term deposit.

Feel free to extend this analysis by exploring additional features or visualizations.

In [None]:
# Add the cluster labels to the original dataframe for further interpretation
df['cluster'] = cluster_labels

# Display summary statistics for numeric features in each cluster
for cluster in range(optimal_k):
    print(f"\nSummary statistics for Cluster {cluster}:")
    display(df[df['cluster'] == cluster][numeric_cols].describe())

# Examine the distribution of the target variable 'y' in each cluster
for cluster in range(optimal_k):
    print(f"\nTarget variable distribution in Cluster {cluster}:")
    print(df[df['cluster'] == cluster]['y'].value_counts())

## Discussion and Reflection

- The **elbow method** and **silhouette scores** helped us choose an appropriate number of clusters (in this example, we selected 3).
- The **PCA visualization** shows that while clusters are reasonably separated in a 2D projection, some overlap still exists. This is expected when reducing high-dimensional data to 2 dimensions.
- The **cluster summaries** and the target variable distributions provide insights into how different customer segments behave. For example, one cluster might have a higher proportion of customers who subscribe to term deposits.
- In a real-world scenario, you could further refine this analysis by experimenting with other clustering techniques, adjusting preprocessing steps, or incorporating domain knowledge to interpret the clusters.

### Final Thoughts

Customer segmentation can drive targeted marketing strategies and help improve campaign efficiency. However, it is important to remember that unsupervised clustering may require iterative tuning and expert interpretation to yield actionable insights.