<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; font-weight: bold; margin-top: 25px;">

# Introduction to Unsupervised Learning: customer segmentation
    
</div>

> Example to explain the concept of customer segmentation
>
> **Objective:** You are the owner of a shopping center and want to understand the behavior of your customers in order to give this information to the marketing team and plan the strategy accordingly.

> * For this example, Hierarchical and K-means clustering methods are used. 
> * Also, dimensionality reduction is used to plot the data in 2D dimension plot. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Import data
df=pd.read_csv("Data/Mall_Customers.csv")
df.head()

In [None]:
# Let's check for missing values
df.isnull().sum()

In [None]:
df.dtypes

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Gender is an Object type. We will need to transform the string feature into discrete numerical values. 
</div>

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; font-weight: bold; margin-top: 25px;">

## Exploratory Data Analysis (EDA)
    
</div>

> Gender distribution 
> 
> Gender distribution by age

In [None]:
# Gender distribution

import seaborn as sns
sns.countplot(x='Gender', data=df)
plt.title('Distribution of Gender')

In [None]:
df.describe()

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

 Gender distribution by age
    
</div>


In [None]:


plt.hist('Age', data=df[df['Gender'] == 'Male'], alpha=0.2, color='g', label='Male')
plt.hist('Age', data=df[df['Gender'] == 'Female'], alpha=0.2, color='blue',label='Female')
plt.title('Age distribution by gender')
plt.xlabel('Age')
plt.legend()

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

 Age distribution by gender
    
</div>


In [None]:

sns.kdeplot(data=df[df['Gender'] == 'Male']['Age'], 
            fill=True, color='green', label='Male', alpha=0.3)
sns.kdeplot(data=df[df['Gender'] == 'Female']['Age'], 
            fill=True, color='blue', label='Female', alpha=0.3)

# Add a descriptive title and axis labels
plt.title('Age Distribution by Gender (Density Plot)', fontsize=14, fontweight='bold')
plt.xlabel('Age', fontsize=12)
plt.ylabel('Density', fontsize=12)

# Add a legend
plt.legend(title="Gender", fontsize=10)


<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

 Boxplot Age - Gender
    
</div>

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="Gender", y="Age", data=df)

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Bonxplot with 'Annual Income(k$)'
    
</div>

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="Gender", y="Annual Income (k$)", data=df)


<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Boxplot with 'Spending Score (1-100)'
    
</div>

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="Gender", y="Spending Score (1-100)", data=df)

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Variables correlation
    
</div>

In [None]:
# Let's check variables correlation
sns.heatmap(df.corr(), annot=True)

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Error! We have string data in our Dataframe. 
    
</div>

> * If the model you're using doesn't assume an ordinal relationship (like most modern ML models), use **One-Hot Encoding**.
>
> * If dimensionality is a concern or you're using algorithms like decision trees that are robust to ordinal values, **Label Encoding** is fine.

In [None]:
df = pd.get_dummies(df, columns=['Gender'], drop_first=True, dtype=int)

In [None]:
df

In [None]:
# Let's check variables correlation
sns.heatmap(df.corr(), annot=True)



<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

## Example 1. Hierarchical Clustering: groups of clients which are similar regarding Annual Income vs. Age
    
</div>

> * **Agglomerative Clustering** is a type of hierarchical clustering algorithm used to group similar data points together. It is a bottom-up approach, where each data point starts as its own individual cluster, and pairs of clusters are merged as we move up the hierarchy.

> Bottom-up Approach:
            > * Initially, each data point is treated as its own cluster.
            > * The algorithm iteratively merges the two closest clusters at each step.
            > * This process continues until all data points are merged into a single cluster or until a predefined number of clusters is reached.

> ``AgglomerativeClustering(n_clusters=5)``: Hierarchical clustering is performed, and the n_clusters argument specifies the number of clusters you want to identify. You can modify this number based on your analysis or the dendrogram.

> ``fit_predict()``: This method fits the model to the scaled data and assigns each data point to a cluster.

In [None]:

# create a subset/specific dataframe for this exercise
df_age_income = df[['Age', 'Annual Income (k$)']]

In [None]:
df_age_income

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Scale the data and Apply Agglomerative clustering
    
</div>



In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler


# Step 1: Scale the data
scaler = StandardScaler()
df_age_income_scaled = scaler.fit_transform(df_age_income)

# Step 2: Apply Agglomerative Clustering (Hierarchical Clustering)
agg_clustering = AgglomerativeClustering()  # You can adjust the number of clusters as needed -- n_clusters=2
y_hierarchical = agg_clustering.fit_predict(df_age_income_scaled)


In [None]:
y_hierarchical

In [None]:
df_age_income_scaled

In [None]:
# Step 3: Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(df_age_income_scaled[:, 0], df_age_income_scaled[:, 1], c=y_hierarchical, cmap='viridis', s=50)
plt.title('Hierarchical Clustering')
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.show()


<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Plot the results with non-scale data to understand better the groups
    
</div>


In [None]:

# Step 3: Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(df_age_income.loc[:, "Age"], df_age_income.loc[:, "Annual Income (k$)"], c=y_hierarchical, cmap='viridis', s=50)
plt.title('Hierarchical Clustering')
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.show()

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Visualizing the Dendrogram
    
</div>

> The dendrogram helps to visualize how clusters are merged at each level, and the vertical distance between branches indicates how similar the clusters are.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

# Step 4: Create the linkage matrix for the dendrogram
linked = linkage(df_age_income, 'ward')

# Step 5: Plot the dendrogram
plt.figure(figsize=(20, 7))
dendrogram(linked)
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Sample Index')
plt.ylabel('Euclidean Distance')
plt.show()



<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

## Example 2. K-means: Find K groups of clients which are similar regarding Spending Score vs. Age
    
</div>

> * To set the number of clusters we want to have as a result, we will use the **Elbow method**


In [None]:
sns.scatterplot(x='Age', y='Spending Score (1-100)', data=df)
plt.title('Age to Spending Score')

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; font-weight: bold; margin-top: 25px;">

Elbow method
    
</div>

> * The Elbow Method is a commonly used technique in clustering analysis to determine the optimal number of clusters (k) in a dataset.
> * On the graph, the **x-axis** represents the **number of clusters (k)**, and the **y-axis** represents the within-cluster sum of squared errors **(WCSS)**.
> * The Elbow Method helps balance between:
    * Underfitting: Too few clusters (large WSS).
    * Overfitting: Too many clusters (small WSS but with less meaningful groupings).
> * By choosing the "elbow point," you aim for a model that captures the structure of the data efficiently without unnecessary complexity.

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Scale the data
    
</div>

In [None]:
from sklearn.preprocessing import StandardScaler

# create a subset/specific dataframe for this exercise
df_age_spending = df[['Age', 'Spending Score (1-100)']]

# Scale the data
scaler = StandardScaler()
df_age_spending_scaled = scaler.fit_transform(df_age_spending)

In [None]:
df_age_spending_scaled

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans


wcss = []
K = range(1,15)

for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_age_spending_scaled)
    wcss.append(km.inertia_)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Number of centroids')
plt.ylabel('WCSS')
plt.title('Elbow Method For Optimal k')
plt.show()

> * The elbow appears to occur at **k = 3**. After this point, the WCSS starts to decrease at a much slower rate, indicating that adding more clusters doesn't significantly improve the model's fit. Therefore, k = 3 is likely the optimal number of clusters for this dataset.

> * ``fit()``: Fits the k-means model to the data. This means that the k-means algorithm calculates the positions of the cluster centroids and assigns them to the data instances based on these positions.
> * ``predict()``: Assigns each data point to one of the clusters already calculated by the model. In other words, after the centroids of the clusters are calculated, the algorithm assigns each point to its closest centroid.




In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_age_spending_scaled)
y_kmeans = kmeans.predict(df_age_spending_scaled)


In [None]:
y_kmeans

In [None]:

plt.scatter(df['Age'], df['Spending Score (1-100)'], c=y_kmeans, s=50, alpha=0.5,cmap='viridis')
centers = kmeans.cluster_centers_
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')

#### Conclusions

Marketing target group could be as follows:

> * Group of young people with high spending.
> * Group of young people with low spending.
> * Older group of people with low and medium spend.




<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

## Example 3. K-means: Find K groups of clients which are similar regarding Spending Score vs Annual Income
    
</div>

> * To set the number of clusters we want to have as a result, we will use the **Elbow method**


**Which groups do you detect?**

In [None]:
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=df)
plt.title('Annual Income(k$) to Spending Score')

In [None]:
# create a subset/specific dataframe for this exercise
df_income_spending = df[['Annual Income (k$)', 'Spending Score (1-100)']]


# Scale the data
scaler = StandardScaler()
df_income_spending_scaled = scaler.fit_transform(df_income_spending)

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import time

wcss = []
K = range(1,15)

for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_income_spending_scaled)
    wcss.append(km.inertia_)
    


# Plot the Elbow
plt.plot(K, wcss, 'bx-')
plt.xlabel('Number of centroids')
plt.ylabel('WCSS')
plt.title('Elbow Method For Optimal k')
plt.show()


In [None]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(df_income_spending_scaled)
y_kmeans = kmeans.predict(df_income_spending_scaled)

In [None]:
y_kmeans

In [None]:

plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'], c=y_kmeans, s=50, alpha=0.5,cmap='viridis')
centers = kmeans.cluster_centers_
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')

### Conclusions
> * Low income, high spending (Yellow cluster): These customers have low incomes but spend heavily, indicating strong engagement or discretionary spending habits.
> * Low income, low spending (Green cluster): Customers in this group have low incomes and spend minimally, likely budget-conscious shoppers.
> * Moderate income, moderate spending (Purple cluster): These customers have moderate incomes and spend moderately, likely representing the store’s core, regular customer base.
> * High income, low spending (dark blue cluster): Despite having high incomes, these customers spend less. The store may need to find ways to increase engagement or tailor marketing efforts for this group.
> * High income, high spending (Blue cluster): Have high incomes and spend heavily. 


<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

#  Dimensionality reduction with linear unsupervised method, Principal Component Analysis (PCA) 
    
</div>


> * Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a high-dimensional dataset into a lower-dimensional one by identifying the most important directions (called principal components) that capture the maximum variance in the data. This allows for a more efficient representation of the data while preserving its essential structure, making it easier to visualize and analyze.
> * In simpler terms, PCA reduces the number of variables in the data while retaining as much information as possible.


>
> * We use the overall dataset to cluster clients, considering all the information given

In [None]:
df.head()

In [None]:
df = df.drop(columns=["CustomerID"])
df

## Dimension reduction: PCA

> * Reduce the dataset to 2 principal components that capture the maximum information of the original attributes.
> * By applying PCA, we transform the data into a 2D space, allowing us to visually observe the structure and grouping of the data points.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)  # Scaling the data


pca = PCA(n_components=2)
results_pca = pca.fit_transform(df)
print(results_pca)


In [None]:

plt.scatter(results_pca[:, 0], results_pca[:, 1],alpha=0.4)
plt.show()

Calculate for this 2 features, the optimal number of clusters

In [None]:
wcss = []
K = range(1,15)

for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(results_pca)
    wcss.append(km.inertia_)
    
plt.plot(K, wcss, 'bx-')
plt.xlabel('Number of centroids')
plt.ylabel('WCSS')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(results_pca)
y_kmeans = kmeans.predict(results_pca)

plt.scatter(results_pca[:, 0], results_pca[:, 1], c=y_kmeans, s=50, alpha=0.5,cmap='viridis')
centers = kmeans.cluster_centers_
plt.xlabel('PC 1')
plt.ylabel('PC 2')

## Adding the resulting Clusters to the Original Data

> * **Clustering results in labels**: Through the clustering process, we have generated labels for each data point.
> * **Supervised learning with cluster labels**: These labels can now be used to train a supervised classification model. Unsupervised methods like clustering are essential for creating labels.
>  * **Predicting customer groups**: When a new customer arrives, we can predict which group they belong to based on their input data, without needing any prior study or manual grouping.


In [None]:
y_kmeans

In [None]:
df_labels = df.copy()
df_labels['cluster'] = y_kmeans

In [None]:
df_labels.head()



<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

 ## Plot Age vs Annual Income, coloring the data points for each Cluster
    
</div>

In [None]:
plt.figure(figsize=(8, 6))

# Scatter plot with color depending on the 'cluster' column
sns.scatterplot(data=df_labels, x='Age', y='Annual Income (k$)', hue='cluster', palette='viridis', s=100)

# Customize the plot
plt.title('Age vs Annual Income by Cluster')
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.legend(title='Cluster', loc='upper left')

# Show plot
plt.show()



<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

 ## Plot Annual Income (k$) vs Spending Score, coloring the data points for each Cluster
    
</div>

In [None]:
plt.figure(figsize=(8, 6))

# Scatter plot with color depending on the 'cluster' column
sns.scatterplot(data=df_labels, y='Spending Score (1-100)', x='Annual Income (k$)', hue='cluster', palette='viridis', s=100)

# Customize the plot
plt.title('Annual Income (k$) vs Spending Score by Cluster')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend(title='Cluster', loc='upper left')

# Show plot
plt.show()



<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

 ## Plot Age vs Spending Score, coloring the data points for each Cluster
    
</div>

In [None]:
plt.figure(figsize=(8, 6))

# Scatter plot with color depending on the 'cluster' column
sns.scatterplot(data=df_labels, y='Spending Score (1-100)', x='Age', hue='cluster', palette='viridis', s=100)

# Customize the plot
plt.title('Age vs Spending Score by Cluster')
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')
plt.legend(title='Cluster', loc='upper left')

# Show plot
plt.show()