## Mall Customers Clustering

**Importing Modules**

The objective of the proyect is to use different clustering algorithms to come up with conclusions about the dataset, as well as comparing the different used models.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, DBSCAN
import scipy.cluster.hierarchy as sch
import plotly.figure_factory as ff
import plotly.express as px
import plotly.graph_objects as go
sns.set()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

%matplotlib inline

**Loading and getting to know the dataset**

In [None]:
dataset = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
dataset.head()

**The dataset has:**
- Numerical columns: CustomerID, Age, Annual Income and Spending Score.
- Categorical column: Gender    

In [None]:
dataset.describe().T

- There are 200 observations in total.
- The customers range from 18 to 70 years old.
- The annual income ranges from 15000 USD to 137000 USD.

*Checking for missing values*

In [None]:
print(pd.isnull(dataset).sum())

There are no missing values in the set.

**EDA**

First I'll drop the ID column, as it doesn't give any useful information

In [None]:
dataset2 = dataset.copy()
dataset2 = dataset2.drop(['CustomerID'], axis = 1)

In [None]:
fig = px.scatter_matrix(dataset2,
    dimensions=["Age", "Annual Income (k$)", "Spending Score (1-100)"],
    color="Gender", symbol="Gender",
    title="Scatter matrix",
    labels={col:col.replace('_', ' ') for col in dataset2.columns}) # remove underscore

fig.update_traces(diagonal_visible=False)
fig.show()

Although many relations could be analyzed, I'll focus on the Annual Income vs. Spending Score.

**Correlations**

In [None]:
x = ["Age", "Annual Income (k$)", "Spending Score (1-100)"]
heat = go.Heatmap(z =dataset2.corr(),
                  x = x,
                  y=x,
                  xgap=1, ygap=1,
                  colorbar_thickness=20,
                  colorbar_ticklen=3,
                  hovertext = dataset2.corr(),
                  hoverinfo='text'
                   )

title = 'Correlation Matrix'               

layout = go.Layout(title_text=title, title_x=0.5, 
                   width=600, height=600,
                   xaxis_showgrid=False,
                   yaxis_showgrid=False,
                   yaxis_autorange='reversed')
   
fig=go.Figure(data=[heat], layout=layout)        
fig.show() 

There seems to only be a slight correlation between 'Age' and 'Spending score' for the set. People tend to spend less as they get older.

**Let's see how numerical columns are distributed**

In [None]:
hist_data = [dataset2['Age'], dataset2['Annual Income (k$)'], dataset2['Spending Score (1-100)']]
group_labels = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']

fig = ff.create_distplot(hist_data, group_labels, bin_size=[5, 10, 8])
fig.update_layout(title_text='Age, Income and Score distribution')
fig.show()

In [None]:
fig = px.scatter(dataset2, x="Annual Income (k$)", y = "Spending Score (1-100)",size='Age', color="Gender")
fig.show()

There is no clear correlation between Annual income and Spending score, let's see later what clustering analysis can tell us

**What about age feature?**

In [None]:
Genre = pd.DataFrame(dataset2['Gender'].value_counts()).reset_index()
Genre.columns = ['Gender','Total']
fig = px.pie(Genre, values = 'Total', names = 'Gender', title='Gender', hole=.4, color = 'Gender',width=800, height=400)
fig.show()

In [None]:
fig = px.bar(Genre, x = 'Gender', y='Total', color='Gender',width=600, height=500)
fig.show()

In [None]:
Male = dataset2[dataset2["Gender"] == 'Male'][['Gender','Age']]
temp = pd.DataFrame(Male['Age'].value_counts().reset_index())
temp.columns = ['Age','Total']

Female = dataset2[dataset2["Gender"] == 'Female'][['Gender','Age']]
temp2 = pd.DataFrame(Female['Age'].value_counts().reset_index())
temp2.columns = ['Age','Total']

In [None]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Bar(
    x = temp['Age'],
    y = temp['Total'],
    name='Male',
    marker_color='rgba(94, 144, 175, 0.8)'
))
fig.add_trace(go.Bar(
    x = temp2['Age'],
    y = temp2['Total'],
    name='Female',
    marker_color='rgba(249, 70, 10, 0.9)'
))

# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(title = 'Age per genre', barmode = 'group', xaxis_tickangle=-45)
fig.show()


**Conclusions:**
- There are more women than men evaluated, and both average age is around 33 years.
- There are more older men than women in the dataset.
- There is no correlation between age with income or spending score.

In [None]:
X = dataset2.iloc[:,2:4].values

**Clustering analysis**

I'll do some comparison with different clustering algorithms:
- K-Means
- Hierarchical clustering
- Affinity propagation
- DBSCAN

## K-Means

K-means starts with allocating cluster centers randomly and then looks for "better" solutions. One thing about this algorithm is that I have to give the number of clusters beforehand, so I'll be using the WCSS (elbow method) to come up with a more accurate idea.

**WCSS**

In [None]:
wcss = []
for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = "k-means++", max_iter = 500, n_init = 10, random_state = 123)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
fig = go.Figure(data = go.Scatter(x = [1,2,3,4,5,6,7,8,9,10], y = wcss))


fig.update_layout(title='WCSS vs. Cluster number',
                   xaxis_title='Clusters',
                   yaxis_title='WCSS')
fig.show()

In [None]:
kmeans = KMeans(n_clusters = 5, init="k-means++", max_iter = 500, n_init = 10, random_state = 123)
identified_clusters = kmeans.fit_predict(X)


data_with_clusters = dataset2.copy()
data_with_clusters['Cluster'] = identified_clusters

In [None]:
fig = px.scatter_3d(data_with_clusters, x = 'Age', y='Annual Income (k$)', z='Spending Score (1-100)',
              color='Cluster', opacity = 0.8, size='Age', size_max=30)
fig.show()

We can see that the clusters could be labeled as:
- Low income and low spending score (blue)
- Low income and hig spending score (yellow)
- Mid income and mid spending score (pink): Seems to be the most populated one
- High annual income and low spending score (purple)
- High annual income and high spending score (orange)

As the mall marketing department, we would like to move every observation upward so people spend more money. We should focus on the pink and purple clusters as they represent either a whole lot of people or high annual income to be spent. We may offer some discounts studying what pople in the pink cluster mostly buy, and offer some premium items for people in the purple one.

## Hierarchical Clustering

In the hierarchichal agglomerative clustering, each point starts being an individual cluster, and they group taking into account the distance between each one (first the closer ones). I can set the distance I want to evaluate.
To get a better idea of the number of clusters, I'll make use of a dendrogram.

In [None]:
fig = ff.create_dendrogram(X,
                           linkagefun = lambda x: sch.linkage(x, "ward"),)

# Ward minimizes the variance of the points inside a cluster.

fig.update_layout(title = 'Hierarchical Clustering', xaxis_title='Customers',
                   yaxis_title='Euclidean Distance', width=700, height=700)

fig.show()

In [None]:
hc = AgglomerativeClustering(n_clusters = 5, affinity = "euclidean", linkage = "ward")
identified_clusters = hc.fit_predict(X)

data_with_clusters = dataset2.copy()
data_with_clusters['Cluster'] = identified_clusters

fig = px.scatter_3d(data_with_clusters, x = 'Age', y='Annual Income (k$)', z='Spending Score (1-100)',
              color='Cluster', opacity = 0.8, size='Age', size_max=30)
fig.show()

Makes the same clusters as K-Means, having just slight

**Conclusion:** 
- By using the Elbow method, the most accurate cluster number may be 3 or 5
- Taking a look at the Dendrogram, we see that cutting horizontally the largest vertical line (the second blue from the left), 5 clusters seem to be the best option.
- In both cases as we made the clusters, the same conclusions could be made. These two algorithms seem to have worked really well.

**Affinity propagation**

This algorithm doesn't require a preset cluster number. It takes as input measures of similarity between pair of data points. As they have similarities, they can belong to the same cluster. I'll use default settings.

In [None]:
ap = AffinityPropagation(random_state = 0)
identified_clusters = ap.fit_predict(X)

data_with_clusters = dataset2.copy()
data_with_clusters['Cluster'] = identified_clusters

fig = px.scatter_3d(data_with_clusters, x = 'Age', y='Annual Income (k$)', z='Spending Score (1-100)',
              color='Cluster', opacity = 0.8, size='Age', size_max=30)
fig.show()

The number of clusters is 10. It doesn't seem like a good result, but could be an useful algorithm with others datasets, or maybe studying how Age relates to the other variables.

**DBSCAN**

It is a density based clustering algorithm. For each observation, the algorithm will form a shape around it and count how many fata points are within this shape (cluster). After there are no more nearby points, it will procede to make another cluster.
I'll define the minimum number of data points to determine a cluster and the max distance for points to be part of the same cluster. I don't have to set the number of clusters beforehand.

In [None]:
DBS = DBSCAN(eps = 9, min_samples = 5)

identified_clusters = DBS.fit_predict(X)

data_with_clusters = dataset2.copy()
data_with_clusters['Cluster'] = identified_clusters

fig = px.scatter_3d(data_with_clusters, x = 'Age', y='Annual Income (k$)', z='Spending Score (1-100)',
              color='Cluster', opacity = 0.8, size='Age', size_max=30)
fig.show()

Although the result isn't as accurate as K-Means, DBSCAN is a great algorithm for tuning an try to come up with different conclusions. In this case we achieve the same amount of clusters but they are not as informative or representative (might get better with some more tuning).

Thanks for reaching the end!! Upvote if you liked it!