In [None]:
# importing the packages
import numpy as np #for multi-dimensional arrays, matrices and high level mathematical functions
import matplotlib.pyplot as plt #for object-oriented API. Acts like an extention to numpy
import pandas as pd #for data manipulation and analysis
from sklearn.cluster import KMeans #for unsupervised A.I. algorithms
from yellowbrick.cluster import KElbowVisualizer #for visualising the elbow when deciding the number of clusters for KMeans
import seaborn as sns; sns.set()  # for plot styling
%matplotlib inline

**Doing the EDA**


In order to have a good cluster result, we  need to understand data, find patterns and interpret the plots. This can help us both with unsupervised clustering and shaping up a marketing strategy that can be implemented in our smar sale system.

In [None]:
#Read data from the CSV file
customers=pd.read_csv("../input/customer-segmentation/Cust_Segmentation.csv")
customers.head()

In [None]:
#check for missing data and value types 
customers.info()

In [None]:
customers.describe()

In [None]:
#look for patterns in the dataset. Where would it be better to use unsupervised machine learning algorithms
sns.pairplot(customers)

We can observe some patterns in specific plots. These are the fisrt clues on where the algorithms could be implemented. We can also try to find patterns by separating the dataset entries based on different criteria. We move now into a three dimensions plot where not only the X and Y axis give us details about the data but the color too. The dataset has 8 features which means it can be represented in maximul 8 dimensions. But we are just going to analyse scatter plots which cannot ve visualised in 8 dimensions. 

In [None]:
#what if we cluster the data based on wether the customers are defaulted or not? Can we see patterns?
sns.pairplot(hue="Defaulted", data=customers)

In some graphs we can observe already some form of clustering where black dots are not blended with white dots. This means we can already see how supervised clustering shapes up based on this criteria.
We will now separate the entries based on the level of education. Let's see how the level of education can change the way we look at the graphs.

In [None]:
#what about the level of education?
sns.pairplot(hue="Edu", data=customers)

Unlike the separation based on wether the customers are defaulted or not, the separation based on education level doesn't have so many graphs that are so well separated. Education levels are mixed in most of the graphs. However it can be observed some sort of pattern if we look at the "Years Employed" and "Income" graph. Higher education levels are at higher income levels than lower education levels. Even in this situation, the more years employed does a customer have, the more likely is to have a greater income. We can observe the minimum income raises with the number of years employed.
We will look now at a corelation matrix to see which graph have values that corelate more 

In [None]:
#corelation matrix helps us better visualise which graphs have stronger corelation between values
customers.corr().style.background_gradient(cmap="coolwarm")

**In depth view of plots**


We can observe the intersection between "Years Employed" and "Income". As a result we will take a deeper look at the graph representing the before-mentioned features. 

In [None]:
#create plot to ilustrate the coleration between years employed and income. Color diferentiation based on the level of education
plt.figure(figsize = (10,10))
sns.scatterplot(x="Years Employed", y="Income", hue = 'Edu',palette="blend:#55cf59,#000000", data=customers)

In the plot that we've created above, we can observe that people with less experience in the filed they are working in can have similar income levels with people that have more years of experience but lower level of education. This kind of information can be already used for a marketing strategy. Because this graph is one of a very high coeficient of corelation we should look further in it to find other patterns.

In [None]:
# create the same plot but differentiate entries by color based on the customers' age
plt.figure(figsize = (10,10))
sns.scatterplot(x="Years Employed", y="Income", hue = 'Age', data=customers)


It is no surprise to see that older people are more likely to ave more years employed in the domain they are working in but the interesting fact about this plot is that older people with less years of experience have better incomes than younger people with the same number of years employed. This is a strong sign that further analiysis is required on a plot ilustrating the income in relation to age.

In [None]:
#create plot to ilustrate the income in relation to age. Color diferentiation based on Years employed
plt.figure(figsize = (10,10))
sns.scatterplot(x="Age", y="Income", hue = 'Years Employed', palette="blend:#fc8d8d,#000000", data=customers)


This might be the one of the clearest graphs. We can see a very clear separation between young customers with little experience and older customers with more experience. This can be considered supervised clustering because we have such a strong coraltion and so good separation.
We can also observe that the big majority of customers are young and with less years of experience. This might mean the business is probably a STUDENT RELATED bussines since most of the students are young and have few years  of experience with low income.

If we look at the corelation matrix again, we can observe the biggest corelation ratio is between the "Card debt" and "Other debt" features. so let's plot this graph and do color distintion based on second best corelation ratio in the corelation matrx, namely, the "Income"

In [None]:
#create graph of "card debt" in relation to "Other debt" and create color dinstinction based on icome
plt.figure(figsize = (10,10))
sns.scatterplot(x="Card Debt", y="Other Debt", palette = "blend:#d9d6ff,#000000",hue = 'Income', data=customers)

The scatter plot above shows us that most of the customers have little to no debts at all. But the same people who don't have any debts are the ones with the least amounts of money. This detail enhances the idea of a bussines related to students.

Yet, in the graph below we can see the younger people with lower incomes are more likely to be defaulted than older people with more income. A trend line has been aded to better visualise the difference in the effect of the bigger income over the defaulted feature.

In [None]:
plt.figure(figsize = (10,10))
sns.lmplot(x="Age", y="Income", hue="Defaulted", data=customers)

**Scaling the graph**

Before implementing the unsupervised algorithms we want to make sure we get acurate results. In order to improve the acuracy of the results, we want to scale the values of both axis of a plot to be between 0 and 1 so the error of the algorithms is minimal.

In [None]:
from sklearn.preprocessing import MinMaxScaler #for scaling  the values in the graph
scaler=MinMaxScaler()
customers[['Income']] = scaler.fit_transform(customers[['Income']])
customers[['Years Employed']] = scaler.fit_transform(customers[['Years Employed']])

plt.figure(figsize = (10,10))
sns.scatterplot(x="Years Employed", y="Income", hue = 'Edu',palette="blend:#55cf59,#000000", data=customers)

**Check the right number of clusters to implement KMeans**

To implement the KMeans algorithm we need to find out which is the right number of clusters we want to work on. Therefore, we create a grapf that ilustrates the distortion score for each cluster number between 1 and 12 and we look at the last number which will not increase this distortion significantly.

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,12))
plt.figure(figsize = (10,10))
visualizer.fit(customers[['Years Employed', 'Income']])  # Fit the data to the visualizer
visualizer.show() #plot the visualiser

**Show the clustered array based on specific graph **

Now we have the number of clusters and we can implement the Kmeans algorithm.  

In [None]:
km = KMeans(n_clusters = 3) #set the number of clusters
y_predicted = km.fit_predict(customers[['Years Employed', 'Income']]) #give the algorithm the graph we want to apply the calculations on
print(y_predicted) #print the array of clusters

In [None]:
#plot the graph
customers['cluster'] = y_predicted
customers1=customers[customers.cluster==0]
customers2=customers[customers.cluster==1]
customers3=customers[customers.cluster==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="Years Employed", y="Income", color = 'red', data=customers1)
sns.scatterplot(x="Years Employed", y="Income", color = 'green', data=customers2)
sns.scatterplot(x="Years Employed", y="Income", color = 'blue', data=customers3)

We can see the algorithm worked preey well, The clustering is very well defined and the entries mix very little which is a good sign of a quality clustering. Now we can repeat this proces for multipla graphs to see what the result are in different situations.

In [None]:

model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,12))
plt.figure(figsize = (10,10))
visualizer.fit(customers[['Age', 'Income']])        # Fit the data to the visualizer
visualizer.show() 

In [None]:
km = KMeans(n_clusters = 3)
y_predicted = km.fit_predict(customers[['Age', 'Income']])
customers['cluster'] = y_predicted
customers1=customers[customers.cluster==0]
customers2=customers[customers.cluster==1]
customers3=customers[customers.cluster==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="Age", y="Income", color = 'red', data=customers1)
sns.scatterplot(x="Age", y="Income", color = 'green', data=customers2)
sns.scatterplot(x="Age", y="Income", color = 'blue', data=customers3)

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,12))
plt.figure(figsize = (10,10))
visualizer.fit(customers[['Card Debt', 'Other Debt']])        # Fit the data to the visualizer
visualizer.show() 

In [None]:
km = KMeans(n_clusters = 3)
y_predicted = km.fit_predict(customers[['Card Debt', 'Other Debt']])
customers['cluster'] = y_predicted
customers1=customers[customers.cluster==0]
customers2=customers[customers.cluster==1]
customers3=customers[customers.cluster==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="Card Debt", y="Other Debt", color = 'red', data=customers1)
sns.scatterplot(x="Card Debt", y="Other Debt", color = 'green', data=customers2)
sns.scatterplot(x="Card Debt", y="Other Debt", color = 'blue', data=customers3)

When we look at the "Card Debt" and "Other debt" clustering we can observe it went really close to the clustering we achieved from supervised cluustering on the same graph. This is another sign that the algorithm is precise enough

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,12))
plt.figure(figsize = (10,10))
visualizer.fit(customers[['DebtIncomeRatio', 'Card Debt']])        # Fit the data to the visualizer
visualizer.show() 

In [None]:
km = KMeans(n_clusters = 3)
y_predicted = km.fit_predict(customers[['DebtIncomeRatio', 'Card Debt']])
customers['cluster'] = y_predicted
customers1=customers[customers.cluster==0]
customers2=customers[customers.cluster==1]
customers3=customers[customers.cluster==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="DebtIncomeRatio", y="Card Debt", color = 'red', data=customers1)
sns.scatterplot(x="DebtIncomeRatio", y="Card Debt", color = 'green', data=customers2)
sns.scatterplot(x="DebtIncomeRatio", y="Card Debt", color = 'blue', data=customers3)

In [None]:
km = KMeans(n_clusters = 3)
y_predicted = km.fit_predict(customers[['Card Debt','DebtIncomeRatio']])
customers['cluster'] = y_predicted
customers1=customers[customers.cluster==0]
customers2=customers[customers.cluster==1]
customers3=customers[customers.cluster==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="Card Debt", y="DebtIncomeRatio", color = 'red', data=customers1)
sns.scatterplot(x="Card Debt", y="DebtIncomeRatio", color = 'green', data=customers2)
sns.scatterplot(x="Card Debt", y="DebtIncomeRatio", color = 'blue', data=customers3)

**Hierarchical clustering**

Now we need to compare the results of different unsupervised clustering algorithms to uderstand which one is better to be used for a business decision.

In [None]:
from sklearn.cluster import AgglomerativeClustering  # for hiearchical clustering
hclusters = AgglomerativeClustering().fit(customers[['Card Debt','DebtIncomeRatio']])
hclusters.labels_

In [None]:
customers['H_clusters'] = hclusters.labels_
customers1=customers[customers.H_clusters==0]
customers2=customers[customers.H_clusters==1]
customers3=customers[customers.H_clusters==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="Years Employed", y="Income", color = 'red', data=customers1)
sns.scatterplot(x="Years Employed", y="Income", color = 'green', data=customers2)
sns.scatterplot(x="Years Employed", y="Income", color = 'blue', data=customers3)

We can clearly see that the clusteing we achieved when using hierarchical clustering is very messy, values are mixed together and we have no clear information to extract from this scatterplot.

In [None]:
customers['H_clusters'] = hclusters.labels_
customers1=customers[customers.H_clusters==0]
customers2=customers[customers.H_clusters==1]
customers3=customers[customers.H_clusters==2]
plt.figure(figsize = (10,10))
sns.scatterplot(x="Card Debt", y="Other Debt", color = 'red', data=customers1)
sns.scatterplot(x="Card Debt", y="Other Debt", color = 'green', data=customers2)
sns.scatterplot(x="Card Debt", y="Other Debt", color = 'blue', data=customers3)

On the Card Debt-Other debt graph we can see a similar clustering with KMeans but with just 2 clusters. the values are still mixed together but the information is better separated than other graphs.

**DBSCAN Clustering**

For better evaluation of the results, we are also comparing the results from DBSCAN clustering algorithm with the other two.

In [None]:
from sklearn.cluster import DBSCAN
dbclusters = DBSCAN(eps=0.3, min_samples=10).fit(customers[['DebtIncomeRatio', 'Card Debt']]) 
print(dbclusters.labels_) #print the full aray of clusters
print(dbclusters.labels_.min()) #show the minimum and the maximum value of the array so we know how many clusters we have.
print(dbclusters.labels_.max())

In [None]:
customers['DBSCAN_clusters'] = dbclusters.labels_
customers0=customers[customers.DBSCAN_clusters==-1]
customers1=customers[customers.DBSCAN_clusters==0]
customers2=customers[customers.DBSCAN_clusters==1]
customers3=customers[customers.DBSCAN_clusters==2]
customers4=customers[customers.DBSCAN_clusters==3]
customers5=customers[customers.DBSCAN_clusters==4]
plt.figure(figsize = (10,10))
sns.scatterplot(x='DebtIncomeRatio', y="Card Debt", color = 'brown', data=customers0)
sns.scatterplot(x='DebtIncomeRatio', y="Card Debt", color = 'red', data=customers1)
sns.scatterplot(x='DebtIncomeRatio', y="Card Debt", color = 'green', data=customers2)
sns.scatterplot(x='DebtIncomeRatio', y="Card Debt", color = 'blue', data=customers3)
sns.scatterplot(x='DebtIncomeRatio', y="Card Debt", color = 'orange', data=customers4)
sns.scatterplot(x='DebtIncomeRatio', y="Card Debt", color = 'black', data=customers5)

The clustering result done with DBSCAN is made out of 6 clusters which might be useful for a bussines decision if the purpose is to create multipla marketing strategies. Now let's see hoe DBSCAN is working with other graphs

In [None]:
from sklearn.cluster import DBSCAN
dbclusters = DBSCAN(eps=0.3, min_samples=10).fit(customers[['Card Debt', 'Other Debt']]) 
dbclusters.labels_

In [None]:
customers['DBSCAN_clusters'] = dbclusters.labels_
customers1=customers[customers.DBSCAN_clusters==-1]
customers2=customers[customers.DBSCAN_clusters==0]
plt.figure(figsize = (10,10))
sns.scatterplot(x='Card Debt', y="Other Debt", color = 'red', data=customers1)
sns.scatterplot(x='Card Debt', y="Other Debt", color = 'green', data=customers2)

If we compare the result from the Hierachical clustering with DBSCAN, they look similar but DBSCAN is definetely separing the clusters in a more precise way. howeve, if we need more clusters, the KMeans is the choice.

In [None]:
dbclusters = DBSCAN(eps=0.3, min_samples=10).fit(customers[['Years Employed', 'Income']]) 
dbclusters.labels_

In [None]:
customers['DBSCAN_clusters'] = dbclusters.labels_
customers1=customers[customers.DBSCAN_clusters==-1]
customers2=customers[customers.DBSCAN_clusters==0]
plt.figure(figsize = (10,10))
sns.scatterplot(x='Years Employed', y="Income", color = 'red', data=customers1)
sns.scatterplot(x='Years Employed', y="Income", color = 'green', data=customers2)

We can see DBSCAN made 2 clusters for this graph and one of the clusters contains just one element which can be considered an outlier. Let's get rid of this to see how DBSCAN will change after that.

In [None]:
customers.drop(labels = 532, inplace=True)
customers[customers['Income'] == customers['Income'].max()]

In [None]:

dbclusters = DBSCAN(eps=0.3, min_samples=10).fit(customers[['Years Employed', 'Income']]) 
dbclusters.labels_

In [None]:
customers['DBSCAN_clusters'] = dbclusters.labels_
#customers1=customers[customers.DBSCAN_clusters==-1]
customers2=customers[customers.DBSCAN_clusters==0]
plt.figure(figsize = (10,10))
sns.scatterplot(x='Years Employed', y="Income", color = 'red', data=customers2)
#sns.scatterplot(x='Years Employed', y="Income", color = 'green', data=customers2)

Once we take out the outlier we can see DBSCAN is doing no clustering at all. As a result, the best clustering algorithms for our smart sale system could be KMeans but this really depends on what the bussiness is planing to achieve.