In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
from numpy import unique
from numpy import where
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# visualization of data
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cluster import KMeans # unsupervised machine learning algorithm 
from sklearn.metrics import silhouette_score # Used for the silhouette method for acquiring K
from sklearn.cluster import Birch # Used for birch clustering method
from sklearn.cluster import SpectralClustering # Used for spectral clustering method

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Create dataframes for EDA
customer_df = pd.read_csv('../input/customer-segmentation/Cust_Segmentation.csv')
customer_df

In [None]:
customer_df.describe()

In [None]:
# To explore the data, let’s see which columns have the tightest correlation 
sns.pairplot(hue='Edu', data=customer_df, palette="husl")

Due to the amount of datapoints recorded and the five colour variants for education level it is hard to distinguish any absolute correlation from this graph and further work is needed to make any conclusions.

In [None]:
sns.pairplot(hue='Defaulted', data=customer_df, palette="bright")

In reducing the possible number of colours on the graph from five to two, now reviewing whether an individual has defaulted or not from education may be easier to read. 

There are a lot of clustering occurring in the debt income ratio column which is not surprising if an individual has defaulted. The same degree of clustering applies In the card debt, however other debt is a lot more blended with less correlation to defaults. 

The other columns however are of greater degrees of blended correlation, limiting the possibility for correlations to be visible at this time, without further visualisations being present for those who have defaulted.

In [None]:
sns.displot(customer_df, x="Edu", hue="Defaulted")

The full picture was not painted clearly in previous visualisations, as lower educated people had a higher count as to defaults, but quantity of people who were lower educated was also higher. The ratios of each group would be more telling.

In [None]:
edu_1 = len(customer_df.query('Edu == 1'))
edu_1_defaulted = len(customer_df.query('Edu == 1 and Defaulted == 1'))
edu_ratio_1 = (round(edu_1_defaulted/edu_1, 2)*100)

edu_2 = len(customer_df.query('Edu == 2'))
edu_2_defaulted = len(customer_df.query('Edu == 2 and Defaulted == 1'))
edu_ratio_2 = (round(edu_2_defaulted/edu_2, 2)*100)

edu_3 = len(customer_df.query('Edu == 3'))
edu_3_defaulted = len(customer_df.query('Edu == 3 and Defaulted == 1'))
edu_ratio_3 = (round(edu_3_defaulted/edu_3, 2)*100)

edu_4 = len(customer_df.query('Edu == 4'))
edu_4_defaulted = len(customer_df.query('Edu == 4 and Defaulted == 1'))
edu_ratio_4 = (round(edu_4_defaulted/edu_4, 2)*100)

edu_5 = len(customer_df.query('Edu == 5'))
edu_5_defaulted = len(customer_df.query('Edu == 5 and Defaulted == 1'))
edu_ratio_5 = (round(edu_5_defaulted/edu_5, 2)*100)

In [None]:
edu_data = [edu_ratio_1, edu_ratio_2, edu_ratio_3, edu_ratio_4, edu_ratio_5]
edu_data_df = df = pd.DataFrame({"Edu": edu_data})
plt.bar(['1','2','3','4','5'], edu_data_df.Edu)
plt.xlabel('Education level')
plt.ylabel('Ratio of defaults')
plt.title('Eduction level and ratio of defaults')
plt.show()

On average therefore, those in the medium-high range of education (levels 3 and 4) had the highest default rate by the count of individuals in that education level. Which was a surprising result to find.

Below, lets try and find addition correlations via a heatmap instead to determine areas of correlation. To make the graph easier to read and avoid duplicated results, lets use a triangle heat map to avoid duplication.

In [None]:
corr = customer_df.corr()

# Hide the upper half to maintain a triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Create figure
f, ax = plt.subplots(figsize=(12, 10))

# Create colourmap
cmap = sns.color_palette("coolwarm", as_cmap=True)

# Display heatmap
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, annot=True, linewidths=1, cbar_kws={"shrink": 1})

One of the best visualisations for displaying EDA is a heat map, which is annotated with the correlation between variables. The visualisation can also made easier to read by removing the top half of the graph, thus removing duplicated fields to increase readability.

The strongest levels of correlation are between card debt and other debt of 64% suggesting poor levels of debt management in the customer segmentation. In second place for highest correlation is income and years employed with a 63% correlation, which we have seen previously in the dataset through other visualisations. In third place is income and other debt, which is strange. One would assume other debt would refer to mortgages potentially, which would make sense as higher earners would have a higher chance of receiving a mortgage. income and other debt also saw a high correlation of 60% which could link years employed to income, income to other debt and other debt to card debt. However, it is important to note that correlation does not imply causation. 

As for the lowest levels of correlations, years employed and defaulted had a correlation of -28%. The lowest correlations are negligible comparatively to the high correlation results, especially considering education and years employed had a -15% correlation and age and defaulted also had a low correlation of -14%.


Lets further expand our findings with a relplot to show the types of debt (other and card) and then compare that with the levels of education. Furthermore, lets use the debt income ratio to evaluate the excessive nature of the debts. It is expected that there will be a larger degree of debt income ratio at the upper bounds of card and other debt.

In [None]:
sns.relplot(x="Other Debt", y="Card Debt", hue="Edu", size="DebtIncomeRatio",
            sizes=(40, 400), alpha=.5, palette="muted",
            height=6, data=customer_df)

Rather surprisingly there was not the same degree of debt income distribution at the tail ends of card and other debts, this could be due to individuals with low income, having low debts in pound (£) amounts, because of poor credit scores or not meeting the requirements for larger lending amounts due to financial instability. 

In [None]:
sns.jointplot(x="Years Employed", y="Income", data=customer_df, color="red")

The joint plot shows the general increase in wages as years employed goes up, this can be seen from ten years onwards. We can also see that the highest density of people has only been with the company between 0-2 years with a visible drop off in density as years employed increases. Finally, we can see that the majority of customers have a higher density towards a lower income, reducing in density as income increases.

In [None]:
sns.histplot(data=customer_df, x="Age")

From the histogram shown above, the age range of customers with the highest density is in the age range of 25-40 years old. With reductions in density for both younger and old customers than the age range mentioned previously.

Now we will implement the unsupervised machine learning algoritms K-mean, using the silhouette method to determine K.

This will test our previous finding from the EDA, in terms of correlation to see the findings from the machine learning algorithm.

In [None]:
#Positive correlations tests
test1_df = customer_df[["Card Debt", "Other Debt"]]
test2_df = customer_df[["Income", "Years Employed"]]
test3_df = customer_df[["Income", "Other Debt"]]

From the EDA conducted the following test cases based on the highest correlating datasets will be tested to determine the accuracy of K-mean and how well it can determine patterns in the data.

**KMEAN TEST 1**

In [None]:
sil = []
maximum_k = 10

# Find the best value for k, minimum of 2
for k in range(2, maximum_k + 1):
    kmeans = KMeans(n_clusters = k).fit(test1_df)
    labels = kmeans.labels_
    sil.append(silhouette_score(test1_df, labels, metric = 'euclidean'))

# Store the value of k and sil scores, to find the best k to use
k = [*range(2, 11, 1)]
dic1 = {'k':k,'sil':sil}
sil1_df = pd.DataFrame(dic1)
sil1_df

The code above uses the silhouette method to determine the value of k from the highest silhouette score. The minimum value of k is 2 and the maximum is set to 10.

In [None]:
sns.lineplot(data=sil1_df, x="k", y="sil")

The graph then plots the results of the k iterations and the silhouette score to visualise the best k value.

In [None]:
print(labels)

This array shows the values used to train the model.

In [None]:
customer_df['Labelled_Clusters1'] = labels
test0 = customer_df[customer_df.Labelled_Clusters1 == 0]
test1 = customer_df[customer_df.Labelled_Clusters1 == 1]

sns.scatterplot(x="Card Debt", y="Other Debt", data=test0)
sns.scatterplot(x="Card Debt", y="Other Debt", data=test1)

Now we have all of the data required we can now plot the data for one of the tests, in this case for card debt and other debt. It is clear there is two clear clusters with low other debt being much tighter correlated with low card debt, whereas high other debt bared no relation to other debt.

**KMEAN TEST 2**

In [None]:
sil = []
maximum_k = 10

# Find the best value for k, minimum of 2
for k in range(2, maximum_k + 1):
    kmeans = KMeans(n_clusters = k).fit(test2_df)
    labels = kmeans.labels_
    sil.append(silhouette_score(test1_df, labels, metric = 'euclidean'))

# Store the value of k and sil scores, to find the best k to use
k = [*range(2, 11, 1)]
dic2 = {'k':k,'sil':sil}
sil1_df = pd.DataFrame(dic2)

sns.lineplot(data=sil1_df, x="k", y="sil")

In [None]:
customer_df['Labelled_Clusters2'] = labels
test0 = customer_df[customer_df.Labelled_Clusters2 == 0]
test1 = customer_df[customer_df.Labelled_Clusters2 == 1]

sns.scatterplot(x="Income", y="Years Employed", data=test0)
sns.scatterplot(x="Income", y="Years Employed", data=test1)

For the second scatter plot, the clustering was very strong in proximity, around low income and with years employed between 0-10 and incomes between 10-25. The second cluster has a broad proximity but contained within incomes of 95+, but the years employed was mixed, falling between 5 years and  30 for employment. NOTE: Values used are estimates.

**KMEAN TEST 3**

In [None]:
sil = []
maximum_k = 10

# Find the best value for k, minimum of 2
for k in range(2, maximum_k + 1):
    kmeans = KMeans(n_clusters = k).fit(test3_df)
    labels = kmeans.labels_
    sil.append(silhouette_score(test1_df, labels, metric = 'euclidean'))

# Store the value of k and sil scores, to find the best k to use
k = [*range(2, 11, 1)]
dic3 = {'k':k,'sil':sil}
sil1_df = pd.DataFrame(dic3)

sns.lineplot(data=sil1_df, x="k", y="sil")

In [None]:
customer_df['Labelled_Clusters3'] = labels
test0 = customer_df[customer_df.Labelled_Clusters3 == 0]
test1 = customer_df[customer_df.Labelled_Clusters3 == 1]

sns.scatterplot(x="Income", y="Other Debt", data=test0)
sns.scatterplot(x="Income", y="Other Debt", data=test1)

Similar results were also present within income measured against other debt, with low income showing low other debt, and high income showing a varied degree of other debt, therefore presenting a low proximity structure. This is the opposite to the tight clustering of the low income and low other debt individuals. 

Next we will use birch clustering method for generating clusters to compare against k mean

**BIRCH CLUSTERING TEST 1**

In [None]:
# birch clustering
birch1_array = test1_df.to_numpy() # Convert dataframe to np array for data formatting 
print(birch1_array)

First we need to reformat existing data from a dataframe to a numpy array to first the requirements of the scatter graph, for this applicational use, before it can be used in the Birch method.

In [None]:
# Set the parameters for number of clusters. To keep the results comparable number of clusters was set to 2
model = Birch(threshold=0.01, n_clusters=2) 
model.fit(birch1_array)
clust = model.predict(birch1_array)
clusters = unique(clust)
# For each cluster create a scatter plot
for cluster in clusters:
    row = where(clust == cluster)
    plt.scatter(birch1_array[row, 0], birch1_array[row, 1])
    plt.xlabel("Card Debt")
    plt.ylabel("Other Debt")
plt.show()

Comparing the two scatter graphs from the kmean example to the birch clustering method is challenging due to not all of the datapoint being plotted by the kmean scatter graph. However, there are distinctions to be made, the kmean visualisations show a higher degree of precision in identifying clusters of data, where as the birch method showed a more inclusive approach to plotting and thus painting broader conclusions as to the relationships within the data.

In this graph in particular very different classifications were presented, in comparison to kmean. Kmean had two distinct groupings, where as birch was more closely blended. In this instance therefore, kmean is the better approach for identification. `

**BIRCH CLUSTERING TEST 2**

In [None]:
birch2_array = test2_df.to_numpy() # Convert dataframe to np array for data formatting 
model = Birch(threshold=0.01, n_clusters=2) 
model.fit(birch2_array)
clust = model.predict(birch2_array)
clusters = unique(clust)
# For each cluster create a scatter plot
for cluster in clusters:
    row = where(clust == cluster)
    plt.scatter(birch2_array[row, 0], birch2_array[row, 1])
    plt.xlabel("Income")
    plt.ylabel("Years Employed")
plt.show()

The same results have been repeated here, further reiterating the findings from the first test comparison. 

In [None]:
birch3_array = test3_df.to_numpy() # Convert dataframe to np array for data formatting 
model = Birch(threshold=0.01, n_clusters=2) 
model.fit(birch3_array)
clust = model.predict(birch3_array)
clusters = unique(clust)
# For each cluster create a scatter plot
for cluster in clusters:
    row = where(clust == cluster)
    plt.scatter(birch3_array[row, 0], birch3_array[row, 1])
    plt.xlabel("Income")
    plt.ylabel("Other Debt")
plt.show()

As for the last comparison, the result are similar in their findings. However, the results are more severe in terms of the blended results presented within the birch method with data being closely presented between the two clusters.

Because the birch models classify the whole dataset this can be problematic as decision lines are blurred for classification, where as the kmean method opts for a higher degree of density within most of its groupings. 

**SPECTRAL CLUSTERING TEST 1**

In [None]:
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors')
labels = model.fit_predict(birch1_array)
plt.scatter(birch1_array[:, 0], birch1_array[:, 1], c=labels, s=50);
plt.xlabel("Card Debt")
plt.ylabel("Other Debt")
plt.show()

For the spectral clustering method, the clustering was broadly in line with that of the birch method but still obscured when compared to the precision of the k-mean approach.

**SPECTRAL CLUSTERING TEST 2**

In [None]:
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors')
labels = model.fit_predict(birch2_array)
plt.scatter(birch2_array[:, 0], birch2_array[:, 1], c=labels, s=50);
plt.xlabel("Income")
plt.ylabel("Years Employed")
plt.show()

Again, the same results were seen in the second test, as with the first.

**SPECTRAL CLUSTERING TEST 3**

In [None]:
model = SpectralClustering(n_clusters=3, affinity='nearest_neighbors')
labels = model.fit_predict(birch3_array)
plt.scatter(birch3_array[:, 0], birch3_array[:, 1], c=labels, s=50);
plt.xlabel("Income")
plt.ylabel("Other Debt")
plt.show()

As for this test of the spectral clustering method, an issue with the implementation was found, which did not affect the previous tests. The third test does present a graph, but also accompanied by a warning. This issue was partially fixed when an additional k, from two to three, was added, which allowed for two clusters on the graph to be displayed.