## Introduction
They all have one thing in common, which is that they belong to the same category of algorithms: supervised learning. This kind of algorithm tries to learn patterns based on a specified outcome column (target variable) such as sales, employee churn, or class of customer.

But what if you don't have such a variable in dataset or you don't want to specify the target variable? How to find interesting patterns? We need to use the clustering algorithms that belong to the unsupervised learning category.

Clustering Algorithms are very popular in the data science industry for grouping similar data points and detecting outliers. Can be used for clustering algorithms for banks for fraud detection by identifying unusual clusters from the data, also be used by e-commerce companies to identify groups of users with similar browsing behaviors, as in the following figures:

![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/f75/89e/d32/C15019_05_01.png)


Clustering analysis performed on this data would uncover natural patterns by grouping similar data points such that you may get the following result:
![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/ca5/20d/092/C15019_05_02.png)

The data is now segmented into three customers groups depeding on their recurring visit and time spent on the website, different marketing plans can then be used for each of these groups to maximize sales.

## Clustering with K-Means

The Objective of k-means is to group similar data points ( or observations) together that will form a cluster. Think of it as grouping elements close to each other. Example if were manually analyzing user behavior on a mobile app, might end grouping customers who log in quite frequenly, our users who make bigger in-app purchases.

In [1]:
import pandas as pd
from sklearn.cluster import KMeans

In [52]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols =  ['Postcode','Average net tax','Average total deductions'])
df.head()

Unnamed: 0,Postcode,Average total deductions,Average net tax
0,2000,2071,27555
1,2006,3804,28142
2,2007,1740,15649
3,2008,3917,53976
4,2009,3433,32430


In [53]:
kmeans = KMeans(random_state= 42)
X = df[['Average net tax','Average total deductions']]
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

In [4]:
y_preds = kmeans.predict(X)
df['cluster'] = y_preds

In [5]:
df.head()

Unnamed: 0,Postcode,Average total deductions,Average net tax,cluster
0,2000,2071,27555,6
1,2006,3804,28142,6
2,2007,1740,15649,5
3,2008,3917,53976,7
4,2009,3433,32430,2


## Interpreting k-means Results

The objective of cluster analysis is to group observations with similar patterns together. But how we can see whether the groupings found by the algorithm are meaningful? 

One way of investigating this is to analyze the dataset row by row with the assigned cluster for each observation, other way is using pivot tables or group by operations.

We can create a pivot table similiar to Excel, we will be using the **pivot_table()** method from pandas. We specify the following parameters:

* **values**: Corresponds to the numerical columns you want to calculate summaries for ( or aggregations), such as getting averages or counts.

* **index**: Specify the columns you want to see summaries for.

* **aggfunc**: This is where your will specify the aggregation functions you want to summarize the data with, such getting averages our counts.

In [6]:
import numpy as np
df.pivot_table(values = ['Average net tax', 'Average total deductions'], index = 'cluster', aggfunc = np.mean)

Unnamed: 0_level_0,Average net tax,Average total deductions
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10126.249641,2176.045911
1,20474.671096,2909.352159
2,36197.407895,4789.328947
3,13016.528477,2480.054305
4,77733.111111,12523.111111
5,16391.342495,2657.048626
6,26765.963235,3498.676471
7,50887.692308,6156.961538


In this summary, we can see that the algorithm has grouped the data into eight clusters (clusters 0 to 7). Cluster 0 has the lowest average net tax and total deductions amounts among all the clusters, while cluster 4 has the highest values. With this pivot table, we are able to compare clusters between them using their summarised values.

Using an aggregated view of clusters is a good way of seeing the difference between them. Another possibility is to visualize clusters in a graph.

In [7]:
import altair as alt
chart = alt.Chart(df)
scatter_plot = chart.mark_circle()
scatter_plot.encode(x = 'Average net tax', y = 'Average total deductions', color = 'cluster:N')

Excellent! We can now easily see what the clusters in this graph are and how they differ from each other. We can clearly see that k-means assigned data points to each cluster mainly based on the x-axis variable, which is Average net tax. The boundaries of the clusters are vertical straight lines. For instance, the boundary separating the red and purple clusters is roughly around 18,000. Observations below this limit are assigned to the red cluster (2) and those above to the purple cluster (6).

In [8]:
import altair as alt
chart = alt.Chart(df)
scatter_plot = chart.mark_circle()
scatter_plot.encode(x = 'Average net tax', y = 'Average total deductions', color = 'cluster:N',
                    tooltip = ['Postcode','cluster','Average net tax', 'Average total deductions']).interactive()

### Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses

In [9]:
import pandas as pd
from sklearn.cluster import KMeans 
import altair as alt
import numpy as np

In [10]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols = ['Postcode','Average total business income','Average total business expenses'])
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses
0,2000,210901,222191
1,2006,69983,48971
2,2007,575099,639499
3,2008,53329,32173
4,2009,237539,222993


In [11]:
X = df[['Average total business income', 'Average total business expenses']]

In [12]:
kmeans = KMeans(random_state = 8)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=8, tol=0.0001, verbose=0)

In [13]:
y_preds = kmeans.predict(X)
y_preds[-10:]

array([3, 0, 0, 1, 6, 0, 0, 0, 0, 3], dtype=int32)

In [14]:
df['cluster'] = y_preds

In [15]:
df.pivot_table(values = ['Average total business income', 'Average total business expenses'],
               index = 'cluster', aggfunc = np.mean)

Unnamed: 0_level_0,Average total business expenses,Average total business income
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,38130.124827,53608.097087
1,58310.891111,76449.391111
2,250410.190476,301417.809524
3,82319.30597,104102.712687
4,812481.333333,837920.333333
5,173350.25974,208767.74026
6,118572.299517,145933.570048
7,449722.5,488551.625


In [16]:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses', color='cluster:N', tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']).interactive()

## Choosing the Number of Clusters

We can select how many clusters we want passing the value of a hyperparameter. For K-means, **n_clusters** is one of the most importante hyperparameters to tune. Choosing a low value will lead k-means to group many data points together, even though they are very different from each other. On the other hand, choosing a high value may force the algorithm to split close observations into multiple ones, even though they are very similar.

Looking at the scatter plot from the ATO dataset, eight clusters seems to be a lot. On the graph, some of clusters look very close to each other and have similar values.
Intuitively, just by looking at the plot, could have said there were between two and four different clusters. As we can see, this is quite suggestive, and would be great if there was a function to help us to define the right number of clusters. But for lucky such method exist, and it is called **Elbow** method.

This method assesses the compactness of clusters, the objective being to minimize a valeu known as **inertia**. 
Inertia is a value that says, for a group of data points, how far from each other or how close to each other they are.

In [17]:
clusters = pd.DataFrame()
clusters['cluster_range'] = range(1,10)
inertia = []

for k in clusters['cluster_range']:
  kmeans = KMeans(n_clusters = k, random_state = 8).fit(X)
  inertia.append(kmeans.inertia_)

In [18]:
clusters['inertia'] = inertia
clusters

Unnamed: 0,cluster_range,inertia
0,1,13335160000000.0
1,2,7063097000000.0
2,3,3718683000000.0
3,4,2341849000000.0
4,5,1713801000000.0
5,6,1226459000000.0
6,7,942273500000.0
7,8,748884700000.0
8,9,634848300000.0


In [19]:
alt.Chart(clusters).mark_line().encode(x = 'cluster_range', y = 'inertia')

Now we plotted the inertia value against the number of clusters, we need to find the optimal number of clusters. We need to find the inflection point in the graph, where the inertia value starts to decrease slowly, the slope it is almost 45-degree angle. Finding the right inflection point can be tricky. From the picture above we need to find the cneter of the Elbow. So from the graph above, we can see that the optimal value it is **3**.

In [20]:
kmeans = KMeans(random_state = 42, n_clusters = 3)
kmeans.fit(X)
df['cluster2'] = kmeans.predict(X)
df

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster,cluster2
0,2000,210901,222191,5,2
1,2006,69983,48971,1,0
2,2007,575099,639499,7,1
3,2008,53329,32173,0,0
4,2009,237539,222993,5,2
...,...,...,...,...,...
2468,870,62793,44687,0,0
2469,872,53025,45670,0,0
2470,880,45603,28700,0,0
2471,885,53148,39850,0,0


In [21]:
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster2:N',
    tooltip=['Postcode', 'cluster']
).interactive()

### Finding the Optimal Number of Clusters

In [22]:
import pandas as pd
from sklearn.cluster import KMeans 
import altair as alt

In [23]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols = ['Postcode',
                                      'Average total business income',
                                      'Average total business expenses'])


In [24]:
X = df[['Average total business income', 'Average total business expenses']]
clusters = pd.DataFrame()
inertia = []

In [25]:
clusters['cluster_range'] = range(1,25)

for k in clusters['cluster_range']:
  kmeans = KMeans(n_clusters = k).fit(X)
  inertia.append(kmeans.inertia_)

In [26]:
clusters['inertia'] = inertia
clusters

Unnamed: 0,cluster_range,inertia
0,1,13335160000000.0
1,2,7063097000000.0
2,3,3718740000000.0
3,4,2351450000000.0
4,5,1740765000000.0
5,6,1224315000000.0
6,7,960991500000.0
7,8,748853300000.0
8,9,634613500000.0
9,10,569912100000.0


In [27]:
alt.Chart(clusters).mark_line().encode(alt.X('cluster_range'), alt.Y('inertia'))

In [28]:
optim_cluster = 4
kmeans = KMeans(random_state = 42, n_clusters = optim_cluster)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

In [29]:
df['cluster2'] = kmeans.predict(X)

In [30]:
df

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster2
0,2000,210901,222191,3
1,2006,69983,48971,0
2,2007,575099,639499,2
3,2008,53329,32173,0
4,2009,237539,222993,3
...,...,...,...,...
2468,870,62793,44687,0
2469,872,53025,45670,0
2470,880,45603,28700,0
2471,885,53148,39850,0


In [31]:
alt.Chart(df).mark_circle().encode(x = 'Average total business income',
                                  y = 'Average total business expenses',
                                  color = 'cluster2:N',
                                  tooltip = ['Postcode', 'cluster2', 'Average total business income',
                                             'Average total business expenses']).interactive()

Observations:
* **Cluster 0 (blue)**: Is for all observations with average total business income values lower than 100,00 and average of total business expenses lower than 80k.

* **Cluster 3 (cyan)**: Is grouping data points that have an average total business income lower than 180k and average total business expenses lower than 160k.

* **Cluster 1 (orange)**: Average total business income value lower than 370k and total business expenses lower than 330k

* **Cluster 2 (Red)**: Business income values higher than 370k and total business expenses values higher than 330k.

## Initializing Clusters

Since the beginning of this chapter, we've been referring to k-means every time we've fitted our clustering algorithms. But you may have noticed in each model summary that there was a hyperparameter called init with the default value as k-means++. We were, in fact, using k-means++ all this time.

The difference between k-means and k-means++ is in how they initialize clusters at the start of the training. k-means randomly chooses the center of each cluster (called the centroid) and then assigns each data point to its nearest cluster. If this cluster initialization is chosen incorrectly, this may lead to non-optimal grouping at the end of the training process. For example, in the following graph, we can clearly see the three natural groupings of the data, but the algorithm didn't succeed in identifying them properly:

![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/a37/751/797/C15019_05_26.png)

K-Means++ is an attempt to find better clusters at initilization time. The idea behing it is to choose the first cluster randomly and then pick the next ones, those further away, using a probability distribution from the remaining data points.
Even though k-means++ tends to get better results compared to the original k-means, in some cases, it can still lead to non-optimal clustering.

Another hyperparameter we can use to lower the risk of incorrect clusters is **n_init**. This corresponds to the numnber of times k-means is run with different initializations, the final model being the best run.

So, if you have a high number for this hyperparameter, you will have a higher chance of finding the optimal clusters, but the training time will be longer.

In [32]:
kmeans = KMeans(random_state = 14, n_clusters = 3, init = 'random', n_init= 1)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300, n_clusters=3,
       n_init=1, n_jobs=None, precompute_distances='auto', random_state=14,
       tol=0.0001, verbose=0)

In [33]:
df['cluster3'] = kmeans.predict(X)
alt.Chart(df).mark_circle().encode(x='Average net tax', y='Average total deductions',color='cluster3:N',
    tooltip=['Postcode', 'cluster', 'Average net tax', 'Average total deductions']
).interactive()

ValueError: ignored

alt.Chart(...)

### Using Different Initialization Parameters to Achieve a Suitable Outcome


In [34]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt

In [None]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'


In [35]:
df = pd.read_csv(file_url,usecols=['Postcode', 'Average total business income', 'Average total business expenses'])


In [36]:
X = df[['Average total business income', 'Average total business expenses']]

In [37]:
kmeans = KMeans(random_state= 1, n_clusters = 4, init = 'random', n_init=1)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300, n_clusters=4,
       n_init=1, n_jobs=None, precompute_distances='auto', random_state=1,
       tol=0.0001, verbose=0)

In [43]:
df['cluster3'] = kmeans.predict(X)
scatter_plot = alt.Chart(df).mark_circle()

In [44]:
scatter_plot.encode(x = 'Average total business income',
                    y = 'Average total business expenses',
                    color = 'cluster3:N',
                    tooltip = ['Postcode','cluster3',
                               'Average total business income',
                               'Average total business expenses']).interactive()

In [46]:
kmeans = KMeans(random_state = 1, n_clusters = 4, init = 'random', n_init = 10)
kmeans.fit(X)

df['cluster4'] = kmeans.predict(X)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x = 'Average total business income', y = 'Average total business expenses',
                    color = 'cluster4:N',
                    tooltip = ['Postcode', 'cluster4', 'Average total business income',
                               'Average total business expenses']).interactive()

In [48]:
kmeans = KMeans(random_state = 1, n_clusters = 4, init = 'random', n_init = 100)
kmeans.fit(X)

df['cluster5'] = kmeans.predict(X)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x = 'Average total business income', y = 'Average total business expenses',
                    color = 'cluster5:N',
                    tooltip = ['Postcode', 'cluster4', 'Average total business income',
                               'Average total business expenses']).interactive()

## Calculating the Distance to the Centroid

We've talked a lot about similarities between data points in the previous sections, but we haven't really defined what this means. You have probably guessed that it has something to do with how close or how far observations are from each other. You are heading in the right direction. It is to do with some sort of distance measure between two points. The one used by k-means is called squared Euclidean distance and its formula is:

![texto alternativo](https://s3.amazonaws.com/thinkific/file_uploads/59347/images/53d/4bf/6ee/C15019_05_32.png)

This formula represents the sum of the squared difference between the data coordinates. Here, x and y are two data points and the index, i, represents the number of coordinates. If the data has two dimensions, i equals 2. Similaly, if there are three dimensions, then i will be 3.

In [49]:
x = X.iloc[0,].values
y = X.iloc[1,].values
print(x)
print(y)

[210901 222191]
[69983 48971]


The coordinates for **x** are **(210901, 222191)** and the coordinates for **y** are **(69983, 48971)**.

In [51]:
squared_distances = (x[0] - y[0]) ** 2 + (x[1] - y[1]) ** 2
print(squared_distances)

49863051124


K-Means uses this metric to calculate the distance between each data point and the center of its assigned cluster (also called the centroid). Here is the basic logic behing this algorithm:

1. Choose the centers of the clusters (the centroids) randomly.
2. Assign each data point to the nearest centroid using the squared Euclidean distance..
3. Update each centroid coordinate to the newly calculated center of the data points assigned to it.
4. Repeat Steps 2 and 3 until the clusters coverage.

In [55]:
kmeans = KMeans(random_state = 42, n_clusters= 3, init = 'k-means++', n_init= 5)
kmeans.fit(X)
df['cluster6'] = kmeans.predict(X)

In [56]:
#Extract centroids
centroids = kmeans.cluster_centers_
centroids = pd.DataFrame(centroids, columns = ['Average net tax',
                                               'Average total deductions'])
print(centroids)

   Average net tax  Average total deductions
0     21645.622924               3054.901993
1     12395.607303               2383.760674
2     45279.802198               6067.000000


In [58]:
chart1 = alt.Chart(df).mark_circle().encode(x ='Average net tax', y = 'Average total deductions',
                                            color = 'cluster6:N',
                                            tooltip = ['Postcode', 'cluster6', 'Average net tax',
                                                       'Average total deductions']).interactive()

In [59]:
chart1

In [60]:
chart2 = alt.Chart(centroids).mark_circle(size = 100).encode(
    x = 'Average net tax',
    y = 'Average total deductions',
    color = alt.value('black'),
    tooltip = ['Average net tax',
               'Average total deductions']
).interactive()
chart2

In [61]:
chart1 + chart2

### Finding the Closest Centroids in Our Dataset

In [62]:
import pandas as pd
from sklearn.cluster import KMeans 
import altair as alt

In [63]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url,
                 usecols = ['Postcode', 'Average total business income',
                            'Average total business expenses'])
X = df[['Average total business income',
        'Average total business expenses']]

In [64]:
business_income_min = df['Average total business income'].min()
business_income_max = df['Average total business income'].max()

business_expenses_min = df['Average total business expenses'].min()
business_expenses_max = df['Average total business expenses'].max()

In [65]:
print(business_income_min)
print(business_income_max)
print(business_expenses_min)
print(business_expenses_max)

0
876324
0
884659


Now import the random package and use the seed() method to set a seed of 42, as shown in the following code snippet:

In [67]:
import random
random.seed(42)

Generate four random values using the sample() method from the random package with possible values between the minimum and maximum values of the 'Average total business expenses' column using range() and store the results in a new column called 'Average total business income' from the centroids DataFrame:

In [68]:
centroids = pd.DataFrame()
centroids['Average total business income'] = random.sample(range(business_income_min,business_income_max), 4)

Repeat the same process to generate 4 random values for 'Average total business expenses':

In [69]:
centroids['Average total business expenses'] = random.sample(range(
    business_expenses_min, business_expenses_max
),4)

In [70]:
centroids['cluster'] = centroids.index 
centroids

Unnamed: 0,Average total business income,Average total business expenses,cluster
0,670487,288389,0
1,116739,256787,1
2,26225,234053,2
3,777572,146316,3


In [71]:
chart1 = alt.Chart(df.head()).mark_circle().encode(
    x = 'Average total business income',
    y = 'Average total business expenses',
    color = alt.value('orange'),
    tooltip = ['Postcode', 'Average total business income',
               'Average total business expenses']
).interactive()
chart1

In [72]:
chart2 = alt.Chart(centroids).mark_circle(size = 100).encode(
    x = 'Average total business income',
    y = 'Average total business expenses',
    color = alt.value('black'),
    tooltip = ['cluster', 'Average total business income',
               'Average total business expenses']
).interactive()
chart2

In [73]:
chart1 + chart2

In [74]:
def squared_euclidean(data_x, data_y, centroid_x, centroid_y):
  return (data_x - centroid_x) ** 2 + (data_y - centroid_y)**2

data_x = df.at[0, 'Average total business income']
data_y = df.at[1, 'Average total business expenses']

In [75]:
distances = [squared_euclidean(data_x,data_y, centroids.at[i,
                               'Average total business income'],
                               centroids.at[i,'Average total business expenses']) for i in range(4)]
distances

[268540270120, 52053972100, 68360571700, 330592071266]

In [76]:
cluster_index = distances.index(min(distances))

In [77]:
df.at[0, 'cluster'] = cluster_index

In [78]:
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,
2,2007,575099,639499,
3,2008,53329,32173,
4,2009,237539,222993,


In [79]:
distances = [squared_euclidean(df.at[1, 'Average total business income'], df.at[1, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[1, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[2, 'Average total business income'], df.at[2, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[2, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[3, 'Average total business income'], df.at[3, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[3, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[4, 'Average total business income'], df.at[4, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[4, 'cluster'] = distances.index(min(distances))

In [80]:
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,2.0
4,2009,237539,222993,1.0


In [81]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', color='cluster:N',
    tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']
).interactive()

chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()
chart1 + chart2

In this final result, we can see where the four clusters have been placed in the graph and which cluster the five data points have been assigned to:

* The two data points in the bottom-left corner have been assigned to cluster 2, which corresponds to the one with a centroid of coordinates of 26,000 (average total business income) and 234,000 (average total business expense). It is the closest centroid for these two points.
* The two observations in the middle are very close to the centroid with coordinates of 116,000 (average total business income) and 256,000 (average total business expense), which corresponds to cluster 1.
* The observation at the top has been assigned to cluster 0, whose centroid has coordinates of 670,000 (average total business income) and 288,000 (average total business expense).

### Standardizing the Data from Our Dataset

In [82]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt

In [83]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url,
                 usecols = ['Postcode', 'Average total business income',
                            'Average total business expenses'])


In [84]:
X = df[['Average total business income',
        'Average total business expenses']]

In [85]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [87]:
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(X)
X_min_max = min_max_scaler.transform(X)

In [88]:
kmeans = KMeans(random_state=1, n_clusters=4, init='k-means++', n_init=5)
kmeans.fit(X_min_max)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=5, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

In [89]:
df['cluster8'] = kmeans.predict(X_min_max)

In [90]:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster8:N',
    tooltip=['Postcode', 'cluster8', 'Average total business income', 'Average total business expenses']
).interactive()

In [91]:
standard_scaler = StandardScaler()
X_scaled = standard_scaler.fit_transform(X)
kmeans = KMeans(random_state=1, n_clusters=4, init='k-means++', n_init=5)
kmeans.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=5, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

In [92]:
df['cluster9'] = kmeans.predict(X_scaled)

In [93]:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster9:N',
    tooltip=['Postcode', 'cluster9', 'Average total business income', 'Average total business expenses']
).interactive()

The k-means clustering results are very similar between min-max and z-score standardization, which are the outputs for Steps 10 and 13. Compared to the results in Exercise 5.4, Using Different Initialization Parameters to Achieve a Suitable Outcome, we can see that the boundaries between clusters 1 and 2 are slightly lower after standardization; indeed, the blue cluster has switched places in the output. If you look at Figure 5.31 of Exercise 5.4, Using Different Initialization Parameters to Achieve a Suitable Outcome, you will notice the switch in cluster 0 (blue) when compared to Figure 5.54 of Exercise 5.5, Finding the Closest Centroids in Our Dataset. The reason why these results are very close to each other is due to the fact that the range of values for the two variables (average total business income and average total business expenses) are almost identical: between 0 and 900,000. Therefore, k-means is not putting more weight toward one of these variables.

Well done! You've completed the final exercise of this chapter. You have learned how to preprocess data before fitting a k-means model with two very popular methods: min-max scaling and z-score.

## Perform Customer Segmentation in a Bank Using k-means

In [97]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/german.data-numeric.dat'
df = pd.read_csv('/content/german.data-numeric.dat', header = None,sep= '\s\s+', prefix='X')

  


In [105]:
X = df[['X3','X9']]
X

Unnamed: 0,X3,X9
0,12,67
1,60,22
2,21,49
3,79,45
4,49,53
...,...,...
995,17,31
996,39,40
997,8,38
998,18,23


In [106]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_scaled = sc.fit_transform(X)

In [107]:
clusters = pd.DataFrame()
inertia = []

clusters['cluster_range'] = range(1,15)

In [108]:
for k in clusters['cluster_range']:
  kmeans = KMeans(n_clusters = k, random_state = 8).fit(X_scaled)
  inertia.append(kmeans.inertia_)

In [110]:
clusters['inertia'] = inertia

In [111]:
clusters

Unnamed: 0,cluster_range,inertia
0,1,2000.0
1,2,1280.612749
2,3,767.637196
3,4,576.086134
4,5,443.905649
5,6,360.418261
6,7,291.39305
7,8,252.709449
8,9,219.498996
9,10,193.015983


In [113]:
alt.Chart(clusters).mark_line().encode(alt.X('cluster_range'), alt.Y('inertia'))

In [115]:
cluster_optimal = 5

In [116]:
kmeans = KMeans(random_state = 1, n_clusters = cluster_optimal, init = 'k-means++',
                n_init = 50, max_iter = 1000)

kmeans.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
       n_clusters=5, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

In [117]:
df['cluster'] = kmeans.predict(X_scaled)

In [119]:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x = 'X3', y = 'X9', color =  'cluster:N')

## Summary
You are now ready to perform cluster analysis with the k-means algorithm on your own dataset. This type of analysis is very popular in the industry for segmenting customer profiles as well as detecting suspicious transactions or anomalies.

We learned about a lot of different concepts, such as centroids and squared Euclidean distance. We went through the main k-means hyperparameters: init (initialization method), n_init (number of initialization runs), n_clusters (number of clusters), and random_state (specified seed). We also discussed the importance of choosing the optimal number of clusters, initializing centroids properly, and standardizing data. You have learned how to use the following Python packages: pandas, altair, sklearn, and KMeans.

In this chapter, we only looked at k-means, but it is not the only clustering algorithm. There are quite a lot of algorithms that use different approaches, such as hierarchical clustering, principal component analysis, and the Gaussian mixture model, to name a few. If you are interested in this field, you now have all the basic knowledge you need to explore these other algorithms on your own.

Next, you will see how we can assess the performance of these models and what tools can be used to make them even better.