In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# K-Means

As a first step, we will need to import the data from the `retail_ml_dataset.csv` data file that we constructed and exported on Day 1 (or the corresponding backup file that we have provided) into the variable **_X_** using the `read_csv()` function from `pandas` (`pd`). We also want to define the column that we are going to use as the row labels of the DataFrame; in this case, *CustomerID*. Once loaded, we can once again apply the `head()` function to preview the first 5 rows of our DataFrame. 

In [None]:
# Import the data from the retail_ml_dataset.csv, 

customers_ml_data = pd.read_csv('data/online_retail_afterEDA.csv', index_col='CustomerID')
customers_ml_data.head()

We will start by looking specifically at numerical features. Below we list non binary features and separate this into a dataset called `customers`:

In [None]:
non_binary_cols = [
    'balance', 'max_spent', 'mean_spent', 
    'min_spent', 'n_orders','total_items', 
    'total_refunded', 'total_spent' ]

customers = customers_ml_data[non_binary_cols]
customers.head()

Let's also import the data from `pca_scores.csv`. By now this should be easy! Let's call this `Xscores`. Set the index column to be `CustomerID`:

In [None]:
# Import pca_scores.csv using pd.read_csv
Xscores = pd.read_csv('data/pca_scores.csv', index_col='CustomerID')
Xscores.head()


## Clustering with K-Means

K-means clustering is a method for finding clusters and cluster centroids (that is, the center point of a cluster) in a set of points. The K-means algorithm is quite simple and alternates between the two steps:

1. for each centroid, identify the subset of training points that are closer to it than to any other centroid
2. update the location of the centroid to match the points related to it

These two steps are iterated until the centroids no longer move (significantly) or the assignments no longer change. Then, a new point $x$ can be assigned to the cluster of the closest prototype.

### Learning Activity - Run K-Means with two features

Isolate the features `mean_spent` and `max_spent`, then run the K-Means algorithm on the resulting dataset using K=2 and visualise the result. You will need:

* to create an instance of `KMeans` with 2 clusters
* fit this to the isolated features (via the `.fit` method)
* look how it's doing by using by showing the assignment predicted (via the `.predict` method)

This is the standard `sklearn` workflow for most of the algorithms.

In [None]:
from sklearn.cluster import KMeans

# Apply k-means with 2 clusters using a subset of features 
# (mean_spent and max_spent)

kmeans = KMeans(n_clusters = 2)
cust2  = customers[['mean_spent', 'max_spent']]
kmeans.fit(cust2)
cluster_assignment = kmeans.predict(cust2)


Let us introduce a simple function to better visualise what's going on:

In [None]:
# This function generates a pairplot enhanced with the result of k-means
def pairplot_cluster(df, cols, cluster_assignment):
    """
    Input
        df, dataframe that contains the data to plot
        cols, columns to consider for the plot
        cluster_assignments, cluster asignment returned 
        by the clustering algorithm
    """
    # seaborn will color the samples according to the column cluster
    df['cluster'] = cluster_assignment 
    sns.pairplot(df, vars=cols, hue='cluster')
    df.drop('cluster', axis=1, inplace=True)

And let's use it now to see how we did previously... (ignore the warnings if anything comes up)

In [None]:
# Visualise the clusters using pairplot_cluster()
pairplot_cluster(customers, ['mean_spent', 'max_spent'], cluster_assignment)


#### What can you observe?

* the separation between the two clusters is "clean" (the two clusters can be separated with a line)
* one cluster contains customers with low spendings, the other one with high spendings

### Test Activity - Run K-Means with all the features
Run K-Means using all the features available and visualise the result in the subspace `mean_spent` and `max_spent`.

In [None]:
# Apply k-means with 2 clusters using all features
kmeans = KMeans(n_clusters = 2)
kmeans.fit(customers)
cluster_assignment = kmeans.predict(customers)


and visualise using the same subset of variables as before... what has changed??

In [None]:
# Visualise the clusters using pairplot_cluster()
pairplot_cluster(customers, ['mean_spent', 'max_spent'], cluster_assignment)


***Question***: Why can't the clusters be separated with a line as before?

### Learning activity - Compare expenditure between clusters

Select the features `'mean_spent'` and `'max_spent'` and compare the two clusters obtained above using them.

In [None]:
# Compare expenditure between clusters
features = ['mean_spent', 'max_spent']

# create a dataframe corresponding to the case
# cluster_assignment == 0
cluster1_df = pd.DataFrame(data= customers[cluster_assignment == 0], 
                             columns=customers.columns)[features]

cluster1_desc = cluster1_df.describe()
cluster1_desc.columns = [c+'_0' for c in cluster1_desc.columns]


In [None]:
# then with cluster_assignment == 0
cluster2_df = pd.DataFrame(data=customers[cluster_assignment == 1], 
                             columns=customers.columns)[features]

cluster2_desc = cluster2_df.describe()
cluster2_desc.columns = [c+'_1' for c in cluster2_desc.columns]


In [None]:
#Concatenate both:
compare_df = pd.concat((cluster1_desc, cluster2_desc), axis=1)
compare_df


### Test Activity - Looking at the centroids

Look at the centroids of the clusters `kmeans.cluster_centers_` and check the values of the centroids for the features `mean_spent`, `max_spent`. You will need to create a new DataFrame where the data is simply `kmeans.cluster_centers_`.

In [None]:
# Get the centroids and display them
centers_df = pd.DataFrame(data=kmeans.cluster_centers_, columns=customers.columns)
print(centers_df[features])


### Learning Activity - Compare mean expediture with box plot

Compare the distribution of the feature `mean_spent` in the two clusters using a box plot. You will need:

* `sns.boxplot` (seaborn's boxplot)

In [None]:
# Compare mean expediture with box plot
cluster1_df.columns = [c+'_0' for c in cluster1_df.columns]
cluster2_df.columns = [c+'_1' for c in cluster2_df.columns]

#plt.figure(figsize = (10,6))
sns.boxplot(data=pd.concat((cluster1_desc['mean_spent_0'], 
                            cluster2_desc['mean_spent_1']), 
                           axis=1), showfliers=False)


does this seem to make sense? How can you interpret the plots?

### Learning Activity - Compute the silhouette score
Compute the silhouette score of the clusters resuting from the application of K-Means.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (``a``) and the mean nearest-cluster distance (``b``) for each sample.  The Silhouette Coefficient for a sample is ``(b - a) / max(a, b)``. It represents how similar a sample is to the samples in its own cluster compared to samples in other clusters.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

`sklearn` provides the function `silhouette_score` which you can call and display.

In [None]:
from sklearn.metrics import silhouette_score

# Computing the silhouette score
print('silhouette_score {0:.2f}'.format(silhouette_score(customers, cluster_assignment)))


This silhouette score is reasonably high which we can intepret by saying that the corresponding clusters are quite compact.

### Test Activity - Run KMeans on the dataset obtained with PCA

Compute KMeans on the dataset `XScores` using the first 2 principal components.

Visualise the results using again the function `pairplot_cluster` in the first 4 principal components.

In [None]:
# Run KMeans on the first two principal components
kmeans = KMeans(n_clusters = 2)
kmeans.fit(Xscores[['PC1', 'PC2']])
cluster_assignment = kmeans.predict(Xscores[['PC1', 'PC2']])
pairplot_cluster(Xscores, ['PC1', 'PC2', 'PC3', 'PC4'], cluster_assignment)
