# Introduction

As displayed by the EDA per https://www.kaggle.com/gvyshnya/using-autoviz-to-build-a-comprehensive-eda , *cont2* and *cont14* seem to have a nice separation of values into relatively contained clusters vs. the values of *target* in the training set. It leads to the hypothesis on a statiscially meaningful clustering of the observations in the training and testing sets for this competition within the 2-dimentional affinity space of *cont2*X*cont14* 

*Notes*: 
- we are going to use KMeans clustering and Euclidian distance metric in *cont2*X*cont14* space to find the optimal clustering break-down
- there had been experiments with density clustering approach (namely, with *DBSCAN* method) but they did not work well for this dataset

# Preparation Activities

First of all, we are going to do a few usual preparation steps

- import the packages we need to work with in the course of the current analytical effort
- read the competion data in memory for future manipulations

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple, List, Dict

import matplotlib.pyplot as plt
import matplotlib.cm as cm

import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = True


def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-jan-2021/train.csv'
        test_path = '../input/tabular-playground-series-jan-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-jan-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path

In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

In [None]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

df_train = pd.read_csv(train_set_path)
df_test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

# list of feature columns
feature_list = [col for col in df_train.columns if col.startswith('cont')]

Before moving on with the clustering experiments, we will check the basic info about our training dataset (records count, data types of variables, % of missing values etc.)

In [None]:
df_train.info()

# KMeans Clustering Experiments

Now, we are ready to create a subset of the training set to use in the clustering experiments

In [None]:
%%time
# Let's cluster the observations 

clustering_cols = ['cont2', 'cont14']

# subset of training set for the clustering experiment
X = df_train.filter(clustering_cols, axis=1)
#X = StandardScaler().fit_transform(X)

display(X.head())

Reducing the clustering feature space to 2 will allow for more relevant utilization of Euclidian distance-based clustering algorithms (like KMeans clustering we are going to utilize below).

However, the weak side of such algorithms is a certain voluntarism of a researcher in specifying the number of target clusters to be calculated by the analytical software before the actual analysis started. Thus the final clustering composition is very sensitive to the decision on the number of clusters to calculate (and thus the real analytical edge of the clustering composition could be less then useful).

To mitigate such a risk, we are going to put some data-driven ground into selection of a number of clusters to calculate for our current KMeans clustering experiment, using so called 'silhouette analysis' (as explained in https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score



X = StandardScaler().fit_transform(X)

range_n_clusters = [6, 7, 8, 9, 10, 11, 12]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

From reviewing the charts above, it looks like clustering with 6 clusters seems to be the best one in terms of their geometry and the points spread.

Now we are ready to proceed with the actual clustering the observations in the training and test sets, using KMeans clustering with 6 clusters.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=6)
kmeans.fit(X)

clusters = kmeans.predict(X)
centroids = kmeans.cluster_centers_

df_train['cluster'] = clusters

# subset of training set for the clustering experiment
X_test = df_test.filter(clustering_cols, axis=1)
X_test = StandardScaler().fit_transform(X_test)

clusters_test = kmeans.predict(X_test)
df_test['cluster'] = clusters_test

# drop id column
df_train = df_train.drop(['id'], axis=1)
df_test = df_test.drop(['id'], axis=1)

df_train.groupby('cluster').mean().reset_index()

As we can see from the feature variable mean variability across the clusters calculated on the training set, the clusters we calculated are statistically significant and really provide a meaningful grouping of the records of the training set.

Let's count the number of records in each cluster of the training set.

In [None]:
df_count = df_train.filter(['cluster', 'cont1'], axis=1)
df_count.groupby('cluster').count().reset_index()

Now, we are going to check if the clustering calculated above is applicable to the testing set in the equally good manner.

In [None]:
df_test.groupby('cluster').mean().reset_index()

As we can see from the feature variable mean variability across the clusters calculated on the testing set, the clusters are also statistically significant.

Let's count the number of records in each cluster of the testing set.

In [None]:
df_count = df_test.filter(['cluster', 'cont1'], axis=1)
df_count.groupby('cluster').count().reset_index()

# References

You can find more theory on the methods/techniques used in this experiments per the links below

- Selecting the number of clusters with silhouette analysis on KMeans clustering - https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
- Mukesh Chaudhary, Silhouette Analysis in K-means Clustering - https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111

In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)