<a class="anchor" id="0"></a>

# The Multiple Clustering by 12 methods for data from the dataset [Heart Disease UCI data](https://www.kaggle.com/ronitf/heart-disease-uci):
### Methods with automatic determination of the number of clusters:
* MeanShift
* DBSCAN
* OPTICS
* AffinityPropagation

### Methods that require the number of clusters as an input parameter:
* KMeans
* MiniBatchKMeans
* AgglomerativeClustering_ward
* AgglomerativeClustering_average
* AgglomerativeClustering_complete
* Birch
* GaussianMixture
* SpectralClustering

## Acknowledgements
* [Heart Disease - Automatic AdvEDA & FE & 20 models](https://www.kaggle.com/vbmokin/heart-disease-automatic-adveda-fe-20-models)
* [Titanic Top 3% : cluster analysis](https://www.kaggle.com/vbmokin/titanic-top-3-cluster-analysis)
* [Clustering & Visualization of Clusters using PCA](https://www.kaggle.com/sabanasimbutt/clustering-visualization-of-clusters-using-pca)
* https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Import libraries](#1)
1. [Download datasets](#2)
1. [EDA & FE](#3)
1. [Clustering](#4)
1. [Conclusion](#5)

## 1. Import libraries <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import pandas_profiling as pp

from sklearn import cluster, mixture
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN, OPTICS
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import kneighbors_graph
from itertools import cycle, islice

import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

import warnings
warnings.simplefilter('ignore')

pd.set_option('max_columns', 200)

## 2. Download datasets <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
data = pd.read_csv("../input/heart-disease-uci/heart.csv")
data = data.drop_duplicates().reset_index(drop=True)

In [None]:
data.head(3)

## 3. EDA & FE <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
data.describe()

In [None]:
# Data format optimization
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

data = reduce_mem_usage(data)

In [None]:
data.head(3)

In [None]:
data.info()

## 4. Clustering <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to https://www.kaggle.com/vbmokin/titanic-top-3-cluster-analysis

def generate_clustering_algorithms(Z, n_clusters, m):
    # Generate clustering algorithms:
    # m = 'MeanShift', 'KMeans', 'MiniBatchKMeans', 'AgglomerativeClustering_ward',
    # 'SpectralClustering', 'DBSCAN', 'OPTICS', 'AffinityPropagation',
    # 'AgglomerativeClustering_average', 'Birch', 'GaussianMixture'
    
    # The minimal percentage of similarity of the clustered feature with "Survived" for inclusion in the final dataset
    limit_opt = 0.7
    
    # Thanks to: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py
    params = {'quantile': .2,
              'eps': .3,
              'damping': .9,
              'preference': -200,
              'n_neighbors': 10,
              'n_clusters': n_clusters,
              'min_samples': 3,
              'xi': 0.05,
              'min_cluster_size': 0.05}
    
    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(Z, quantile=params['quantile'])

    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(
        Z, n_neighbors=params['n_neighbors'], include_self=False)
    
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)

    # ============
    # Create cluster objects
    # ============
    if m == 'MeanShift':
        cl = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    elif m == 'KMeans':
        cl = cluster.KMeans(n_clusters=n_clusters, random_state = 1000)
    elif m == 'MiniBatchKMeans':
        cl = cluster.MiniBatchKMeans(n_clusters=n_clusters)
    elif m == 'AgglomerativeClustering_ward':
        cl = cluster.AgglomerativeClustering(n_clusters=n_clusters, linkage='ward',
                                    connectivity=connectivity)
    elif m == 'SpectralClustering':
        cl = cluster.SpectralClustering(n_clusters=n_clusters, eigen_solver='arpack',
                                        affinity="nearest_neighbors")
    elif m == 'DBSCAN':
        cl = cluster.DBSCAN(eps=params['eps'])
    elif m == 'OPTICS':
        cl = cluster.OPTICS(min_samples=params['min_samples'],
                            xi=params['xi'],
                            min_cluster_size=params['min_cluster_size'])
    elif m == 'AffinityPropagation':
        cl = cluster.AffinityPropagation(damping=params['damping'])
    elif m == 'AgglomerativeClustering_average':
        cl = cluster.AgglomerativeClustering(linkage="average", affinity="cityblock",
                    n_clusters=params['n_clusters'], connectivity=connectivity)
    elif m == 'AgglomerativeClustering_complete':
        cl = cluster.AgglomerativeClustering(linkage="complete", affinity="cityblock",
                    n_clusters=params['n_clusters'], connectivity=connectivity)        
    elif m == 'Birch':
        cl = cluster.Birch(n_clusters=params['n_clusters'])
    elif m == 'GaussianMixture':
        cl = mixture.GaussianMixture(n_components=n_clusters, covariance_type='full')
        
    return cl

In [None]:
def clustering_df(X, n, m, output_hist):
    
    # Standardization
    X_columns = X.columns
    scaler = StandardScaler()
    scaler.fit(X)
    X = pd.DataFrame(scaler.transform(X), columns = X_columns)
    cl = generate_clustering_algorithms(X, n, m)
    cl.fit(X)
    if hasattr(cl, 'labels_'):
        labels = cl.labels_.astype(np.int)
    else:
        labels = cl.predict(X) 
    clusters=pd.concat([X, pd.DataFrame({'cluster':labels})], axis=1)
    
    # Inverse Standardization
    X_inv = pd.DataFrame(scaler.inverse_transform(X), columns = X_columns)    
    clusters_inv=pd.concat([X_inv, pd.DataFrame({'cluster':labels})], axis=1)
    
    # Number of points in clusters
    print("Number of points in clusters:\n", clusters['cluster'].value_counts())
    
    # Data in clusters - thanks to https://www.kaggle.com/sabanasimbutt/clustering-visualization-of-clusters-using-pca    
    if output_hist:
        for c in clusters:
            grid = sns.FacetGrid(clusters_inv, col='cluster')
            grid.map(plt.hist, c)
        
    return clusters, clusters_inv

## Methods with automatic determination of the number of clusters:
* MeanShift
* DBSCAN
* OPTICS
* AffinityPropagation

## Methods that require the number of clusters as an input parameter:
* KMeans
* MiniBatchKMeans
* AgglomerativeClustering_ward
* AgglomerativeClustering_average
* AgglomerativeClustering_complete
* Birch
* GaussianMixture
* SpectralClustering

In [None]:
# All 12 methods
methods_all = ['KMeans', 'MiniBatchKMeans', 'MeanShift', 
               'DBSCAN', 'OPTICS', 
               'AffinityPropagation',               
               'AgglomerativeClustering_ward',
               'AgglomerativeClustering_average',
               'AgglomerativeClustering_complete',
               'Birch', 
               'GaussianMixture',
               'SpectralClustering'
              ]

In [None]:
# The number of default clusters in methods where such a parameter is required
n_default = 6

In [None]:
def plot_draw(X, title, m):
    # Drawing a plot with clusters on the plane (using PCA transformation)
    # Thanks to https://www.kaggle.com/sabanasimbutt/clustering-visualization-of-clusters-using-pca
    
    dist = 1 - cosine_similarity(X)
    
    # PCA transform
    pca = PCA(2)
    pca.fit(dist)
    X_PCA = pca.transform(dist)
    
    # Generate point numbers and colors for clusters
    hsv = plt.get_cmap('hsv')
    n_clusters = max(X['cluster'].value_counts().index)-min(X['cluster'].value_counts().index)+2
    colors = list(hsv(np.linspace(0, 1, n_clusters)))
    colors_num = list(np.linspace(min(X['cluster'].value_counts().index), max(X['cluster'].value_counts().index), n_clusters))
    colors_num = [int(x) for x in colors_num]
    colors_str = [str(x) for x in colors_num]
    names_dict = dict(zip(colors_num, colors_str))
    colors_dict = dict(zip(colors_num, colors))
    
    # Visualization
    x, y = X_PCA[:, 0], X_PCA[:, 1]

    df = pd.DataFrame({'x': x, 'y':y, 'label':X['cluster'].tolist()}) 
    groups = df.groupby('label')

    fig, ax = plt.subplots(figsize=(12, 8)) 

    for name, group in groups:
        ax.plot(group.x, group.y, marker='o', linestyle='', ms=10,
                color=colors_dict[name],
                label=names_dict[name], 
                mec='none')
        ax.set_aspect('auto')
        ax.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
        ax.tick_params(axis= 'y',which='both',left='off',top='off',labelleft='off')

    ax.legend(loc='upper right')
    ax.set_title(f"{title} by method {m}")
    plt.show()

In [None]:
res = dict(zip(methods_all, [False]*len(methods_all)))
n_clust = dict(zip(methods_all, [1]*len(methods_all)))
for method in methods_all:
    print(f"Method - {method}")
    Y, Y_inv = clustering_df(data.copy(), n_default, method, True)
    
    # If the number of clusters is less than 2, then the clustering is not successful
    n_cl = len(Y['cluster'].value_counts())
    if n_cl > 1:
        res[method] = True
        n_clust[method] = n_cl
        plot_draw(Y, "Data clustering", method)
    else:
        print('Clustering is not successful because all data is in one cluster!\n')

## 5. Conclusion <a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Results: optimal clustering methods
methods_bad = []
print('Optimal clustering methods:\n')
for (k, v) in res.items():
    if v:
        print(f"- {k} with number of clusters = {n_clust[k]}")
    else: 
        methods_bad.append(k)

In [None]:
# Results: methods in which all data are in one cluster
if len(methods_bad) > 0:
    print('Methods in which all data are in one cluster:\n')
    for method in methods_bad:
        print(f'- {method}')

I hope you find this notebook useful and enjoyable.

Your comments and feedback are most welcome.

[Go to Top](#0)