# Clustering practical course 1

The aim of this practical session is to apply the different clustering algorithms seen in class: K-Means, Hierarchical clustering and Spectral clustering in order to cluster patients into two groups: chronic kidney disease patients and no chronic kidney disease patients. 

**Reminder: Clustering algorithms are unsupervised methods and hence we do not use the labels when applying them. However, we can use them afterwards as an evaluation step.**

**Useful links: 
Before using any method, it is important to know how and when to use it. For this, you can refer to the extra notebooks: KMeansClusteringwithPython.ipynb, ClusteringPython.ipynb and SpectralClustering.ipynb.**

**In addition, there is a Scikit Learn user guide that you can find here : https://scikit-learn.org/stable/user_guide.html and a Pandas user guide that you can find her: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html.** 

## Data Description

We will use a data frame with 400 instances on the following 24 variables.
* Attribute Information:
We use 24 ( 11 numeric, 13 nominal) 

1.Age(numerical) 
age in years 

2.Blood Pressure(numerical) 
bp in mm/Hg 

3.Specific Gravity(nominal??) 
sg - (1.005,1.010,1.015,1.020,1.025) 

4.Albumin(nominal??) 
al - (0,1,2,3,4,5) 

5.Sugar(nominal??) 
su - (0,1,2,3,4,5) 

6.Red Blood Cells(nominal) 
rbc - (normal,abnormal) 

7.Pus Cell (nominal) 
pc - (normal,abnormal) 

8.Pus Cell clumps(nominal) 
pcc - (present,notpresent) 

9.Bacteria(nominal) 
ba - (present,notpresent) 

10.Blood Glucose Random(numerical)	
bgr in mgs/dl 

11.Blood Urea(numerical)	
bu in mgs/dl

12.Serum Creatinine(numerical)	
sc in mgs/dl

13.Sodium(numerical) 
sod in mEq/L 

14.Potassium(numerical)	
pot in mEq/L 

15.Hemoglobin(numerical) 
hemo in gms 

16.Packed Cell Volume(numerical) 

17.White Blood Cell Count(numerical) 
wc in cells/cumm 

18.Red Blood Cell Count(numerical)	
rc in millions/cmm 

19.Hypertension(nominal)	
htn - (yes,no) 

20.Diabetes Mellitus(nominal)	
dm - (yes,no) 

21.Coronary Artery Disease(nominal) 
cad - (yes,no) 

22.Appetite(nominal)	
appet - (good,poor) 

23.Pedal Edema(nominal) 
pe - (yes,no)	

24.Anemia(nominal) 
ane - (yes,no) 



## Mount Drive 

The following code is to be executed in case you are using Google Colab **ONLY**.

In [None]:
import os
from google.colab import drive

drive.mount('/content/drive')

# Change to the directory to where your files are
os.chdir('drive/My Drive/')  

## Import librairies

**First, we import the libraries we usually use for data analysis: numpy, pandas, matplotlib and seaborn.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Import the data

**Read in the "chronic_kidney_disease" file using read_csv.**

**After importing the data, we need to check what it looks like and its properties. Use the panda's head, describe and info functions.**

## Data Preparation

**There are multiple things we need to do before being able to apply any clustering algorithm:**

**1- Separate the labels from the variables**

**2- Separate the numerical from the categorical variables**

**3- Replace the missing values from both numerical and categorical variables**

In [None]:
#1- Separate the labels from the variables
data = df.drop("class", axis =1)
labels = df["class"]

In [None]:
# 2-Separate the numerical from the categorical variables

data_num = data[["age", "bp","sg", "al", "su", "bgr", "bu", "sc", "sod", "pot", "hemo","pcv", "wbcc", "rbcc"]]

# (Do the same for the categorical attributes)

# Help: the nominal attributes are: "rbc","pc","pcc","ba","htn","dm", "cad", "appet", "pe", "ane"


In [None]:
# 3- Replace the missing values from both numerical and categorical variables
# In the case of numerical variables, we will replace the missing values by the median.
# In the case of nominal variables, we will first transform and represent them into numerical values 
# using factorize (search the pandas user guide for more info ) and then we will replace the missing 
# values by the most frequent one. 

# Numerical: 
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(data_num)

# (Do the same for the categorical attributes)
# Use the scikit learn user guide to find the correct strategy to use.




In [3]:
# Concatenate the two datasets again into one dataframe called: data_tr



## Data Visualization
It's time to create some data visualizations! 

## Applying K-Means and Hierarchical clustering

**Before applying Silhouette and finding the appropriate number of clusters, try a simple run of both K-Means and Hierarchical clustering in order to better understand how they works** 

In [None]:
# K-Means


# Hierarchical clustering

## Number of clusters

**In a clustering algorithm, it is always important to have the best number of clusters for the data we have. To save time trying different cluster numbers, we will use the silhouette method (on both K-Means and hierarchical) to directly define it.** 

In [None]:
# We will start with K-Means. 

#First, we define the range in which we will search for our number of clusters. 
range_n_clusters = range(2,5)

# Kmeans

# Import KMeans from SciKit Learn
from sklearn.cluster import KMeans
import matplotlib.cm as cm

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
  
    # Apply K-Means here and save the predicted labels (the class in which each sample was classified)




    # Compute silhouette_avg the variable containig the average silhouette 
    # using the silhouette_score function.
     
    
    # Print the average obtained using the Silhouette score function

    
    

    # Compute sample_silhouette_values, the variable containing the silhouette scores for each sample using 
    # silhouette_samples function

    
    
    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
 

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()


In [None]:
# We will now do the same for the Hierarchical clustering algorithm. 

## K Means Cluster Creation

**Create an instance of a K Means model with the selected number of clusters.**

**Fit the model to all the data except for the class label.**

**What are the cluster center vectors?**

# Hierarchical clustering
**Create an instance of a hierarchical clustering model with the selected number of clusters.**

**Visualize the resulting assignment vector**

**What are the properties of the model? (Which distance was used, which linkage method, etc)**

## Evaluation

There is no perfect way to evaluate clustering if you don't have the labels. In this practical course we take advantage of the availability of labels to evaluate our clusters.

**Create a new column for df called 'Cluster_model', which is a class column for each model.**

**Create a confusion matrix and classification report to see how well the clustering algorithms worked without being given any labels.**

In [None]:
from sklearn.metrics import confusion_matrix,classification_report

#K Means


# Hierarchical clustering

## Spectral Clustering

**Next, we will compare Spectral Clustering and K-Means on the same data.** 

**Create an instance of a Spectral Clustering model**

## Evaluation
**Similarily to the previous algorithms, create a confusion matrix and compare to the previous methods.**