### **KMeans Clustering**
- As we discussed previously, the ML algorithms can be broadly classified into two categories: `Supervised` and `Unsupervised`
    - When we have a "target" and we need to obtain a function "f" with independent variables "X"s to predict target, the approach is called `Supervised`.
    - When we do not have a target, in such cases we try to understand the data by looking at the relations among the instances wrt the variables given (which instances are similar or what is relation between instance a and instance b etc). Today we will be learning another algorithm called kmeans
- Kmeans is a distance based iterative technique, where the instances that are "closer" are "grouped" together forming a "`cluster`". 
- This "closeness" is computed by distances,by default, Euclidean distances
- We need to specify prior, how many clusters we want to get.
- What is iterative in this case?
    - We specify a number of clusters we need, so in the first iteration, centroids(centre) of the cluster are randomly picked in the data (this centroid need not be a data point but could be any other point as well). For eg: if we need 3 clusters, 3 centroids are randomly picked.
    - Now with respect to each of these centroids, distance is computed for each of the points in the data and the data point is assigned to that cluster for which the point's distance is closest to its centroid. This is "Assignment phase".
    - Once all points are assigned to the clusters, new centroids are computed from the points of each cluster (in 2d it is (x1+x2)/2, (y1+y2)/2)..remember this formula :)
    - Once, this new centroids are computed, the assignment phase starts-- compute the distance between each of the data points with each of the new centroids and assign the point to the closest cluster. After assignment, the new cluster centroids are computed. This process continues until there is no change in cluster centroids from previous iteration
    
    
 - **Getting Ideal K**
    - we find the clusters by assessing the `within sum of squares of distances (wss)` of all points to the centroid of cluster.
    - We experiment with k, starting with 1 (number of clusters formed would be 1 in first iteration and wss is computed, then 2 clusters are created and wss is computed, then 3 clusters and so on).
    - We plot this wss with the k (k value on x-axis and wss on y-axis).This plot is known as a scree plot. We inspect this plot to get the ideal k.
    - Other measures that are used to check if the clustering is proper are silhouette distances.

 - **Clustering Stability**
    - Once the data points are segmented into clusters, we need to check for cluster stability i.e. if we run the same algorithm, will the data points that were segmented into one cluster previously are segmented go together again. If yes, then the cluster is stable else not.
    
- **Interpretation of centroids**
    - Centroids of the clusters are like representative of the cluster. For example, if we are segmenting customers of retail stores based on certain attributes like average spending, items purchased, preferred mode of transaction etc.,then the centroid of a cluster would represent (on an average for each of the attribute) the behaviour of customer in that cluster. 
    

### *Note on Distances*

- In an ideal case, when all the variables are numeric, Euclidean distance makes sense
(If we want to use Euclidean).
- Other distance measures we have are- for numeric we have **Euclidean, Manhattan,Chebyshev** and for categorical we have **Hamming, Jaccard etc**
- If the data has mixed variables (both numeric and categorical) then a *quick way to*
*do clustering is by creating dummy variables for categorical attributes*. Even when using Euclidean as a distance metric, this approach may not be appreciated by some
Data Science practitioners.Another approach for dealing with mixed variables is to
use `Gower’s algorithm`.

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
import os
os.getcwd()

In [None]:
cereals = # read the dataset
cereals.head()

## Data Description

- `calories`: calories per serving
- `protein`: grams of protein
- `fat`: grams of fat
- `sodium`: milligrams of sodium
- `fiber`: grams of dietary fiber
- `carbo`: grams of complex carbohydrates
- `sugars`: grams of sugars
- `potass`: milligrams of potassium
- `vitamins`: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
- `shelf`: display shelf (1, 2, or 3, counting from the floor)
- `weight`: weight in ounces of one serving
- `cups`: number of cups in one serving
- `rating`: a rating of the cereals

## Dropping less relevant columns

In [None]:
# Drop the columns ['shelf','weight','cups','rating'] from the dataframe.




In [None]:
cereals.describe()

In [None]:
# Check for Missing Values



## Decoupling name label

In [None]:
labels = cereals['name']
cereals.drop(['name'], axis=1,inplace=True)

## Imputation

In [None]:
#Impute the missing values in the dataframe using SimpleImputer and strategy='mean'.
from sklearn.impute import SimpleImputer



## Standardization

In [None]:
#Using StandardScaler, perform standardization on the dataset

from sklearn.preprocessing import StandardScaler


In [None]:
cereals_std.describe()

## Agglomerative Clustering
**Parameter description**

`n_clusters` : The number of clusters to find.

`linkage` : {“ward”, “complete”, “average”}

**ward** minimizes the variance of the clusters being merged.

**complete** uses the maximum distances between all observations of the two sets.

**average** uses the average of the distances of each observation of the two sets.

**affinity** : {“euclidean”, “l1”, “l2”, “manhattan”, “cosine”}

Metric used to compute the linkage.



In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram
linkage_matrix = linkage(cereals_std, method='ward', metric='euclidean')

In [None]:
labelList = range(1, cereals_std.shape[0]+1 )
plt.figure(figsize=(12, 7)) 
dendrogram(linkage_matrix, labels=labelList)
plt.show()

## Implementing 6 clusters

In [None]:
from sklearn.cluster import AgglomerativeClustering

clust = AgglomerativeClustering(n_clusters=6,
                                affinity='euclidean',
                                linkage='ward')

cluster_predictions = clust.fit_predict(cereals)

result = pd.DataFrame({'Label':labels,
                       'Cluster':cluster_predictions})
result.head()

## K-Means Clustering
**Parameter description**

`n_clusters` : The number of clusters to find.

`n_init` : Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

`max_iter` : max iterations of recomputing new cluster centroids

`n_jobs` : The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

In [None]:
from sklearn.cluster import KMeans
wss = []
for k in range(2,15):
    km = KMeans(n_clusters=k)
    km.fit(cereals_std)
    wss.append(km.inertia_)

In [None]:
plt.figure(figsize=(12,7))
plt.plot(range(2,15), wss, 'bx--')
plt.xlabel('k')
plt.ylabel('wss')
plt.show()

## Kmeans with 6 clusters

In [None]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=6, random_state=4545, n_init=50)
km.fit(cereals_std)
kmeans_clusters = km.predict(cereals_std)

result = pd.DataFrame({"Label":labels, "KMeans Cluster":kmeans_clusters})
result.head()

## Cluster Characteristics

In [None]:
cereals_org = pd.read_csv("/home/divyas/Lab/Cereals.csv")
cereals_org.drop(['shelf','weight','cups','rating'], axis=1, inplace=True)
cereals_org['Cluster'] = result['KMeans Cluster']
cereals_org.groupby("Cluster").mean()

### Checking cluster stability

In [None]:
from sklearn.metrics import adjusted_rand_score
import numpy as np

In [None]:
indices=cereals_std.sample(frac=0.9,random_state=123).index
print(indices)

In [None]:
cereals_std_subset=cereals_std.iloc[indices,:]

In [None]:
kmeans_object = KMeans(n_clusters=5,n_init=30,max_iter=300,random_state=1000)
kmeans_object.fit(cereals_std)
clus1= kmeans_object.predict(cereals_std)

In [None]:
kmeans_object = KMeans(n_clusters=5,n_init=30,max_iter=300,random_state=1000)
kmeans_object.fit(cereals_std_subset)
clus2= kmeans_object.predict(cereals_std_subset)

In [None]:
print(len(clus1))
print(len(clus2))

In [None]:
clus1=clus1[indices]
print(len(clus1))

In [None]:
adjusted_rand_score(clus1,clus2)