---
# Machine Leaning Methods: K Means Clustering
---
We will use scikit-learn for clustering in the following example. Scikit-learn is the go-to package for machine learning in Python. It is built on top of the other packages we've discussed (i.e. numpy, SciPy, matplotlib, etc.). 

---
## Introduction to K means
---
<b>K means clustering function</b><br>
class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)

#### Commonly Used Parameters
	
<b> n_clusters: </b> int, optional, default: 8 <br>
The number of clusters to form as well as the number of centroids to generate.<br>

<b> init: </b> {'k-means++', 'random' or an ndarray}, default: 'k-means++' <br>
Method for initialization.

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. 

‘random’: choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
<br>

<b> n_init: </b> int, default: 10 <br>
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.<br>

<b> max_iter: </b> int, default: 300 <br>
Maximum number of iterations of the k-means algorithm for a single run.<br>

<b> random_state: </b> int, RandomState instance of None, optional, default: None <br>
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.<br>

#### Attributes
<b> cluster\_centers\_: </b> array, [n_clusters, n_features] <br>
Coordinates of cluster centers<br>

<b> labels_: </b> int, default: 300 <br>
Labels of each point<br>

<b> inertia_: </b> int, default: 300 <br>
Sum of squared distances of samples to their closest cluster center.<br>

The table below shows common methods that are a part of the k Means package:

| Method | Description         
| ------------- |:------------- 
|fit (X[,y])| Compute k-means clustering.
|fit_predict (X[,y])| Compute cluster centers and predict cluster index for each sample.
|fit_transform (X[,y])| Compute clustering and transfirm X to cluster-distance space.
|get_params ([deep])| Get parameters for this estimator.
|predict (X)| Predict the closest cluster each sample in X belongs to.
|score (X[,y])| Opposite of the value of X on the K-means objective.
|set_params (\*\*params)| Set the parameters of this estimator.
|transform (X)| Transform X to a cluster-distance space.

In order to use the k Means functions, you must import the package from sklearn.cluster, as seen below:

In [1]:
from sklearn.cluster import KMeans

---
## Example: Using Iris Sci-kit Learn Dataset 
---

In [2]:
import numpy as np
from sklearn import datasets

np.random.seed(5)

# Loading the Iris dataset. This data sets consists of 3 different types of irises’ 
# (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray.
# The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
iris = datasets.load_iris()

#Assigning the 4 columns of data (epal Length, Sepal Width, Petal Length and Petal Width) to the variable X
X = iris.data

#Assigning the actual classification (Setosa, Versicolour, and Virginica) to the variable y
y = iris.target

print("Actual Classifications")
print(y)
print()

#K means clustering results with 8 clusters
estimators_8 = KMeans(n_clusters=8)
est_8 = estimators_8.fit(X)
labels_8 = est_8.labels_
print("K means results using: 8 clusters")
print(labels_8)
print()

#K means clustering results with 3 clusters
estimators_3 = KMeans(n_clusters=3)
est_3 = estimators_3.fit(X)
labels_3 = est_3.labels_
print("K means results using: 3 clusters")
print(labels_3)
print()


Actual Classifications
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

K means results using: 8 clusters
[1 5 5 5 1 1 5 1 5 5 1 5 5 5 1 1 1 1 1 1 1 1 5 1 5 5 1 1 1 5 5 1 1 1 5 5 1
 5 5 1 1 5 5 1 1 5 1 5 1 5 2 2 2 7 2 7 2 6 2 7 6 7 7 2 7 2 7 7 2 7 4 7 4 2
 2 2 2 2 2 7 7 7 7 4 7 2 2 2 7 7 7 2 7 6 7 7 7 2 6 7 0 4 3 0 0 3 7 3 0 3 0
 4 0 4 4 0 0 3 3 4 0 4 3 4 0 3 4 4 0 3 3 3 0 4 4 3 0 0 4 0 0 0 4 0 0 0 4 0
 0 4]

K means results using: 3 clusters
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1

We are unable to compare the results of the 8 clusters against the original classifications due to the difference in number of categories. However, below we show the results of K means clustering using 3 clusters.

In [3]:
import pandas as pd

pd_labels_3 = pd.DataFrame(labels_3)
pd_target = pd.DataFrame(y)

results = pd_labels_3
results = results.rename(columns={0 : 'Prediction'})
results['Target'] = pd_target

#Using Crosstab to print results
print(pd.crosstab(results['Target'], results['Prediction']))
print()

print("Classifications of Iris Flowers")
print(list(iris.target_names))

Prediction   0   1   2
Target                
0           50   0   0
1            0   2  48
2            0  36  14

Classifications of Iris Flowers
['setosa', 'versicolor', 'virginica']


The results of the K means algorithm for 3 clusters is as follows:
    1. Cluster 0 contained all of the Setosa iris flowers.
    2. Cluster 1 contained 2 of the Versicolor iris flowers and 48 of the Virginica iris flowers.
    3. Cluster 2 contained 36 of the Versicolor iris flowers and 14 of the Virginica iris flowers.

The k-Means clustering algorithm is a strong choice when trying to solve any type of "grouping" task. The code above is a good starting point for implementing the k-Means algoritm on your engagements. Further information can be found in the section below:

### Other Useful Resources
- [Introduction to k-Means Clustering](https://www.datascience.com/blog/k-means-clustering)
- [K-means clustering: how it works](https://www.youtube.com/watch?v=_aWzGGNrcic)