# K-Means Clustering

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

## Generate Data

To make sure that our data actually contains relevant clusters, we will generate it ourselves.

To do so, we will use [`make_blobs`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) from `sklearn`.  

We want a dataset with **500 observations**, **2 features** and **4 clusters**.  

We use *random_state=42* so that you can compare results with your buddy.

**Run the cell below to generate your data**

In [None]:
# Generate data
X, y = make_blobs(n_samples=500, centers=4, random_state=42)

❓ **>>>** Make a scatter plot of your two features against each other. Color the points according to their corresponding  value in `y`.

Don't forget the color argument is:
 - `c` for matplotlib  
 - `hue` for seaborn 

In [None]:
# Code here!


You should see 4 distinct clusters, each with a different color.

## K-Means

Let's assume that we never knew about `y` and only received `X` to work with.  
We only have 2 features and no target.

Your goal is to find the **number of clusters** ($k$) that best matches the structure of your data.  


❓ **>>>** Import `KMeans` from `sklearn` and initiate a model with the parameters:
- `n_clusters=2`,
- `random_state=42`

❓ **>>>** Fit the model on your `X`  
❓ **>>>** Get your predictions and store them in a `y_pred` variable.

**Hint**: Everything is just one line of code.

In [None]:
# Code here!


The predictions are a vector of cluster assignment for each observation.  
With `n_clusters=2` each observation in `X` will be associated to either one of two clusters.

❓ **>>>** Using the previous code, make a scatter plot of your two features against each other.

❓ **>>>** Color the points according the predicted cluster in `y_pred`.

In [None]:
# Code here!


You can still see 4 distinct clusters, however the color only show 2.
**This clustering around 2 centers is clearly no optimal, we can do better.**

## Find the optimal number of clusters $k$

Once fitted, the `KMeans` instance gains an attribute named `inertia_`.

It represents the **sum of squared distances of observations to their associated (closest) cluster center**. 

So the lower, the better.  

In [None]:
KMeans(n_clusters=2, random_state=42).fit(X).inertia_

Think of this in comparison to the Sum of Squared Errors in a Linear Regression.

- `SSE` of a `Linear Regression` --> `Sum of squared distances between observations and the regression line`  

- `Inertia` of a `KMeans Clustering` --> `Sum of distances between observations and their closest centroid`

One way for us to find the optimal number of clusters is a heuristic: the **Elbow Method**.  

We have to try several number of clusters and look at the inertia obtained for each one.

❓ **>>>** Fit a `KMeans` for every number of clusters between 1 and 10, for each one, save the inertia in a list `wcss` (Within-Cluster Sum of Square).

In [None]:
wcss = []
clusters = list(range(1, 11))
# Code here!


❓ **>>>**  Plot the inertias in `wcss` against their corresponding number of clusters ❓ **>>>**

In [None]:
# Code here!


We clearly see an Elbow at 4 clusters.  

## K-Means with optimal clusters

With the optimal number of clusters know, it's time to fit a last `KMeans`.

❓ **>>>** Fit a `KMeans` with `n_clusters=4` on your `X`, store the predictions in `y_pred`  

❓ **>>>** Make a scatter plot of your two features against each other, and color the points  according the predicted cluster in `y_pred`

In [None]:
# Code here!


We successfully identified **4 clusters** among our observations.  

**Notes:** Scaling features before clustering is not always necessary, but it rarely hurts.
You can check these [detailed answers](https://datascience.stackexchange.com/questions/6715/is-it-necessary-to-standardize-your-data-before-clustering) to go further.