# Unsupervised Learning and K Means

This module introduces Unsupervised Learning and its applications. One of the most common uses of Unsupervised Learning is clustering observations using k-means. In this module, you become familiar with the theory behind this algorithm, and put it in practice in a demonstration.

## Learning Objectives
- Explain the kinds of problems suitable for Unsupervised Learning approaches
- Describe the clustering process of the k-means algorithm
- Become familiar with k-means clustering syntax in scikit learn

# Introduction to Unsupervised Learning

- Data points have unknown outcome

Types:

- Clustering: identify unknown structure in data
  - K-Means
  - Hierarchical Agglomerative Clustering
  - Mean Shift
  - DBSCAN

- Dimensionality reduction: use structural characteristics to simplify data
  -  Namely using structural characteristics to reduce the size of our dataset without losing much information contained in that original dataset.
  - Principal Components Analysis
  - Non-negative Matrix Factorization

## Curse of Dimensionality
- In theory, increasing features should improve performance.
- In practice, too many features leads to worse performance.

Number of training examples required increases exponentially with dimensionality.
- 1 dimension: 10 positions 
- 2 dimensions: 100 positions 
- 3 dimensions: 1000 positions

## Curse of Dimensionality: Churn Example

The Curse of Dimensionality comes up often in applications.

Consider the customer churn example from earlier.

The original dataset has 54 columns:

- Some, like 'Age', 'Under 30', and 'Senior Citizen' are
closely related.
- Others (Latitude for example) are essentially duplicated.
- Even if we remove duplicates and non-numeric columns,
the curse of dimensionality applies

Clustering can help identify groups of similar customers

Dimensionality reduction can improve both the performance and
interpretability of this grouping

![](./images/01_UnsupervisedLearningOverview.png)

![](./images/02_ClusteringExample.png)

## Common Clustering Use Cases

Classification

Anomaly detection

Customer segmentation

Improve supervised learning

## Common Dimesion Reduction Use Cases

Image processing: 
- High relusotion images -> compressed images

Image tracking



# K Means Clustering

## K Means Algorithm

K=2

- we're going to initialize our algorithm by picking 2 random points. And these are going to act as the centroids of our clusters
- Then with our centroids initiated, we take each example in our space, and determine which cluster it belongs to by computing the distance to the nearest centroid, and seeing which one's closer.
- So the second step is then to adjust the points, to adjust those centroids that we just discussed to the new mean of our clusters. 
- By continuously iterating, moving to the mean of those identified points that were closest, until it was not able to move anymore. Those centuries stayed in place, and we have our two clusters.

K=3

- there can be multiple solutions, depends on initial points

### Smarter Initializion
- Random 1 point
- Pick next point with probability $ distance (x_i)^2/ \sum_{i=1}^n distance(x_i)^2 $ (far point from first point)
- ... pick next point far from others
- 

## Selecting the Right Number of Clusters in K-Means

Sometimes the question has a K:
- Clustering similar jobs on 4 CPU cores (K=4)
- A clothing design in 10 different sizes to cover most people (K=10)
- A navigation interface for browsing scientific papers with 20 disciplines (K=20)

Often, the number of clusters (K) is unclear, and we need an approach to select it.


### Evaluating Clustering Performance

Inertia: sum of squared distance from each point $(x_i)$ to its cluster $(C_k)$.
$$
\sum_{i=1}^n(x_i-C_k)^2
$$

- Smaller value corresponds to tighter clusters.

- Value sensitive to number of points in cluster.

- And if you're more concerned that clusters have similar numbers of points, then you should use inertia.

Distortion: average of squared distance from each point $(x_i)$ to its cluster $(C_k)$.
$$
\dfrac{1}{n} \sum_{i=1}^n(x_i-C_k)^2
$$

- Smaller value corresponds to tighter clusters.
- Doesn't generally increase as more points are added (relative to Inertia)
- When the similarity of points in the cluster is more important, you should use distortion

So what can we do in order to find the clustering with best inertia? What we would do is, we initiate our K means algorithm several times. And with different initial configurations, and with that, assuming we predefined what our K is, we can compute the resulting inertia or distortion. Keep that results and see which one of our different initializations or configurations lead to the best inertia or distortion. 



## Elbow method and Applying K-means

![](./images/03_ChoosingRightNumberOfCluster.png)

### K-Means: The Syntax

```python
# Import the class containing the clustering method.
from sklearn. cluster import KMeans

# Create an instance of the class.
kmeans = KMeans(n_clusters = 3, init='k-means++')

#Fit the instance on the data and then predict clusters for new data.
kmeans = kmeans.fit(X1)
y_predict = kmeans.predict(X2)

# Can also be used in batch mode with MiniBatchKMeans.
```

### K-Means: Elbow Method Syntax

To implement elbow method, fit K-Means for various levels of k, save inertia values.

```python
inertia = [ ]
list_clusters = list(range(10))
for k in list_clusters:
  kmeans = KMeans(n_clusters=k)
  kmeans.fit(X)
  inertia.append(km.inertia_)
```