# Unsupervised Learning and K Means

This module introduces Unsupervised Learning and its applications. One of the most common uses of Unsupervised Learning is clustering observations using k-means. In this module, you become familiar with the theory behind this algorithm, and put it in practice in a demonstration.

## Learning Objectives
- Explain the kinds of problems suitable for Unsupervised Learning approaches
- Describe the clustering process of the k-means algorithm
- Become familiar with k-means clustering syntax in scikit learn

# Introduction to Unsupervised Learning

- Data points have unknown outcome

Types:

- Clustering: identify unknown structure in data
  - K-Means
  - Hierarchical Agglomerative Clustering
  - Mean Shift
  - DBSCAN

- Dimensionality reduction: use structural characteristics to simplify data
  -  Namely using structural characteristics to reduce the size of our dataset without losing much information contained in that original dataset.
  - Principal Components Analysis
  - Non-negative Matrix Factorization

## Curse of Dimensionality
- In theory, increasing features should improve performance.
- In practice, too many features leads to worse performance.

Number of training examples required increases exponentially with dimensionality.
- 1 dimension: 10 positions 
- 2 dimensions: 100 positions 
- 3 dimensions: 1000 positions

## Curse of Dimensionality: Churn Example

The Curse of Dimensionality comes up often in applications.

Consider the customer churn example from earlier.

The original dataset has 54 columns:

- Some, like 'Age', 'Under 30', and 'Senior Citizen' are
closely related.
- Others (Latitude for example) are essentially duplicated.
- Even if we remove duplicates and non-numeric columns,
the curse of dimensionality applies

Clustering can help identify groups of similar customers

Dimensionality reduction can improve both the performance and
interpretability of this grouping

![](./images/01_UnsupervisedLearningOverview.png)

![](./images/02_ClusteringExample.png)

## Common Clustering Use Cases

Classification

Anomaly detection

Customer segmentation

Improve supervised learning

## Common Dimesion Reduction Use Cases

Image processing: 
- High relusotion images -> compressed images

Image tracking



# K Means Clustering

## K Means Algorithm

K=2

- we're going to initialize our algorithm by picking 2 random points. And these are going to act as the centroids of our clusters
- Then with our centroids initiated, we take each example in our space, and determine which cluster it belongs to by computing the distance to the nearest centroid, and seeing which one's closer.
- So the second step is then to adjust the points, to adjust those centroids that we just discussed to the new mean of our clusters. 
- By continuously iterating, moving to the mean of those identified points that were closest, until it was not able to move anymore. Those centuries stayed in place, and we have our two clusters.

K=3

- there can be multiple solutions, depends on initial points

### Smarter Initializion
- Random 1 point
- Pick next point with probability $ distance (x_i)^2/ \sum_{i=1}^n distance(x_i)^2 $ (far point from first point)
- ... pick next point far from others
- 

## Selecting the Right Number of Clusters in K-Means

Sometimes the question has a K:
- Clustering similar jobs on 4 CPU cores (K=4)
- A clothing design in 10 different sizes to cover most people (K=10)
- A navigation interface for browsing scientific papers with 20 disciplines (K=20)

Often, the number of clusters (K) is unclear, and we need an approach to select it.


### Evaluating Clustering Performance

Inertia: sum of squared distance from each point $(x_i)$ to its cluster $(C_k)$.
$$
\sum_{i=1}^n(x_i-C_k)^2
$$

- Smaller value corresponds to tighter clusters.

- Value sensitive to number of points in cluster.

- And if you're more concerned that clusters have similar numbers of points, then you should use inertia.

Distortion: average of squared distance from each point $(x_i)$ to its cluster $(C_k)$.
$$
\dfrac{1}{n} \sum_{i=1}^n(x_i-C_k)^2
$$

- Smaller value corresponds to tighter clusters.
- Doesn't generally increase as more points are added (relative to Inertia)
- When the similarity of points in the cluster is more important, you should use distortion

So what can we do in order to find the clustering with best inertia? What we would do is, we initiate our K means algorithm several times. And with different initial configurations, and with that, assuming we predefined what our K is, we can compute the resulting inertia or distortion. Keep that results and see which one of our different initializations or configurations lead to the best inertia or distortion. 



## Elbow method and Applying K-means

![](./images/03_ChoosingRightNumberOfCluster.png)

### K-Means: The Syntax

```python
# Import the class containing the clustering method.
from sklearn. cluster import KMeans

# Create an instance of the class.
kmeans = KMeans(n_clusters = 3, init='k-means++')

#Fit the instance on the data and then predict clusters for new data.
kmeans = kmeans.fit(X1)
y_predict = kmeans.predict(X2)

# Can also be used in batch mode with MiniBatchKMeans.
```

### K-Means: Elbow Method Syntax

To implement elbow method, fit K-Means for various levels of k, save inertia values.

```python
inertia = [ ]
list_clusters = list(range(10))
for k in list_clusters:
  kmeans = KMeans(n_clusters=k)
  kmeans.fit(X)
  inertia.append(km.inertia_)
```

# Summary
## Unsupervised Learning Algorithms
Unsupervised algorithms are relevant when we don’t have an outcome or labeled variable we are trying to predict.

They are helpful to find structures within our data set and when we want to partition our data set into smaller pieces.   

Types of Unsupervised Learning:

| Type of Unsupervised Learning | Data                                                    | Example                                                                      | Algorithms                                                         |
|-------------------------------|---------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------|
| Clustering                    | Use unlabeled data, Identify unknown structures in data | Segmenting customers into different groups                                   | K-means, Hierarchical Agglomerative Clustering, DBSCAN, Mean shift |
| Dimensionality Reduction      | Use structural characteristics to simplify data         | Reducing size without losing too much information from our original data set | Principal Components Analysis, Non-negative Matrix, Factorization  |

Dimensionality reduction is important in the context of large amounts of data.

## The Curse of Dimensionality

In theory, a large number of features should improve performance.  As models have more data to learn from, they should be more successful. But in practice, too many features lead to worse performance. There are several reasons why too many features end up leading to worse performance. If you have too many features, several things can be wrong, for example: 

-        Some features can be spurious correlations, which means they correlate into the data set but not outside your data set, as long as new data comes in. 

-        Too many features create more noise than signal.

-        Algorithms find it hard to sort through non-meaningful features if you have too many features. 

-        The number of training examples required increases exponentially with dimensionality.

-        Higher dimensions slows performance.

-        Larger data sets are computationally more expensive.

-        Higher incidence of outliers. 

To fix these problems in real life, it's best to reduce the dimension of the data set. 

Similar to feature selection, you can use Unsupervised Machine Learning models such as Principal Components Analysis.

## Common uses of clustering cases in the real world
1.     Anomaly detection

Example: Fraudulent transactions.

Suspicious fraud patterns such as small clusters of credit card transactions with high volume of attempts, small amounts, for new merchants. This creates a new cluster and this is presented as an anomaly so perhaps there’s fraudulent transactions happening. 

2.     Customer segmentation

You could segment the customers by recency, frequency, and average amount of visits in the last 3 months. Another common type of segmentation is by demographic and the level of engagement, for example, single costumers, new parents, empty nesters, etc. And the combinations of each with the preferred marketing channel, so you can use these insights for future marketing campaigns. 

3.      Improve supervised learning

You can perform a Logistic regression for each cluster. This means training one model for each segment of your data to try to improve the classification.

## Common uses of Dimension Reduction in the real world

1. Turn high-resolution images into compressed images

This means to come to a reduced, more compact version of those images, so they can still contain most of the data that can tell us what the image is about.  

2.  Image tracking

Reduce the noise to the primary factors that are relevant in a video capture. The benefits of reducing the data set can greatly speed up the computational efficiency of the detection algorithms.   

## K-means Clustering
K-means clustering is an iterative process in which similar observations are grouped together. To do that, this algorithm starts by taking 2 random points known as centroids, and starts calculating the distance of each observation to the centroid, and assigning each cluster to the nearest centroid. After the first iteration, every point belongs to a cluster.

Next, the number of centroids increases by one, and the centroid for each cluster is recalculated as the points with the average distance to all points in a given cluster. Then, we keep repeating this process until no example is assigned to another cluster. 

And this process is repeated k-times, hence the name k-means. This algorithm converges when clusters do not move anymore.

We can also create multiple clusters, and we can have multiple solutions. By multiple solutions, we mean that the clusters are not going to move anymore (they converged), but we can converge in different places, where we no longer move those centroids.

## Advantages and Disadvantages of K-Means  

The main advantage of k-means algorithm is that it is easy to compute. One disadvantage is that this algorithm is sensitive to the choice of the initial points, so different initial configurations may yield different results. 

To overcome this, there is a smarter initialization of K-mean clusters called K-means ++, which helps to avoid getting stuck at local optima. This is the default implementation of the K-means.     

## Model Selection, choosing K number of clusters

Sometimes you want to split your data into a predetermined number of groups or segments. Often, the number of clusters (K) is unclear, and you need an approach to select it.

A common metric is Inertia, defined as the sum of squares distance from each point to its cluster centroid.

Smaller values of Inertia correspond to tighter clusters, this means that we are penalizing spread out clusters and rewarding clusters that are tighter to their centroids.

The drawback of this metric is that its value sensitive to number of points in clusters. The more points you add, the more you will continue penalizing the inertia of a cluster, even if those points that are relatively closer to the centroids than the existing points. 

Another metric is Distortion, defined as the average of squared distance from each point to its cluster.

Smaller values of distortion corresponds to tighter clusters.

An advantage of distortion is that it doesn’t generally increase, as more points are added (relative to inertia). This means that it doesn’t increase distortion, as closer points will actually decrease the average distance to the cluster centroid.

## Inertia Vs. Distortion 

Both Inertia and Distortion are measures of entropy per cluster.

Inertia will always increase, as more members are added to each cluster, while this will not be the case with distortion. 

When the similarity of the points in the cluster are very relevant, you should use distortion and if you are more concerned that clusters should have a similar number of points, then you should use inertia.     

## Finding the right cluster
To find the cluster with a low entropy metric, you can run a few k-means clustering models with different initial configurations, compare the results, and determine which one of the different initializations of configurations leads to the lowest inertia or distortion.