# Module 7 Introduction to Clustering

# Introduction
### Slide 1
#### Module 7 Introduction to Clustering
- Unsupervised Learning
- K-Means
- DB-SCAN
- Case Study-Credit Card Data

---
## Introduction Script
Hello and welcome. 

This module is about clustering, which is a type of unsupervised learning. 

Unitl now, all the machine learning algorithms we've learned are supervised learning. Supervised learning uses training data to make connections between the inputs and the outputs. 

Unsupervised learning, on the other hand, does not use outputs. The most common unsupervised learning method is clustering, which is used to find hidden patterns in the data and split the data into different groups.

We will introduce two types of clustering algorithms, K-means and DBSCAN.

One of the most popular applications of clustering is customer segmentation, which divides customers into sub-groups based on some type of shared characteristics, like purchase pattern. In this module, we will have a case study that uses k-means clustering to segment credit card users into different groups.

As I mentioned before, please watch the video to learn the concepts behind the algorithms, and more importantly, go through the lesson notebooks and practice as much as you can.

---
## Lesson 1: Introduction to K-means Clustering

---
### Slide 1(Just the first image in clustering)
#### K-means

1. Choose k centers randomly
2. Form k groups by centers
3. Recalculate centers of k groups
4. Repeat 2-4 until no change in groups or reach max_iter
5. Calculate total distance D
6. Repeat 1-6 pre-defined times(n_init)
7. Choose groups with smallest D as final clusters

### Slide 1 Script
(face)

In this lesson, we introduce the simplest clustering method, k-means. K-means tries to divide a dataset into k groups. We need to provide the value of k or number of groups to the model.

Let's start with an example.

In this example, the dots are the datapoints. Our goal is to divide the dataset into two different groups. By visualizing the data, we can easily identify the two groups, one group at the top right corner and one group at bottom left corner.

Let's see how k-means algorithm divids this dataset.


---
### Slide 2

#### K-means

##### Image need to change
<img src='https://miro.medium.com/max/638/0*rrzG3LyOnAvOepbJ.png' width=500>

---
### Slide 2 Script

(slide1)

The first step in k-means algorithms is to randomly pick k data points as centers, in this case, since we are going to divide the dataset into two groups, we pick two data points randomly as initial centers, represented by the blue dot and the red dot.

(slide2)

We then divide the dataset to two subgroups based on the distance between each datapoint and the two centers. Now we have two subgroups, blue and red, all datapoints in subgroup blue are closer to the blue center, all datapoints in subgroup red are closer to the red center.

(slide3)

We then calculate the true center of each group, represented by the new blue and red dots.

(slide4)

With the two new centers, we again divide the dataset into two groups based on the distance of datapoints to the two new centers. 

(slide5)

We repeat step 3 and step 4 untill the centers don't change any more. The two subgroups determined by the final certers are the final result.

(slide 6)
This is the whole process.

Since we pick original centers randomly, it's possible that the final result is not optimal. To ensure the best result, we can repeat the whole process with different original centers. For each iteration, we can calculate the sum of the distance between each datapoint to its center, then choose the one with the least total distance as our final result.

We will use the iris dataset to demonstrate k-means in this lesson. The iris dataset has 4 features. To make visualization easier, we use the principla component analysis to convert the iris dataset to two features.


---
### Slide 3

#### Principal Component Analysis(PCA)
- Reduce Dimension
- Retain major characteristics of original features
- Help visualize clusters


---
### Slide 3 Script

Principla component analysis or PCA is an algorithm to reduce dataset features while retaining the majority of the characteristics of orignal features. PCA itself is a type of unsupervised learning.

It'a common practice in clustering analysis to visualize dataset with the help of PCA. But please note that we still use the original features to do clustering. PCA is just used to help us visualize the dataset.

---
### Slide 4
#### Sklearn KMeans

- Scale features
- No train-test split
- Sklean KMeans Hyperparameters
    - `n_clusters`: number of clusters, $k$
    - `n_init`: number of times to choose different initial centers, default 10.
    - `max_iter`: maximum number of iterations for the algorithm in any given run, default 300.

### Slide 4 Script

Now let's look at some import aspects in the k-means model.

First, since k-means relies on distance calculation, scaling feature is critical. 

Secondly, train-test split is not necessary because kmeans is an unsupervised learning. 

Scikit learn cluster module has kmeans model which accepts many hyperparamters. I'll explain the three most important hyperparameters.

The first one is n_clusters, which is the number of clusters or the value of k; 

The second one is n_init, which is the number of times to choose different initial centers, the default value is 10.

The thrid one is max_iter, this hyperparameter set a limit on how many times it can repeat the step 3 and 4, or recalculating centers and regroup the dataset. In k-means, after the initial centers are picked, the algorithm will repeat re-grouping and recalculating centers until there's no change in the sub groups. But sometimes, the subgroups never stop changing and the process falls into an infinite loop. To avoid this, we set max_iter. The process will end after max_iter is reached even if the subgroups still change. The default value of max_iter is 300.



---


### Slide 5
#### Sklearn KMeans
```
from sklearn.cluster import KMeans

# We build our model assuming three clusters
k_means = KMeans(n_clusters=3, n_init=25, random_state=23)

# We fit our original data
k_means.fit(x)

# Obtain the predictions
y_cluster = k_means.predict(x)
```

### Slide 5 Script
The steps to apply kmeans is pretty similar to that of the supervised learning models. We first create the kmeans model, then train the model with the dataset, then call predict function to divide the dataset into subgroups.

In the lesson notebook, we plot the clusters and compare it with the true clusters because for iris dataset, we have the output which is the species column. You are encouraged to understand the python code that is used to make the plot but it's not required.


### Slide 6
#### Determine K - Elbow Method
- Use KMeans
- Get total distance D for each $k$
- Plot D against k
- Find elbow
<img src='images/elbow.png' width=500>

### Slide 6 Script
K-means is pretty easy to understand. But there's a challenge. How do we know the proper value of k?

Sometimes, we can get this information with prior knowledge, like with the iris dataset, we know there are three different species of iris in the dataset. But when we don't have the prior information, we can find k programatically with the Elbow Method.

Elbow method actually uses the k-means algorithm itself to find the best value of k. We perform k-means on the data set for a range of k values, for example from 1 to 10. In each round, we calculate the sum of the distance from each data point to its cluster center. We then plot the total distance against k values.

The plot in this slide is the elbow plot on iris data set for k from 1 to 10. The total distance decreases when k increases. The extreme case is when k equals to the total number of data points in the dataset. Every data point is its own cluster center and the distance from a data point to its center is always 0, so the total distance is also 0.

The best value for k is selected at the elbow point within this plot. In another word, we will choose a k so that increases in k won't reduce total distance significantly.

For iris data set, as shown in the plot, the elbow appears at k equals to 3, which matches our prior knowledge.


### Slide 8
#### Explore Cluster Characteristics
<img src='images/cluster_pairplot.png' width=600>

### Slide 8 Script

After the data set is divided to different clusters, we want to explore the characteristics of each cluster. You can calculate various statistics of the clusters, or explore the clusters visually. One of the simplest ways is to plot a pairplot on the dataset and the cluster labels.

The plot in this slide is the pairplot of the iris dataset and the cluster labels assigned by the k-means model.

From the last column the pairplot, the 3rd and 4th row indicate that cluster 0 has shortest petal width and petal lenght. From the second row, we can see that cluster 0 has largest sepal width.

The scatter plots in the pairplot also clearly shows that cluster 0 is easily separated from other clusters. On the other hand, there's no clear separation between cluster 1 and cluster 2.

The scatter plots are a good indicator of how good the clusters are, but we still need some metrics to evaluate the quality of the clusters.

### Slide 9
#### Clustering Evaluation
- Adjusted Rand Index
 - -1 to 1
 - Compare with true class
- Silhouette
 - -1 to 1
 - Don't use true class
 - Compare distance to assigned center and next closest center

### Slide 9 Script

We've learned many differenct regression and classification evaluation metrics. But they're all for supervised learning and the metrics are based on the comparison of the predictions and the true outputs.

Clustering is unsupervised learning where the learning process doesn't need the true outputs. When true outputs are available in the dataset, like the iris data set, we can compare the predicted clusters with the true outputs and get a score. 

Most of the scikit learn clustering evaluation metrics need true outputs. Adjusted Rand index is one of them. The range of adjusted rand index is from -1 to 1, 1 means the clusters match the true outputs 100%.

The problem is, most clustering datasets don't have outputs at all. We need to have a metric that doesn't rely on the true outputs. There's only one such metric in the scikit learn metrics module, the Silhouette score. Instead of comparing to the true outputs, the silhouette score is calculated by comparing the total distance of all datapoints to their cluster centers with the total distance of the datapoints to their next closest cluster centers. If the distance to the cluster centers are much less than the distances to the next nearest centers, it means the clusters have good separations. The silhouette score also ranges from -1 to 1, the larger value indicates the better clustering result.

---
## Lesson 2: K-means Case Study



---
### Slide 1
#### Credit Card Data
<img src='images/credit_card_dataset.png' width=600>

---
### Slide 1 Script
(face)

In this lesson, we will have a k-means case study on the credit card dataset. 

(slide)

The credit card dataset is downloaded from kaggle, which is an online community of data scientists. You can find a lot of datasets, data analytics articles and source code on kaggle.

The credit card dataset summarizes the usage behavior of about 9000 active credit card holders during 6 months period. The original dataset has 18 features. We will only pick 5 features in this lesson. The five features are:

Our goal is to divide the credit card users to different sub groups.


---
### Slide 2
#### Determine K
<img src='images/elbow_creditcard.png' width=500>


---
### Slide 2 Script

Since we don't have an idea how many subgroups there should be in the dataset, we will the elbow plot to help us determine the value of k.

Unfortunately, there's no clear elbow in the plot. This is very common in clustering analysis. Real world datasets rarely have clear cluster separations like what we saw in the iris dataset.

In this case, we will pick k equals to 6.

---
### Slide 3
#### Clusters Member Counts

Cluster 0     : 3436 members  
Cluster 1     : 2572 members  
Cluster 2     : 1270 members  
Cluster 3     : 1233 members  
Cluster 4     :   56 members  
Cluster 5     :   69 members  

---
### Slide 3 Script
Once we determine the value of k, it's very simple to apply kmeans model on the dataset.

This is the count of each cluster identified by the kemans model. Cluster 4 and 5 are negalectable since the counts are too small comparing to the total size of the dataset. 

In the lesson notebook, we explore the clusters with the help of the pairplot. The pairplot indicates that there's no clear separation between clusters. But we can still find some useful information. Please refer to the notebook for more details.


### Slide 4(Not needed)
#### Explore Clusters with Pair Plot
<img src='images/creditcard_pairplot.png' width=600>

### Slide 4 Script(Not needed)

The pairplot indicates that there's no clear separation between clusters. But we can still find some useful information. For example, the first row and last column shows that most of the purchases in cluster 1 are cash-in-advance purchases, which means the users in this cluster made a lot of online purchases. These users are good marketing target of online busineses.

I'm not going to spend too much time on this plot since the individual plots are too small to see in the slide. Please refer to the lesson notebook for more coding and analysis details of this case study.

---
### Slide 5
#### Visualize Data
<img src='images/pca_creditcard.png' width=500>

---
### Slide 5 Script

This is a 2-dimensional scatter plot of the credit card dataset. We first need to apply pca to reduce the dataset to two features, then plot this scatter plot. From the plot we can see that there isn't any clear separation in the dataset. One possible reason could be that we don't have enough information in the dataset. You may try download the orignal credit card data from kaggle and repeat the case study to see if you can get better clustering result.


---
## Lesson 3: Introduction to Density-Based Clustering

- DBSCAN
- Scikit Learn DBSCAN Model
- Hyperparameter Estimation
- Kmeans vs. DBSCAN


### Lesson 3 Script
(face)
In the previous two lessons, we introduced k-means algorithm, which is a simple and fast clustering method. The biggest challenge in k-means is that it requires knowledge of the number of clusters, or value of k ahead of time.

In this lesson, we will introduce a density based clustering method, db-scan, which stands for Density-Based Spatial Clustering of Applications with Noise. To understand this confusing name, let's first take a look at some basic concpets in dbscan.

---
### Slide 1
#### DBSCAN
- Hyperparameters
 - `eps`: Max distance between two points to be considered as in same neighbor
 - `min_samples`: Number of points that must lie within the neighborhood of the current point in order for it to be considered a _core point_. 

- Point Categories
 - **Core point**: at least `min_samples` of points within distance `eps`.
 - **Border point**: not a core point, but within `eps` to at least one core point.
 - **Noise point**: Neither core point nor border point.

- Clusters
 - All core points that are reachable to each other form a cluster. All border points of these core points also belong to the cluster.


### Slide 1 Script

In k-means model, the most import hyperparameter is the value of k. DBSCAN doesn't requir value of k, but it requires the other two hyperparameters, epsilon and min_samples.

epsilon is a distance value, if the distance between two data points are less than epsilon, they are considered as in the same neighbor.

min_samples is an integer which is used to define a core point. 

a core point is a datapoint that has at least min_samples neighbors.

If a datapoint is not a core point, but it's within the distance epsilon to any of the core points, it's a border point.

If a datapoint is neither core point nor border point, it's a noise point.

Two datapoints are reachable when their distance is less than epsilon. All core points that are reachable to each other forms a cluster. Border points belong to the cluster of their closest core points.

Let's use an image to understand these concepts.


### Slide 2
### DBSCAN
<img src='https://3.bp.blogspot.com/-rDYuyg00Z0w/WXA-OQpkAfI/AAAAAAAAI_I/QshfNVNHD_wXJwXEipRIVzDSX5iOEAy2wCEwYBhgL/s1600/DBSCAN_Points.PNG'>
<img src='https://miro.medium.com/max/2174/0*bUyZlx3rbNneiUA_'>


### Slide 2 Script

In this image, min_samples is 4, epsilon is the radius of the circles. In the image, the red dot is a core point because it has 6 neighbors within the distance epsilon. There are 6 neighbors becaues the datapoint itself is counted as one neighbor. 

The blue dot is a border point because it is not a core point since it only has two neighbors, but it's reachable to a core point. The yellow dot is neither core point nor border point, so it's a noise point.

(next slide)

In this iamge, all the red dots are core points and they are reachable to each other. The bule dots are border points because they are reachable to at least one core point. All the reachable core points and their border points form a cluster.

So in the image, these dots are in the same cluster and this dot is a noise.

### Slide 3
#### Sklearn DBSCAN
```
from sklearn.cluster import DBSCAN

# Apply DBSCAN
db = DBSCAN(eps=0.575, min_samples=10)
db.fit(x)
```

### Slide 3 Script

Scikit learn cluster module defines the DBSCAN model. It's very simple to permform a DBSCAN clustering. As shown in this slide, you just need to provide epsilon and min_samples values and call fit function to train on the dataset.

The problem is, how do we determine the values of epsilon and min_samples?

### Slide 4
#### Hyperparameter Estimation

- `min_samples`
 - Greater than number of features. ie. 2 times
- `eps`
 - small `eps` preferable
 - k-distance graph

### Slide 4 Script

There are couple of general rules in choosing the hyperparameter values. 

min_samples can't be too small, a rule of thumb is two times of number of features in the dataset.

For epsilon, a samller value is prefered. And we can estimate the proper epsilon with the k-distance graph.

### Slide 5
#### K-distance Graph

- Plot distance to nearest neighbor for all data points
- Choose a distance at elbow that covers majority of data points

<img src='images/k_distance.png' width=500>

### Slide 5 Script

This is the k-distance graph. The y axis is the distance of each datapoint to its nearest neighbor. We sort the distances of all datapoints to their nearest neighbors, then make the k-distance plot. The x axis in the plot represents the sorted index of the datapoints. 

For example, the plot in this slide is the k-distance plot on the iris dataset. There are 150 datapoints in the iris dataset, so there are 150 points on the plot line. On the plot line, when x equals to 0, y is also about 0, it means that the distance between this datapoint to its nearest neighbor is 0; when x equals to 140, y is about 0.6, it means that the distance of this datapoint to its nearest neighbor is about 0.6, and there are 140 datapoints that are within 0.6 to their nearest neighborts.

The elbow in the k-distance graph indicates the distance within which the majority data points are to their nearest neighbors. This elbow is the estimate of the epsilon value. In this case, the epsilon is between 0.5 to 0.6.

You can find the python code to plot the k-distance graph in the lesson notebook. It's fairly straightforward, please try to understand it.


### Slide 6
#### k-Means vs. DBSCAN

#### k-Means works well:
<img src='images/k_means_good.png' width=400>

#### k-Means works poorly but DBSCAN works well:
<img src='images/dbscan_good.png' width=400>


### Slide 6 Script

Now we've learned two clustering algorithms, k-means and dbscan. A natural question is which one we should use?

While, the answer to this question total depends on the dataset.

In the left image in this slide, the clusters in the dataset have spherical-like shapes, the clusters can be constructed nicely around the centers. For this kind of dataset, k-means works better.

In the right image, however, the clusters in the dataset have a special shape. It's not possible to separate the two clusters with fixed centers. But dbscan can easily separate the clusters since the data points in the same cluster are reachable to each other. In this case, dbscan is better.

So visualizing the dataset can help us make the choice. But to visualize a dataset, we will have to first reduce the dataset  to two features, which we can do by using principle component analysis introduced in lesson 1.

Another notable issue is that dbscan performs better when the clusters have similar density, because there's just one epsilon value for all clusters.

If you recall the case study on the credit card data we did in lesson 2, the scatter plot of the credit card data look like this.



### Slide 7
<img src='images/pca_creditcard.png' width=500>

### Slide 7 Script

The density of the data points are apparently not consistant. There's also no special shapes. So even though k-means doesn't do a good job to separate the clusters, it's still considered as the better choice between the two clustering algorithms for this dataset.

# Module 7 Review

### Slide 1
- K-means
 - n_clusters
 - elbow method
- DBSCAN
 - eps, min_samples
 - k-distance graph
- Visualize dataset
 - PCA

---
## Review Script

In this module we learned two popular unsupervised learning algorithms, k-means and dbscan.

You need to understand how they work and the difference between the two algorithms.

k-means forms clusters with predetermined k value of number of clusters. It works better with dataset that have spherical-like cluster shapes.

The elbow plot helps to determine value of k.

dbscan forms clusters by connection core points and border points. It doesn't require number of clusters in advance, but it requires epsilon which is the distance to define neighbor points, and min_samples that is used to define core points. dbscan words better when clusters in a dataset have speical shapes and have similar density.

The k-distance graph helps to determine value of epsilon.

A scatter plot with the help of PCA is a good way to visualize the dataset and will help to determine which algorithm to choose.

Now I'd like to talk about the assignment a little bit.

One problem in the assignment asks you to prepare data for an elbow plot. You don't need to write the plot code, just create data for the plot. There's complete elbow plot code in lesson 1 notebook. If you understand the code in the lesson notebook and follow the problem instruction closely, you should have no trouble to finish the problem.

Other problems are fairly straightforward. Just remember to work on the problems in order.

Good luck.