![image.png](attachment:image.png)

## Clustering:
    - What is Clustering? example
    - Type of clustering
    - K-Means Clustering Algorithm
    - Choosing Right K number of clusters
    - Python Implementation of K-means Clustering Algorithm
    - Visualization 
    - Conclusions

#  Clustering  - Making Sense of Unlabeled Data
- The process of dividing datasets into groups consisting of similar data points is called clustering.
- Clustering is an unsupervised learning technique.
-  Clustering is often used for market segmentation, social network analysis, search results grouping, medical imaging and anomaly detection. 
![image.png](attachment:image.png)

# k-means clustering algorithm

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Application
- Clustering is a widely used machine learning approach. 
- It is sometimes used in the domain of network security as a way to detect anomalous behavior in computer networks. 
- Clustering is also sometimes used to automatically group or categorize documents based on similarity. 
- Another common use of clustering is to segment customers for marketing purposes. 

- To illustrate this use case, let's assume that we work for a small credit union and that we have the data shown here for each of our card customers. 
- Our objective with this data could be to group or segment these customers based on spending score and income level.
- Using a scale of low to high for both variables, we can represent the data this way in 2 dimensional space. 
- A clustering algorithm could then assign each customer to a group based on how similar they are to other customers. 

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

- In this example, the algorithm assigned each customer to one of three clusters.
- This means that customers that end up in the same cluster are very similar in terms of their income and spending habits.
- These clusters have no intrinsic meaning other than that they represent closely related customers.
- It is up to us to assign contextual labels to each of the clusters.

![image.png](attachment:image.png)

- For example, we could decide to assign the labels alpha, beta, and theta, three clusters.
- By doing this, we are implicitly assigning labels to every item within each cluster. 
- Because of our ability to apply labels to previously unlabel data in this way, clustering is also sometimes referred to as unsupervised classification.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

- There are several approaches to clustering each with its own strengths and weaknesses. 
- The type of clustering one chooses to use is often dependent on the characteristics of the data and the type of clusters needed.

# Type of Clustering

### hierarchical clustering
- Clustering can be hierarchical. With hierarchical clustering, clusters are nested within each other.
- This means that the boundaries of a particular cluster can fall within the boundaries of another cluster creating a parent/child relationship. 
- This nested structure between clusters creates a hierarchy that is often represented in the form of a cluster tree known as a dendrogram.
![image.png](attachment:image.png)

# Partitional Clustering
- Clustering can be partitional. With partitional clustering, each cluster boundary is independent of the others.
- There is no hierarchy between clusters. 
- Partitional clustering divides data objects into non overlapping groups. 
- In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.
![image.png](attachment:image.png)


# overlapping Clustering
- Clustering can also be overlapping. As the name implies an overlapping cluster is one where the boundaries of one cluster can overlap with those of other clusters. 
- This means that each item in the dataset can belong to one or more clusters. 
- Overlapping clustering differs from the hierical clustering approach in that with hierical clustering there is a parent-child relationship between clusters. 
- The boundaries of a child cluster must always be within the boundaries of the parent cluster. There is no parent-child relationship with overlapping clustering.
![image.png](attachment:image.png)

# Fuzzy or soft clustering
- Another approach to clustering is fuzzy or soft clustering. 
- With soft clustering, the membership of an item to a particular cluster is specified based on the membership weight that goes between 0 and 1. 
- The larger the weight, the greater the likelihood that the item belongs to a particular cluster. 
- If the weight is 0, then the item absolutely does not belong to the cluster. 
- If the weight is one, then the item absolutely does belong to the cluster in question.
![image.png](attachment:image.png)

#  Density based clustering
- Clustering can also be density based. 
- Density based clustering determines cluster assignments based on the density of data points in a region. 
- Clusters are assigned where there are high densities of data points separated by low density regions
![image.png](attachment:image.png)

## K-Mean Clustering
- One of the most commonly used clustering techniques is known as K-Means clustering. 
- K-Means clustering is a partitional clustering approach.
- This means that the cluster boundaries are independent of each other. 
- Each item can only belong to one cluster and every item is assigned to a cluster. 
- With K-Means clustering, we start by specifying how many clusters, K, we want. 
- Then the algorithm uses a process known as expectation maximization to assign every item within the dataset to one and only one of K non overlapping clusters based on similarity.
![image.png](attachment:image.png)

![image.png](attachment:image.png)

- To illustrate how K-Means clustering works, let's imagine that we have a data set as represented here with 12 instances and two features, A and B. 
![image.png](attachment:image.png)

- If our goal is to partition this data into three separate clusters, we set the value of K to three and let the K-Means clustering algorithm do the rest. 
- The first thing that the algorithm does is choose K random points in the feature space that serve as initial centers for the clusters. These initial centers are represented by points C1, C2, and C3 as shown here. 
- Note that these initial centers are randomly chosen and do not have to be one of the points from the original data.
![image.png](attachment:image.png)

- After choosing the initial cluster centers, the algorithm then assigns each item to the center that is closest to it.
- This is the expectation phase of the expectation maximization process. 
- To determine the cluster center closest to a particular point, the K-Means algorithm calculates the Euclidean distance between each point and each of the three cluster centers.
- Euclidean distance is the straight line distance between the coordinates of two points in multidimensional space. 
- The Euclidean distance between any two points, A and B, is as shown here.
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

- For example, given a data point with (X,Y) coordinates of (3,3) and a cluster center with (X,Y) coordinates of (4,2), the Euclidean distance between the data point and the cluster center is 1.414. 
- Getting back to our illustration, with each item assigned to a cluster, the K-Means algorithm then proceeds to compute a new centroid for each cluster.
![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

 - This is the maximization phase of the expectation maximization process. 
 - The cluster centroid is the average position of the items within the cluster.
 - The cluster centroid for any three points, A, B, and C is calculated as shown here. 
 - For example, given the three points in the red cluster with (X,Y) coordinates as shown, the cluster centroid is a point with (X,Y) coordinates of (2,5). 
 - Getting back to our illustration, after a new cluster or new cluster centers are calculated, the K-Means cluster algorithm reassigns each item to the cluster of the centroid closest to it. 
 - This has effect of shifting some points from one cluster to another. As a process of expectation and maximization is repeated, the shape of the cluster evolves as cluster centroids shift and items are reassigned to new clusters.
 - Eventually the clusters will achieve a state of convergence. 
 - This is a point when cluster centroids is no longer shift, which also means no item reassignments can occur. At this point, the process terminates and we get a final cluster assignment for each item in the data set.

![image.png](attachment:image.png)

# Choosing the right number of clusters
![image.png](attachment:image.png)

# Elbow method - WCSS
![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)