#### Clustering
- Clustering is an unsupervised learning
- It is the process of partitioning a set of data into meaningful sub-classes

Application of clustering:
- Pattern recognition
- Spatial data analysis
- Image processing
- Document classification
- Identification of similar entities
- Finding pattern of weather behaviour

Why we need clustering:
- Clustering can be used for the following tasks:
    - Analyzing research data
    - Creating a summary
    - Detecting noise
    - Duplication detection

Algorithms of clustering:
- K-means clustering
- Hierachical clustering
- Density-based clustering


#### K-means clustering
- It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.

Algorithm of K-clustering:
- Partition clustering
- Without any cluster-internal structure, K-means  divides data into non-overlapping groups.
- Within the cluster, examples are very similar.
- Various clusters have quite different examples

Determination of K in K-means clustering
- The perfomance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task.

**Elbow method** 
- It uses the concept of WCSS value (Within Cluster Sum of Squares) which defines the total variations within a cluster

Steps of Elbow method:
- It executes the K-means clustering on a given dataset for different K values (ranges from 1 - 10)
- For each value of K, calculates the WCSS value
- Plots a curve between calculated WCSS values and the number of clusters K
- The sharp point of bend or a point of the plot that looks like an arm, then that point is considered as the best value of K.














In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # 3d plotting tool
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs # generate synthetic clustered data

plt.rcParams['figure.figsize'] = (16,9) # plt.rcParams modifies global Matplotlib settings.
# Generate Synthetic Dataset with 4 Clusters
X,y = make_blobs(n_samples=800, n_features=3, centers=4) 
# make_blobs() creates a random dataset with multiple clusters.
# n_samples=800 → 800 data points will be generated.
# n_features=3 → Each data point has 3 features (for 3D plotting).
# centers=4 → The dataset contains 4 clusters.
X


array([[-2.81031454,  3.7962009 ,  6.74904045],
       [-1.93606152,  3.9581727 , -9.67625225],
       [ 2.28793664, -4.29972926, -6.39700659],
       ...,
       [ 3.34258449, -4.3693951 , -8.26538745],
       [ 1.05381772, -7.03249685, -5.32037082],
       [-1.74086639,  2.79685687, -8.73532727]])

In [3]:
y

array([1, 2, 0, 1, 3, 1, 3, 3, 2, 2, 1, 2, 1, 3, 2, 1, 3, 1, 3, 2, 1, 0,
       2, 3, 2, 0, 2, 0, 1, 1, 1, 1, 3, 1, 0, 1, 1, 0, 1, 1, 1, 3, 1, 1,
       0, 2, 3, 3, 2, 0, 2, 2, 2, 0, 2, 2, 0, 1, 2, 0, 0, 3, 0, 1, 0, 1,
       3, 1, 3, 2, 1, 2, 2, 2, 1, 1, 1, 0, 1, 2, 3, 3, 0, 1, 2, 0, 1, 0,
       3, 1, 3, 1, 3, 0, 0, 2, 2, 2, 2, 2, 1, 1, 2, 0, 0, 3, 0, 2, 2, 2,
       1, 1, 1, 3, 2, 3, 3, 3, 3, 2, 1, 0, 3, 0, 2, 3, 3, 3, 0, 2, 0, 2,
       0, 2, 1, 2, 0, 3, 0, 1, 2, 1, 2, 3, 1, 2, 0, 3, 2, 2, 1, 3, 2, 2,
       3, 0, 0, 0, 3, 1, 2, 3, 3, 0, 1, 1, 0, 3, 3, 2, 1, 2, 1, 2, 0, 0,
       0, 0, 1, 0, 2, 3, 1, 3, 1, 3, 1, 0, 2, 2, 1, 1, 2, 1, 1, 0, 1, 1,
       0, 1, 1, 3, 3, 2, 0, 2, 1, 1, 1, 1, 3, 3, 3, 3, 2, 3, 1, 2, 3, 3,
       1, 2, 3, 1, 2, 0, 3, 3, 2, 2, 2, 3, 1, 3, 2, 1, 2, 0, 1, 0, 0, 2,
       1, 3, 0, 2, 1, 0, 3, 2, 3, 0, 2, 1, 2, 0, 0, 1, 1, 2, 2, 1, 1, 0,
       3, 1, 2, 0, 1, 3, 1, 0, 1, 0, 2, 1, 3, 1, 0, 3, 0, 1, 3, 2, 1, 2,
       3, 2, 1, 3, 3, 3, 1, 0, 3, 1, 1, 3, 3, 2, 1,