# Clustering with K-Means

**Clustering** is an unsupervised learning technique used to group similar data points into clusters. One of the most popular clustering algorithms is **K-Means**. K-Means partitions the data into **K** clusters, where each data point belongs to the cluster with the nearest mean. It's widely used in data analysis, image compression, and customer segmentation.

### Steps in K-Means Clustering:
1. **Initialization**: Randomly select K data points as the initial centroids (center points of clusters).
2. **Assignment Step**: Assign each data point to the nearest centroid.
3. **Update Step**: Recalculate the centroids based on the mean of the points assigned to each cluster.
4. **Repeat**: Repeat the assignment and update steps until the centroids no longer change or the algorithm converges.

### How K-Means Works:
- **K** represents the number of clusters that the algorithm will form.
- The **centroids** are the central points of each cluster and are updated during the algorithm's execution.
- **Euclidean distance** is typically used to measure the distance between data points aikit-learn matplotlib

### Applications
K-means clustering is versatile and can be applied in various domains, including:
- **Market Segmentation**: Grouping customers based on purchasing behavior.
- **Anomaly Detection**: Identifying outliers or unusual patterns in data.
- **Customer Grouping**: Categorizing customers for targeted marketing.

# Using K-Means for Classification (Clustering - Unsupervized ML):

## Elbow Method

### What is the Elbow Method?
The Elbow Method is used to determine the optimal number of clusters (`K`) for K-means clustering. It involves plotting the inertia (sum of squared distances from each point to its cluster centroid) against the number of clusters and identifying the "elbow point."

### Steps:
1. Run K-means for a range of values of `K`.
2. Calculate inertia for each value of `K`.
3. Plot `K` vs. inertia.
4. Identify the "elbow," where the rate of decrease in inertia slows significantly. This point represents the optimal `K`.

### Why Use the Elbow Method?
- To avoid underfitting or overfitting by choosing a reasonable number of clusters.

## **1. Elbow Method for Determining Optimal Clusters**

In addition to visual inspection, the optimal K can be determined programmatically by monitoring the percentage change in inertia. If the change remains below a certain threshold (e.g., 5%) for three consecutive K values, the optimal K is identified.

### Visualizing Clusters
After determining the optimal K, the final K-means model is fitted to the data. Clusters can then be visualized by plotting data points, colored by their assigned cluster labels. This helps in interpreting the cluster structure and verifying the clustering results.

### Steps in the Implementation
1. **Load the Dataset**: Load the dataset containing features such as marks and study hours.
2. **Normalize the Data**: Apply Min-Max scaling to ensure equal contribution from all features.
3. **Determine Optimal Clusters**: Use the elbow method to plot inertia and identify the optimal K.
4. **Fit the Final Model**: Train the K-means model using the optimal K.
5. **Visualize Clusters**: Plot the data points with cluster assignments.
6. **Output Optimal K**: Display the number of clusters chosen.


## Sample Code:

In [None]:
# Import pandas for data processing
import pandas as pd

# Read the dataset
dataset = pd.read_csv("studentclusters.csv")
X = dataset.copy()

# Visualize the data using Scatter plot (optional, just to see the raw data)
X.plot.scatter(x='marks', y='shours')

# Fit and Transform the data for MinMax normalization
from sklearn.preprocessing import minmax_scale
X_scaled = minmax_scale(X)

# Elbow method to determine optimum clusters
inertia_values = []  # To store inertia values


# import KMeans for clustering
from sklearn.cluster import KMeans
# Try different numbers of clusters (from 2 to 15)
for i in range(2, 6):
    kmeans = KMeans(n_clusters=i, random_state=42)  # Set random_state for reproducibility
    kmeans.fit(X_scaled)
    inertia_values.append(kmeans.inertia_)

# Plotting the Elbow Curve
import matplotlib.pyplot as plt

plt.plot(range(2, 6), inertia_values, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.grid(True)
plt.show()


## What Does the Code Achieve?
### 1. Loads and Prepares the Data
- The dataset is read, and its features are normalized using MinMax scaling.
- This ensures fair contribution of all attributes in the clustering process, as K-means is sensitive to the scale of features.



### 2. Explores the Data
- A scatter plot is generated to provide an initial view of the data's distribution.
- This helps understand the relationship between the features (`marks` and `study hours`) visually.



### 3. Applies K-means Clustering
- The data is clustered for different numbers of clusters (`K`).
- For each value of `K`, the clustering is performed, and the inertia (a measure of clustering quality) is calculated.



### 4. Determines the Optimal Number of Clusters
- The Elbow Method is used to find the optimal value of `K` by analyzing the inertia values.
- The "elbow point," where the rate of decrease in inertia slows significantly, represents the optimal number of clusters.


### 5. Visualizes the Results
- The Elbow Curve is plotted to visualize how inertia changes with the number of clusters.
- This visual representation helps identify the optimal value of `K` effectively.


## **2. Elbow Plot and Cluster Plot Separate**

#### Purpose:
- Provides a visual understanding of the clusters formed by K-means.
- Verifies the clustering results and checks for overlapping clusters or outliers.

#### Steps for Cluster Plot:
1. Fit the K-means model using the optimal `K`.
2. Assign cluster labels to each data point.
3. Create a scatter plot with points colored according to their cluster labels.
4. Analyze the grouping and separation of clusters.

## Sample Code:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import minmax_scale
from sklearn.cluster import KMeans

# Read the dataset
dataset = pd.read_csv("studentclusters.csv")
X = dataset.copy()

# Normalize the data for MinMax normalization
X_scaled = minmax_scale(X)

# Get the number of samples in the dataset
n_samples = X_scaled.shape[0]

# Elbow method to determine optimum clusters
inertia_values = []  # To store inertia values
inertia_change = []  # To track percentage change in inertia values

# The maximum number of clusters should be at most the number of samples
max_clusters = min(50, n_samples)

# Try different numbers of clusters (from 2 to max_clusters)
for i in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=i, random_state=42)  # Set random_state for reproducibility
    kmeans.fit(X_scaled)
    inertia_values.append(kmeans.inertia_)

    # Calculate the percentage change in inertia compared to the previous inertia value
    if i > 2:
        prev_inertia = inertia_values[-2]
        current_inertia = inertia_values[-1]

        # Avoid division by zero by checking if the previous inertia is non-zero
        if prev_inertia != 0:
            change = (current_inertia - prev_inertia) / prev_inertia * 100
            inertia_change.append(change)

# Plotting the Elbow Curve
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)  # First plot (Elbow method)
plt.plot(range(2, max_clusters + 1), inertia_values, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.grid(True)

# Check for inertia change and automatically set optimal clusters if change is < 5% for 3 consecutive trials
optimal_k = None
for i in range(3, len(inertia_change)):
    if all(abs(change) < 5 for change in inertia_change[i-3:i]):
        optimal_k = i + 3  # We add 3 because the range starts from 2
        break

# If optimal_k was not set, take the last valid k
if not optimal_k:
    optimal_k = max_clusters

# Fit the final KMeans model with the optimal_k
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(X_scaled)

# Plot the clusters
plt.subplot(1, 2, 2)  # Second plot (Cluster visualization)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title(f'Clusters for K={optimal_k}')
plt.xlabel('Marks')
plt.ylabel('Study Hours')
plt.grid(True)

plt.tight_layout()
plt.show()

# Output the optimal number of clusters determined
print(f"Optimal number of clusters: {optimal_k}")


## What Does the Code Achieve?

### 1. Loads and Prepares the Data
- The dataset is loaded using `pandas`, and a copy is made to ensure the original data remains intact.
- The features in the dataset are normalized using Min-Max scaling to ensure fair contribution of all attributes during clustering.

**Purpose**: Prepare the dataset for the K-means clustering algorithm by standardizing the feature scales.



### 2. Determines the Optimal Number of Clusters
- The Elbow Method is used to calculate **inertia values** for a range of clusters (`K` values).
- **Inertia** is the sum of squared distances from data points to their nearest cluster center. Lower inertia values indicate tighter clusters.

**Purpose**: Identify the optimal number of clusters by finding the "elbow point" where the rate of decrease in inertia slows significantly.



### 3. Automates the Detection of Optimal Clusters
- The percentage change in inertia values is calculated for successive cluster numbers (`K` values).
- A heuristic checks if the inertia change stabilizes below a threshold (e.g., 5%) for three consecutive values of `K`. If this condition is met, the corresponding `K` is selected as the optimal number of clusters.

**Purpose**: Automate the process of finding the optimal number of clusters, eliminating the need for manual interpretation of the Elbow Curve.



### 4. Applies K-Means Clustering
- The K-means algorithm is applied to the dataset using the optimal number of clusters (`K`).
- Each data point is assigned to a cluster based on its proximity to the cluster centroids.

**Purpose**: Partition the data into meaningful clusters based on the similarity of data points.



### 5. Visualizes the Results
- **Elbow Plot**: A line plot of inertia values against the number of clusters (`K`), helping visualize how inertia decreases as the number of clusters increases.
- **Cluster Plot**: A scatter plot where each data point is colored according to its cluster assignment, visually demonstrating the grouping of data points.

**Purpose**: Provide visual insights into the clustering process:
  - The Elbow Plot helps identify the optimal `K`.
  - The Cluster Plot shows the final cluster assignments.



### 6. Outputs the Optimal Number of Clusters
- Prints the optimal number of clusters determined by the Elbow Method and heuristic.

**Purpose**: Inform the user about the ideal number of clusters for the dataset.
