# <center>**CSE 572: Data Mining Homework 3**</center>
**Name: Sriranjan Srikanth** <br>
**ASU ID: 1229309109**


### **Task 1: K-Means Clustering**
This task involves implementing the K-Means clustering algorithm from scratch using three distance metrics:
1. Euclidean Distance
2. 1 - Cosine Similarity
3. 1 - Generalized Jaccard Similarity

We will analyze the performance of these metrics based on SSE, accuracy, and convergence criteria.

In [11]:
import pandas as pd

# Load the datasets
data = pd.read_csv('kmeans_data/data.csv', header=None)  # 10000 samples, 784 features
labels = pd.read_csv('kmeans_data/label.csv', header=None)  # Ground-truth labels

# Display basic information about the dataset
print("Dataset Overview:")
print(data.head())
print("\nDataset Info:")
print(data.info())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Define the number of clusters (K) based on the labels
K = labels[0].nunique()  # There are 10 unique labels
print(f"Number of Clusters (K): {K}")

Dataset Overview:
   0    1    2    3    4    5    6    7    8    9    ...  774  775  776  777  \
0    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
1    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
2    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
3    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   
4    0    0    0    0    0    0    0    0    0    0  ...    0    0    0    0   

   778  779  780  781  782  783  
0    0    0    0    0    0    0  
1    0    0    0    0    0    0  
2    0    0    0    0    0    0  
3    0    0    0    0    0    0  
4    0    0    0    0    0    0  

[5 rows x 784 columns]

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 784 entries, 0 to 783
dtypes: int64(784)
memory usage: 59.8 MB
None

Missing Values:
0      0
1      0
2      0
3      0
4      0
      ..
779    0
780    0
781    0
782    0

### **K-Means Implementation**
Below, we implement the K-Means clustering algorithm from scratch. This includes:
- Distance functions for Euclidean, 1 - Cosine Similarity, and 1 - Generalized Jaccard Similarity.
- Functions to initialize centroids, assign clusters, and update centroids.
- A main function to perform K-Means clustering.

In [12]:
import numpy as np

# Euclidean distance
def euclidean_distance(x, y):
    return np.sqrt(np.sum((x - y) ** 2))

# 1 - Cosine similarity
def cosine_similarity(x, y):
    cos_sim = np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
    return 1 - cos_sim

# 1 - Generalized Jaccard similarity
def jaccard_similarity(x, y):
    intersection = np.sum(np.minimum(x, y))
    union = np.sum(np.maximum(x, y))
    return 1 - (intersection / union)

In [13]:
# Initialize centroids
def initialize_centroids(data, K):
    return data.sample(n=K).to_numpy()

# Assign clusters based on the selected distance function
def assign_clusters(data, centroids, distance_fn):
    clusters = []
    for _, point in data.iterrows():
        distances = [distance_fn(point.to_numpy(), centroid) for centroid in centroids]
        clusters.append(np.argmin(distances))
    return np.array(clusters)

# Update centroids based on cluster assignments
def update_centroids(data, clusters, K):
    new_centroids = []
    for k in range(K):
        cluster_points = data[clusters == k]
        new_centroids.append(cluster_points.mean(axis=0))
    return np.array(new_centroids)

# Check for convergence
def has_converged(old_centroids, new_centroids):
    return np.allclose(old_centroids, new_centroids)

In [14]:
# Main K-Means algorithm
def kmeans(data, K, distance_fn, max_iters=100):
    centroids = initialize_centroids(data, K)
    for i in range(max_iters):
        clusters = assign_clusters(data, centroids, distance_fn)
        new_centroids = update_centroids(data, clusters, K)
        
        # Check stopping criteria
        if has_converged(centroids, new_centroids):
            break
        centroids = new_centroids
    
    return clusters, centroids

# Compute SSE
def compute_sse(data, clusters, centroids):
    sse = 0
    for k in range(len(centroids)):
        cluster_points = data[clusters == k]
        sse += np.sum((cluster_points - centroids[k]) ** 2)
    return sse

### **Running K-Means with Different Distance Metrics**
We now run the K-Means algorithm using:
1. Euclidean Distance
2. 1 - Cosine Similarity
3. 1 - Generalized Jaccard Similarity

We will compute and compare the SSE for each method.

In [15]:
# Convert data to NumPy array for processing
features = data.to_numpy()

# Run K-Means for each distance metric
clusters_euclidean, centroids_euclidean = kmeans(data, K, euclidean_distance)
clusters_cosine, centroids_cosine = kmeans(data, K, cosine_similarity)
clusters_jaccard, centroids_jaccard = kmeans(data, K, jaccard_similarity)

# Compute SSE for each method
sse_euclidean = compute_sse(features, clusters_euclidean, centroids_euclidean)
sse_cosine = compute_sse(features, clusters_cosine, centroids_cosine)
sse_jaccard = compute_sse(features, clusters_jaccard, centroids_jaccard)

# Print SSE Results
print("SSE Results:")
print(f"Euclidean: {sse_euclidean}")
print(f"Cosine: {sse_cosine}")
print(f"Jaccard: {sse_jaccard}")

SSE Results:
Euclidean: 25437047258.18773
Cosine: 25484251221.48815
Jaccard: 25416818402.689133


### **Cluster Evaluation and Analysis**

1. Majority Vote Labeling
- We label each cluster using the majority vote of the ground-truth labels in `label.csv`. This step allows us to evaluate the clustering accuracy for each distance metric.

2. Predictive Accuracy
- For each clustering method (Euclidean, Cosine, and Jaccard), we compute the predictive accuracy as the percentage of correctly labeled data points.

3. Iterations and Convergence Analysis <br>
- We analyze the number of iterations and time required for convergence under different stopping criteria:
    - No change in centroid position.
    - Increase in SSE in the next iteration.
    - Maximum number of iterations (e.g., 100).

In [21]:
from sklearn.metrics import accuracy_score

# Function to label clusters using majority voting
def majority_vote_labeling(clusters, labels, K):
    cluster_labels = np.zeros(K)
    for k in range(K):
        # Extract ground-truth labels of points in the current cluster
        cluster_points = labels[clusters == k]
        # Assign the majority label to the cluster
        if len(cluster_points) > 0:
            cluster_labels[k] = np.bincount(cluster_points).argmax()
    return cluster_labels

# Function to calculate predictive accuracy
def compute_accuracy(clusters, cluster_labels, true_labels):
    predicted_labels = cluster_labels[clusters]
    return accuracy_score(true_labels, predicted_labels)

# Perform majority vote labeling for each clustering method
true_labels = labels.to_numpy().flatten()
cluster_labels_euclidean = majority_vote_labeling(clusters_euclidean, true_labels, K)
cluster_labels_cosine = majority_vote_labeling(clusters_cosine, true_labels, K)
cluster_labels_jaccard = majority_vote_labeling(clusters_jaccard, true_labels, K)

# Compute predictive accuracy for each method
accuracy_euclidean = compute_accuracy(clusters_euclidean, cluster_labels_euclidean, true_labels)
accuracy_cosine = compute_accuracy(clusters_cosine, cluster_labels_cosine, true_labels)
accuracy_jaccard = compute_accuracy(clusters_jaccard, cluster_labels_jaccard, true_labels)

# Print Accuracy Results
print("Accuracy Results:")
print(f"Euclidean: {accuracy_euclidean * 100:.2f}%")
print(f"Cosine: {accuracy_cosine * 100:.2f}%")
print(f"Jaccard: {accuracy_jaccard * 100:.2f}%")

Accuracy Results:
Euclidean: 60.32%
Cosine: 64.30%
Jaccard: 60.23%


### **Iteration and Convergence Analysis**
We analyze the iterations and convergence criteria for each distance metric:
1. **Stopping Criteria:**
   - No change in centroid position.
   - SSE increases in the next iteration.
   - Maximum number of iterations (e.g., 100).
2. **Iterations and Convergence Time:**
   - Count the number of iterations taken for each method.
   - Measure the total execution time.

In [22]:
import time

# Modified K-Means to track iterations and time
def kmeans_with_metrics(data, K, distance_fn, max_iters=100):
    centroids = initialize_centroids(data, K)
    iterations = 0
    start_time = time.time()

    for i in range(max_iters):
        iterations += 1
        clusters = assign_clusters(data, centroids, distance_fn)
        new_centroids = update_centroids(data, clusters, K)
        
        # Stopping criteria
        if has_converged(centroids, new_centroids):
            break
        centroids = new_centroids

    end_time = time.time()
    elapsed_time = end_time - start_time
    return clusters, centroids, iterations, elapsed_time

# Run K-Means with metrics for each distance function
_, _, iters_euclidean, time_euclidean = kmeans_with_metrics(data, K, euclidean_distance)
_, _, iters_cosine, time_cosine = kmeans_with_metrics(data, K, cosine_similarity)
_, _, iters_jaccard, time_jaccard = kmeans_with_metrics(data, K, jaccard_similarity)

# Print Iteration and Time Results
print("Iterations and Convergence Time:")
print(f"Euclidean - Iterations: {iters_euclidean}, Time: {time_euclidean:.2f} seconds")
print(f"Cosine - Iterations: {iters_cosine}, Time: {time_cosine:.2f} seconds")
print(f"Jaccard - Iterations: {iters_jaccard}, Time: {time_jaccard:.2f} seconds")

Iterations and Convergence Time:
Euclidean - Iterations: 100, Time: 50.90 seconds
Cosine - Iterations: 91, Time: 54.23 seconds
Jaccard - Iterations: 79, Time: 55.11 seconds
