---
title: "Q4. Expection-Maximization Algorithm (8 points)"
author: "TW"
date: "2025-03-29"
categories: python
draft: false
---

# Q4. Expectation-Maximization Algorithm (8 points)

In this question, you are required to code by yourself to complete the EM algorithm.

### Data Descriptions:
1. The data is in Data_Q4.csv.
2. The test dataset is shown in Q4_Data.csv. There are 6 attributes, which are 'A','B',...,'F', and totally 626 instances in the dataset. You need to cluster all the instances into two classes. Assume the initial centers are c1=(0,0,0,0,0,0) and c2=(1,1,1,1,1,1).

### REQUIREMENTS:
1. Report the updated centers and SSE for the first two iterations.
2. Report the overall iteration step when your algorithm terminates.
3. Report the final converged centers for each cluster.

### Submissions:
1. Put all reports in Q4_readme.pdf.
2. Submit your source code in folder Q4_code.
3. Put files/folder above in folder Q4.

### NOTES:
Please use the terminate condition below:

**Terminate condition: the EM algorithm will terminate when:**
1. The sum of L1-distance for each dimension of old-new center
   \[
   \sum_{\text{each center}} ||C_{\text{old}} - C_{\text{new}}||_1
   \]
   is smaller than 0.0001, **or**
2. The iteration step is greater than the maximum iteration step 100.

Let’s tackle Q4, which involves implementing the Expectation-Maximization (EM) algorithm for clustering a dataset provided in `Data_Q4.csv`. The dataset contains 626 instances with 6 attributes, and we need to cluster them into two clusters (c1 and c2) with specific initial centers. We’ll report the updated centers and Sum of Squared Errors (SSE) for the first two iterations, the final centers when the algorithm converges, and submit the code and report in a folder named `Q4`. Let’s break this down step by step.

---

### Step 1: Understanding the Problem and Data

- **Data Description**:
  - `Data_Q4.csv`: Contains 626 instances with 6 attributes (numerical features).
  - We need to cluster the data into 2 clusters: c1 and c2.
  - Initial centers are given:
    - c1 = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
    - c2 = (1.1, 1.1, 1.1, 1.1, 1.1, 1.1)

- **Task**:
  - Implement the EM algorithm for clustering (though this is more akin to Gaussian Mixture Models, the problem seems to describe a k-means-like EM approach with hard assignments).
  - Report the updated centers and SSE for the first two iterations.
  - Report the final centers when the algorithm converges.
  - Submit the code and report in a folder named `Q4`.

- **Termination Conditions**:
  - The sum of L1-distance between old and new centers for each cluster is smaller than 0.0001, i.e., \(\sum_{\text{each center}} ||C_{\text{old}} - C_{\text{new}}||_1 < 0.0001\).
  - The iteration step exceeds the maximum of 100 iterations.

#### EM Algorithm for Clustering
The EM algorithm is typically used for Gaussian Mixture Models (GMMs), where it iteratively estimates the parameters (means, covariances, and mixing coefficients) of the mixture components. However, the problem description (hard assignments to clusters, L1-distance for convergence, and SSE as a metric) suggests a k-means-like approach with EM terminology. In k-means, the "E-step" assigns points to the nearest cluster, and the "M-step" updates the cluster centers as the mean of assigned points. We’ll implement this interpretation of the EM algorithm:

- **E-step**: Assign each data point to the nearest cluster based on Euclidean distance.
- **M-step**: Update the cluster centers as the mean of the points assigned to each cluster.
- **SSE**: Compute the Sum of Squared Errors as the sum of squared Euclidean distances from each point to its assigned cluster center.
- **Convergence**: Stop when the L1-distance between old and new centers is less than 0.0001 or after 100 iterations.

---

### Step 2: Implementing the EM Algorithm
We’ll implement the algorithm in Python using NumPy and Pandas. Let’s go through the steps.

#### Step 2.1: Load and Preprocess the Data
First, we load the data from `Data_Q4.csv`.



In [4]:
%cd /content/drive/MyDrive/Notes/MSBD5002/Data_Q4

/content/drive/MyDrive/Notes/MSBD5002/Data_Q4


In [8]:
import numpy as np
import pandas as pd
import os
import shutil

# Load the data
data = pd.read_csv('Q4_Data.csv')
print("Shape of data:", data.shape)  # Should be (626, 6)

# Convert to numpy array
X = data.to_numpy()

Shape of data: (626, 6)




#### Step 2.2: Implement the EM Algorithm
We’ll define the EM algorithm with the specified initial centers, iterate until convergence, and track the centers and SSE for the first two iterations.



In [10]:
# Define the EM algorithm for clustering
def em_clustering(X, initial_centers, max_iters=100, tol=0.0001):
    """
    Parameters:
    - X: Data array of shape (n_samples, n_features)
    - initial_centers: Initial cluster centers of shape (n_clusters, n_features)
    - max_iters: Maximum number of iterations
    - tol: Tolerance for convergence (L1-distance)
    Returns:
    - centers: Final cluster centers
    - iteration_logs: List of (centers, SSE) for each iteration
    """
    n_samples, n_features = X.shape
    n_clusters = initial_centers.shape[0]

    # Initialize centers
    centers = initial_centers.copy()
    iteration_logs = []

    for iteration in range(max_iters):
        # E-step: Assign points to the nearest cluster
        distances = np.zeros((n_samples, n_clusters))
        for k in range(n_clusters):
            distances[:, k] = np.sum((X - centers[k]) ** 2, axis=1)  # Squared Euclidean distance
        labels = np.argmin(distances, axis=1)  # Assign to nearest cluster

        # Compute SSE (Sum of Squared Errors)
        sse = 0
        for k in range(n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                sse += np.sum((cluster_points - centers[k]) ** 2)

        # M-step: Update cluster centers
        new_centers = np.zeros_like(centers)
        for k in range(n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                new_centers[k] = np.mean(cluster_points, axis=0)
            else:
                new_centers[k] = centers[k]  # If cluster is empty, keep the old center

        # Log the centers and SSE for this iteration
        iteration_logs.append((centers.copy(), sse))

        # Check for convergence using L1-distance
        l1_distance = np.sum(np.abs(new_centers - centers))
        centers = new_centers

        if l1_distance < tol:
            print(f"Converged after {iteration + 1} iterations")
            break

    if iteration == max_iters - 1:
        print(f"Reached maximum iterations ({max_iters})")

    return centers, iteration_logs


In [11]:
# Initial centers
initial_centers = np.array([
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],  # c1
    [1.1, 1.1, 1.1, 1.1, 1.1, 1.1]   # c2
])

# Run the EM algorithm
final_centers, iteration_logs = em_clustering(X, initial_centers)

Converged after 11 iterations


- **E-step**: We compute the squared Euclidean distance from each point to each cluster center and assign the point to the nearest cluster.
- **M-step**: We update each cluster center as the mean of the points assigned to that cluster.
- **SSE**: We compute the Sum of Squared Errors as the sum of squared distances from each point to its assigned cluster center.
- **Convergence**: We check the L1-distance (sum of absolute differences) between the old and new centers and stop if it’s less than 0.0001 or after 100 iterations.
- **Logging**: We store the centers and SSE for each iteration to report the first two iterations.

#### Step 2.3: Report the Results
We need to report:
1. The updated centers and SSE for the first two iterations.
2. The final centers when the algorithm converges.



In [12]:
# Report the updated centers and SSE for the first two iterations
print("First Iteration:")
print("Centers:")
print("c1:", iteration_logs[0][0][0])
print("c2:", iteration_logs[0][0][1])
print("SSE:", iteration_logs[0][1])

print("\nSecond Iteration:")
print("Centers:")
print("c1:", iteration_logs[1][0][0])
print("c2:", iteration_logs[1][0][1])
print("SSE:", iteration_logs[1][1])

# Report the final centers
print("\nFinal Centers:")
print("c1:", final_centers[0])
print("c2:", final_centers[1])

First Iteration:
Centers:
c1: [0. 0. 0. 0. 0. 0.]
c2: [1.1 1.1 1.1 1.1 1.1 1.1]
SSE: 530541.02

Second Iteration:
Centers:
c1: [1.44444444 0.22222222 0.66666667 0.         0.         0.11111111]
c2: [ 2.84602917  6.42139384 14.31604538  8.99351702  0.24311183  1.01620746]
SSE: 325470.1898978879

Final Centers:
c1: [2.52037037 4.6037037  9.33888889 5.17037037 0.22222222 0.95      ]
c2: [ 4.74418605 17.18604651 44.13953488 32.05813953  0.34883721  1.3372093 ]


- The `iteration_logs` list contains tuples of (centers, SSE) for each iteration. We access the first two entries for the first two iterations.
- The `final_centers` variable contains the centers after convergence.

#### Step 2.4: Package the Submission
We need to submit the code and report in a folder named `Q4`. The report (`Q4_readme.pdf`) should include the updated centers and SSE for the first two iterations, the final centers, and the code.



- **Note**: The question asks for `Q4_readme.pdf`, so you’ll need to convert the code and report to PDF format manually (e.g., by copying the code and output into a document and saving as PDF).

---

### Step 3: Write the Report
The report should include the updated centers and SSE for the first two iterations, the final centers, and the code. Here’s a summary to include in `Q4_readme.pdf`:

#### Report Content

1. **Introduction**:

   - The task is to implement the Expectation-Maximization (EM) algorithm for clustering the data in `Q4_Data.csv` into two clusters (c1 and c2).
   - The dataset contains 626 instances with 6 attributes.
   - Initial centers:
     - c1 = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
     - c2 = (1.1, 1.1, 1.1, 1.1, 1.1, 1.1)

2. **Algorithm Description**:

   - **E-step**: Assign each data point to the nearest cluster based on Euclidean distance.
   - **M-step**: Update the cluster centers as the mean of the points assigned to each cluster.
   - **SSE**: Compute the Sum of Squared Errors as the sum of squared distances from each point to its assigned cluster center.
   - **Convergence**: Stop when the L1-distance between old and new centers is less than 0.0001 or after 100 iterations.

3. **Results**:

   - **First Iteration**:
     - Centers:
       - c1: [Output from iteration_logs[0][0][0]]
       - c2: [Output from iteration_logs[0][0][1]]
     - SSE: [Output from iteration_logs[0][1]]

   - **Second Iteration**:
     - Centers:
       - c1: [Output from iteration_logs[1][0][0]]
       - c2: [Output from iteration_logs[1][0][1]]
     - SSE: [Output from iteration_logs[1][1]]

   - **Final Centers**:
     - c1: [Output from final_centers[0]]
     - c2: [Output from final_centers[1]]

4. **Code**:
   - [Include the entire code from above]
---

### Final Submission
Your submission folder `Q4` should contain:
- `Q4_readme.pdf`: The report with the code, updated centers, and SSE for the first two iterations, and the final centers.

**Folder Structure**:
```
Q4/
└── Q4_readme.pdf
```

To create the PDF:

1. Copy the report content above into a document editor.
2. Include the actual output (centers and SSE) from running the code.
3. Format it for clarity (e.g., use headings, bullet points).
4. Export the document as a PDF named `Q4_readme.pdf`.
5. Place the PDF in the `Q4` folder.

---

### Notes and Potential Improvements
1. **EM vs. K-Means**: The problem uses EM terminology but describes a k-means-like algorithm. A true EM algorithm for clustering would involve Gaussian Mixture Models (GMMs) with soft assignments (probabilities), covariance matrices, and maximization of the likelihood. If the problem intended a GMM, we’d need to modify the implementation to include these components.
2. **Empty Clusters**: The code handles empty clusters by keeping the old center, but in a real scenario, you might reinitialize the center randomly.
3. **Data Preprocessing**: The problem doesn’t specify preprocessing, but in practice, you might normalize the data if the features have different scales.

# END