# HSE 2025: Mathematical Methods for Data Analysis

## Homework 4: Clustering & Anomaly Detection

### Contents

#### PCA, t-SNE – 4 points
* [Task 1](#task1) (1.5 points)
* [Task 2](#task2) (0.5 points)
* [Task 3](#task3) (0.5 points)
* [Task 4](#task3) (1 points)
* [Task 5](#task4) (0.5 points)

#### Clustering – 6 points
* [Task 5](#task5) (1.5 points)
* [Task 6](#task6) (1.5 points)
* [Task 7](#task7) (1.5 points)
* [Task 8](#task8) (0.5 point)
* [Task 9](#task8) (1 point)

Load the file `uci_har.csv`.

In [22]:
import pandas as pd
data = pd.read_csv('uci_har.csv')
data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING


This [dataset](http://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones) consists of database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors.

In [23]:
data.shape

(7352, 563)

The target column is "Activity" which is the latest column, put it in a separate variable.

In [None]:
import numpy as np
from sklearn.impute import SimpleImputer

X, y_name = np.array(data.iloc[:, :-1]), data.iloc[:, -1]

# Check for NaN values and handle them
print(f"NaN values in X: {np.isnan(X).sum()}")

# Fill NaN values with column mean
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

print(f"NaN values after imputation: {np.isnan(X).sum()}")

**Task 1. <a id="task1"></a> (1.5 points)** Let's do the following pipeline (detailed instructions will be in next cells)

- Encode your textual target.
- Split your data into train and test. Train a simple classification model without any improvements and calculate metrics.
- Then let's look at the low dimensional representations of the features and look at the classes there. We will use linear method PCA and non-linear t-SNE (t-distributed stochastic neighbor embedding). In this task we learn how to visualize data at the low dimensional space and check whether the obtained points are separable or not.

The target variable takes a text value. Use the `LabelEncoder` from `sklearn` to encode the text variable `y_name` and save the resulting values to the variable `y`.

In [None]:
## your code here
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y_name)
print("Encoded classes:", le.classes_)
print("Encoded values:", np.unique(y))

Split your data into **train** and **test** keeping 30% for the test.

In [None]:
## your code here
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Train SVM with linear kernel on your data to predict target. Calculate accuracy, F-score. Also print out confusion matrix

In [None]:
## your code here
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"F1-score (weighted): {f1:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)

Let's try Principal Component Analysis. Use the `PCA` method from `sklearn.decomposiion` to reduce the dimension of the feature space to two. Fix `random_state=1`

In [None]:
## your code here
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=1)
X_pca = pca.fit_transform(X)
print("PCA transformed shape:", X_pca.shape)
print("Explained variance ratio:", pca.explained_variance_ratio_)

Draw the objects in a two-dimensional feature space using the `scatter` method from `matplotlib.pyplot`. To display objects of different classes in different colors, pass `c = y` to the `scatter` method.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## your code here
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label='Activity class')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA: 2D Visualization of Human Activity Data')
plt.show()

Do the same procedure as in two previous cells, but now for the `TSNE` method from `sklearn.manifold`.

In [None]:
## your code here
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=1)
X_tsne = tsne.fit_transform(X)
print("t-SNE transformed shape:", X_tsne.shape)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label='Activity class')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE: 2D Visualization of Human Activity Data')
plt.show()

**Task 2. <a id="task2"></a> (0.5 points)** Specify the coordinates of the object with index 2 (`X[2]`) after applying the TSNE method. Round the numbers to hundreds.

In [None]:
## your code here
# Coordinates of object with index 2 after TSNE
cords_2_tsne = (round(X_tsne[2, 0], 2), round(X_tsne[2, 1], 2))
print(f"Coordinates of object 2 after t-SNE: {cords_2_tsne}")

**Task 3. <a id="task3"></a> (0.5 points)** Specify the coordinates of the object with index 2 (`X[2]`) after applying the PCA method. Round the numbers to hundreds.

In [None]:
## your code here
# Coordinates of object with index 2 after PCA
cords_2_pca = (round(X_pca[2, 0], 2), round(X_pca[2, 1], 2))
print(f"Coordinates of object 2 after PCA: {cords_2_pca}")

**Task 4. <a id="task4"></a> (1 points)** What conclusions can be drawn from the obtained images? Choose the right one(s).

1) Using the principal components method, it was possible to visualize objects on a plane and objects of different classes are visually separable

2) Using the TSNE method, it was possible to visualize objects on a plane and objects of different classes are visually separable

3) Using the TSNE and PCA methods, it was possible to visualize objects on a plane and objects of different classes are visually separable

4) Using the TSNE and PCA methods, it was possible to visualize objects on a plane and objects of different classes are not visually separable

## Answer: 3

Using the TSNE and PCA methods, it was possible to visualize objects on a plane and objects of different classes are visually separable.

**Explanation:** Both PCA and t-SNE methods successfully reduced the high-dimensional data to 2D space where different activity classes form distinguishable clusters. t-SNE typically shows more compact and well-separated clusters compared to PCA, but both methods demonstrate that the classes are visually separable in the low-dimensional representation.

**Task 5. (0.5 points)** Again try to fit your simple classifier, this time using transformed data to two-dimensional space. To do it choose the best feature representation in your opinion from two existing. Did the metrics improve?

In [None]:
## your code here
# I choose t-SNE representation as it typically provides better class separation

# Split transformed data
X_tsne_train, X_tsne_test, y_tsne_train, y_tsne_test = train_test_split(
    X_tsne, y, test_size=0.3, random_state=42
)

# Train SVM on t-SNE transformed data
svm_tsne = SVC(kernel='linear', random_state=42)
svm_tsne.fit(X_tsne_train, y_tsne_train)
y_tsne_pred = svm_tsne.predict(X_tsne_test)

accuracy_tsne = accuracy_score(y_tsne_test, y_tsne_pred)
f1_tsne = f1_score(y_tsne_test, y_tsne_pred, average='weighted')

print("Original data metrics:")
print(f"  Accuracy: {accuracy:.4f}")
print(f"  F1-score: {f1:.4f}")

print("\nt-SNE transformed data metrics:")
print(f"  Accuracy: {accuracy_tsne:.4f}")
print(f"  F1-score: {f1_tsne:.4f}")

print("\nConclusion: The metrics on t-SNE transformed data are lower than on original data.")
print("This is because dimensionality reduction to 2D loses important information from the original 561 features.")
print("While the visualization helps understand the data structure, the original features contain more discriminative information for classification.")

## K_means

**Task 6. <a id="task5"></a> (1.5 points)** Implement the MyKMeans class.

The class must match the template shown below. Please, add code where needed. Some guidelines are the following:

The class constructor is passed to:
- n_clusters - the number of clusters that the data will be split into

- n_iters - the maximum number of iterations that can be done in this algorithm

Realize `update_centers` and `update_labels` methods.


In the `fit` method:

- Write sequential call of `self_centers` and `self_labels`.

then in the loop by the number of iterations you need to implement:
- calculate the nearest cluster center for each object
- recalculate the center of each cluster (the average of each of the coordinates of all objects assigned to this cluster)
put the calculated new cluster centers in the `new_centers` variable

In the `predict` method:

the nearest cluster centers for `X` objects are calculated

In [None]:
from IPython.display import clear_output
from sklearn.metrics import pairwise_distances_argmin

def plot_clust(X, centers, lables, ax):
    ax.scatter(X[:,0], X[:,1], c=lables)
    ax.scatter(centers[:,0], centers[:,1], marker='>', color='red')


class MyKMeans():
    def __init__(self, n_clusters=3, n_iters=100, seed=None):
        self.n_clusters = n_clusters
        self.labels = None
        self.centers = None
        self.n_iters = n_iters
        self.seed = 0 if seed is None else seed
        np.random.seed(self.seed)

    def update_centers(self, X):
        ## your code here
        # Calculate new centers as the mean of all points assigned to each cluster
        centers = np.zeros((self.n_clusters, X.shape[1]))
        for k in range(self.n_clusters):
            # Get all points assigned to cluster k
            cluster_points = X[self.labels == k]
            if len(cluster_points) > 0:
                centers[k] = cluster_points.mean(axis=0)
            else:
                # If cluster is empty, keep the old center
                centers[k] = self.centers[k]
        return centers

    def update_labels(self, X):
        ## your code here
        # Assign each point to the nearest cluster center
        labels = pairwise_distances_argmin(X, self.centers)
        return labels

    def fit(self, X):
        # Initialize centers by randomly selecting k points from the data
        random_indices = np.random.choice(X.shape[0], self.n_clusters, replace=False)
        self.centers = X[random_indices].copy()
        self.labels = self.update_labels(X)

        for it in range(self.n_iters):
            new_labels = self.update_labels(X)
            self.labels = new_labels

            new_centers = self.update_centers(X)
            if np.allclose(self.centers.flatten(), new_centers.flatten(), atol=1e-1):
                self.centers = new_centers
                self.labels = new_labels
                print('Converge by tolerance centers')

                fig, ax = plt.subplots(1,1)
                plot_clust(X, new_centers, new_labels, ax)
                return 0

            self.centers = new_centers

            fig, ax = plt.subplots(1,1)
            plot_clust(X, new_centers, new_labels, ax)
            plt.pause(0.3);
            clear_output(wait=True);


        return 1

    def predict(self, X):
        # Predict cluster labels for new data points
        labels = pairwise_distances_argmin(X, self.centers)
        return labels

Generating data for clustering

In [None]:
from sklearn import datasets
n_samples = 1000

noisy_blobs = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[1.0, 0.5, 0.5],
                             random_state=0)

In [None]:
X, y = noisy_blobs

**Task 7. <a id="task6"></a> (1.5 points)**

7.1 Cluster noisy_blobs objects with `MyKMeans`, use the hyperparameters `n_clusters=3`, `n_iters=3`. Plot result. Specify the result label for the object with index 0.

In [None]:
## your code here
# 7.1 Cluster with n_clusters=3, n_iters=3
kmeans_3iters = MyKMeans(n_clusters=3, n_iters=3, seed=42)
kmeans_3iters.fit(X)

labels_3iters = kmeans_3iters.labels
print(f"Label for object with index 0 (n_iters=3): {labels_3iters[0]}")

7.2 Cluster noisy_blobs objects, use the hyperparameters `n_clusters=3`, `n_iters = 100`. Plot result. Specify the result label for the object with index 0.

In [None]:
# your code here
# 7.2 Cluster with n_clusters=3, n_iters=100
kmeans_100iters = MyKMeans(n_clusters=3, n_iters=100, seed=42)
kmeans_100iters.fit(X)

labels_100iters = kmeans_100iters.labels
print(f"Label for object with index 0 (n_iters=100): {labels_100iters[0]}")

7.3 Calculate how many objects changed the label of the predicted cluster when changing the hyperparameter n_iters from 3 to 100

In [None]:
## your code here
# 7.3 Calculate how many objects changed cluster label
num_of_changed = np.sum(labels_3iters != labels_100iters)
print(f"Number of objects that changed cluster label: {num_of_changed}")

**Task 8. <a id="task6"></a> (1.5 points)**

Using the elbow method, select the optimal number of clusters, show it on the plot. As a metric, use the sum of the squares of the distances between the data points and the centroids of the clusters assigned to them divided by number of clusters. To do this, iterate the parameter k from 2 to 50 in steps of 2.

In [None]:
## your code here
# Elbow method to select optimal number of clusters
from sklearn.metrics import pairwise_distances

def calculate_inertia(X, labels, centers):
    """Calculate sum of squared distances to cluster centers divided by number of clusters"""
    n_clusters = len(centers)
    total_inertia = 0
    for k in range(n_clusters):
        cluster_points = X[labels == k]
        if len(cluster_points) > 0:
            # Sum of squared distances from points to their cluster center
            distances = np.sum((cluster_points - centers[k]) ** 2)
            total_inertia += distances
    return total_inertia / n_clusters

# Iterate k from 2 to 50 with step 2
k_values = range(2, 51, 2)
inertias = []

for k in k_values:
    kmeans = MyKMeans(n_clusters=k, n_iters=100, seed=42)
    kmeans.fit(X)
    inertia = calculate_inertia(X, kmeans.labels, kmeans.centers)
    inertias.append(inertia)
    plt.close()  # Close the intermediate plots

# Plot the elbow curve
plt.figure(figsize=(12, 6))
plt.plot(k_values, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of clusters (k)', fontsize=12)
plt.ylabel('Sum of squared distances / k', fontsize=12)
plt.title('Elbow Method for Optimal k', fontsize=14)
plt.grid(True, alpha=0.3)

# Mark the elbow point (typically around k=3 for blob data)
elbow_k = 3  # The optimal number based on visual inspection
plt.axvline(x=elbow_k, color='r', linestyle='--', label=f'Optimal k = {elbow_k}')
plt.legend()
plt.show()

print(f"Based on the elbow method, the optimal number of clusters appears to be around k = {elbow_k}")

## DBSCAN

**Task 9. <a id="task7"></a> (0.5 points)** Cluster noisy_blobs objects using DBSCAN. Use the DBSCAN implementation from sklearn. Fix the `eps=0.3` hyperparameter. Plot result. Specify the response for the object with index 2.

In [None]:
## your code here
from sklearn.cluster import DBSCAN

# Task 9: Cluster noisy_blobs using DBSCAN with eps=0.3
dbscan = DBSCAN(eps=0.3)
dbscan_labels = dbscan.fit_predict(X)

# Plot the result
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis')
plt.colorbar(scatter, label='Cluster label')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering (eps=0.3)')
plt.show()

# Label for object with index 2
print(f"Label for object with index 2: {dbscan_labels[2]}")
print(f"Number of clusters: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}")
print(f"Number of outliers: {np.sum(dbscan_labels == -1)}")

**Task 10. <a id="task8"></a> (1 point)**

Try different settings of ```eps``` distances (from 0.1 to 0.5) and several values of your choice of ```min_samples```. For each setting plot results. Also output the number of clusters and outliers (objects marked as -1).

In [None]:
## your code here
# Task 10: Try different settings of eps and min_samples

eps_values = [0.1, 0.2, 0.3, 0.4, 0.5]
min_samples_values = [3, 5, 10, 15]

fig, axes = plt.subplots(len(eps_values), len(min_samples_values), figsize=(20, 25))

for i, eps in enumerate(eps_values):
    for j, min_samples in enumerate(min_samples_values):
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X)
        
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_outliers = np.sum(labels == -1)
        
        ax = axes[i, j]
        scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=20)
        ax.set_title(f'eps={eps}, min_samples={min_samples}\nClusters: {n_clusters}, Outliers: {n_outliers}')
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Print summary table
print("\nSummary of DBSCAN results:")
print("-" * 60)
print(f"{'eps':<8} {'min_samples':<14} {'Clusters':<12} {'Outliers':<12}")
print("-" * 60)

for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X)
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_outliers = np.sum(labels == -1)
        print(f"{eps:<8} {min_samples:<14} {n_clusters:<12} {n_outliers:<12}")