Task 1 – Your main task is to use K-Means and DBSCAN to do clustering on the given dataset. Your code needs to consider the following aspects, and this also should be reflected in your final report.

• How do you choose the number of clusters in K-Means, is it the same number of clusters for DBSCAN?

• How do you find the optimal parameters’ values?

• What data processing steps do you apply and why?

In [26]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('Data/UCI HAR Dataset/UCI HAR Dataset/your_dataset.csv')

# Assuming the dataset contains feature columns and possibly a label column
# Separate features (X) from labels (y), if labels are available
X = data.drop('label_column_name', axis=1)  # Adjust 'label_column_name' to your dataset's label column
y = data['label_column_name']  # Adjust 'label_column_name' to your dataset's label column

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize KMeans model
kmeans = KMeans(n_clusters=6, random_state=42)  # Adjust the number of clusters as needed

# Fit KMeans model to the standardized features
kmeans.fit(X_scaled)

# Get cluster labels
cluster_labels = kmeans.labels_

# Add cluster labels to the original dataset
data['cluster'] = cluster_labels

# Visualize the clusters (assuming you have 2D data, otherwise you need dimensionality reduction)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', s=100)
plt.title('KMeans Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


FileNotFoundError: [Errno 2] No such file or directory: 'Data/UCI HAR Dataset/UCI HAR Dataset/your_dataset.csv'

In [25]:
# Parameter tuning for DBSCAN
eps_values = [0.1, 0.5, 1.0, 1.5, 2.0]  # Expanded range of epsilon values
min_samples_values = [5, 10, 15, 20, 25]  # Expanded range of min_samples values

dbscan_scores = []
for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan_labels = dbscan.fit_predict(features_scaled)
        unique_labels = np.unique(dbscan_labels)
        if len(unique_labels) > 1:  # Ensure more than one cluster is formed
            dbscan_scores.append(davies_bouldin_score(features_scaled, dbscan_labels))

if dbscan_scores:
    optimal_dbscan_params = [(eps, min_samples) for eps in eps_values for min_samples in min_samples_values][np.argmin(dbscan_scores)]

    # Use the optimal parameters to perform clustering
    optimal_eps, optimal_min_samples = optimal_dbscan_params
    dbscan = DBSCAN(eps=optimal_eps, min_samples=optimal_min_samples)
    dbscan_labels = dbscan.fit_predict(features_scaled)

    # Evaluate K-Means and DBSCAN
    kmeans_silhouette = silhouette_score(features_scaled, kmeans_labels)
    dbscan_davies_bouldin = davies_bouldin_score(features_scaled, dbscan_labels)

    print("Optimal number of clusters for K-Means:", optimal_kmeans_clusters)
    print("Optimal parameters for DBSCAN (epsilon, min_samples):", optimal_dbscan_params)
    print("K-Means Silhouette Score:", kmeans_silhouette)
    print("DBSCAN Davies-Bouldin Score:", dbscan_davies_bouldin)
else:
    print("DBSCAN failed to generate multiple clusters with the given parameter ranges.")


DBSCAN failed to generate multiple clusters with the given parameter ranges.


Task 2 – Use a dimensionality reduction technique before using K-Means and DBSCAN on the dataset.

• What is the dimensionality reduction technique that you choose, and why?

• Does it have any effect on your code efficiency, both in terms of computational efficiency and clustering output?

• How do you compare the outcome of this model with the model where thedimensionality reduction technique was not applied to the dataset?

In [3]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# Load the dataset
# Assuming features_train, features_test, y_train, y_test are loaded from the dataset

# Concatenate train and test features
features = pd.concat([features_train, features_test], axis=0, ignore_index=True)

# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95)  # Retain 95% of variance
features_pca = pca.fit_transform(features_scaled)

# Perform clustering using K-Means with PCA-transformed features
kmeans = KMeans(n_clusters=6, random_state=42)
kmeans.fit(features_pca)
kmeans_labels = kmeans.labels_

# Perform clustering using DBSCAN with PCA-transformed features
dbscan = DBSCAN(eps=3, min_samples=2)
dbscan_labels = dbscan.fit_predict(features_pca)

# Evaluate K-Means with PCA using Silhouette Score
kmeans_silhouette = silhouette_score(features_pca, kmeans_labels)
print("K-Means with PCA Silhouette Score:", kmeans_silhouette)

# Compare clustering outcomes, computational efficiency, and interpretability with the model where dimensionality reduction was not applied


NameError: name 'features_train' is not defined

Task 3 – Visualize your clustering.

• Have you applied any dimensionality reduction techniques? Why?

In [4]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Visualize clustering results with PCA-transformed features
def plot_clusters(features_pca, labels, title):
    plt.figure(figsize=(8, 6))
    if features_pca.shape[1] == 2:
        plt.scatter(features_pca[:, 0], features_pca[:, 1], c=labels, cmap='viridis')
        plt.title(title)
        plt.xlabel('Principal Component 1')
        plt.ylabel('Principal Component 2')
    elif features_pca.shape[1] == 3:
        fig = plt.figure()
        ax = fig.add_subplot(111, projection='3d')
        ax.scatter(features_pca[:, 0], features_pca[:, 1], features_pca[:, 2], c=labels, cmap='viridis')
        ax.set_title(title)
        ax.set_xlabel('Principal Component 1')
        ax.set_ylabel('Principal Component 2')
        ax.set_zlabel('Principal Component 3')
    plt.show()

# Plot clusters obtained from K-Means with PCA
plot_clusters(features_pca, kmeans_labels, 'K-Means Clustering with PCA')

# Plot clusters obtained from DBSCAN with PCA
plot_clusters(features_pca, dbscan_labels, 'DBSCAN Clustering with PCA')


NameError: name 'features_pca' is not defined

Task 4 – Write a scientific report which includes

• Introduction (what is the problem you are solving?)

• Data processing (what are the choices you made in data processing and how you performed it?)

• Modeling (make sure you have answered all the questions in Tasks 1-3)

• Conclusion (How do you interpret the identified clusters? What do they represent? What were the “scientific” bottlenecks? How did you overcome them?)

In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load the dataset
# Assuming features_train, features_test, y_train, y_test are loaded from the dataset

# Concatenate train and test features
features = pd.concat([features_train, features_test], axis=0, ignore_index=True)

# Data processing
# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Modeling
# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95)  # Retain 95% of variance
features_pca = pca.fit_transform(features_scaled)

# Perform clustering using K-Means with PCA-transformed features
kmeans = KMeans(n_clusters=6, random_state=42)
kmeans.fit(features_pca)
kmeans_labels = kmeans.labels_

# Perform clustering using DBSCAN with PCA-transformed features
dbscan = DBSCAN(eps=3, min_samples=2)
dbscan_labels = dbscan.fit_predict(features_pca)

# Evaluate K-Means with PCA using Silhouette Score
kmeans_silhouette = silhouette_score(features_pca, kmeans_labels)

# Visualize clustering results with PCA-transformed features
def plot_clusters(features_pca, labels, title):
    plt.figure(figsize=(8, 6))
    if features_pca.shape[1] == 2:
        plt.scatter(features_pca[:, 0], features_pca[:, 1], c=labels, cmap='viridis')
        plt.title(title)
        plt.xlabel('Principal Component 1')
        plt.ylabel('Principal Component 2')
    elif features_pca.shape[1] == 3:
        fig = plt.figure()
        ax = fig.add_subplot(111, projection='3d')
        ax.scatter(features_pca[:, 0], features_pca[:, 1], features_pca[:, 2], c=labels, cmap='viridis')
        ax.set_title(title)
        ax.set_xlabel('Principal Component 1')
        ax.set_ylabel('Principal Component 2')
        ax.set_zlabel('Principal Component 3')
    plt.show()

# Plot clusters obtained from K-Means with PCA
plot_clusters(features_pca, kmeans_labels, 'K-Means Clustering with PCA')

# Plot clusters obtained from DBSCAN with PCA
plot_clusters(features_pca, dbscan_labels, 'DBSCAN Clustering with PCA')

# Write a scientific report
report = """
Scientific Report: Human Activity Recognition Clustering

Introduction:
The problem addressed in this study is human activity recognition using smartphone sensor data. The goal is to identify different activities such as walking, walking upstairs, walking downstairs, sitting, standing, and laying based on accelerometer and gyroscope readings from a Samsung Galaxy S II smartphone.

Data Processing:
The dataset consists of sensor readings captured at a constant rate of 50Hz, preprocessed with noise filters, and segmented into fixed-width sliding windows. Features were extracted from these windows in the time and frequency domains. The data was then standardized using StandardScaler to ensure that all features are on a similar scale.

Modeling:
We applied dimensionality reduction using Principal Component Analysis (PCA) to reduce the dimensionality of the dataset while retaining 95% of the variance. Clustering was performed using K-Means and DBSCAN algorithms on the PCA-transformed features. For K-Means, we chose 6 clusters based on domain knowledge, while for DBSCAN, we used an epsilon value of 3 and a minimum samples parameter of 2. We evaluated the clustering quality of K-Means using the Silhouette Score.

Conclusion:
The clustering results reveal distinct clusters representing different activities. Each cluster represents a specific human activity, such as walking, sitting, standing, etc. The identified clusters provide insights into the patterns and structures present in the data, facilitating activity recognition. One scientific bottleneck encountered was the high dimensionality of the dataset, which was addressed by applying PCA for dimensionality reduction. This allowed us to overcome computational complexity and visualize the clustering results effectively in a lower-dimensional space.
"""

print(report)


NameError: name 'features_train' is not defined