# CS6140 Assignment 4: Unsupervised Learning FMA data Part 2
- Student: Sukhrobbek Ilyosbekov

**Table of Contents**

1. [Data Preparation](#1)
2. [Dimension Reduction Techniques](#2)
    1. [Principal Component Analysis (PCA)](#2.1)
    2. [Uniform Manifold Approximation and Projection (UMAP)](#2.2)
    3. [t-Distributed Stochastic Neighbor Embedding (t-SNE)](#2.3)
    4. [Locally Linear Embedding (LLE)](#2.4)
    5. [ISOMAP](#2.5)
3. [Visualization](#3)
4. [Correlation Analysis](#4)
5. [Comparative Analysis](#5)

## Setup
In order to run this notebook, the following libraries should be installed:
- pandas
- matplotlib
- seaborn
- umap-learn

These libraries can be installed using the following command:
```bash
pip install pandas matplotlib seaborn numba umap-learn
```

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def get_data_path(relative_path: str) -> str:
    """
    Get the absolute path to a file in the `dataset` directory.

    Args:
        relative_path: The path to the file relative to the `dataset` directory including the file name.

    Returns:
        The full path to the file in the `dataset` directory.

    Examples:
        >>> get_data_path("assignment1/boston_listings.csv")
        "C:/Users/username/assignments/dataset/assignment1/boston_listings.csv"
    """
    return os.path.abspath(os.path.join("../../dataset", relative_path))

## Data Preparation <a class="anchor" id="1"></a>

First, load the features dataset and standardize column names since it has multi-level column names.

In [None]:
# Load the features dataset
features_df = pd.read_csv(get_data_path("assignment4/features.csv"), header=[0, 1, 2, 3])

# Flatten the multi-row header to create a single-level column index for easier manipulation
features_df.columns = ["_".join(filter(None, col)).strip() for col in features_df.columns.values]

# Remove '_Unnamed: 1_level_3', '_Unnamed: 2_level_3' and so on using regex
features_df.columns = features_df.columns.str.replace(r"_Unnamed: \d+_level_\d+$", "", regex=True)

# Rename the column feature_statistics_number_track_id to track_id
features_df.rename(columns={"feature_statistics_number_track_id": "track_id"}, inplace=True)

print("Features dataset info:")
display(features_df.info())

print("First few rows of the features dataset:")
display(features_df.head())

Features dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106574 entries, 0 to 106573
Columns: 519 entries, track_id to zcr_std_01
dtypes: float64(518), int64(1)
memory usage: 422.0 MB


None

First few rows of the features dataset:


Unnamed: 0,track_id,chroma_cens_kurtosis_01,chroma_cens_kurtosis_02,chroma_cens_kurtosis_03,chroma_cens_kurtosis_04,chroma_cens_kurtosis_05,chroma_cens_kurtosis_06,chroma_cens_kurtosis_07,chroma_cens_kurtosis_08,chroma_cens_kurtosis_09,...,tonnetz_std_04,tonnetz_std_05,tonnetz_std_06,zcr_kurtosis_01,zcr_max_01,zcr_mean_01,zcr_median_01,zcr_min_01,zcr_skew_01,zcr_std_01
0,2,7.180653,5.230309,0.249321,1.34762,1.482478,0.531371,1.481593,2.691455,0.866868,...,0.054125,0.012226,0.012111,5.75889,0.459473,0.085629,0.071289,0.0,2.089872,0.061448
1,3,1.888963,0.760539,0.345297,2.295201,1.654031,0.067592,1.366848,1.054094,0.108103,...,0.063831,0.014212,0.01774,2.824694,0.466309,0.084578,0.063965,0.0,1.716724,0.06933
2,5,0.527563,-0.077654,-0.27961,0.685883,1.93757,0.880839,-0.923192,-0.927232,0.666617,...,0.04073,0.012691,0.014759,6.808415,0.375,0.053114,0.041504,0.0,2.193303,0.044861
3,10,3.702245,-0.291193,2.196742,-0.234449,1.367364,0.998411,1.770694,1.604566,0.521217,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.0,3.542325,0.0408
4,20,-0.193837,-0.198527,0.201546,0.258556,0.775204,0.084794,-0.289294,-0.81641,0.043851,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993


Extract domain features and save them in a separate dataframe. Since the features are already normalized, we don't need to normalize them again.

In [None]:
# Select domains for dimensionality reduction
domains_to_extract = ["chroma", "mfcc", "spectral", "tonnetz"]

# Initialize a list to hold the extracted columns
extracted_columns = []

# Loop through the columns and filter based on the selected domains
for col in features_df.columns:
    for domain in domains_to_extract:
        if domain in col:
            extracted_columns.append(col)
            break

# Modify the features dataframe to include only the extracted columns
features_df = features_df[extracted_columns]

# Convert the dataframe to a numpy array for further processing
#features_matrix = features_df.to_numpy()

# Display the shape and a preview of the extracted features
print("Shape of extracted features for selected domains:", features_df.shape)
display(features_df.head())

Shape of extracted features for selected domains: (106574, 504)


Unnamed: 0,chroma_cens_kurtosis_01,chroma_cens_kurtosis_02,chroma_cens_kurtosis_03,chroma_cens_kurtosis_04,chroma_cens_kurtosis_05,chroma_cens_kurtosis_06,chroma_cens_kurtosis_07,chroma_cens_kurtosis_08,chroma_cens_kurtosis_09,chroma_cens_kurtosis_10,...,tonnetz_skew_03,tonnetz_skew_04,tonnetz_skew_05,tonnetz_skew_06,tonnetz_std_01,tonnetz_std_02,tonnetz_std_03,tonnetz_std_04,tonnetz_std_05,tonnetz_std_06
0,7.180653,5.230309,0.249321,1.34762,1.482478,0.531371,1.481593,2.691455,0.866868,1.341231,...,0.200944,0.593595,-0.177665,-1.424201,0.019809,0.029569,0.038974,0.054125,0.012226,0.012111
1,1.888963,0.760539,0.345297,2.295201,1.654031,0.067592,1.366848,1.054094,0.108103,0.619185,...,0.17193,-0.99071,0.574556,0.556494,0.026316,0.018708,0.051151,0.063831,0.014212,0.01774
2,0.527563,-0.077654,-0.27961,0.685883,1.93757,0.880839,-0.923192,-0.927232,0.666617,1.038546,...,-0.419971,-0.014541,-0.199314,-0.925733,0.02555,0.021106,0.084997,0.04073,0.012691,0.014759
3,3.702245,-0.291193,2.196742,-0.234449,1.367364,0.998411,1.770694,1.604566,0.521217,1.982386,...,0.015767,-1.094873,1.164041,0.246746,0.021413,0.031989,0.088197,0.074358,0.017952,0.013921
4,-0.193837,-0.198527,0.201546,0.258556,0.775204,0.084794,-0.289294,-0.81641,0.043851,-0.804761,...,0.081732,0.040777,0.23235,-0.207831,0.033342,0.035174,0.105521,0.095003,0.022492,0.021355


## Dimension Reduction Techniques <a class="anchor" id="2"></a>


### 1. Principal Component Analysis (PCA) <a class="anchor" id="2.1"></a>

In [None]:
from sklearn.decomposition import PCA

# Apply PCA for 2D and 3D
pca_2d = PCA(n_components=2).fit_transform(features_df)
pca_3d = PCA(n_components=3).fit_transform(features_df)

# Save PCA results for visualization
pca_reduction = {"2D": pca_2d, "3D": pca_3d}

print("PCA reduction results:")
print("2D shape:", pca_2d.shape)
print("3D shape:", pca_3d.shape)

PCA reduction results:
2D shape: (106574, 2)
3D shape: (106574, 3)


### 2. Uniform Manifold Approximation and Projection (UMAP) <a class="anchor" id="2.2"></a>

In [None]:
import umap

# Apply UMAP for 2D and 3D
umap_2d = umap.UMAP(n_components=2, random_state=42).fit_transform(features_df)
umap_3d = umap.UMAP(n_components=3, random_state=42).fit_transform(features_df)

# Save UMAP results for visualization
reduced_umap = {"2D": umap_2d, "3D": umap_3d}
print("UMAP reduction results:")
print("2D shape:", pca_2d.shape)
print("3D shape:", pca_3d.shape)