# Dimentionality reduction

Here we will work on how we reduce the dimentionality of the ECGs. Since the ECGs have so many dimentions, if we try to work directly with them we will suffer from the course of dimentionality, resulting in overfitting. Still, it is important to reduce dimentionality and maitain the maximum variance or **information**.

To do so we have discussed to main methods:

* Sparse / Located PCA
  * Uses a penalty term to encourage some behaviour
  * Can be used to encourage clusters in continuous time blocks
  * Time-localized axis

* Varimax-rotated PCA 
  * After regular PCA, finds rotation of axis to get few large coefficients and the rest near zero
  * peakier = more local
  * Time-localized axis

In this notebook we work on the assumption that we receive a dataset containing the ECGs preprocessed. 

The following cell is just some ficitonal dataset in order to facilitate the work of this notebook

Before anything lets brefly explain what is PCA, since it is the main denominator in the two methods. Classical PCA or Principal Component Analysis, is a mathematical techique to analyze the main orthogonal directions of the data, that capture maximum variance using the eigendecompoisiton of the matrices to extract them.

The main problem with PCA is that this main orthogonal directions are "dense" linear combinations of all inputs variables, often hard to **interpret**

With that in mind we can beggin the dimentionality reduction.



In [1]:
# TODO Generar un dataset fictici

import numpy as np
import pandas as pd


# Set seed for reproducibility
np.random.seed(42)

# Define dimensions
n_samples = 200
n_features = 300

# Create fictional ECG-like data (here using random normal values as a placeholder)
X = np.random.randn(n_samples, n_features)

# Assemble into a DataFrame for easy inspection
df_X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(n_features)])

display(df_X.head(5))


Unnamed: 0,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_290,feat_291,feat_292,feat_293,feat_294,feat_295,feat_296,feat_297,feat_298,feat_299
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,-0.234137,1.579213,0.767435,-0.469474,0.54256,...,-0.208122,-0.493001,-0.589365,0.849602,0.357015,-0.69291,0.8996,0.3073,0.812862,0.629629
1,-0.828995,-0.560181,0.747294,0.61037,-0.020902,0.117327,1.277665,-0.591571,0.547097,-0.202193,...,0.071566,-0.477657,0.47898,0.333662,1.03754,-0.510016,-0.269875,-0.978764,-0.444293,0.3773
2,0.756989,-0.922165,0.869606,1.355638,0.413435,1.876796,-0.773789,-1.244655,-1.77872,1.496044,...,0.820482,0.507274,1.066675,1.169296,1.382159,0.64871,-0.167118,0.146714,1.206509,-0.816936
3,0.368673,-0.393339,0.028745,1.278452,0.191099,0.046437,-1.359856,0.746254,0.645484,2.163255,...,0.105376,-1.334025,-0.601368,0.319782,-1.592994,0.440475,-0.019638,0.55249,0.223914,1.36414
4,0.125225,-0.429406,0.122298,0.543298,0.04886,0.040592,-0.701992,-0.662901,-1.402605,1.749577,...,-0.551858,2.558199,-0.564248,0.184551,1.54211,2.006093,2.061504,1.208366,1.024063,0.592527


Here we implement starigh PCA

In [2]:
from sklearn.decomposition import PCA
import pandas as pd

# 1) Assume df_X is your (n_samples × n_features) DataFrame
#    e.g. created earlier via:
#    df_X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(n_features)])

# 2) Instantiate PCA
#    - n_components can be either:
#        • an integer (number of PCs), or
#        • a float between 0 and 1 (fraction of variance to keep)
pca = PCA(n_components=0.95, random_state=42)

# 3) Fit & transform
X_pca = pca.fit_transform(df_X.values)

# 4) Inspect variance explained
print("Explained variance ratio per PC:")
print(pca.explained_variance_ratio_)  
print("\nCumulative variance explained:")
print(pca.explained_variance_ratio_.cumsum())

# 5) (Optional) Wrap into a DataFrame for easy use
df_X_pca = pd.DataFrame(
    X_pca,
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=df_X.index
)

print("\nFirst 5 rows of the reduced data:")
print(df_X_pca.head(5))


Explained variance ratio per PC:
[0.01605944 0.01569535 0.01499596 0.01488709 0.01455131 0.01424252
 0.01415717 0.01363388 0.01356039 0.01344596 0.01314262 0.01290196
 0.01280622 0.01244572 0.01229407 0.01206551 0.01187292 0.01170122
 0.01146608 0.01138102 0.01120521 0.01105827 0.0109297  0.01070051
 0.01060557 0.01056034 0.01045172 0.01033721 0.01013765 0.01001458
 0.00994158 0.00973409 0.00968025 0.00962159 0.00944848 0.00927274
 0.00916679 0.00904363 0.00899977 0.00892411 0.00888745 0.00884979
 0.00862049 0.00838706 0.00827709 0.0081563  0.00808601 0.00802413
 0.00789816 0.0077281  0.00766952 0.00743622 0.00733128 0.00730304
 0.00724774 0.00711202 0.00701177 0.00695149 0.00691388 0.00674345
 0.00665178 0.00656398 0.00654665 0.00639504 0.00629454 0.00616742
 0.00615379 0.0060565  0.00604253 0.00594068 0.00584859 0.00578821
 0.00573479 0.00564961 0.00557211 0.00545586 0.00541105 0.00535889
 0.00532864 0.00521022 0.00514272 0.0050569  0.00496574 0.00491471
 0.00487082 0.00476166 0.0046

# Sparse / Located PCA

This method of PCA, does the same as PCA, but in order to avoid the "dense" linear combinations, it adds a penalty or constraint, such that on the components loadings each principal component involves only a few variables, leaving the rest at 0.

This is usefull since this maximizes the discovered features, eliminates the ones that are less informative, it is easy to interpret and 

In [3]:
from sklearn.decomposition import SparsePCA
import pandas as pd
import numpy as np

# Recreate (or assume you already have) df_X
np.random.seed(42)
n_samples, n_features = 200, 300
X = np.random.randn(n_samples, n_features)
df_X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(n_features)])

# 1) Fit SparsePCA without normalize_components
spca = SparsePCA(
    n_components=10,   # number of sparse principal components
    alpha=1.0,         # sparsity penalty (higher → sparser loadings)
    random_state=42,
)
spca.fit(df_X.values)

# 2) Extract component loadings into a DataFrame
components = pd.DataFrame(
    spca.components_,
    columns=df_X.columns,
    index=[f'PC_{i}' for i in range(10)]
)

# Print top-8 features (by absolute loading) for the first 5 PCs
for pc in components.index[:5]:
    top_feats = components.loc[pc].abs().sort_values(ascending=False).head(8)
    print(f"\n{pc} top loadings:")
    print(top_feats)

# 3) Transform the data into the 10-dimensional sparse-PC space
X_sp = spca.transform(df_X.values)
df_X_sp = pd.DataFrame(
    X_sp,
    columns=[f'PC_{i}' for i in range(10)]
)

print("\nTransformed data (first 5 rows):")
print(df_X_sp.head())



PC_0 top loadings:
feat_74     0.316802
feat_228    0.283498
feat_72     0.245838
feat_108    0.236455
feat_245    0.227249
feat_160    0.197237
feat_13     0.191349
feat_8      0.187856
Name: PC_0, dtype: float64

PC_1 top loadings:
feat_146    0.228632
feat_43     0.209846
feat_148    0.202400
feat_177    0.200614
feat_54     0.190349
feat_14     0.186605
feat_111    0.179402
feat_80     0.177820
Name: PC_1, dtype: float64

PC_2 top loadings:
feat_130    0.275821
feat_9      0.249756
feat_205    0.244931
feat_225    0.218859
feat_21     0.209641
feat_214    0.204038
feat_216    0.196294
feat_257    0.183819
Name: PC_2, dtype: float64

PC_3 top loadings:
feat_86     0.268748
feat_105    0.257897
feat_147    0.240687
feat_91     0.226022
feat_100    0.192003
feat_186    0.183467
feat_289    0.182757
feat_48     0.182751
Name: PC_3, dtype: float64

PC_4 top loadings:
feat_158    0.273490
feat_195    0.259760
feat_31     0.245300
feat_52     0.212296
feat_273    0.211079
feat_26     0.1

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import SparsePCA
from sklearn.metrics import mean_squared_error


X_recon = np.dot(X_sp, spca.components_)

# Compute variance and error metrics
var_orig = np.var(X, axis=0).sum()
var_recon = np.var(X_recon, axis=0).sum()
variance_ratio = var_recon / var_orig
mse = mean_squared_error(X, X_recon)

# Print results
print("SparsePCA Reconstruction Metrics")
print("---------------------------------")
print(f"Total original variance     : {var_orig:.4f}")
print(f"Total reconstructed variance: {var_recon:.4f}")
print(f"Variance ratio (recon/orig) : {variance_ratio:.4f}")
print(f"Mean Squared Error          : {mse:.4f}")


SparsePCA Reconstruction Metrics
---------------------------------
Total original variance     : 299.4894
Total reconstructed variance: 34.8273
Variance ratio (recon/orig) : 0.1163
Mean Squared Error          : 0.8851


# Varimax 

Varimax rotation takes the dense PCA loadings and rotates them to maximize the variance of squared loadings per component, producing a few large (peak) coefficients and many near zero.

In [12]:
import numpy as np
from sklearn.decomposition import PCA

# 4a) Dense PCA
pca = PCA(n_components=0.95, random_state=42)
X_pca = pca.fit_transform(df_X.values)
loadings = pca.components_

# 4b) Varimax rotation function
def varimax(Phi, gamma=1.0, q=100, tol=1e-6):
    p, k = Phi.shape
    R = np.eye(k)
    d_old = 0
    for _ in range(q):
        L = Phi @ R
        u, s, vh = np.linalg.svd(
            Phi.T @ (L**3 - (gamma/p) * L @ np.diag(np.diag(L.T @ L)))
        )
        R = u @ vh
        d = s.sum()
        if d_old and d - d_old < tol:
            break
        d_old = d
    return Phi @ R

# 4c) Apply varimax
rot_load = varimax(loadings.T).T

df_load_rot = pd.DataFrame(
    rot_load,
    columns=df_X.columns,
    index=[f'PC{i+1}_rot' for i in range(rot_load.shape[0])]
)

# Show top features per rotated component
for comp in df_load_rot.index[:3]:
    top_feats = df_load_rot.loc[comp].abs().nlargest(8)
    print(f"{comp} top features:\n", top_feats)

PC1_rot top features:
 feat_160    0.611958
feat_194    0.163164
feat_166    0.155095
feat_175    0.152752
feat_87     0.136912
feat_42     0.124962
feat_182    0.124229
feat_262    0.121628
Name: PC1_rot, dtype: float64
PC2_rot top features:
 feat_80     0.657233
feat_59     0.147627
feat_49     0.140667
feat_83     0.136933
feat_277    0.124374
feat_116    0.121086
feat_12     0.117473
feat_276    0.116194
Name: PC2_rot, dtype: float64
PC3_rot top features:
 feat_108    0.653636
feat_277    0.181869
feat_247    0.136170
feat_239    0.123922
feat_232    0.117709
feat_186    0.111421
feat_71     0.110252
feat_121    0.107989
Name: PC3_rot, dtype: float64


In [15]:
X_rot = X @ rot_load.T  
# Compute variance and error metrics
var_orig = np.var(X, axis=0).sum()
var_rot = np.var(X_rot, axis=0).sum()
variance_ratio = var_rot / var_orig

# Print results
print("SparsePCA Reconstruction Metrics")
print("---------------------------------")
print(f"Total original variance     : {var_orig:.4f}")
print(f"Total reconstructed variance: {var_rot:.4f}")
print(f"Variance ratio (recon/orig) : {variance_ratio:.4f}")
print(f"Number of components after rotation: {rot_load.shape[0]}")

SparsePCA Reconstruction Metrics
---------------------------------
Total original variance     : 299.4894
Total reconstructed variance: 284.8177
Variance ratio (recon/orig) : 0.9510
Number of components after rotation: 144
