# **Dimensionality Reduction Methods**



Dimensionality of a dataset refers to the number of features within a dataset and reducing dimensionality allows for faster runtimes and (often) better performance. This is an extremely powerful tool in working with datasets with “high dimensionality”. For instance, a hundred-feature problem can be reduced to less than ten modified features, saving a lot of computational time and resources while maintaining or even improving performance. Typically, dimensionality reduction methods are machine learning algorithms themselves, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc.

These techniques transform the existing feature space into a new subset of features that are ordered by decreasing importance. Since they “extract” new features from high dimensional data they’re also referred to as Feature Extraction methods. The transformed features do not directly relate to anything in the real world anymore. Rather, they are mathematical objects that are related to the original features. However, these mathematical objects are often difficult to interpret. The lack of interpretability is one of the drawbacks of dimensionality reduction.

We can apply these reduction methods not just to the whole feature space `X` but to a portion of highly corellated features.

## **Principal Component Analysis (PCA)**

Principal component analysis (PCA) is an unsupervised learning algorithm that transforms high-dimensional data into a smaller number of features using principal components (PCs). The goal is to preserve as much variance and information from the original data in a lower-dimensional space. 

### **np.linalg.eig**

The new features generated by PCA are linear combinations of the original features. In this exercise, we will perform PCA using the NumPy method `np.linalg.eig`, which performs eigen decomposition and outputs the eigenvalues and eigenvectors. The eigenvalues are related to the relative variation described by each principal component. The eigenvectors are also known as the principal axes. They tell us how to transform (rotate) our data into new features that capture this variation.

```python
correlation_matrix = data.corr()
eigenvalues, eigenvectors = np.linalg.eig(correlation_matrix)
```

After performing PCA, we generally want to know how useful the new features are. One way to visualize this is to create a scree plot, which shows the proportion of information described by each principal component. The proportion of information explained is equal to the relative size of each eigenvalue:



```python
info_prop = eigenvalues / eigenvalues.sum()
print(info_prop)
```

To create a scree plot, we can then plot these relative proportions:

```python
plt.plot(np.arange(1,len(info_prop)+1), info_prop, 'bo-')
plt.show()
```

Another way to view this is to see how many principal axes it takes to reach around 95% of the total amount of information. Ideally, we’d like to retain as few features as possible while still reaching this threshold. To do this, we need to calculate the cumulative sum of the info_prop vector we created earlier:

```python
cum_info_prop = np.cumsum(info_prop)

plt.plot(np.arange(1,len(info_prop)+1), cum_info_prop, 'bo-')
plt.hlines(y=.95, xmin=0, xmax=15)
plt.vlines(x=4, ymin=0, ymax=1)
plt.show()
```

### **sklearn.decomposition.PCA**

Another way to perform PCA is using the scikit-learn module `sklearn.decomposition.PCA`. The steps to perform PCA using this method are:

1. Standardize the data matrix. This is done by subtracting the mean and dividing by the standard deviation of each column vector.

```python
    # Standardize numerical features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
```

2. Perform eigendecomposition by fitting the standardized data. We can access the eigenvectors using the components_ attribute and the proportional sizes of the eigenvalues using the `explained_variance_ratio_` attribute. This module has many advantages over the NumPy method, including a number of different solvers to calculate the principal axes. This can greatly improve the quality of the results.

```python
    from sklearn.decomposition import PCA

    pca = PCA(n_components=2)

    components = pca.fit(data_standardized).components_
    components = pd.DataFrame(components).transpose()
    components.index =  data_matrix.columns
    print(components)

    var_ratio = pca.explained_variance_ratio_
    var_ratio = pd.DataFrame(var_ratio).transpose()
    print(var_ratio)
```


3. Once we have performed PCA and obtained the eigenvectors, we can use them to project the data onto the first few principal axes. We can do this by taking the dot product of the data and eigenvectors, or by using the `sklearn.decomposition.PCA` module as follows:

```python
    from sklearn.decomposition import PCA

    # Apply PCA (reduce to 2 components for visualization)
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_scaled)

    # Convert to DataFrame for visualization
    df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
    df_pca['Survived'] = y.values
```


4. Once we have the transformed data, we can look at a scatter plot of the first two transformed features using seaborn or matplotlib. This allows us to view relationships between multiple features at once in 2D or 3D space. Often, the the first 2-3 principal components result in clustering of the data.

```python
    # Plot PCA results
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x='PC1', y='PC2', hue='Survived', data=df_pca, palette='coolwarm', alpha=0.7)
    plt.title('PCA on Titanic Dataset')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(title='Survived')
    plt.show()

    sns.lmplot(x='PC1', y='PC2', data=df_pca, hue='species', fit_reg=False)
    plt.show()
```


So far we have used PCA to find principal axes and project the data onto them. We can use a subset of the projected data for modeling, while retaining most of the information in the original (and higher-dimensional) dataset.

For example, recall in the previous exercise that the first four principal axes already contained 95% of the total amount of variance (or information) in the original data. We can use the first four components to train a model, just like we would on the original 16 features.

Because of the lower dimensionality, we should expect training times to be faster. Furthermore, the principal axes ensure that each new feature has no correlation with any other, which can result in better model performance.

```python
# Train model without PCA
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy without PCA: {accuracy_score(y_test, y_pred):.4f}")

# Train model with PCA-transformed data
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
model_pca = RandomForestClassifier(random_state=42)
model_pca.fit(X_train_pca, y_train)
y_pred_pca = model_pca.predict(X_test_pca)
print(f"Accuracy with PCA (2 components): {accuracy_score(y_test, y_pred_pca):.4f}")
```

PCA reduces dimensionality but may lose some interpretability. Model performance without PCA is usually better because tree-based models don't need dimensionality reduction. PCA is more useful when features are highly correlated or when using distance-based models like KNN.

### **Extra Reading**

https://medium.com/@abhaysingh71711/linear-discriminant-analysis-lda-maximizing-class-separability-for-supervised-learning-f5f0a504c196

## **Linear Discriminant Analysis (LDA)**

Like PCA, Linear discriminant analysis (LDA) can be used to transform data into new features that are linear combinations of the original ones. However, instead of maximizing the preserved variance from the original data, LDA works by maximizing the differences between known classes such that they are well separated in lower dimensional space. Linear discriminant analysis (LDA) is a dimensionality reduction technique used for classification problems with continuous independent variables. This technique can be used in a wide variety of applications including image recognition, marketing, biological classification, and more. It works by using linear algebra techniques to find a subspace within the space of independent variables that simplifies classification problems.


Let’s start with a simple example. Here’s a classification problem with two continuous independent variables and two classes: a red class and a blue class.

![image](images/lda_1.png)

In this example you could easily draw a line that would separate most of the red data points from the blue ones. But could you simplify the problem by reducing the data to a single dimension? If we project the data onto a 1-dimensional subspace, we might be able to separate red from blue with only one dimension instead of two. Here are two different ways of projecting the data onto 1-dimensional subspaces.

![image](images/lda_2.png)

![image](images/lda_3.png)

In the first example, there’s no good way to separate red from blue. But in the second example you could easily choose a good decision boundary. This second subspace isn’t just any subspace, it’s actually the subspace that does the best job of separating red and blue. It was obtained by using LDA.

This is what LDA does in general. It finds the best possible subspace for a given classification problem. When there are only two classes, LDA finds a subspace that maximizes the ratio of variance between the classes to variance within the classes. This means that it finds a subspace where classes are far apart from each other, but the observations within each individual class are close to each other. LDA can also be applied when there are more than two classes, but this is slightly more complicated.

Once LDA finds the best subspace, data points can be projected onto that subspace. This yields a data set with fewer dimensions but nearly as much predictive power. LDA can also be used as a classifier by itself with the additional step of computing a decision boundary.

Notice that LDA is similar to Principal Component Analysis (PCA). Both LDA and PCA are dimensionality reduction tools. They both reduce dimensions in a similar way: by using linear algebra to find optimal subspaces for a statistical or machine learning problem.

A key difference between LDA and PCA is that they have different applications. PCA uses linear algebra techniques to find a subspace that maximizes the variance of a data set. This means that PCA basically finds the subspace that is best for linear regression problems. LDA, on the other hand, uses linear algebra techniques to find a subspace that is best for classification problems.

```python
# Import libraries
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Load data
seeds = pd.read_csv('seeds.csv')
X = seeds.drop('variety', axis=1)
y = seeds['variety']

# Create LDA model
lda = LinearDiscriminantAnalysis(n_components=1)

# Fit the data and create a subspace X_new
X_new = lda.fit_transform(X, y)
```

X_new is a 1-dimensional subspace of X. We can use it to train a classifier like logistic regression.
```python
# Import library
from sklearn.linear_model import LogisticRegression

# Create logistic regression model
lr = LogisticRegression()

# Fit the model
lr.fit(X_new, y)

# Model accuracy
print(lr.score(X_new, y))
```


## **T-distributed Stochastic Neighbor Embedding (t-SNE)**

T-distributed stochastic neighbor embedding (t-SNE) is another dimensionality reduction technique, but it uses a nonlinear method to map each data point from a high-dimensional space to a 2- or 3-dimensional space. As its name implies, t-SNE uses the student t-distribution to model the points in a way such that similar points are mapped closer to each other and dissimilar points are further apart. As a result, t-SNE is great at preserving the local structure of the original data.