## What you'll learn in this course

* What is Dimensionality Reduction and why do we use it
* What is PCA
* The Intuition Behind the computation

## **Dimensionality Reduction**

There is this misconception in data science that the more explanatory variables we have at our disposal, the better the model we can derive from them.

The key to effective data science is producing useful results within a given context. Sometimes, having too many variables can introduce redundancy and make interpretation difficult.

To address this, we use dimensionality reduction techniques. These methods allow us to summarize data efficiently, reducing the number of features while preserving the most important information.

---

## **Why Do We Use Dimensionality Reduction**

### 1. Visualisation

Imagine you have a dataset with four dimensions (4-D). Since humans can only visualize up to three dimensions, plotting such data directly is impossible. However, by reducing the dimensionality (e.g., from 4D to 2D), we can project the data into a lower-dimensional space and visualize it more easily.


### 2. Noise Reduction

It is common in data science to have redundant or highly correlated variables that describe the same phenomenon. Reducing the number of dimensions can filter out noise and highlight the most relevant patterns. Think about a dataset of blurry images with excessive noise.

---

## **Principal Component Analysis (PCA)**

PCA is the most famous algorithm for dimensionality reduction.

The idea of this unsupervised algorithm is to create a linear combination of features that will transform your initial dataset into a smaller dataset.



<img src="https://raw.githubusercontent.com/sbendimerad/sklearn_course_demo/main/images/image1.png" alt="PCA_logic"/>

To achieve this, the original dataset is multiplied by a matrix of vectors, named Eigen Vectors. Eigen Vectors have key properties:

* They define new axes (principal components) that best capture the variance in the data.
* They summarize the original data, preserving the most important patterns with minimal information loss.


<img src="https://raw.githubusercontent.com/sbendimerad/sklearn_course_demo/main/images/image3.png" alt="UA"/>



---

# **How It Works?**

How do we find these **eigenvectors**? We follow **four key steps**:

## **1. Normalization**  

This is a standard preprocessing step in Machine Learning. We apply the following formula to all data points in the dataset:

$$
z_i = \frac{x_i - \mu}{\sigma}
$$

Where:  
- $\mu$ is the **mean** of the feature  
- $\sigma$ is the **standard deviation**  

This ensures that all features have the same scale, preventing bias in the analysis.

---

## **2. Compute the Covariance Matrix**  

Covariance represents how two variables in our dataset are related to each other. Below is an example of a **covariance matrix**:


<img src="https://raw.githubusercontent.com/sbendimerad/sklearn_course_demo/main/images/image4.png" alt="PCA_logic"/>

- **Diagonal values** → variance of each variable  
- **Off-diagonal values** → covariance between variables  

### **Why is this important?**  
The goal of PCA is to **summarize information** efficiently. Ideally, we want to transform our data into a new coordinate system where the **covariance matrix contains only non-zero values on the diagonal**.  

This means:  
- We have **new uncorrelated features** (principal components).  
- Redundant information is **removed**.  
- Noise is reduced, and we keep only the **most important information**.  


## **3. Compute SVD**  

PCA relies on **Singular Value Decomposition (SVD)** to break down the original data matrix $A$ into three matrices:  

$$
A = U \Sigma V^\intercal
$$

Where:  
- $U$ contains the **eigenvectors** of $AA^\intercal$ (new axes for projection).  
- $\Sigma$ is a diagonal matrix of **eigenvalues** (representing the importance of each axis).  
- $V^\intercal$ contains the **eigenvectors** of $A^\intercal A$.  

Finding $U$ means we've identified the best directions to **project** our data into a lower-dimensional space.

---

## **4. Apply PCA**  

To get our **principal components**, we simply multiply:  

$$
A' = A U
$$

This **transforms** the data into a new space where features are uncorrelated and ordered by importance.  

Thanks to this transformation, we obtain **new features** (principal components) that **summarize the data** while reducing its dimensionality.  

Imagine our original dataset contained variables like **age, salary, number of children, and years of experience**. After PCA, the data is summarized into **two new features (PC1 and PC2)**—each representing a linear combination of the original variables.  

The number of principal components is always **less than or equal to** the number of original features.




<img src="https://raw.githubusercontent.com/sbendimerad/sklearn_course_demo/main/images/image1.png" alt="UA"/>
