# Principal Components & Clustering Analysis Benefits

Principal components and clustering analysis are two very useful and prominent technics in machine learning. Those are different technics that do not serve the same purpose. However, they are precious tools for data scientist like us, to build and optimize their models and analysis. Hence, we will review their mutual benefits and the scenarios where they might be a good fit of complementarity.

## 1. PCA Benefits

Let's start with a definition:

"Principal Component Analysis (PCA) is a statistical techniques used to reduce the dimensionality of the data (reduce the number of features in the dataset) by selecting the most important features that capture maximum information about the dataset"

In other words, PCA uses the correlation between variables of a dataset to produce new variables called "dimensions" that reduce the size of the data on the constraint of minimizing the loss of information.

The output of the PCA are components which are a particular linear relation between our variables:
The first component explains the highest degree of the variance and it decreases from there. The objective is to have the highest percentage of variance explained by the smallest number of components.

![image.png](attachment:image.png)

### Computation Ressources Reduction

The most intuitive benefit of the technique is the reduction of computation time and power.

Indeed, by reducing the amount of data while keeping close to the same level of information, the model can run faster with results relatively close to the original dataset.

Instead of computing every piece of data, some chunks being useless or redundant, the model can focus more on the high-variance-explaining material.

### Dimension Reduction

The reduction of dimensions brings many mechnical effects:

- The issue of overfitting arises when a model is linked too closely with a particular dataset. By reducing the redundancy and least informative part of the dataset, we can move away from the particular case and get closer to the underlying general rule of the data. The model thus focusing on the most informative dimensions without being distracted by every particularities of the data. It mechanically decreases the noise.


- The method also brings more representative samples, for the same reasons explicited before, which can be very useful in the case of bootstrapping.


- If you want to visualize the whole dataset at once, it will be difficult or impossible if there are more than 3 variables. The alternative is to visualize through a number of 2-D graphs but it will get difficult for human brains to connect those different graphs together. The PCA allows in this case to reduce dimensions and to visualize the relations between observations over one graph for more than 2 or 3 variables.

### Multicollinearity Reduction

In large datasets, the more variables you have the more likely it is to have related variables. If variables are correlated it can introduce biases in your algorithm. 

This phenomenon is called "multicollinearity". Fortunately this issue is adressed by PCA. 

Dimension reduction will allow to regroup variables that are correlated and that explain the same part of the variance together. Not mentioning that finding correlated variables manually in a large dataset is tedious to say the least.

## 2. Clustering Benefits

"The objective of cluster analysis is to find similar groups of subjects, where “similarity” between each pair of subjects means some global measure over the whole set of characteristics."

In summary, the goal of clustering analysis is to regroup observations in different classes, in a way that an observation from a class is more similar to the other observations from its class rather than any other class. The similarity or distance between observations is calculated by using an arbitrary method, for example it can be the euclidian distance.

There are many clustering methods but usually the expected output is an array which has the same length as the dataset, in which every observations is attributed the number of its class.

![image.png](attachment:image.png)

### Classification & Flexibility

The first use of the clustering method is the identification of a number of subgroups (pre-known or not) in the dataset and the relations between said groups (hierarchical or not).

The many algorithms existing in clustering analysis allow a high flexibility in building those clusters, depending on the type of data or underlying relations linking the data. There are thus many ways of building clusters.

The main determinent that will drive those different types is the way you chose to define your classes.
Here a couple of examples:
- K-means uses a predetermined number of centroids and then picks every observations closest to those to determine the population of the class
- TSNE calculates the distance from every points relatively to every other points to detect similarity

Clustering analysis is basically a flexible and efficient method to organise and interpret your data.

### Discover Unknown Structure in Data

Clustering can also be used in exploratory data analysis (EDA) to get a first sense of the data. EDA is a very important first step when first coming into contact with data, especially when the dataset is large and the variables complex.

Clustering Analysis does not uses dependent variable and independent variables, thus you can load the cleaned data in the clustering algorithm and get a clear idea of the populations present in the dataset and their relationship towards each others.

When digging deeper it can also help uncover underlying structures in your data. Especially if it is counter intuitive insights. Those hidden insights can however change the understanding of the model.

## 3. Combination

Now that we are more familiar with those two techniques' mechanisms and benefits, it is interesting to learn in which case we can combine them efficiently.

To summarize, we know that PCA reduces dimensions and optimize the information/sample ratio while clustering allows identifying subpopulations and underlying patterns.

The perfect scenario to enforce our combo would then be a large dataset with many variables and possibly redundant information, in which we would like to profile the characteristics of the subpopulations in it.

The PCA would be used first to optimize the data and reduce the number of variables, in the second step clustering would analyze the newly optimized data to determine the classes.

In other words, in the case of classifying large datasets PCA can be a great help to clustering algorithms. Improving the efficiency and quality of the analysis.

#### Sources

http://theprofessionalspoint.blogspot.com/2019/03/advantages-and-disadvantages-of_4.html

https://towardsdatascience.com/principal-components-analysis-pca-fundamentals-benefits-insights-for-industry-2f03ad18c4d7

https://medium.com/@dmitriy.kavyazin/principal-component-analysis-and-k-means-clustering-to-visualize-a-high-dimensional-dataset-577b2a7a5fe2

https://www.qualtrics.com/experience-management/research/cluster-analysis/

https://www.surveygizmo.com/resources/blog/cluster-analysis/#:~:text=Clustering%20allows%20researchers%20to%20identify,data%20once%20they%20are%20discovered.

https://www.youtube.com/watch?v=FgakZw6K1QQ