# Principal Component Analysis

Sometimes, we have many variables in data sets. This large number of variables also expresses many dimensions. Many of the features in our datasets are in close correlation with each other. For example, there is a very strong relationship between the monthly and annual payment of a client of the telecommunications company, or between waist width and body fat ratio. Therefore, we can create a new component from these features and reduce the size. 

**What is Principal Component Analysis**:

Principal Component Analysis (PCA) is **a complexity reduction technique** that tries to reduce it to a smaller set of **components representing most of the information in the variables**. At the conceptual level, PCA identifies the sets of variables that share the variance and creates a component to represent this variance. 

PCA is useful in any situation if you have a large set of variables that are somewhat correlated and you wish to reduce the dimensionality or have the option of using lesser variables. Reduced dimensionality aids in identifying patterns among the components and distinguishing shared qualities among the columns.

When you have a lot of predictor variables in a regression and they all exhibit multicollinearity, PCA comes in handy again. This eliminates multicollinearity and simplifies the regression model development process.

The **two main purposes of PCA** are: 

- Reducing dimensionality in large datasets, (to avoid **curse of dimensionality**)
- Easily visualization of the datasets for analysis.

Best practice 75%-90% arasi bilgi icerecek kadar component sayisi (k) secmektir. Ama correlationu cok yuksek olan featureların varligi halinde 75% altina da inilebilir. 

![image.png](attachment:798a7d6c-16df-47f3-adb2-6bb42cb26e5e.png)

Principal Component Analysis (PCA) is a statistical technique that helps to simplify the complexity of a dataset by reducing the number of variables while retaining the important information.

It works by identifying the underlying patterns and relationships between the different variables in a dataset and finding new variables that summarize the original data. These new variables are called "principal components."

The first principal component captures the most significant amount of variation in the data, and each subsequent component captures less and less variation.

PCA can be useful in various fields, including finance, engineering, biology, and psychology, for identifying the critical factors that influence a phenomenon and for visualizing complex data.

![image.png](attachment:da5c8c35-9b4c-4cfd-8044-6c9b5b9fb4d6.png)

**Key terms in Principal Component Analysis (PCA)--:

- **Variance**: It is a measure of how much a variable changes or fluctuates. In PCA, we look for the variables with the highest variances as they contribute the most to the overall dataset.
- **Eigenvectors**: Eigenvectors are the direction vectors that define the new coordinate system in which we can represent our data. They indicate the direction in which the data varies the most.
- **Eigenvalues**: Eigenvalues are the measures of the amount of variation captured by each eigenvector. The higher the eigenvalue, the more significant the corresponding eigenvector is.
- **Principal Components**: Principal components are the new variables that result from the transformation of the original data using eigenvectors. They are orthogonal to each other and capture most of the variation in the original data.
- **Dimensionality Reduction**: PCA helps in reducing the dimensionality of the dataset by eliminating the less significant variables while retaining most of the original data's important information.
- **Covariance Matrix**: It is a square matrix that shows the covariance between each pair of variables in the dataset. It is used to compute the eigenvectors and eigenvalues needed for PCA.
- **Scree Plot**: A scree plot is a graphical representation of the eigenvalues of the principal components, used to determine the number of principal components to keep in the analysis. It shows the point where the eigenvalues begin to level off and can be used to determine the number of principal components to retain.

**Advantages**:

- **Reduces Complexity**: PCA is an effective technique for reducing the complexity of large datasets by identifying the most significant variables and summarizing them into a few principal components.
- **Retains important information**: PCA can retain the essential information in the data, even after reducing the number of variables. This makes it easier to analyze, visualize and interpret the data.
- **Improves data visualization**: PCA can help in visualizing complex data by reducing it to two or three dimensions, making it easier to interpret.
- **Removes multicollinearity**: PCA can remove the effects of multicollinearity, which occurs when two or more variables are highly correlated.
- **Speeds up machine learning algorithms**: PCA can improve the performance of machine learning algorithms by reducing the number of variables and focusing on the most relevant information.

**Disadvantages**:

- **Loss of interpretability**: The new principal components generated through PCA may not be directly interpretable, making it difficult to explain the results to non-technical stakeholders.
- **Dependent on variable scaling**: PCA is sensitive to the scaling of variables. If the variables have different units or scales, then the results of PCA may be biased towards the variables with larger scales.
- **Information loss**: While PCA retains most of the essential information, there is still some information loss, especially if the number of principal components used is significantly lower than the number of original variables.
- **Not suitable for categorical data**: PCA is not suitable for categorical data because it is based on variance-covariance matrix calculations, which are not applicable to categorical variables.
- **Requires domain knowledge**: The results of PCA need to be interpreted in the context of the domain knowledge of the data being analyzed. Without this understanding, the results may not be meaningful.

PCA can be used as a preprocessing step for various machine learning algorithms to improve their performance. **Some of the algorithms that are more suitable for PCA include**:

- **Linear Regression**: PCA can be used to reduce the dimensionality of the dataset before applying linear regression. This can help to reduce the effects of multicollinearity and improve the accuracy of the model.
- **Logistic Regression**: PCA can be used to reduce the dimensionality of the dataset before applying logistic regression. This can help to improve the efficiency of the model and reduce the risk of overfitting.
- **Support Vector Machines (SVM)**: PCA can be used to preprocess the data before applying SVM. This can help to reduce the computational cost of the algorithm and improve its performance.
- **Neural Networks**: PCA can be used as a preprocessing step to reduce the dimensionality of the input data to neural networks. This can help to improve the efficiency of the training process and reduce the risk of overfitting.
- **Clustering**: PCA can be used to reduce the dimensionality of the dataset before applying clustering algorithms. This can help to improve the accuracy of the clustering and reduce the computational cost of the algorithm.


Overall, PCA is a powerful technique that can improve the performance of various machine learning algorithms by reducing the dimensionality of the dataset, improving efficiency, and reducing the risk of overfitting.