# How-to: Principal component analyses (PCA)

## 1. Install and import the necesary packages and libraries

I already have the most recent versions of **pandas, numpy, seaborn, matplotlib and scikit learn** installed, but you can install them using pip (see pypi.org) or conda install in Anaconda prompt (see anaconda.org). If you get the ImportError: cannot import name 'html5lib' from 'pip._vendor', you can install html5lib in Anaconda prompt (conda install -c anaconda html5lib).

Currently installed versions: 
<br>Pandas 1.4.4
<br>numpy 1.21.5
<br>seaborn 0.12.2
<br>matplotlib 3.5.1
<br>scikit learn 1.1.1

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

## 2. Read csv file into Pandas dataframe

In [None]:
# Read the csv file into the pandas dataframe
df = pd.read_csv('filename.csv')

# If the rows are truncated so we can't see the full list, you can correct that with:
pd.set_option('display.max_rows', None)

# Let's display max columns too 
pd.set_option('display.max_columns', None)

df.head()

## 3. Principal component analysis (PCA)

In [None]:
# Selecting the variables you want to include in the PCA
df2 = df[['variable1', 'variable2', 'variable2', 'variable4']]

In [None]:
# Conduct the PCA
pca_name = PCA(n_components=2) # Use n_components to specify the number of principal components you want
principalComponents = pca_name.fit_transform(df2)

# Create the df that will contain all the principal component values
principal_df = pd.DataFrame(data = principalComponents
             , columns = ['pc1', 'pc2']) # Increase the number of columns if you have more than 2 components

# Check the results
print(principal_df.head())

In [None]:
# Get the explained variance
print('Explained variation per principal component: {}'.format(pca_name.explained_variance_ratio_))

In [None]:
# Calculate the total variance explained
print('Total explained variance:')
print(0.57+0.0.89) # Add the variances as calculated in the cell above (the values presented here is just an example)

In [None]:
# Get the loadings on each of the principal components
# Define feature_names
feature_names = ['variable1', 'variable2', 'variable2', 'variable4']

# Get loadings
loadings = pca_name.components_.T * np.sqrt(pca_name.explained_variance_)
loading_matrix=pd.DataFrame(loadings, columns=['pc1', 'pc2'], index = feature_names)
print(loading_matrix.sort_values(by=['pc1'], ascending=False)) # Sorts the features according to the values for PC1

In [None]:
# Add the PCs to the original dataframe (concatenate)
df3 = pd.concat([df, principal_df], axis=1)
print('shape', df3.shape)
df3.tail()

## 4. Saving the dataset

In [None]:
df3.to_csv('filename.csv')