## EDA: iris dataset - part 3: PCA

PCA allows to reduce our datasets dimensions without sacrificing too much information. Note: PCA is a linear combination of features and tries to maximise the variance. The first PC will be the direction of maximum variance, the second one is orthogonal to that.

First, import the dependencies:

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt

Then import our data set as iris:

In [None]:
iris = pd.read_csv(r"C:\Users\jschoer\Desktop\DSA103 Coding and Tests\DSA103\python-chemistry-intro\src\dsa103\lecture 7\DSA_iris_cleaned.csv")
iris.head()

1. Separate (numerical) features for PCA from the targets ("species"):

In [None]:
X = iris.drop(["species"], axis=1)
y = iris["species"]

2. Standardise (substract mean and divide by std. deviation)

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

3. Apply PCA (specify the targeted number of dimensions)

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

4. Combine for plotting

In [None]:
df_pca = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
df_pca["target"] = y

5. Visualize as scatter plot

In [None]:
sns.scatterplot(data=df_pca, x="PC1", y="PC2", hue="target", s=80)
plt.title("PCA Visualization")
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)")
plt.show()

Compare that with the scatter plots of the untransformed iris data:

In [None]:
sns.pairplot(iris, hue='species', diag_kind='kde')
plt.suptitle('Pairplot of Features by Species', y=1.02)
plt.show()