# Why PCA ?

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from data_utils import PCA, StandardScaler, KMeansClustering, object_from_json_url

### Get Data

In [None]:
PENGUIN_URL = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/refs/heads/main/datasets/json/penguins.json"
penguin_data = object_from_json_url(PENGUIN_URL)

penguins_df = pd.DataFrame.from_records(penguin_data)
penguins_df

### Penguin Example

Explore the penguin data.

Let's encode the species column into integers.
It's a simple encoding, so we can just do it manually using a function and the `DataFrame.apply()` command.

In [None]:
species = list(penguins_df["species"].unique())

def species_to_label(s):
  return species.index(s)

penguins_df["label"] = penguins_df["species"].apply(species_to_label)

display(penguins_df)
penguins_df.shape

### Scale

In [None]:
# TODO: Scale data

### Covariances

If we're trying to get some insight into our data, we can look at covariance tables and some plots.

Now that we have scaled data we can look at covariance tables.

In [None]:
# TODO: Look at covariances

In [None]:
# Get 2 or 3 most related features with idxmax()
p_cov = penguins_scaled_df.cov()
p_cov[p_cov == p_cov.max(axis=1)] = 0

display(p_cov)
p_cov.abs().idxmax()

### Separate Features

In [None]:
# TODO: Separate features from the scaled full DataFrame

### Plot the Data

We can look at plots of the most correlated features, and of all of the possible pairs of features.

In [None]:
print("TOP FEATURES")
# TOP CORRELATED FEATURES

# TODO: Add the 2 most correlated features here
top_features = []

for i,cx in enumerate(top_features):
  for j,cy in enumerate(top_features):
    if j > i:
      plt.scatter(penguins_features_df[cx], penguins_features_df[cy], c=penguins_df["label"])
      plt.xlabel(cx)
      plt.ylabel(cy)
      plt.show()

print("ALL FEATURES")
# ALL FEATURES
for i,cx in enumerate(penguins_features_df.columns):
  for j,cy in enumerate(penguins_features_df.columns):
    if j > i:
      plt.scatter(penguins_features_df[cx], penguins_features_df[cy], c=penguins_df["label"])
      plt.xlabel(cx)
      plt.ylabel(cy)
      plt.show()

### PCA

The plots tell us a lot, but information is spread through many images.

The top-correlated features actually make it hard to see the separation between $2$ of the species of penguins.

We can try to simplify how we visualize this data by performing `PCA` and combining some of the original features into _principal components_.

In [None]:
# TODO: create PCA with 3 components

# TODO: fit+transform

# TODO: look at explained variance

### Covariances Again

Can look at covariance table of the `PCA` data.

In [None]:
# TODO: Look at covariance of PCA data

Hmmm... the covariances of the `PCA` data look strange...

But, that's actually what's expected.

`PCA` separates our data into new features that are combinations of the previous features, but that are themselves not related to each other.

### Plots

In [None]:
pca_column_names = penguins_pca_df.columns

# First 2 PCs
plt.scatter(penguins_pca_df[pca_column_names[0]], penguins_pca_df[pca_column_names[1]], c=penguins_df["label"])
plt.xlabel(pca_column_names[0])
plt.ylabel(pca_column_names[1])
plt.title("2 PCs")
plt.show()

# First 3 PCs
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(projection='3d')

ax.scatter(penguins_pca_df[pca_column_names[0]],
           penguins_pca_df[pca_column_names[1]],
           penguins_pca_df[pca_column_names[2]],
           c=penguins_df["label"])
ax.set_xlabel(pca_column_names[0])
ax.set_ylabel(pca_column_names[1])
ax.set_zlabel(pca_column_names[2])
ax.set_title("3 PCs")
plt.show()

Although it has combined some of the features, we can still see a lot of information from our original data.

### Clustering

I wonder what clustering would do...

In [None]:
penguin_clusterer = KMeansClustering(n_clusters=6)
penguin_clusters = penguin_clusterer.fit_predict(penguins_pca_df)

In [None]:
# First 3 PCs
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(projection='3d')

ax.scatter(penguins_pca_df[pca_column_names[0]],
           penguins_pca_df[pca_column_names[1]],
           penguins_pca_df[pca_column_names[2]],
           c=penguin_clusters["clusters"])
ax.set_xlabel(pca_column_names[0])
ax.set_ylabel(pca_column_names[1])
ax.set_zlabel(pca_column_names[2])
ax.set_title("3 PCs")
plt.show()