# LDA vs PCA on Iris

Compare two matrix factorization techniques, **PCA** and **LDA** (Linear Discriminant Analysis), for multi-class classification problem using the classic Iris dataset.

I'm going off of these two awesome blog posts to better understand each step of these techniques:
- https://www.apsl.net/blog/2017/07/18/using-linear-discriminant-analysis-lda-data-explore-step-step/
- https://www.apsl.net/blog/2017/06/21/using-principal-component-analysis-pac-data-explore-step-step/

Contents:
- [PCA](#PCA)
- [LDA](#LDA)
- [Compare PCA and LDA](#Compare-PCA-and-LDA)
- [Building a classifier](#Building-a-classifier)

## Load the data

In [None]:
import math
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt


df = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")

df.info()
df.head()

In [None]:
# get numpy arrays
label = "species"

y = df[label].values
X = df[[col for col in df.columns if col != label]].values

## PCA

Steps:
1. Standarize the data (be careful of data leakage if using PCA components for modeling)
2. Obtain the eigendecomposition from the covariance or correlation matrix
3. Sort eigenvalues in descending order and choose the $k$ eigvenvectors that correspond to the $k$ largest eigenvalues
4. Construct the project matrix $W$ from the selected $k$ eigenvectors
5. Transform the original dataset $X$ via $W$ to obtain a $k$-dimensional feature subspace

In [None]:
label_dict = {1: 'Iris-Setosa',
              2: 'Iris-Versicolor',
              3: 'Iris-Virgnica'}

feature_dict = {0: 'sepal length [cm]',
                1: 'sepal width [cm]',
                2: 'petal length [cm]',
                3: 'petal width [cm]'}

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(8, 6))
    for cnt in range(4):
        plt.subplot(2, 2, cnt+1)
        for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):
            plt.hist(X[y==lab, cnt],
                     label=lab,
                     bins=10,
                     alpha=0.3,)
        plt.xlabel(feature_dict[cnt])
    plt.legend(loc='upper right', fancybox=True, fontsize=8)

    plt.tight_layout()
    plt.savefig('PREDI.png', format='png', dpi=1200)
    plt.show()

In [None]:
# step 1: standardize the data
from sklearn.preprocessing import StandardScaler

X_std = StandardScaler().fit_transform(X)

In [None]:
# step 2: eigendecomposition
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
np.linalg.norm(eig_vecs[0]) # eigenvectors have unit norm

Quick note on using the covariance vs correlation matrix.

Especially, in the field of "Finance", the correlation matrix typically used instead of the covariance matrix. However, the eigendecomposition of the covariance matrix (if the input data was standardized) yields the same results as a eigendecomposition on the correlation matrix, since **the correlation matrix can be understood as the normalized covariance matrix.**

In [None]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

In [None]:
# step 3: determine top k eigenvectors
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))

    plt.bar(range(4), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
    plt.step(range(4), cum_var_exp, where='mid',
             label='cumulative explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()
plt.savefig('PREDI2.png', format='png', dpi=1200)
plt.show()

Together, the first two principal components contain 95.8% of the information.

In [None]:
# step 4: make the projection matrix W
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', matrix_w)

In [None]:
# step 5: project data X onto new feature space
Y = X_std.dot(matrix_w)

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
                        ('blue', 'red', 'green')):
        plt.scatter(Y[y==lab, 0],
                    Y[y==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

In [None]:
# Using sklearn
from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(X_std)

## LDA

5 steps:
1. Compute the d-dimensional mean vectors for each class (where d=number of features)
2. Compute two "scatter" matrices: (a) between-class matrix and (b) within-class matrix
3. Compute the eigendecomposition for both the combined scatter matrices: $Av=\lambda v$ where $A=S_W^{-1}S_B$, where $W$ and $B$ indicate the within- and between scatter matrices.
4. Sort the eigenvectors in descending order by their eigenvalues to form a $d \times k$ matrix $W$ of eigenvectors.
5. Use the $W$ matrix to transform the sample into the new subspace. using matrix multiplication: $Z=XW$

In [None]:
# first encode the label so it's convenient to work with
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
label_encoder = enc.fit(y)
y = label_encoder.transform(y) + 1

label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}

In [None]:
# EDA like before
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,6))

for ax,cnt in zip(axes.ravel(), range(4)):  

    # set bin sizes
    min_b = math.floor(np.min(X[:,cnt]))
    max_b = math.ceil(np.max(X[:,cnt]))
    bins = np.linspace(min_b, max_b, 25)

    # plottling the histograms
    for lab,col in zip(range(1,4), ('blue', 'red', 'green')):
        ax.hist(X[y==lab, cnt],
                   color=col,
                   label='class %s' %label_dict[lab],
                   bins=bins,
                   alpha=0.5,)
    ylims = ax.get_ylim()

    # plot annotation
    leg = ax.legend(loc='upper right', fancybox=True, fontsize=8)
    leg.get_frame().set_alpha(0.5)
    ax.set_ylim([0, max(ylims)+2])
    ax.set_xlabel(feature_dict[cnt])
    ax.set_title('Iris histogram #%s' %str(cnt+1))

    # hide axis ticks
    ax.tick_params(axis="both", which="both", bottom="off", top="off",  
            labelbottom="on", left="off", right="off", labelleft="on")

    # remove axis spines
    ax.spines["top"].set_visible(False)  
    ax.spines["right"].set_visible(False)
    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)    

axes[0][0].set_ylabel('count')
axes[1][0].set_ylabel('count')

fig.tight_layout()       

plt.show()

In [None]:
# step 1: mean vectors
np.set_printoptions(precision=4)

mean_vectors = []
for _cls in np.unique(y):
    mean_vectors.append(np.mean(X[y==_cls], axis=0))
    print('Mean Vector class %s: %s\n' %(label_dict[_cls], mean_vectors[_cls-1]))
    
# for use later
global_mean_vector = np.mean(X, axis=0)

In [None]:
# step 2: scatter matrices
S_W = np.zeros((4,4))
for cl, mv in zip(np.unique(y), mean_vectors):
    class_sc_mat = np.zeros((4,4))                  # scatter matrix for every class
    for row in X[y == cl]:
        row, mv = row.reshape(4,1), mv.reshape(4,1) # make column vectors
        class_sc_mat += (row-mv).dot((row-mv).T)
    S_W += class_sc_mat                             # sum class scatter matrices
print('within-class Scatter Matrix:\n', S_W)

In [None]:
# equivalent to cell above, less code, harder to read
dd = np.zeros((4,4))
for idx in range(1,4):
    dd += np.dot((X[y==idx] - mean_vectors[idx-1]).T, (X[y==idx] - mean_vectors[idx-1]))
dd

In [None]:
# between scatter matrix
S_B = np.zeros((4,4))
for i,mean_vec in enumerate(mean_vectors):  
    n = X[y==i+1,:].shape[0]
    mean_vec = mean_vec.reshape(4,1) # make column vector
    global_mean_vector = global_mean_vector.reshape(4,1) # make column vector
    S_B += n * (mean_vec - global_mean_vector).dot((mean_vec - global_mean_vector).T)

print('between-class Scatter Matrix:\n', S_B)

In [None]:
# step 3: eigendecomposition of scatter matrices
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

for i in range(len(eig_vals)):
    eigvec_sc = eig_vecs[:,i].reshape(4,1)   
    print('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
    print('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))

In [None]:
# step 4: choose k << d eigenvectors for new subspace
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues in decreasing order:\n')
for i in eig_pairs:
    print(i[0])
    
print('\nVariance explained:\n')
eigv_sum = sum(eig_vals)
for i,j in enumerate(eig_pairs):
    print('eigenvalue {0:}: {1:.2%}'.format(i+1, (j[0]/eigv_sum).real))


Probably only need the first eigenvector, as opposed to PCA above which suggested 2.

In [None]:
# Step 4: cont'd
W = np.hstack((eig_pairs[0][1].reshape(4,1), eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', W.real)

In [None]:
# step 5: transform samples onto the new subspace
X_lda = X.dot(W)
assert X_lda.shape == (150,2), "The matrix is not 150x2 dimensional."

In [None]:
from matplotlib import pyplot as plt

def plot_step_lda():

    ax = plt.subplot(111)
    for label,marker,color in zip(
        range(1,4),('^', 's', 'o'),('blue', 'red', 'green')):

        plt.scatter(x=X_lda[:,0].real[y == label],
                y=X_lda[:,1].real[y == label],
                marker=marker,
                color=color,
                alpha=0.5,
                label=label_dict[label]
                )

    plt.xlabel('LD1')
    plt.ylabel('LD2')

    leg = plt.legend(loc='upper right', fancybox=True)
    leg.get_frame().set_alpha(0.5)
    plt.title('LDA: Iris projection onto the first 2 linear discriminants')

    # hide axis ticks
    plt.tick_params(axis="both", which="both", bottom="off", top="off",  
            labelbottom="on", left="off", right="off", labelleft="on")

    # remove axis spines
    ax.spines["top"].set_visible(False)  
    ax.spines["right"].set_visible(False)
    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)    

    plt.grid()
    plt.tight_layout
    plt.show()

plot_step_lda()

In [None]:
# in Sklearn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

sklearn_lda = LDA(n_components=2)
X_lda_sklearn = sklearn_lda.fit_transform(X, y)

## Building a classifier

Roughly the steps to follow:

1. Check for class imbalance
2. Settle on performance metric for training: we'll just look at a bunch
3. Split data into training test split
4. Train models using (stratified?) k-fold cross-validation and report validation performance
5. Evaluate final models on test set

In [None]:
# even class distribition
df.species.value_counts(normalize=True)

In [None]:
from sklearn.model_selection import train_test_split

label = "species"
y = df[label]
enc = LabelEncoder()
label_encoder = enc.fit(y)
# make labels ints
y = label_encoder.transform(y)
# to get the labels back: label_encoder.inverse_transform(y)

X = df[[col for col in df.columns if col != label]]

# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=73, #stratitfy=y
)

In [None]:
# close enough, stratify argument can't work
np.bincount(y_train) / len(y_train), np.bincount(y_test) / len(y_test)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA as sklearnPCA

# using prior info to choose n_components
pca_pipe = Pipeline([("scaler", StandardScaler()), ("pca", sklearnPCA(n_components=2))])
# don't need to standardize for LDA: https://stats.stackexchange.com/a/110803
lda_pipe = Pipeline([("lda", LDA(n_components=1))])

# fit on training data to prevent leakage
pca_fit = pca_pipe.fit(X_train)
lda_fit = lda_pipe.fit(X_train, y_train)

Next, is a bit overkill for such a small dataset, but it is the "right way" to do things. I'll do a stratified k-fold split for training and validation sets.

## Compare PCA and LDA

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

def evaluate_predictions(true, pred):
    print("F1:".ljust(10), f1_score(true, pred, average="macro"))
    print("Precision:".ljust(10), precision_score(true, pred, average="macro"))
    print("Recall".ljust(10), recall_score(true, pred, average="macro"))

def train_and_validate(train_x, train_y, estimator):
    skf = StratifiedKFold(n_splits=3)
    for train_index, valid_index in skf.split(train_x, train_y):
        # training set
        X_t, y_t = train_x[train_index], train_y[train_index]
        # validation set
        X_v, y_v = train_x[valid_index], train_y[valid_index]
        # train model
        clf = estimator.fit(X_t, y_t)
        preds_t = clf.predict(X_t)
        preds_v = clf.predict(X_v)
        
        # evaluate
        print("Training set evaluation - ")
        evaluate_predictions(preds_t, y_t)
        print("\nValidation set evaluation - ")
        evaluate_predictions(preds_v, y_v)
        print("\n"*2)
    return estimator.fit(train_x, train_y)

In [None]:
from sklearn.linear_model import LogisticRegression

# pca data
X_pca = pca_fit.transform(X_train)

lr_pca = train_and_validate(X_pca, y_train, LogisticRegression(random_state=312))

In [None]:
# LDA data
X_lda = lda_fit.transform(X_train)

lr_lda = train_and_validate(X_lda, y_train, LogisticRegression(random_state=312))

In [None]:
# Compare test performance
preds_pca = lr_pca.predict(pca_fit.transform(X_test))
preds_lda = lr_lda.predict(lda_fit.transform(X_test))

print("PCA::")
evaluate_predictions(preds_pca, y_test)
print("\nLDA::")
evaluate_predictions(preds_lda, y_test)

In [None]:
# TODO: plot the LDA LR deciision boundary if I can

## Results

Perhaps I shouldn't be surprised by the LDA's single feature superior performance due to the supervised nature of the dimensionality reduction technique. However, I _am_ surprised mostly because it's a new technique for me that I leared from Andreas Mueller's Scipy 2020 talk about `dabl`, a package that I'm sure would've been useful here.

## Things I've learned:
- Sklearn actually has a lot of stuff scattered about to learn. It also doesn't work intuitively (to me!) yet
- Pipelines are nice, they're my friend
- Sklearn works with numpy arrays mostly, so getting more familiar with Numpy will help given my pandas dependence (I like naming columns!)
- [Sklearn has tutorials!](https://scikit-learn.org/stable/tutorial/index.html) I should go thru them to get more familiar with what's available.
- [Matplotlib also has tutorials](https://matplotlib.org/stable/tutorials/index.html), definitely need those to learn the API(s) better.