# Roll No: 2018102024

# Excercise 2

In Excercise 1, we computed the LDA for a multi-class problem, the IRIS dataset. In this excercise, we will now compare the LDA and PCA for the IRIS dataset.

To revisit, the iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:
1. Iris-setosa (n=50)
2. Iris-versicolor (n=50)
3. Iris-virginica (n=50)

The four features of the Iris dataset:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

<img src="iris_petal_sepal.png">



In [1]:
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set();
import pandas as pd
from sklearn.model_selection import train_test_split
from numpy import pi

### Importing the dataset

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)

dataset.tail()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


### Data preprocessing

Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:

In [3]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

The above script assigns the first four columns of the dataset i.e. the feature set to X variable while the values in the fifth column (labels) are assigned to the y variable.

The following code divides data into training and test sets:

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### Feature Scaling

We will now perform feature scaling as part of data preprocessing too. For this task, we will be using scikit learn `StandardScalar`.

In [5]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [6]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

((120, 4), (30, 4))
((120,), (30,))


## Write your code below

Write your code to compute the PCA and LDA on the IRIS dataset below.

In [7]:
def getlabel(y, labels=None):
    if labels == None: labels = [i for i in range(y.shape[0])]
    classes = np.unique(y)
    out = np.zeros((y.shape[0]), dtype='object')
    cnt = 0
    for c in classes:
        out[y == c] = labels[cnt]
        cnt += 1
    return out

def cov(X):
    return 1 / X.shape[0] * X.T @ X

def LDA(X_train, y_train):
    
    m, d = X_train.shape
    mean = np.mean(X_train, axis=0).reshape((1, -1))
    SW, SB = np.zeros((d, d)), np.zeros((d, d))
    classes = np.unique(y_train)
    for c in range(classes.shape[0]):
        Xc = X_train[y_train == classes[c]]
        meanc = np.mean(Xc, axis=0).reshape((1, -1))
        SB += Xc.shape[0] * cov(meanc - mean)
        SW += cov(Xc - meanc)
    SB /= m
    
    eigvals, eigvecs = np.linalg.eig(np.linalg.pinv(SW) @ SB)
    
    idx = np.argsort(eigvals)[::-1]
    eigvals = np.real(eigvals[idx])
    eigvecs = np.real(eigvecs[:, idx])
    
    return eigvecs[:, 0:1].T

SyntaxError: invalid syntax (<ipython-input-7-2d265ec5ae98>, line 12)

In [None]:
# LDA
W_lda = LDA(X_train, y_train)

In [None]:
# training data
plt.scatter((W_lda @ X_train.T).ravel(), np.zeros((X_train.shape[0])), c=getlabel(y_train,['r', 'g', 'b']))
plt.show()

In [None]:
# test data
plt.scatter((W_lda @ X_test.T).ravel(), np.zeros((X_test.shape[0])), c=getlabel(y_test,['r', 'g', 'b']))
plt.show()

In [8]:
def PCA(X):
    
    m, n = X.shape
    X -= np.mean(X, axis=0, keepdims=True)
    
    eigvals, eigvecs = np.linalg.eig(cov(X))
    
    idx = eigvals.argsort()[::-1]
    eigvals = np.real(eigvals[idx])
    eigvecs = np.real(eigvecs[:, idx])
    
    return eigvecs[:, 0:1].T

In [9]:
# PCA
W_pca = PCA(X_train)

NameError: global name 'cov' is not defined

In [10]:
# training data
plt.scatter((W_pca @ X_train.T).ravel(), np.zeros((X_train.shape[0])), c=getlabel(y_train,['r', 'g', 'b']))
plt.show()

SyntaxError: invalid syntax (<ipython-input-10-7227d73bde1b>, line 2)

In [None]:
# test data
plt.scatter((W_pca @ X_test.T).ravel(), np.zeros((X_test.shape[0])), c=getlabel(y_test,['r', 'g', 'b']))
plt.show()

**Observation**

LDA reduces the dimension of the data in such a manner that inter-class variance is maximum and intra-class variance is minimum. <br>
As is visible from the graph, LDA gives the better features than PCA as LDA is able to separate classes while in PCA, green and blue classes are mixing.