## Classification using LDA

Independent variables are normally distributed, which is a fundamental assumption of the LDA method. LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.LDA explicitly attempts to model the difference between the classes of data. PCA, in contrast, does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html


In [54]:

import os
import numpy as np
import pandas as pd

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split


In [2]:
os.chdir('C:\\Users\\satish\\Documents\\New folder\\PGPDS-PBS\\Semester 2\\Advanced Statistics\\Session 18-20 Linear Discriminant Analysis')


## Binary class 

In [10]:
# load data
df = pd.read_excel("lda_numerics_working.xlsx", sheet_name = "Sheet1")
df


Unnamed: 0,X1,X2,class
0,2.947814,6.626878,1
1,2.530388,7.78505,1
2,3.566991,5.651046,1
3,3.156983,5.467077,1
4,2.582346,4.457777,2
5,2.155826,6.222343,2
6,3.273418,3.520687,2


In [30]:
X = df[['X1','X2']]
y = df['class']

# fit the model 
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)

# make predictions
pred = clf.predict(X)
probab = clf.predict_proba(X)

print("Parameters:", clf.get_params())
print("Predictions:", pred)
print("Class probabilities:", probab)

Parameters: {'covariance_estimator': None, 'n_components': None, 'priors': None, 'shrinkage': None, 'solver': 'svd', 'store_covariance': False, 'tol': 0.0001}
Predictions: [1 1 1 1 2 2 2]
Class probabilities: [[9.99999821e-01 1.79241441e-07]
 [9.99999973e-01 2.70033679e-08]
 [1.00000000e+00 4.23764311e-10]
 [9.99534569e-01 4.65430804e-04]
 [9.36087430e-10 9.99999999e-01]
 [6.65016506e-06 9.99993350e-01]
 [4.90351091e-06 9.99995096e-01]]


In [7]:
from sklearn.metrics import accuracy_score
accuracy_score(y, pred)


1.0

## Multi class

In [32]:
os.chdir('C:\\Users\\satish\\Downloads\\')


# load data
iris = pd.read_csv("iris.csv")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [33]:
X= iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y= iris[['species']]

In [34]:
# split into train and test
(X_train,X_test,y_train,y_test) = train_test_split(X, y, test_size = 0.3, stratify = y, random_state = 100)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(105, 4)
(45, 4)
(105, 1)
(45, 1)


In [44]:
# fit model 
clf = LinearDiscriminantAnalysis(solver="eigen")
clf.fit(X_train, y_train)

# make predictions
pred = clf.predict(X_train)
probab = clf.predict_proba(X_train)

print("Parameters:", clf.get_params())
print("Predictions:", pred)
print("Class order:",clf.classes_)
print("Class probabilities:", probab)

  return f(*args, **kwargs)


Parameters: {'covariance_estimator': None, 'n_components': None, 'priors': None, 'shrinkage': None, 'solver': 'eigen', 'store_covariance': False, 'tol': 0.0001}
Predictions: ['Iris-versicolor' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-setosa'
 'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor'
 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor'
 'Iri

In [45]:
from sklearn.metrics import accuracy_score
print("Training accuracy:")
accuracy_score(y_train, pred)


Training accuracy:


0.9809523809523809

In [53]:
# test data performance
pred = clf.predict(X_test)
probab = clf.predict_proba(X_test)

print("Predictions:", pred)
print("Class order:",clf.classes_)
print("Class probabilities:", probab)
# provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of  μk .
print("Means:", clf.means_)
print("Coefficients:", clf.coef_)
print("Prior probabilities:", clf.priors_)


Predictions: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor'
 'Iris-setosa' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Class order: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Class probabilities: [[1.00000000e+00 8.91206761e-22 3.29490966e-44]
 [1.36211440e-24 9.99949982e-01 5.00182420e-05]
 [1.02850458e-43 3.13647312e-04 9.99686353e-01]
 [1.388257

In [47]:
from sklearn.metrics import accuracy_score
print("Test accuracy:")
accuracy_score(y_test, pred)


Test accuracy:


0.9777777777777777

In [48]:
# form confusion matrix and find accuracy scores
from sklearn.metrics import confusion_matrix

c= confusion_matrix(y_test, pred)
c

array([[15,  0,  0],
       [ 0, 15,  0],
       [ 0,  1, 14]], dtype=int64)

In [49]:
# full report
from sklearn import metrics
print("Test data performance:")
print(metrics.classification_report(y_test, clf.predict(X_test)))


Test data performance:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.94      1.00      0.97        15
 Iris-virginica       1.00      0.93      0.97        15

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45



## Quadratic discriminant analysis

Quadratic discriminant analysis (QDA) provides an alternative. Like LDA, the QDA classifier results from assuming that the
observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. the QDA classifier involves a quadratic, rather than a linear, function of the predictors.

In [60]:
# fit model 
clf = QuadraticDiscriminantAnalysis()
clf.fit(X_train, y_train)

# make predictions
pred = clf.predict(X_train)
probab = clf.predict_proba(X_train)

print("Parameters:", clf.get_params())
print("Predictions:", pred)
print("Class order:",clf.classes_)
print("Class probabilities:", probab)

print("Means:", clf.means_)
print("Prior probabilities:", clf.priors_)


Parameters: {'priors': None, 'reg_param': 0.0, 'store_covariance': False, 'tol': 0.0001}
Predictions: ['Iris-versicolor' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-setosa'
 'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor'
 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor'
 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
 'Iris-ver

  return f(*args, **kwargs)


In [61]:
from sklearn.metrics import accuracy_score
print("Training accuracy:")
accuracy_score(y_train, pred)


Training accuracy:


0.9809523809523809

In [62]:
# test data performance
pred = clf.predict(X_test)
probab = clf.predict_proba(X_test)

print("Predictions:", pred)
print("Class order:",clf.classes_)
print("Class probabilities:", probab)

Predictions: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica'
 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor'
 'Iris-setosa' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Class order: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Class probabilities: [[1.00000000e+000 2.73491146e-023 1.83427424e-040]
 [1.35041702e-079 9.99779674e-001 2.20326403e-004]
 [6.42891552e-158 4.53251218e-006 9.99995467e-001]
 

In [63]:
from sklearn.metrics import accuracy_score
print("Test accuracy:")
accuracy_score(y_test, pred)


Test accuracy:


0.9777777777777777

In [64]:
# form confusion matrix and find accuracy scores
from sklearn.metrics import confusion_matrix

c= confusion_matrix(y_test, pred)
c

array([[15,  0,  0],
       [ 0, 15,  0],
       [ 0,  1, 14]], dtype=int64)

In [65]:
# full report
from sklearn import metrics
print("Test data performance:")
print(metrics.classification_report(y_test, clf.predict(X_test)))


Test data performance:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.94      1.00      0.97        15
 Iris-virginica       1.00      0.93      0.97        15

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45

