This notebook implements the **Linear Discriminant Ananlyis(LDA)** as explained in the book "**Python Machine Learning**" by **Sebastian Raschka** and **Vahid Mirjalili**.

Prerequisites:

* Python
* pandas
* numpy

**Dataset:** Wine

**Note:** Descriptive comments explain the code in a better way

**Assumptions for LDA**: 

* Samples are normally distributed
* Features are statistically independent
* Classes have identical covariance matrices


Import necessary packages:

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib.colors import ListedColormap

Load the Wine dataset in a panda dataframe:



In [None]:
df_wine = pd.read_csv('../input/Wine.csv');
df_wine.head()

Add headers in the data:

In [None]:
df_wine.columns = [  'name'
                 ,'alcohol'
             	,'malicAcid'
             	,'ash'
            	,'ashalcalinity'
             	,'magnesium'
            	,'totalPhenols'
             	,'flavanoids'
             	,'nonFlavanoidPhenols'
             	,'proanthocyanins'
            	,'colorIntensity'
             	,'hue'
             	,'od280_od315'
             	,'proline'
                ]
df_wine.head()

Step 1 : Preprocess the data into train and test sets with 70%:30% ratio respectively and standardize the data as is a requirement for LDA to assign equal importance to each feature beforehand

In [None]:
#make train-test sets
from sklearn.model_selection import train_test_split;
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values;
#print(np.unique(y))
#split with stratify on y for equal proportion of classes in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, stratify = y,random_state = 0);

#standardize the features with same model on train and test sets
from sklearn.preprocessing import StandardScaler;
sc = StandardScaler();
X_train_std = sc.fit_transform(X_train);
X_test_sd = sc.transform(X_test);

Step 2: Compute the mean vectors of the features for each class label

>> mi = [feature1 feature2...featureN] , i belongs to classes

In [None]:
#set precision of the vectors
np.set_printoptions(precision = 4);
mean_vecs = [];

#for each of the label compute the mean vector 
for label in range(1,4):
    mean_vecs.append(np.mean(X_train_std[y_train == label],axis = 0));
    print('Mean Vector %s: %s\n' %(label, mean_vecs[label - 1]));

Step 3: Compute the between class and within class **Scatter Matrices** using the mean vectors

**Within Class Scatter Matrix**:  Sum of scatter matrices of each class i.e. 

>> Sum(Si) where Si = Sum(x - mi)(x - mi)T
and i belongs to classes and mi is class mean vector, T = Transpose

In [None]:
#define number of features
d  = 13;
#define the within class scatter matrix of dimension d x d
S_W = np.zeros((d,d));

# run through each class label and keep track of the corresponding mean vector
for label , mv in zip(range(1,4),mean_vecs):
    #define class scatter matrix for each label of dimension d x d 
    class_scatter = np.zeros((d,d));
    
    #run through each row corresponding to a class label and compute the class scatter matrix
    for row in X_train_std[y_train == label]:
        #reshape to vectors of dimension d x 1
        row, mv  = row.reshape(d,1), mv.reshape(d,1);
        #sum for each row d x d dimensional class matrices
        class_scatter += (row - mv).dot((row - mv).T);
    S_W += class_scatter;
# within class scatter matrix of dimension d x d
print("Within Class Scatter Matrix: %s x %s" % (S_W.shape[0], S_W.shape[1]));

In [None]:
print('Class label distribution: %s' % np.bincount(y_train)[1:])

As can be seen above, the classes are not normaly distributed so we need to scale the class scatter matrices before summing them to find the Within Class Scatter Matrix. Dividing the Sum by the number of classes is equivalent to finding the Covariance Matrix which is nothing but the normalized version of the Within Class Scatter Matrix i.e.  

>> S_W scaled = Sum(Si scaled), where Si scaled= (1/n) * Sum(x - mi)(x - mi)T
>>which is equal to Cov = (1/n) S_W 

In [None]:
S_W = np.zeros((d,d));
for label, mv in zip(range(1,4),mean_vecs):
    class_scatter = np.cov(X_train_std[y_train == label].T);
    S_W += class_scatter;
print('Scaled Within Class Scatter Matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]));

**Between Class Scatter Matrix**:  
>>S_B = number sample of class i  * Sum( mi - m)(mi -m).T, i belongs to classes and m is the overall mean including samples from all classes

In [None]:
#calculate the overall mean vector
mean_overall = np.mean(X_train_std,axis = 0);
#define Between Class Scatter Matrix of dimension d x d
S_B = np.zeros((d,d));
for i, mean_vec in enumerate(mean_vecs):
    #find number of samples for each class
    n = X_train[y_train == i + 1].shape[0];
    mean_vec = mean_vec.reshape(d,1);
    mean_overall = mean_overall.reshape(d,1);
    #find the scatter matrix using the above equation
    S_B += n * (mean_vec - mean_overall).dot((mean_vec - mean_overall).T);
print('Between Class Scatter Matrix: %sx%s' % (S_B.shape[0], S_B.shape[1]));

Step 4: Decompose the Inverse(S_W ) * S_B into eigen-pairs and sort in descending order

In [None]:
eigen_vals, eigen_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B));
eigen_pairs  = [(np.abs(eigen_vals[i]),eigen_vecs[:,i]) for i in range(len(eigen_vals))];
eigen_pairs = sorted(eigen_pairs, key = lambda k: k[0], reverse = True);
print('Eigenvalues in descending order: \n');
for eigen_val in eigen_pairs:
    print(eigen_val[0]);

As can be seen from the above result that we get at most c-1 linear dicriminants, where c is the number of classes since the inner class scatter matrix S_B is the sum of c matirces with rank 1. The others are way less than zero just because of the numpy's floating point operations.

Let's now plot the linear discriminants by decreasing eigenvalues to check how much class discriminatory information is captured.

In [None]:
tot = sum(eigen_vals.real)
discr = [(i / tot) for i in sorted(eigen_vals.real,reverse=True)]
cum_discr = np.cumsum(discr)
plt.bar(range(1, 14), discr, alpha=0.5, align='center',label='individual "discriminability"')
plt.step(range(1, 14), cum_discr, where='mid',label='cumulative "discriminability"')
plt.ylabel('"discriminability" ratio')
plt.xlabel('Linear Discriminants')
plt.ylim([-0.1, 1.1])
plt.legend(loc='best')
plt.show()

Thus the above figure rightly shows that the first two linear dicriminants capture almost 100% of the class-discriminatory information.

Step 4: Construct the transformation matrix using the top 2 discriminants

In [None]:
#transformation matrix of dimension d x k i.e. 13 x 2 here
w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real,eigen_pairs[1][1][:, np.newaxis].real))
print('Matrix W:\n', w)

Step 6: Project the samples onto the new feature sub-space using the transformation matrix

In [None]:
# Xnew = Xorig.W
X_train_lda = X_train_std.dot(w)
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']
for l, c, m in zip(np.unique(y_train), colors, markers): plt.scatter(X_train_lda[y_train==l, 0],X_train_lda[y_train==l, 1] * (-1),c=c, label=l, marker=m)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower right')
plt.show()

The above plot clearly makes the data linearly separable in the new feature subspace using a linear classifier.