<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Multivariate Analysis with Python: Linear Discriminant Analysis

## Linear Discriminant Analysis

**Linear Discriminant Analysis (LDA)** is a simple and powerful linear transformation that is most commonly used as dimensionality reduction technique in the pre-processing step for machine learning applications. The goal of linear discriminant analysis is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting ("curse of dimensionality") and also reduce computational costs. 

To further explain dimensionality reduction; 

> Dimensionality Reduction is a machine learning or statistical technique of reducing the amount of random variables in a problem by obtaining a set of principal variables. This process can be carried out using a number of methods that simplify the modeling of complex problems, eliminate redundancy and reduce the possibility of the model overfitting and thereby including results that do not belong. 

The following link provides more explanation of dimensionality reduction ([Link](https://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/)). 

### Example 1

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Example 1
# ---
# Question: Let's apply linear discriminant analysis on the iris dataset below before any modeling 
# ---
# Dataset url = http://bit.ly/IrisDataset
# ---
#
iris = pd.read_csv('http://bit.ly/IrisDataset')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [2]:
# Step 3: Once dataset is loaded into a pandas data frame object, the first step is to divide dataset 
# into features and corresponding labels and then divide the resultant dataset into training and test sets. 
# The following code divides data into labels and feature set. 
# The code assigns the first four columns of the dataset i.e. the feature set to X variable 
# while the values in the fifth column (labels) are assigned to the y variable.
#
X = iris.iloc[:, 0:4].values
y = iris.iloc[:, 4].values
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [3]:
# Step 4: The following code divides data into training and test sets
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [4]:
# Step 5: Feature scaling
# We now need to perform feature scaling. We execute the following code to do so:
# 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [5]:
# Step 6: Peforming LDA
# It requires only four lines of code to perform LDA with Scikit-Learn. 
# The LinearDiscriminantAnalysis class of the sklearn.discriminant_analysis 
# library can be used to Perform LDA in Python. 
# Let's take a look at the following code
#

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
X_train
# In the script above the LinearDiscriminantAnalysis class is imported as LDA. 
# We have to pass the value for the n_components parameter of the LDA, 
# which refers to the number of linear discriminates that we want to retrieve. 
# In this case we set the n_components to 1, since we first want to check the performance 
# of our classifier with a single linear discriminant. 
# Finally we execute the fit and transform methods to actually retrieve the linear discriminants.
# Notice, in case of LDA, the transform method takes two parameters: the X_train and the y_train. 
# This reflects the fact that LDA takes the output class labels into account while selecting the linear discriminants.

array([[-4.84326712],
       [-2.88401579],
       [ 8.04118724],
       [-3.90012642],
       [-6.55946413],
       [-1.82612383],
       [ 8.21851986],
       [-4.31291764],
       [-1.67209307],
       [-1.01913925],
       [-5.37811974],
       [ 7.59082681],
       [-4.76554212],
       [ 6.51049517],
       [ 8.510923  ],
       [-0.93522367],
       [-4.87049552],
       [-6.64278899],
       [-4.46418919],
       [-6.01092077],
       [-0.85385771],
       [-6.37544577],
       [-2.29140422],
       [-0.85778425],
       [-4.08563137],
       [-3.59738247],
       [-4.60441081],
       [-4.8516968 ],
       [-1.14965857],
       [-4.72372712],
       [-2.00993261],
       [ 6.87077438],
       [-5.33964055],
       [-1.83064108],
       [-0.18289676],
       [-1.73467442],
       [-2.11519115],
       [-5.03636778],
       [ 8.40710326],
       [ 7.64836207],
       [-5.17860481],
       [-1.06064911],
       [ 7.65589589],
       [ 9.19366357],
       [-2.11943694],
       [ 6

In [6]:
# Step 7: Training and Making Predictions
# We will use the random forest classifier to evaluate the performance of a PCA-reduced algorithms as shown
# 

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [7]:
# Step 8: Evaluating the Performance
# As always, the last step is to evaluate performance of the algorithm 
# with the help of a confusion matrix and find the accuracy of the prediction.
# 

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy' + str(accuracy_score(y_test, y_pred)))

# We can see that with one linear discriminant, the algorithm achieved an accuracy of 100%, 
# which is greater than the accuracy achieved with one principal component, which was 93.33%.

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
Accuracy1.0


### <font color="green">Challenges</font>

In [8]:
# Challenge 1
# ---
# Question: Perform linear discriminant analysis to predict the cellular localization sites of proteins
# Dataset url = https://www.kaggle.com/imnikhilanand/heart-attack-analysis/data
# ---
# 
df = pd.read_csv('https://www.kaggle.com/imnikhilanand/heart-attack-analysis/data')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2


In [None]:
# Challenge 2
# ---
# Question: Using the breast cancer wisconsin (diagnostic) dataset perform linear discriminant analysis
# Dataset url = http://bit.ly/BreastCancerDataset
# --
#
OUR CODE GOES HERE