In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

We created a header_list to apply to the dataset as names so we could output the feature names if necesssary. Additionally, we set X to the feature columns and y to the target column. Finally, we split the data into 80-20 train-test sets. 

In [22]:
header_list = ['SpMax_L', 'J_Dz(e)', 'nHM', 'F01[N-N]', 'F04[C-N]','NssssC', 'nCb-', 'C%', 'nCp', 'nO', 'F03[C-N]',
               'SdssC', 'HyWi_B(m)', 'LOC', 'SM6_L', 'F03[C-O]', 'Me', 'Mi', 'nN-N', 'nArNO2', 'nCRX3', 'SpPosA_B(p)', 
              'nCIR', 'B01[C-Br]', 'B03[C-Cl]', 'N-073', 'SpMax_A', 'Psi_i_1d', 'B04[C-Br]', 'SdO' , 'TI2_L', 'nCrt',
               'C-026', 'F02[C-N]', 'nHDon', 'SpMax_B(m)', 'Psi_i_A', 'nN', 'SM6_B(m)', 'nArCOOR', 'nX', 'TARGET']
df = pd.read_csv('BioDegData.csv', names = header_list)
X, y = df.iloc[:, 0:-1].values, df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

We applied a standard scaler to the split train-test data to later apply to the lda. 

In [23]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

To implement the actual Linear Discriminant Analysis, we used scikit learn's LDA API call and set the n_components to 1 to return the highest accuracy. We then transformed the train and test data and applied it to a RFClassifier test to obtain an accuracy score for LDA.

In [24]:
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

In [25]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [26]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

[[115  25]
 [  7  64]]
Accuracy: 0.8483412322274881


We found that the accuracy score for the LDA was decent, but not as great for this dataset as PCA. Moreover, we stuck to implementing PCA as a tool for feature extraction in our SVM model. Another disadvantage when using LDA on a dataset like this is that it is biased towards a majority features and does not perform well with skewed data. The feature "SpMax_B(m)" has the highest importance by far, so it is possible that the LDA would overcompensate for that one feature. 