In this notebook my goal is to try to find the model that classifies patients with the greatest precision, subdividing them by class, using the attributes that describe the characteristics of each patient.
Therefore, I will proceed by separating the predictors which are the numerical attributes, from the target which is the class to which the patient belongs.

The contents of this notebook will be:
- Exploratory Data Analysis
- Feature Extraction (PCA,LDA)
- Train_test_split and cross-validation methods
- Multinomial Logistic Regression
- Decision Tree
- K-Nearest Neighbors
- Learning curve
- Confusion Matrix
- Accuracy, Precision, Recall


In [None]:
import pandas as pd # It helps me to manipulate sequential and tabular data
import numpy as np # Linear algebra
import seaborn as sns # It helps me to plot high level graphical interface 
import sklearn #This library provides a great number of classification algorithms
import matplotlib.pyplot as plt # This module permits me to plot 2D grphics
% matplotlib inline 


#  Exploratory Data Analysis

In [None]:
df=pd.read_csv('../input/column_3C_weka.csv')
df.head()

Initially it is possible to observe that the dataset is composed of 310 entries and for each entry we have 7 columns, 6 of which represent the biomechanical characteristics of the patient and a categorical that is the class with which the patient has been labeled. It is very interesting to note that there are no missing data for all patients.

In [None]:
df.info()

With describe() function I obtain a summary of statistical measures

In [None]:
df.describe()

I separate the predictor from response

In [None]:
X=df.drop(['class'],axis=1) #Predictors

Y=df['class'].values #Response
#print(Y)
X.head()
#print(len(X))

Let's now look at these numerical results by graphing them through box-plots:

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(data=X,orient='v')
plt.show()


In all distributions there are outliers, in particular the maximum outlier of the "degree_spondylolisthesis" attribute that is very far from the other attributes is obvious to the eye. By analyzing the relative values of that patient better, I noticed quite strange values compared to the other patients in the same category, so thinking it was a transcription error during the collection of the dataset I decided to eliminate it.

In [None]:
x=df[df['degree_spondylolisthesis']>100]
x.head(5)

In [None]:
x=df[df['degree_spondylolisthesis']<400]
X=x.drop(['class'],axis=1) #Predictors
Y=x['class'].values #Response
X.head()
#Now I have 309 record instead than 310

Deleting the anomalous value gives a better view of the graphical representation:

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=X,orient='v')
plt.show()


From the distance between each quartile and the median it is possible to notice how the attribute 'degree_spondylolisthesis' has a clear positive asymmetric distribution (because the median is smaller than average) with respect to the other attribute groups, moreover it is possible to observe how the standard deviation of this attribute is very high and this indicates that the data accumulate in areas far from the expected value.

From the following graph we can also see how the values of the biomechanical variables for each class vary.

In [None]:
#Create boxplot for each variable
plt.figure(figsize=(20,10))
for id, var in enumerate(X):
    plt.subplot(2,3,id+1)
    sns.boxplot(x='class', y=var, data=x)

The distribution of patients by class tells us that we have 60 (19.4%) patients diagnosed with Hernia, 149 (48.2%) patients with Spondylolysis and 100 (32.3%) patients with a spinal column Normal;

In [None]:
plt.figure(figsize=(7,4))
sns.countplot('class', data=df)
plt.show()

Putting aside the classes we try to see now the correlation between the various variables, to get an idea also on the link between one variable and another.

In [None]:
fig, ax = plt.subplots(figsize=(12,6)) 
sns.heatmap(X.corr(), annot=True, fmt=".2f",linewidths=.5,ax=ax)
plt.show()


The following pairplot shows us graphically the mathematical model that binds the biomechanical variables:

In [None]:
g = sns.pairplot(df, hue='class')

# Principal Component Analysis

In [None]:
#Standardize the data 
from sklearn.preprocessing import StandardScaler
X_std=StandardScaler().fit_transform(X)
scaled_df = pd.DataFrame(X_std, columns=['pelvic_incidence', 'pelvic_tilt', 'lumbar_lordosis_angle','sacral_slope','pelvic_radius','degree_spondylolisthesis'])
#print(np.std(X_std))

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 6))
ax1.set_title('Before standardization')
sns.kdeplot(df['pelvic_incidence'], ax=ax1)
sns.kdeplot(df['pelvic_tilt'], ax=ax1)
sns.kdeplot(df['lumbar_lordosis_angle'], ax=ax1)
sns.kdeplot(df['sacral_slope'], ax=ax1)
sns.kdeplot(df['pelvic_radius'], ax=ax1)
sns.kdeplot(df['degree_spondylolisthesis'], ax=ax1)
ax1.set_ylim(0,0.05)
ax2.set_title('After standardization')
sns.kdeplot(scaled_df['pelvic_incidence'], ax=ax2)
sns.kdeplot(scaled_df['pelvic_tilt'], ax=ax2)
sns.kdeplot(scaled_df['lumbar_lordosis_angle'], ax=ax2)
sns.kdeplot(scaled_df['sacral_slope'], ax=ax2)
sns.kdeplot(scaled_df['pelvic_radius'], ax=ax2)
sns.kdeplot(scaled_df['degree_spondylolisthesis'], ax=ax2)
ax2.set_ylim(0,0.6)

plt.show()


In [None]:
cov_mat=np.cov(X_std.T) # with np.cov I compute the covariance matrix of the standardized data
eigen_vals,eigen_vecs=np.linalg.eig(cov_mat)# with linalg.eig I compute eigenvalues and eigenvectorscalcolo eigen_vals
print('\nEigenvalues \n%s' % eigen_vals)
print('\nEigenvectors \n%s' % eigen_vecs)


In [None]:
tot=sum(eigen_vals)
var_exp=[(i/tot)*100 for i in sorted(eigen_vals,reverse=True)]
cum_var_exp=np.cumsum(var_exp)
#print(len(var_exp))
plt.figure(figsize=(10,5))
plt.bar(range(6),var_exp,alpha=0.5,align='center',label='individual explained variance')
plt.step(range(6),cum_var_exp,where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
    

In [None]:
print("Variance explained by every single main component:")
print(var_exp)
print("\n")
print("Total variance explained:")
print(cum_var_exp)

In [None]:
#I do the list of the (eigenvalue,eigenvector) tuples
eigen_pairs=[(np.abs(eigen_vals[i]),eigen_vecs[:,i]) for i in range(len(eigen_vals))]
# I sort the tuples (eigenvectors,eigenvalues) in discending order
eigen_pairs.sort(key=lambda x:x[0],reverse=True)

print('Eigenvalues in ordine discendente:')
for i in eigen_pairs:
    print(i[0])

In [None]:
#I choose the first Principal Component because I want to plot data in 2 dimensions
matrix_w=np.hstack((eigen_pairs[0][1][:,np.newaxis],
                   eigen_pairs[1][1][:,np.newaxis],
                   ))
print('Matrix W:\n',matrix_w)
print(np.shape(matrix_w))
print(np.shape(X_std))

In [None]:
#I transform the original 309x6 matrix in 309x2 data matrix
X_std_pca=X_std.dot(matrix_w)# (319*6)*(6*2)= (310*2)
#print(X_std_pca)

The plot of the data using the first two principal component is the follow:

In [None]:
colors=['blue','green','orange']
markers=['s','^','o']
plt.figure(figsize=(6,4))
for lab,col,m in zip(np.unique(Y),colors,markers):
    plt.scatter(X_std_pca[Y==lab,0],X_std_pca[Y==lab,1]*-1,label=lab,c=col,marker=m,alpha=0.8)
    
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
#I implement PCA using sklearn
from sklearn.decomposition import PCA as sklearnPCA
pca = sklearnPCA(n_components=2)
T=pca.fit_transform(X_std)#Fit the model with X and apply the dimensionality reduction on X.

In [None]:
import math

def get_important_features(transformed_features, components_, columns):
    """
    This function will return the most "important" 
    features so we can determine which have the most
    effect on multi-dimensional scaling
    """
    num_columns = len(X.columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] :round( math.sqrt(xvector[i]**2 + yvector[i]**2),2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    print ("Features for importance:\n\n",important_features)

get_important_features(T, pca.components_, X.columns.values)

In [None]:
def draw_vectors(transformed_features, components_, columns):
    """
    This funtion will project your *original* features
    onto your principal component feature-space, so that you can
    visualize how "important" each one was in the
    multi-dimensional scaling
    """

    num_columns = len(X.columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    ax = plt.axes()
    

    for i in range(num_columns):
    # Use an arrow to project each original feature as a
    # labeled vector on your principal component axes
        plt.arrow(0, 0, xvector[i], yvector[i], color='black', width=0.0005, head_width=0.02, alpha=0.75)
        plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='black', alpha=0.9)

    return ax

In [None]:
plt.figure(figsize=(14,7))
ax = draw_vectors(T, pca.components_, X.columns.values)
T_df = pd.DataFrame(T)
T_df.columns = ['component1', 'component2']
T_df=T_df.values
colors=['blue','green','orange']
markers=['s','^','o']

for lab,col,m in zip(np.unique(Y),colors,markers):
    plt.scatter(T_df[Y==lab,0],T_df[Y==lab,1],label=lab,c=col,marker=m,alpha=0.6)

plt.xlabel('Principle Component 1')
plt.ylabel('Principle Component 2')

plt.show()
#print(T_df)

In the following plot we have that the ligthing points represent the projected after applying PCA, the dark spots in cross shapes represent the original data

In [None]:
X_new = pca.inverse_transform(T)
colors=['blue','green','orange']
markers=['s','^','o']
plt.figure(figsize=(14,7))
plt.scatter(T[:, 0], T[:, 1], alpha=0.2,c='black',marker='x')
for lab,col,m in zip(np.unique(Y),colors,markers):
    plt.scatter(X_new[Y==lab,0],X_new[Y==lab,1],label=lab,c=col,marker=m,alpha=0.9)


# Linear Discriminant Analysis

In [None]:
from sklearn.preprocessing import LabelEncoder

enc=LabelEncoder()
label_encoder=enc.fit(Y)
Enc_y=label_encoder.transform(Y)+1
label_dict={1:'Hernia',2:'Normal',3:'Spondylolisthesis'}
np.set_printoptions(precision=4)
mean_vecs=[]
for cl in range(1,4):
    mean_vecs.append(np.mean(X[Enc_y==cl],axis=0))
    print('Mean vector class %s:\n%s\n'%(label_dict[cl],mean_vecs[cl-1]))

In [None]:
d=6
S_W=np.zeros((d,d))#scatter matrix is the same of covariance matrix; the cov matrix is normalized version of the scatter matrix
for cl,mv in zip(range(1,4),mean_vecs):
    class_sc_mat=np.zeros((d,d)) 
    for row in X_std[Enc_y==cl]:
        row,mv=row.reshape(d,1),mv.values.reshape(d,1)
        class_sc_mat+=(row-mv).dot((row-mv).T) 
    S_W+=class_sc_mat 
print('Within-class scatter matrix: %sx%s' % (S_W.shape[0],S_W.shape[1]))
print('within-class scatter matrix:\n',S_W)



In [None]:
#Compute the between-class scatter matrix
mean_overall=np.mean(X_std,axis=0)
d=6
S_B=np.zeros((d,d))#between class scatter matrix
for i,mean_vec in enumerate(mean_vecs):
    n=X_std[Enc_y==i+1,:].shape[0]
    mean_vec=mean_vec.reshape(d,1) 
    mean_overall=mean_overall.reshape(d,1)
    S_B+=n*(mean_vec-mean_overall).dot((mean_vec-mean_overall).T)
print('Between-class scatter matrix: %sx%s' % (S_B.shape[0],S_B.shape[1]))
print('Between-class scatter matrix:\n',S_B)




In [None]:
eigen_vals,eigen_vecs=np.linalg.eig(np.linalg.inv(S_W).dot(S_B))


In [None]:
#I do the list of the (eigenvalue,eigenvector) tuples
eigen_pairs=[(np.abs(eigen_vals[i]),eigen_vecs[:,i]) for i in range(len(eigen_vals))]
# I sort the (eigenvectors,eigenvalues) tuples in descending order
eigen_pairs=sorted(eigen_pairs,key=lambda x:x[0],reverse=True)
#stampo
print('Eigenvalues in ordine discendente:')
for i in eigen_pairs:
    print(i[0])

In [None]:
print('Discriminability ratio:\n')
eigv_sum=sum(eigen_vals)
for i,j in enumerate(eigen_pairs):
    print('Linear discriminant {0:}:{1:.2%}'.format(i+1,(j[0]/eigv_sum).real))

In [None]:
tot=sum(eigen_vals.real)
discr=[(i/tot) for i in sorted(eigen_vals.real,reverse=True)]
cum_discr=np.cumsum(discr)

plt.figure(figsize=(10,6))
plt.bar(range(1,7),discr,alpha=0.5,align='center',label='individual discriminability')
plt.step(range(1,7),cum_discr,where='mid',label='cumulative discriminability')
plt.ylabel('discriminability ratio')
plt.xlabel('Linear discriminant')
plt.legend(loc='upper right')
plt.tight_layout()
    

In [None]:
 print('Linear discriminant ',cum_discr)

In [None]:
w=np.hstack((eigen_pairs[0][1][:,np.newaxis],
            eigen_pairs[1][1][:,np.newaxis],
            #eigen_pairs[2][1][:,np.newaxis]
            ))
print('Matrix W:\n',w)

The following scatter-plot represents the data using the first two linear discriminants. We can observe that with LDA we have a better separability of classes than the PCA.

In [None]:
X_std_lda=X_std.dot(w.real)
colors=['blue','green','orange']
markers=['s','^','o']
plt.figure(figsize=(10,6))
for lab,col,m in zip(np.unique(Y),colors,markers):
    plt.scatter(X_std_lda[Y==lab,0]*-1,X_std_lda[Y==lab,1],label=lab,c=col,marker=m,alpha=0.8)
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show() 


In [None]:
#Implementation of LDA using sklearn library
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
sklearn_lda=LDA(n_components=2)
X_lda_sklearn=sklearn_lda.fit_transform(X,Y)

In [None]:
def plot_scikit_lda(X,title):
    plt.figure(figsize=(10,6))
    for label,marker,color in zip(
    np.unique(Y),('s','^','o'),('blue','green','orange')):
        plt.scatter(x=X[:,0][Y==label],
                   y=X[:,1][Y==label],
                    label=label,
                   marker=marker,
                   color=color,
                   alpha=0.8)
    plt.xlabel('LD1')
    plt.ylabel('LD2')
    leg= plt.legend(loc='upper right', fancybox=True)
    
   # plt.title(title)
    
    plt.tick_params(axis="both",which="both",bottom="off",top="off",labelbottom="on",left="off",right="off",labelleft="on")
    
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)
    
   # plt.grid()
    plt.tight_layout
    plt.show()   

In [None]:
plot_scikit_lda(X_lda_sklearn, title='Default LDA via scikit-learn')

****## Train_test_split method

In [None]:
from sklearn.model_selection import train_test_split,StratifiedKFold
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=42,stratify=Y)
print('X_train dim:', x_train.shape)
print('X_test dim:',x_test.shape)


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import svm

With the following I represent a DataFrame,that contains train_score and test_score of different classifier, using train_test-split method

In [None]:

models=[LogisticRegression(multi_class='multinomial',solver='newton-cg'),LogisticRegression(),DecisionTreeClassifier(max_depth=3),DecisionTreeClassifier(max_depth=5),
        KNeighborsClassifier(n_neighbors=3),KNeighborsClassifier(n_neighbors=8),KNeighborsClassifier(n_neighbors=16)]
classifier_comparison=pd.DataFrame(columns=['Classificator','train_score','test_score'])
for i in range(len(models)):
    models[i].fit(x_train,y_train)
    if i==0:
        classifier_comparison.loc[i,'Classificator']='MultinomialLogisticRegression'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)
    elif i==1:
        classifier_comparison.loc[i,'Classificator']='LogisticRegression'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)
    elif i==2:
        dtree=models[i]
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=3)'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)
    elif i==3:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=5)'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)
    elif i==4:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=3)'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)
    elif i==5:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=8)'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)
    elif i==6:
        classifier_comparison.loc[i,'Classificatore']='KNeignborsClassifier (k=16)'
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)  
    else:
        classifier_comparison.loc[i,'Classificator']=models[i].__class__.__name__
        classifier_comparison.loc[i,'train_score']=models[i].score(x_train,y_train)
        classifier_comparison.loc[i,'test_score']=models[i].score(x_test,y_test)  
classifier_comparison
    

# K-fold cross-validation method

With the following I represent a DataFrame,that contains train_score and test_score of different classifier, using train_test-split method.

In [None]:
models=[LogisticRegression(multi_class='multinomial',solver='newton-cg'),LogisticRegression(),DecisionTreeClassifier(max_depth=3),DecisionTreeClassifier(max_depth=5),KNeighborsClassifier(n_neighbors=3),
        KNeighborsClassifier(n_neighbors=8),KNeighborsClassifier(n_neighbors=16)]
kfold=StratifiedKFold(n_splits=10,random_state=42)
classifier_comparison=pd.DataFrame(columns=['Classificator','train_score','test_score'])
for i, model in enumerate(models):
    classifier=model
    cv_result=cross_validate(model,X,Y,cv=kfold,scoring='accuracy')
    if i==0:
        classifier_comparison.loc[i,'Classificator']='MultinomialLogisticRegression'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==1:
        classifier_comparison.loc[i,'Classificator']='LogisticRegression'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()    
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==2:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=3)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==3:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=5)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==4:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=3)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()  
    elif i==5:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=8)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean() 
    elif i==6:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=16)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()  
    
    else:
        classifier_comparison.loc[i,'Classificator']=model.__class__.__name__
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']= cv_result['test_score'].mean()
classifier_comparison


In [None]:
#complexity of the model using K-NN
neig = np.arange(1, 20)
train_accuracy = []
test_accuracy = []
train = np.array(X)
labels = np.array(Y)
# Loop over different value of k to know for which k I have the best accuracy
for i, k in enumerate(neig):
    knn = KNeighborsClassifier(n_neighbors=k)
    test_sum=0
    train_sum=0
    for train_index, test_index in kfold.split(train,labels):
        X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
        y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
        knn.fit(X_train,y_train)
        test_sum+= knn.score(X_test,y_test)
        train_sum+=knn.score(X_train,y_train)
    #train accuracy
    train_accuracy.append((train_sum/10))
    # test accuracy
    test_accuracy.append((test_sum/10))

# Plot
plt.figure(figsize=[8,4])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))


I check the model's performance on the data on which I applied the PCA

In [None]:
models=[LogisticRegression(multi_class='multinomial',solver='newton-cg'),LogisticRegression(),DecisionTreeClassifier(max_depth=3),DecisionTreeClassifier(max_depth=5),KNeighborsClassifier(n_neighbors=3),
        KNeighborsClassifier(n_neighbors=8),KNeighborsClassifier(n_neighbors=16)]
kfold=StratifiedKFold(n_splits=10,random_state=42)
classifier_comparison=pd.DataFrame(columns=['Classificator','train_score','test_score'])
for i, model in enumerate(models):
    classifier=model
    cv_result=cross_validate(model,X_std_pca,Y,cv=kfold,scoring='accuracy')
    if i==0:
        classifier_comparison.loc[i,'Classificator']='MultinomialLogisticRegression'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==1:
        classifier_comparison.loc[i,'Classificator']='LogisticRegression'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()    
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==2:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=3)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==3:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=5)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==4:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=3)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==5:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=8)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean() 
    elif i==6:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=16)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()  
    
    else:
        classifier_comparison.loc[i,'Classificator']=model.__class__.__name__
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']= cv_result['test_score'].mean()
classifier_comparison






I check the model's performance on the data on which I applied the LDA

In [None]:
models=[LogisticRegression(multi_class='multinomial',solver='newton-cg'),LogisticRegression(),DecisionTreeClassifier(max_depth=3),DecisionTreeClassifier(max_depth=5),KNeighborsClassifier(n_neighbors=3),
        KNeighborsClassifier(n_neighbors=8),KNeighborsClassifier(n_neighbors=16)]
kfold=StratifiedKFold(n_splits=10,random_state=42)
classifier_comparison=pd.DataFrame(columns=['Classificator','train_score','test_score'])
for i, model in enumerate(models):
    classifier=model
    cv_result=cross_validate(model,X_lda_sklearn,Y,cv=kfold,scoring='accuracy')
    if i==0:
        classifier_comparison.loc[i,'Classificator']='MultinomialLogisticRegression'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==1:
        classifier_comparison.loc[i,'Classificator']='LogisticRegression'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()    
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==2:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=3)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==3:
        classifier_comparison.loc[i,'Classificator']='DecisionTreeClassifier (max_depth=5)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()
    elif i==4:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=3)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()  
    elif i==5:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=8)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean() 
    elif i==6:
        classifier_comparison.loc[i,'Classificator']='KNeignborsClassifier (k=16)'
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']=cv_result['test_score'].mean()  
    
    else:
        classifier_comparison.loc[i,'Classificator']=model.__class__.__name__
        classifier_comparison.loc[i,'train_score']=cv_result['train_score'].mean()
        classifier_comparison.loc[i,'test_score']= cv_result['test_score'].mean()
classifier_comparison


By comparing the model results after applying the PCA and after applying the LDA, it can be seen that using the LDA results are much better than using PCA. In particular the results that I obtain with after I applied LDA are very similar to the results that I obtain using the original dataset without using feature extraction techniques.

The following is the graphical representation of Decision-Tree with max depth equal to three. In particular from this representation we can see that the variable 'degree_spondylolisthesis' split the dataset in two subset. In particular for 'degree_spondylolistesis'<=16.079 the patients tend to belong principally to the classes are Hernia and Normal instead for values of  'degree_spondylolistesis'>=16.079 the patients principally belong to the class Spondilolystesis.Thanks to the Gini index is possible to analyze the impurity of each node of the tree; moreover it is possible to observe for each class, what is the number of patients that belong to each class.

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
train = np.array(X)
labels = np.array(Y)
dtree=DecisionTreeClassifier(max_depth=3)
test_sum=0
train_sum=0
for train_index, test_index in kfold.split(train,labels):
    
    X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
    y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
    #print("TRAIN:", train_index, "TEST:", test_index)
    dtree.fit(X_train,y_train)
    test_sum+= dtree.score(X_test,y_test)
    train_sum+=dtree.score(X_train,y_train)

print("train score: ",train_sum/10)
print("test score: ",test_sum/10)

labels=[]
for i in range(0,6):
    labels.append(df.columns[i])

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,
                feature_names=labels,
                class_names=['Hernia', 'Normal', 'Spondylolisthesis'],
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

# Learning curve

In [None]:
from sklearn.utils import shuffle
X_shuf, Y_shuf = shuffle(X, Y)


- Learning curve using Multinomial Logistic Regression classifier

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
cv=StratifiedKFold(n_splits=10)
pipe_lr=make_pipeline(StandardScaler(),LogisticRegression(multi_class='multinomial',solver='newton-cg',random_state=42))
train_sizes,train_scores,test_scores= learning_curve(estimator=pipe_lr,X=X_shuf,y=Y_shuf,train_sizes=np.linspace(0.1,1.0,10),cv=cv,n_jobs=1)
train_mean=np.mean(train_scores,axis=1)
train_std=np.std(train_scores,axis=1)
test_mean=np.mean(test_scores,axis=1)
test_std=np.std(test_scores,axis=1)
plt.plot(train_sizes,train_mean,color='blue',marker='o',markersize=5,label='training accuracy')
plt.fill_between(train_sizes,train_mean+train_std,train_mean-train_std,alpha=0.2,color='blue')
plt.plot(train_sizes,test_mean,color='black',linestyle='--',marker='s',markersize=5,label='validation accuracy')
plt.fill_between(train_sizes,test_mean+test_std,test_mean-test_std,alpha=0.2,color='black')
plt.grid()
plt.xlabel('Number of training sample')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.65,1.0])
plt.show()

From the above curve, we observe that our model works really well already when we evaluate only 150 samples and we can see how it tends to increase accuracy as the size of the dataset.

- Learning curve using Decision-Tree classifier with max depth equal to 3

In [None]:
pipe_lr1=make_pipeline(StandardScaler(),DecisionTreeClassifier(max_depth=3,random_state=1))
train_sizes,train_scores,test_scores= learning_curve(estimator=pipe_lr1,X=X_shuf,y=Y_shuf,train_sizes=np.linspace(0.1,1.0,10),cv=cv,n_jobs=1)
train_mean=np.mean(train_scores,axis=1)
train_std=np.std(train_scores,axis=1)
test_mean=np.mean(test_scores,axis=1)
test_std=np.std(test_scores,axis=1)
plt.plot(train_sizes,train_mean,color='blue',marker='o',markersize=5,label='training accuracy')
plt.fill_between(train_sizes,train_mean+train_std,train_mean-train_std,alpha=0.2,color='blue')
plt.plot(train_sizes,test_mean,color='black',linestyle='--',marker='s',markersize=5,label='validation accuracy')
plt.fill_between(train_sizes,test_mean+test_std,test_mean-test_std,alpha=0.2,color='black')
plt.grid()
plt.xlabel('Number of training sample')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.65,1.0])
plt.show()

The above learning curve, in which we use the Decision Tree classifier with depth 3, shows us how this classification technique initially suffers from overfitting seen the marked gap between the two accuracy curves; in particular, we note how this gap tends to get thinner with the increase in the number of samples taken into consideration, until reaching an accuracy of almost 85% when we have considered more than 250 samples.

- Learning curve using K-NN classifier with k=8

In [None]:
pipe_lr2=make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=8))
train_sizes,train_scores,test_scores= learning_curve(estimator=pipe_lr2,X=X_shuf,y=Y_shuf,train_sizes=np.linspace(0.1,1.0,10),cv=cv,n_jobs=1)
train_mean=np.mean(train_scores,axis=1)
train_std=np.std(train_scores,axis=1)
test_mean=np.mean(test_scores,axis=1)
test_std=np.std(test_scores,axis=1)
plt.plot(train_sizes,train_mean,color='blue',marker='o',markersize=5,label='training accuracy')
plt.fill_between(train_sizes,train_mean+train_std,train_mean-train_std,alpha=0.2,color='blue')
plt.plot(train_sizes,test_mean,color='black',linestyle='--',marker='s',markersize=5,label='validation accuracy')
plt.fill_between(train_sizes,test_mean+test_std,test_mean-test_std,alpha=0.2,color='black')
plt.grid()
plt.xlabel('Number of training sample')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.65,1.0])
plt.show()

The above learning curve, shows us that the K-Nearest-Neighbors classifier achieves its greater accuracy when we considered about 150 samples and then lost accuracy. As the number of data taken into consideration increases, it seems that the accuracy of the training set and the test set tend to converge again. In this case with a larger dataset we could have removed some doubt.

# # Confusion Matrix, Precision, Recall

Previously I evaluated my models using Accuracy metric.
Now to measure the performance of the model I use Precision and Recall that I will derive from Confusion Matrix

In [None]:
import itertools
def plot_confusion_matrix(cm, classes,normalize=False,cmap=plt.cm.Blues):
    
    #This function prints and plots the confusion matrix.
   #Normalization can be applied by setting `normalize=True`.
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
scores = cross_val_score(LogisticRegression(multi_class='multinomial',solver='newton-cg'), X, Y, cv=cv)
from sklearn.model_selection import cross_val_predict
y_predL = cross_val_predict(LogisticRegression(multi_class='multinomial',solver='newton-cg'), X, Y, cv=cv)
conf_matL = confusion_matrix(Y, y_predL)


In [None]:
class_names=['Hernia','Normal','Spondylolidthesis']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_matL, classes=class_names )

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_matL, classes=class_names, normalize=True )

plt.show()

In [None]:
recall=np.diag(conf_matL)/np.sum(conf_matL,axis=1)
precision=np.diag(conf_matL)/np.sum(conf_matL,axis=0)
comparison=pd.DataFrame(index=['Hernia','Normal','Spondylolisthesis'],columns=['Recall %','Precision %'])
comparison['Recall %']=recall*100
comparison['Precision %']=precision*100
comparison

In [None]:
scores = cross_val_score(DecisionTreeClassifier(max_depth=3), X, Y, cv=cv)
y_predD = cross_val_predict(DecisionTreeClassifier(max_depth=3), X, Y, cv=cv)
conf_matD = confusion_matrix(Y, y_predD)

In [None]:
plt.figure()
plot_confusion_matrix(conf_matD, classes=class_names )

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_matD, classes=class_names, normalize=True )

plt.show()

In [None]:
recall=np.diag(conf_matD)/np.sum(conf_matD,axis=1)
precision=np.diag(conf_matD)/np.sum(conf_matD,axis=0)
#print(pd.DataFrame(recall,columns=['Recall']))
#print(pd.DataFrame(precision,columns=['Precision']))
comparison=pd.DataFrame(index=['Hernia','Normal','Spondylolisthesis'],columns=['Recall %','Precision %'])
comparison['Recall %']=recall*100
comparison['Precision %']=precision*100
comparison

In [None]:
scores = cross_val_score(KNeighborsClassifier(n_neighbors=8), X, Y, cv=cv)
y_predK = cross_val_predict(KNeighborsClassifier(n_neighbors=8), X, Y, cv=cv)
conf_matK = confusion_matrix(Y, y_predK)

In [None]:
plt.figure()
plot_confusion_matrix(conf_matK, classes=class_names )

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_matK, classes=class_names, normalize=True )

plt.show()

In [None]:
recall=np.diag(conf_matK)/np.sum(conf_matK,axis=1)
precision=np.diag(conf_matK)/np.sum(conf_matK,axis=0)
#print(pd.DataFrame(recall,columns=['Recall']))
#print(pd.DataFrame(precision,columns=['Precision']))
comparison=pd.DataFrame(index=['Hernia','Normal','Spondylolisthesis'],columns=['Recall %','Precision %'])
comparison['Recall %']=recall*100
comparison['Precision %']=precision*100
comparison

Through the techniques of feature extraction used to reduce the dimensionality of the data, and being that my goal is to represent the data through two main components, we obtain a discrete result using the PCA while we have a very good result using the LDA.
Furthermore, it can be observed, how the results obtained using only the first two components of the LDA are very close to the results obtained using the original dataset, without having applied feature extraction techniques. Considering that there is not much difference in the results, it is convenient to reduce the dimensionality of the data through the LDA, both because it allows to improve the computational efficiency of the learning algorithm and because sometimes it allows to improve the performance during the prediction phase.

However, I have found that patients labeled with the "Hernia" and "Normal" classes have similar characteristics and there is no clear division between the two classes, while patients labeled with the "Spondylolisthesis" class are more separate than the other two classes.
In general, for the analyzed dataset, the techniques used are quite efficient, in fact in most cases the accuracy is around 85%.

The generated models will then be able to accurately predict with the patient class for new records that will be added to the dataset, provided they have the same structure with the same attributes and classes.