# Explainable Breast Cancer Diagnosis
## via Logistic Regression and Decision Tree

Most machine learning models are considered black boxes, but in high-stake situations, such as breast cancer diagnosis, we need to know how the classifier reaches its decisions. 

Here, we experiment with explainable techniques and models, not necessarily reaching for the highest accuracy possible (which might entail reverting to black box models), but exploring some options for interpretable machine learning and statistical analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from skimage.io import imshow, imread

warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.head(3)

In [None]:
y=df['diagnosis'] #output labels
df.drop(columns=['Unnamed: 32','id'],inplace=True) #one is useless, the other is Nan

The features are in three groups: 'mean', 'se', and 'worst'. We will make correlation heatmaps for each of these groups, erase redundant columns, then do a heatmap for the whole dataset and erase any final columns that may come up.

In [None]:
sns.heatmap(df.iloc[:,1:11].corr(),annot=True,fmt='.1g');

In [None]:
sns.heatmap(df.iloc[:,11:21].corr(),annot=True,fmt='.1g');

In [None]:
sns.heatmap(df.iloc[:,21:].corr(),annot=True,fmt='.1g');

We will erase one feature for every pair of features with correlation factor >=9 because it's as if we have the same column twice. Then we will make a correlation heatmap for the whole dataset, and erase a feature for each pair of features with correlation >=9.

In [None]:
df.drop(columns=['perimeter_mean','area_mean','compactness_mean','concave points_mean',
                      'perimeter_se','area_se',
                      'perimeter_worst','area_worst'],
                       inplace=True)
plt.figure(figsize=(14,10))
sns.heatmap(df.iloc[:,1:].corr(), annot=True, fmt='.1g');

In [None]:
df.drop(columns=['radius_worst','texture_worst','concavity_worst','concave points_worst',
                'texture_worst',], inplace=True)

In [None]:
#how many features does the new dataset have?
print ('The resulting dataset has',df.shape[1]-1, 'features')

In [None]:
#logistic regression for feature selection
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

y=df['diagnosis']
enc=LabelEncoder()
y=enc.fit_transform(y.values)
x=df.drop(columns='diagnosis').values
x_tr,x_ts,y_tr,y_ts=train_test_split(x,y, random_state=7, test_size=0.2,stratify=y)
sc=StandardScaler()
x_tr_sc=sc.fit_transform(x_tr)
x_ts_sc=sc.transform(x_ts)

#after experiments we found this is best model
lr=LogisticRegression(C=1.0, random_state=7)
lr.fit(x_tr_sc,y_tr)
y_pred=lr.predict(x_ts_sc)
print('coefs', lr.coef_)
print('accuracy', accuracy_score(y_ts,y_pred))

In [None]:
#which are the most significant features, and how much they contribute
coefs=lr.coef_.reshape(18)
for ind in lr.coef_.argsort().reshape(18):
    print(df.columns[ind+1])
    print(coefs[ind])
    print('')

In [None]:
#find the ten most useful features
ab=np.abs(coefs)
cols=df.columns[ab.argsort()[:-11:-1]+1]
cols

In [None]:
#train a model with only the best features

x=df[cols].values
x_tr,x_ts,y_tr,y_ts=train_test_split(x,y, random_state=7, test_size=0.2,stratify=y)
sc=StandardScaler()
x_tr_sc=sc.fit_transform(x_tr)
x_ts_sc=sc.transform(x_ts)


lr=LogisticRegression(C=10.0, random_state=7)
lr.fit(x_tr_sc,y_tr)
y_pred=lr.predict(x_ts_sc)
print('coefs', lr.coef_)
print('accuracy', accuracy_score(y_ts,y_pred))

from sklearn.metrics import confusion_matrix

plt.figure(figsize=(3,3))
sns.heatmap(confusion_matrix(y_ts,y_pred), annot=True, fmt='d');

We see that running logistic regression with only the selected features yields a slight increase in accuracy. Now we will confirm it visually, after inspecting the distributions of the features.

In [None]:
#make violin plots to visually evaluate features selected by lr
means=df.iloc[:,1:7]
ses=df.iloc[:,7:15]
worsts=df.iloc[:,15:]

means_sc=(means-means.mean())/(means.std())
ses_sc=(ses-ses.mean())/(ses.std())
worsts_sc=(worsts-worsts.mean())/(worsts.std())

means_sc=pd.concat([df['diagnosis'],means_sc],axis=1)
ses_sc=pd.concat([df['diagnosis'],ses_sc],axis=1)
worsts_sc=pd.concat([df['diagnosis'],worsts_sc],axis=1)

means_sc=pd.melt(means_sc, id_vars='diagnosis',
                 var_name='features',
                 value_name='value')
ses_sc=pd.melt(ses_sc, id_vars='diagnosis',
               var_name='features',
               value_name='value')
worsts_sc=pd.melt(worsts_sc, id_vars='diagnosis',
                  var_name='features',
                  value_name='value')

In [None]:
sns.violinplot(y='features',x='value', hue='diagnosis',
               data=means_sc, split=True);

In [None]:
sns.violinplot(y='features',x='value', hue='diagnosis',
               data=ses_sc, split=True);

In [None]:
sns.violinplot(y='features',x='value', hue='diagnosis', 
               data=worsts_sc, split=True);

Comparing the violin charts with the best features as selected by the logistic regression classifier we see that the first few of the chosen features are the ones that, in the violinplots have distributions that separate more the two classes. The rest don't seem too significant, but are still better than the features that weren't given strong coefficients by the logistic regression model. All in all, logistic regression seems to having given us the most significant features, the ones more helpful in determining the class.

Now we will find the specific values in these features that determine the class. We will use Decision Tree.
We will also draw a graph, using the graphviz package, of the reasoning process the resulting tree goes through to reach its decisions.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree=DecisionTreeClassifier(max_depth=4)
tree.fit(x_tr,y_tr)
y_pred=tree.predict(x_ts)

print('accuracy:',accuracy_score(y_pred,y_ts))

In [None]:
plt.figure(figsize=(3,3))
sns.heatmap(confusion_matrix(y_ts,y_pred), annot=True, fmt='d');

In [None]:
enc.inverse_transform([0,1])

In [None]:
#create graph
from sklearn.tree import export_graphviz
import graphviz
from graphviz import Source

graph=Source(export_graphviz(tree,feature_names=df[cols].columns,
                   class_names=['B','M'],rounded=True,proportion = False, filled=True,precision=2))



display(graph)

Upon comparison with the violin plots we see that the Decision Tree did pick good features and good values for these features. It's performance was obviously suboptimal, given its 88% accuracy. Also bear in mind, with Decision Trees there's always some randomness involved, and we may run a few trees to find the best.

## Conclusion

We explored the potential of traditional machine learning for explainable classifications. In high-stake situations, like cancer diagnosis, we need some information on how classifiers make predictions. Our pipeline had four steps:

1)We started by removing redundant features, identified through correlation analysis.

2)Then we applied logistic regression to find how much each feature contributes to whether the tumor is benign or malignant. Based on the features selected by the model we picked the ten most significant.

3)We confirmed they were the best features with two methods: a)we run a new logistic regression model and it yielded a slight increase in accuracy, and b) through visual inspection of violin plots we saw that the distributions of these feautures are distinct/separable for each class.

4)Finally, we run a Decision Tree to find the values in these features that determine the class, and visualised this in a tree-graph image.

For each classification, now, the doctor can consult this graph to know how the system reached its decision. For high performance, the doctor should use the Logistic Regression classifier, which doesn't explain its decisions with such detail as the decision tree, but shows the coefficient with which each feature contributes to the classification, and reaches very high accuracy (97,36%).