<a href="https://colab.research.google.com/github/ubern-mia/bme-labs/blob/main/Session3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 3

Today, we will look into what features a classifier deems important for the decision it takes.

First, load the data againg (the same one you used last time).

In [None]:
from google.colab import files
! cd "/content"
uploaded = files.upload()

measurements = "/content/" + list(uploaded.keys())[0]

As last time, define a training and testing split and define the feature columns. Then we will use a random forest classifier for the prediction. Instead of the cross-validation as last time, we use the full training set.

In [None]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, balanced_accuracy_score

# define a mapping of the disease encoding back to strings
diseasestatus = {0: "Healthy", 1: "ASD", 2: "Epilepsy"}

measurements = pd.read_csv("/content/fsfaststats.csv")

# map the disease name to the encoding and make sure the age is a float
measurements["Disease"] = [diseasestatus[e] for e in measurements["Disease"]]
measurements["Age"] = [float(e) for e in measurements["Age"]]

# X is the feature matrix we feed to the classifier, i.e. the measurements and the age of the subject
features = list(set(measurements.columns) - set(["Subject", "Disease"]))
X = measurements.loc[:, features]
y = measurements["Disease"]

# Reserve 20% of the data for testing
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(X, y):
  print("Training set size: " + str(len(train_index)))
  print("Test set size: " + str(len(test_index)))

X_train = X.iloc[train_index, :]
y_train = y.iloc[train_index]

X_test = X.iloc[test_index, :]
y_test = y.iloc[test_index]

# Plug in the parameter settings that work well according to your experiments
clf = RandomForestClassifier(n_estimators=200, max_depth=None, random_state=42)

# Train the classifier and apply it to the test set
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

With this trained classifier, we can now get information on which features are the most important ones:

In [None]:
featimportances = clf.feature_importances_

# sort these in descending order and see which features they correspond to
sortidx = (-featimportances).argsort()
importances_sorted = featimportances[sortidx]
featurenames_sorted = X_train.columns[sortidx]
print(importances_sorted)
print(featurenames_sorted)

Let's plot the feature importances. Since they are so many, limit it to the top 15 features.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

n_most_important = 20
sns.barplot(x=importances_sorted[:n_most_important], 
            y=featurenames_sorted[:n_most_important], color='forestgreen')
plt.title("Random forest feature importance")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

Now you know roughly which features seem to be the most important ones. Now try to see if you can already see the importance by looking at them individually. For this, you can (as we did during the previous sessions), plot the features. Replace the example below where the left hippocampus is selected with the most important feature. Please do this for the top three according to your previous findings.

In [None]:
plotcols = list(set(measurements.columns) - set(["Age", "Subject", "Disease"]))
feature = "Left-Hippocampus"

sns.lmplot(x="Age", y=feature, hue="Disease", data=measurements, ci=95)
plt.title(feature)
plt.show()

## Permutation-based feature importances
We also can gain insight into the importance of features for classifiers that do not offer a built-in technique such as you have seen for the Random Forest classifier. Additionally, the impurity-based feature importance you have used before can lead to misleading results if you have many unique feature values.
Permutation-based feature importance can also give you a better impression of the behaviour on the test set, since the previous method is derived from model parameters inferred from the training set.

The code below shows an example on how to get these feature imporances for the Random Forest classifier you have used before, but it can be replaced by any other classifier you tested.

For this example, we will use the balanced accuracy as a performance metric.


In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(clf, X_test, y_test, n_repeats=10, 
                                random_state=42, n_jobs=5, 
                                scoring='balanced_accuracy')

# put the feature importances into a dataframe with the mean and standard 
# deviation across the repeated runs
forest_importances = pd.DataFrame(zip(result.importances_mean, 
                                      result.importances_std), 
                                  index=X_test.columns, 
                                  columns=["Importance mean", "Importance std"])

In [None]:
# let's sort that according to the mean importance column
forest_importances_sorted = forest_importances.sort_values(
    by=['Importance mean'], ascending=False)

# to not clutter the plot too much, limit ourselves to the n_most_important 
# features as before
plotting_subset = forest_importances_sorted.iloc[0:n_most_important, :]

fig, ax = plt.subplots()
plotting_subset["Importance mean"].plot.bar(
    yerr=plotting_subset['Importance std'], ax=ax, color='forestgreen')
ax.set_title("Permutation feature importances")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

Since we use the age of the patient as a feature and do not explicitly model the aging behaviour, plotting the distributions of the feature by healthy/diseases separately may also be informative (opposed to the feature values vs. age as you did before). To get a clearer plot, the following example only plots the ten most important features.

In [None]:
n_top_features = 10
measurements["Disease"]
subset = list(plotting_subset.index[:n_top_features]) + ['Disease']
important_measurements = measurements[subset]

# convert the dataframe to long format
imp_meas_long = pd.melt(important_measurements, id_vars='Disease',
        var_name='Feature', value_name='Feature value')

In [None]:

fig, ax = plt.subplots(figsize=(20, 5))
sns.boxplot(x='Feature', y='Feature value', hue="Disease", data=imp_meas_long, 
            palette="Set2", ax=ax)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.title("Distribution by disease for important features")
plt.show()