In this example, we will use Sibyl to find the top contributing features in the Student Performance dataset.

Data source:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.


In [1]:
### Imports
import pandas as pd
import numpy as np
import sys

First, we load in the data, and provide human-readable descriptions of every feature. These descriptions will make the resulting explanations much more user friendly.

In [2]:
feature_descriptions = {
    "school":"School",
    "sex":"Sex",
    "age":"Age",
    "address":"Address type",
    "famsize":"Family size",
    "Pstatus":"Parent's cohibition status",
    "Medu":"Mother's education",
    "Fedu":"Father's education",
    "Mjob":"Mother's job",
    "Fjob":"Father's job",
    "reason":"Reason for choosing this school",
    "guardian":"Student's guardian",
    "traveltime":"Home to school travel time",
    "studytime":"Weekly study time",
    "failures":"Number of past class failures",
    "schoolsup":"Extra education support",
    "famsup":"Family eductional support",
    "paid":"Extra paid classes within the subject",
    "activities":"Extra-curricular activities",
    "nursery":"Attended nursery school",
    "higher":"Wants to take higher education",
    "internet":"Has internet at home",
    "romantic":"In a romantic relationship",
    "famrel":"Quality of family relationships (1-5)",
    "freetime":"Amount of free time after school (1-5)",
    "goout":"Frequency of going out with friends (1-5)",
    "Dalc":"Frequency of workday alcohol consumption (1-5)",
    "Walc":"Frequency of workday alcohol consumption (1-5)",
    "health":"Current health status (1-5)",
    "absences":"Number of school absences"
}

file_path = "student-por.csv"
data_table = pd.read_csv(file_path, sep=";")

Next, we extract our chosen target variable. In this case, we will be predicting whether a student will pass (score > 10).

In [3]:
y = (data_table["G3"]>10).astype(int)

data = data_table.drop(["G1", "G2","G3"], axis='columns')
X = data

Now, we will create the transformers. The first will encode boolean features as integers, the second will one hot encode categorical features, and the first with standardize the data. 

Some transformation and explanation type combos require post-hoc transformations on the explanations themselves. In this case, we will run SHAP on the one-hot-encoded features, and then recombine the contributions of these features. The OneHotEncoderWrapper includes this functionality.

In [5]:
from pyreal.utils.transformer import OneHotEncoderWrapper, DataFrameWrapper
from sklearn.preprocessing import StandardScaler

class BooleanEncoder:
    def __init__(self, cols):
        self.cols = cols
    def transform(self, data):
        data_transform = data.copy()
        for col in self.cols:
            data_transform[col].replace(('yes', 'no'), (1, 0), inplace=True)
        data_transform["famsize"] = data_transform["famsize"].astype('category')
        data_transform["famsize"].cat.set_categories(['LE3', 'GT3'], inplace=True)
        data_transform["famsize"].cat.reorder_categories(['LE3', 'GT3'], inplace=True)
        data_transform["famsize"] = data_transform["famsize"].cat.codes
        return data_transform

onehotencoder = OneHotEncoderWrapper(["school", "sex", "address", "Pstatus", "reason", "guardian", "Mjob", "Fjob"])
onehotencoder.fit(data)

boolean_encoder = BooleanEncoder(["schoolsup", "famsup", "paid", "activities", "nursery", "internet", "romantic", "higher"])

standard_scaler = DataFrameWrapper(StandardScaler())
data_for_fitting = boolean_encoder.transform(onehotencoder.transform(X))
standard_scaler.fit(data_for_fitting)

Now, we can create the LocalFeatureContribution object, using the information generated above.

In [10]:
from pyreal.explainers import LocalFeatureContribution

m_transforms = [onehotencoder, boolean_encoder, standard_scaler]
lfc = LocalFeatureContribution(model="model.pkl", 
                               x_orig=X, e_transforms=m_transforms,
                               m_transforms=m_transforms,
                               contribution_transforms=onehotencoder,
                               feature_descriptions=feature_descriptions)

We can test the accuracy of the model

In [11]:
preds = lfc.model_predict(X)
print("Accuracy: %.2f%%" % (np.mean(preds==y)*100))

Accuracy: 81.97%


Finally, we fit our explainer, and take a look at the most predictive features.

In [12]:
lfc.fit()

In [13]:
contributions = lfc.produce(X.iloc[0:1])
top = contributions.sort_values(by=0, axis=1, ascending=False).iloc[:,:5]

print("Top contributing features: ")
for item in top:
    print(item, "-", top[item].values[0])

Top contributing features: 
Father's job - 0.561098933017578
School - 0.45280347134123017
Number of past class failures - 0.3177348486168521
Father's education - 0.2955309994589378
Sex - 0.2843050270532579
