**Import libraries and import data set.**

In [1]:
import pandas as pd
import numpy as np
import edc_agnostic
from function_edc import fn_1 
import scipy

In [45]:
%run sedc_algorithm.py #run sedc_algorithm.py module

**For this demonstration, we use the [Movielens 1M data set](https://grouplens.org/datasets/movielens/1m/), which contains movie viewing behavior of users. The target variable is binary (taking value 1 if gender = 'MALE' and 0 if gender = 'FEMALE').**

In [25]:
target = pd.read_csv('target_ML1M.csv')
data = pd.read_csv('data_ML1M.csv')
feature_names = pd.read_csv('feature_names_ML1M.csv')

**Split data into a training and test set (80-20%). We use the finetuned MLP hyperparameter configuration as found in the paper of De Cnudde et al. (2018) titled *'An exploratory study towards applying and demystifying deep learning classification on behavioral big data'*. We train the MLP classifier on the training data set.** 

In [26]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(scipy.sparse.csr_matrix(data.iloc[:,1:3707].values), target.iloc[:,1], test_size=0.2, random_state=0)

In [27]:
from sklearn.neural_network import MLPClassifier
MLP_model = MLPClassifier(activation='relu', learning_rate_init=0.30452, alpha=0.0001, learning_rate='adaptive', early_stopping=True, hidden_layer_sizes=(532,135,1009), solver='lbfgs', batch_size=100)
MLP_model.fit(x_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size=100, beta_1=0.9,
              beta_2=0.999, early_stopping=True, epsilon=1e-08,
              hidden_layer_sizes=(532, 135, 1009), learning_rate='adaptive',
              learning_rate_init=0.30452, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

**Calculate the Area under the ROC curve (AUC) of the model on the test set.**

In [36]:
from sklearn.metrics import roc_auc_score

Scores = MLP_model.predict_proba(x_test)[:,1] #predict scores using the trained MLP model
AUC = roc_auc_score(y_test,Scores) #output AUC of the model 
print("The AUC of the model is %f" %AUC)

The AUC of the model is 0.821381


**Predict 25% of the test instances as positive (gender = 'MALE') (e.g., because of a limited target budget). Obtain the indices of the test instances that are predicted as 'MALE', i.e. the instances that the model is most sure of that they are 'MALE' users.**

In [37]:
probs = MLP_model.predict_proba(x_test)[:,1]
threshold_classifier_probs = np.percentile(probs,75) 
predictions_probs = (probs>=threshold_classifier_probs)
indices_probs_pos = np.nonzero(predictions_probs)#indices of the test instances that are positively-predicted

In [38]:
classification_model = MLP_model 

def classifier_fn(X):
    c=classification_model.predict_proba(X)
    y_predicted_proba=c[:,1]
    return y_predicted_proba

**Create an SEDC explainer object. By default, the SEDC algorithm stops looking for explanations when a first explanation is found or when a 2-minute time limit is exceeded or when more than 50 iterations are required (see edc_agnostic.py for more details). Only the active (~nonzero) features are perturbed (~set to zero) to evaluate the impact on the model's predicted output. In other words, only the movies that a user has watched can become part of the counterfactual explanation of the model prediction.**

In [39]:
explainer_SEDC = SEDC_Explainer(feature_names = np.array(feature_names.iloc[:,1]), 
                               threshold_classifier = threshold_classifier_probs, 
                               classifier_fn = classifier_fn)

**Show indices of positively-predicted test instances.** 

In [32]:
indices_probs_pos #all instances that are predicted as 'MALE'

(array([  21,   24,   25,   30,   32,   37,   57,   60,   61,   65,   70,
          74,   75,   84,   86,   88,   94,   95,   97,  101,  102,  106,
         116,  120,  122,  123,  127,  133,  139,  140,  141,  142,  144,
         148,  152,  153,  164,  166,  173,  174,  175,  180,  181,  189,
         190,  192,  199,  200,  206,  211,  214,  220,  222,  223,  230,
         254,  255,  257,  264,  269,  271,  273,  282,  284,  285,  287,
         290,  292,  295,  298,  301,  305,  307,  309,  312,  315,  316,
         317,  319,  320,  330,  336,  339,  340,  341,  343,  349,  350,
         352,  353,  358,  361,  366,  372,  375,  384,  394,  395,  396,
         397,  401,  409,  415,  417,  421,  436,  454,  458,  466,  471,
         472,  475,  478,  482,  485,  487,  489,  498,  501,  503,  510,
         513,  523,  524,  529,  532,  533,  549,  550,  563,  564,  565,
         572,  575,  576,  579,  587,  590,  608,  609,  615,  617,  623,
         627,  640,  641,  643,  645, 

**Explain why the user with index = 4 is predicted as a 'MALE' user by the model.**

In [40]:
index = 4
instance_idx = x_test[index]
explanation = explainer_SEDC.explanation(instance_idx)

Initialization complete

 Elapsed time 0 


 Iteration 1 

length of new_combinations is 1 features.
new combination cannot be expanded

 Elapsed time 0 


 size combis to expand 126 

iterations are done

 Elapsed time 0 



In [41]:
explanation = explainer_SEDC.explanation(instance_idx)

Initialization is complete.

 Elapsed time 0 


 Iteration 1 

Length of new_combinations is 1 features.
New combination cannot be expanded

 Elapsed time 0 


 Size combis to expand 126 

Iterations are done.

 Elapsed time 0 



**Show explanation(s) that is/are found.**

In [42]:
explanation[0]

[['Austin Powers']]

In [17]:
print("IF the user did not watch the movie(s) " + str(explanation[0][0]) + ", THEN the predicted class would change from 'MALE' to 'FEMALE'.")


IF the user did not watch the movie(s) ['Tin Cup (1996)'], THEN the predicted class would change from 'MALE' to 'FEMALE'.


**Show more information about the explanation(s): *explanation[0]* shows the explanation set(s), *explanation[1]* shows the number of active features of the instance to explain, *explanation[2]* shows the number of explanations found, *explanation[3]* shows the number of features in the smallest-sized explanation, *explanation[4]* shows the time elapsed in seconds to find the explanation, *explanation[6]* shows the predicted score change when removing the feature(s) in the smallest-sized explanation, *explanation[7]* shows the number of iterations that the algorithm needed.**

In [18]:
explanation

([['Tin Cup (1996)']],
 126,
 2,
 1,
 0.062448978424072266,
 [array([6.94710955e-12])],
 1)

**Explain why the user with index = 30 is predicted as a 'MALE' user by the model.**

In [19]:
index = 30
instance_idx = x_test[index]
explanation = explainer_SEDC.explanation(instance_idx)

Initialization is complete.

 Elapsed time 0 


 Iteration 1 

Length of new_combinations is 1 features.
New combinations can be expanded

 Elapsed time 0 


 Size combis to expand 382 


 Iteration 2 

Length of new_combinations is 2 features.
New combinations can be expanded

 Elapsed time 0 


 Size combis to expand 571 


 Iteration 3 

Length of new_combinations is 3 features.
New combinations can be expanded

 Elapsed time 1 


 Size combis to expand 759 


 Iteration 4 

Length of new_combinations is 4 features.
New combinations can be expanded

 Elapsed time 2 


 Size combis to expand 946 


 Iteration 5 

Length of new_combinations is 5 features.
New combination cannot be expanded

 Elapsed time 2 


 Size combis to expand 946 

Iterations are done.

 Elapsed time 2 



**Show the smallest-sized explanation found by the SEDC explainer for user with index = 30.**

In [20]:
print("IF the user did not watch the movie(s) " + str(explanation[0][0]) + ", THEN the predicted class would change from 'MALE' to 'FEMALE'.")

IF the user did not watch the movie(s) ['Being There (1979)', 'Star Wars', 'Moonraker (1979)', 'Exorcist', 'Apocalypse Now (1979)'], THEN the predicted class would change from 'MALE' to 'FEMALE'.


**Show the 10 first explanation(s) found by the SEDC algorithm to explain the user index = 30. We change max_explained to 10.**

In [21]:
explainer_SEDC2 = SEDC_Explainer(feature_names = np.array(feature_names.iloc[:,1]), 
                               threshold_classifier = threshold_classifier_probs, 
                               classifier_fn = classifier_fn, max_explained=10)

In [22]:
index = 30
instance_idx = x_test[index]
explanation = explainer_SEDC2.explanation(instance_idx)

Initialization is complete.

 Elapsed time 0 


 Iteration 1 

Length of new_combinations is 1 features.
New combinations can be expanded

 Elapsed time 0 


 Size combis to expand 382 


 Iteration 2 

Length of new_combinations is 2 features.
New combinations can be expanded

 Elapsed time 0 


 Size combis to expand 571 


 Iteration 3 

Length of new_combinations is 3 features.
New combinations can be expanded

 Elapsed time 1 


 Size combis to expand 759 


 Iteration 4 

Length of new_combinations is 4 features.
New combinations can be expanded

 Elapsed time 2 


 Size combis to expand 946 


 Iteration 5 

Length of new_combinations is 5 features.
New combination cannot be expanded

 Elapsed time 2 


 Size combis to expand 946 

Iterations are done.

 Elapsed time 2 



**There are 85 explanations found after 3 iterations. The first 10 are shown. The time elapsed is 1.07 seconds. The number of active features (movies watched) is 192 movies.**

In [23]:
explanation

([['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Apocalypse Now (1979)'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'From Russia with Love (1963)'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Halloween (1978)'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Lethal Weapon 3 (1992)'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Clear and Present Danger (1994)'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'GoodFellas (1990)'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Dr. Strangelove or'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Sixth Sense'],
  ['Being There (1979)',
   'Star Wars',
   'Moonraker (1979)',
   'Exorcist',
   'Apollo 13 (1995)'],
  ['Being There (1979)',
   'Star War