<h1>Introduction</h1>

Implementation of Ruleex ANN-DT for sentiment analysis for English Language. The dataset used is Large Movie Review Dataset (Imdb reviews) which have total of 50K reviews having equal distribution in training and test dataset (25K each). Furthermore, the distribution of positive and negative reviews in training and test dataset is also equal (12.5K pos and 12.5 neg reviews in each training and test dataset). Here we take the reduced data for demonstration since the alogorithm is very slow. The model in use is simple deep neural network with tf-idf featurs.

Dataset location: https://ai.stanford.edu/~amaas/data/sentiment/

ANN-DT Paper: https://ieeexplore.ieee.org/document/809084

Code reference: https://github.com/fantamat/ruleex

For this notebook, use forked library: https://github.com/rohancode/ruleex_modified

In [1]:
import warnings
warnings.filterwarnings('ignore')

from glob import glob
import os
import sys

import pandas as pd
import numpy as np
import re

import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Flatten
from keras.layers.core import Dense

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from gtrain import FCNet

import tensorflow as tf

Using TensorFlow backend.


In [2]:
keras.__version__

'2.2.4'

In [3]:
tf.__version__

'1.13.1'

Path of the directory where repository (https://github.com/rohancode/ruleex_modified) is cloned

In [4]:
sys.path.append('/Users/Rohan/Desktop/Cl/XAI/Rulex/workspace')

In [5]:
from ruleex_modified.ruleex.deepred.model import DeepRedFCNet
import ruleex_modified.ruleex.anndt as ndt
import ruleex_modified.ruleex.deepred as dr

In [6]:
from tqdm import tqdm

Sentiment Analysis Model

In [7]:
def parse_folder(name):
    data = []
    for verdict in ('neg', 'pos'):
        for file in tqdm(glob(os.path.join(name, verdict, '*.txt'))):
            data.append({
                'text': open(file, encoding='utf8').read(),
                'verdict': verdict == 'pos'
            })
    return pd.DataFrame(data)

df_train = parse_folder('/Users/Rohan/Desktop/Cl/XAI/aclImdb/train/')
df_test = parse_folder('/Users/Rohan/Desktop/Cl/XAI/aclImdb/test/')

df = pd.concat([df_train, df_test])
df.reset_index(inplace=True)
df.drop(['index'], axis=1, inplace=True)
df = df.sample(frac=1)

100%|██████████| 12500/12500 [00:00<00:00, 14004.46it/s]
100%|██████████| 12500/12500 [00:00<00:00, 13699.22it/s]
100%|██████████| 12500/12500 [00:00<00:00, 13956.11it/s]
100%|██████████| 12500/12500 [00:00<00:00, 12917.82it/s]


In [8]:
df.verdict.value_counts()

True     25000
False    25000
Name: verdict, dtype: int64

Working on reduced data for to reduce runtime for quick demonstration

In [9]:
df = df.head(20000)

In [10]:
df.verdict.value_counts()

False    10026
True      9974
Name: verdict, dtype: int64

In [11]:
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
    return TAG_RE.sub('', text)

def preprocess_text(sen):
    sentence = remove_tags(sen)
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    sentence = re.sub(r'\s+', ' ', sentence)
    return sentence

X_text = []
sentences = list(df['text'])
for sen in sentences:
    X_text.append(preprocess_text(sen))
    
y = df['verdict']
y = np.array(list(map(lambda x: [0,1] if x==True else [1,0], y)))

In [12]:
x_text, x_text_test, y, y_test = train_test_split(X_text, y, test_size=0.20, random_state=42)

In [13]:
vectorizer = TfidfVectorizer(max_features=500)

In [14]:
X_tfidf = vectorizer.fit_transform(x_text)

In [15]:
X_test_tfidf = vectorizer.transform(x_text_test)

In [16]:
model_mn = Sequential([
  Dense(100, activation='relu', input_shape=(500,)),
  Dense(30, activation='relu'),
  Dense(2, activation='softmax'),
])

Instructions for updating:
Colocations handled automatically by placer.


In [17]:
model_mn.compile(
  optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy'],
)

In [18]:
model_mn.fit(
  X_tfidf,
  y,
  epochs=3,
  batch_size=32,
)

Instructions for updating:
Use tf.cast instead.
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x12e72b510>

In [19]:
model_mn.evaluate(X_test_tfidf, y_test)



[0.3848508437871933, 0.8285]

Ruleex ANN-DT

In [20]:
directory = 'sentiment_new123'
data = 'imdb'
method = 'anndt'

In [21]:
name = data + "_" + method
tat_params = dict()

In [22]:
tat_params["save_dir"] = os.path.join("runs", directory, name)
if not os.path.isdir(tat_params["save_dir"]):
    os.makedirs(tat_params["save_dir"])

In [23]:
tat_params["init_restrictions"] = np.array([[0, 1] for _ in range(500)])
tat_params["act_val_num"] = 0

In [24]:
def TaT_anndt(tr_in_dict, tr_out_dict, tst_in_dict, tst_out_dict, params):
    
    layer_sizes = []
    layer_sizes.append(500)
    for layer in model_mn.layers:
        layer_sizes.append(layer.output_shape[-1])
    
    weights = model_mn.get_weights()
    
    net = DeepRedFCNet(layer_sizes)
    net.init_eval_weights(weights)
    
    p = {
        "varbose": 1,
        ndt.ATTRIBUTE_SELECTION: False,
    }

    if "min_train_samples" in params:
        p[ndt.MIN_TRAIN_SAMPLES] = params["min_train_samples"]

    if "max_depth" not in params:
        p[ndt.MAX_DEPTH] = 10
    else:
        p[ndt.MAX_DEPTH] = params["max_depth"]

    if "min_samples" in params:
        p[ndt.MIN_SAMPLES] = params["min_samples"]

    if "min_split_fraction" in params:
        p[ndt.MIN_SPLIT_FRACTION] = params["min_split_fraction"]

    stat_test = None
    if "split_test" in params:
        if params["split_test"]=="t":
            stat_test = ndt.test_t
        elif params["split_test"] == "welch":
            stat_test = ndt.test_welch
        elif params["split_test"] == "chi2":
            stat_test = ndt.test_chi2
        elif params["split_test"] == "f":
            stat_test = ndt.test_F
        else:
            stat_test = ndt.test_chi2
    else:
        stat_test = ndt.test_chi2

    if "measure" in params:
        if params["measure"]=="gini":
            measure = ndt.GiniMeasure
        elif params["measure"]=="missclass":
            measure = ndt.FidelityGain
        elif params["measure"]=="maxdiff":
            measure = ndt.MaxDifference
        elif params["measure"]=="var":
            measure = ndt.VarianceMeasure
        else:
            measure = ndt.EntropyMeasure
    else:
        measure = ndt.EntropyMeasure

    if "attr_selection" in params:
        if params["attr_selection"]=="absvar":
            p[ndt.ATTRIBUTE_SELECTION] = ndt.MODE_ABSOLUTE_VARIATION
        elif params["attr_selection"]=="missclass":
            p[ndt.ATTRIBUTE_SELECTION] = ndt.MODE_MISSCLASSIFICATION
        elif params["attr_selection"]=="conmissclass":
            p[ndt.ATTRIBUTE_SELECTION] = ndt.MODE_CONTINUOUS_MISSCLASSIFICATION

    if "measure_weights" in params:
        if params["measure_weights"]=="train":
            p[ndt.MEASURE_WEIGHTS] = ndt.MODE_TRAIN
        elif params["measure_weights"]=="all":
            p[ndt.MEASURE_WEIGHTS] = ndt.MODE_ALL
        else:
            p[ndt.MEASURE_WEIGHTS] = ndt.MODE_NONE

    if "force_sampling" in params:
        p[ndt.FORCE_SAMPLING] = params["force_sampling"]

    if "num_samples" in params:
        indexes = np.random.permutation(len(tr_in_dict["x"]))
        x_train = tr_in_dict["x"][indexes[:params["num_samples"]]]
    else:
        x_train = tr_in_dict["x"]

    if "vs_others" in params:
        model = lambda x: net.eval_binary_class(x, params["vs_others"])
    else:
        model = net.eval

    if "init_restrictions" in params:
        res = params["init_restrictions"]
    else:
        res = None

    p[ndt.SPLIT_TEST_AFTER] = 3
    rt = ndt.anndt(model, x_train, p,
                   stat_test=stat_test,
                   MeasureClass=measure,
                   init_restrictions=res,
                   sampler=ndt.BerNormalSampler(x_train, always_positive=True, sigma=0.01))
    rt.save(os.path.join(params["save_dir"], "rt.pic"))

    dt = DecisionTreeClassifier(max_depth=p["max_depth"])
    dt.fit(x_train, np.argmax(model(x_train), axis=1))
    dt = dr.sklearndt_to_ruletree(dt, one_class_on_leafs=True)
    print("rt.view_graph()")
    rt.view_graph(filename='mnist_tree.pdf', varbose=True)
    dt.save(os.path.join(params["save_dir"], "sklear_dt.pic"))
    inf = p[ndt.INF]
    rt_train = rt.eval_all(tr_in_dict["x"])
    dt_train = dt.eval_all(tr_in_dict["x"])
    ln = np.argmax(model(tr_in_dict["x"]), axis=1)
    inf["fidelity"] = np.mean(rt_train == ln)
    inf["dt_fidelity"] = np.mean(dt_train == ln)
    rt_val = rt.eval_all(tst_in_dict["x"])
    dt_val = dt.eval_all(tst_in_dict["x"])
    l = np.argmax(tst_out_dict["y"], axis=1)
    ln = np.argmax(model(tst_in_dict["x"]), axis=1)
    inf["val_fidelity"] = np.mean(ln == rt_val)
    inf["val_accuracy"] = np.mean(l == rt_val)
    inf["dt_val_fidelity"] = np.mean(ln == dt_val)
    inf["dt_val_accuracy"] = np.mean(l == dt_val)
    print("Validation {}: fidelity {}, val_fidelity {}, val_accuracy {}".format(params["act_val_num"], inf["fidelity"],
                                                                                inf["val_fidelity"],
                                                                                inf["val_accuracy"]))
    print("Validation {}: fidelity {}, val_fidelity {}, val_accuracy {}".format(params["act_val_num"], inf["dt_fidelity"],
                                                                                inf["dt_val_fidelity"],
                                                                                inf["dt_val_accuracy"]))
    return inf

In [25]:
x_train_final = X_tfidf.toarray()
x_test_final = X_test_tfidf.toarray()

In [26]:
inf = TaT_anndt({"x": x_train_final}, {"y": y}, {"x": x_test_final}, {"y": y_test}, tat_params)

[anndt]: Generated new node with split x_37 > 0.019214667040972223 in train samples separation (3797, 12203)
[anndt]: Generated new node with split x_175 > 0.052310866126188585 in train samples separation (561, 3236)
[anndt]: Generated new node with split x_487 > 0.01358462706339282 in train samples separation (64, 497)
[anndt]: Stopping rule - fraction of the founded node is to low so the leaf is generated.
[anndt]: Generated new node with split x_131 > 0.05004814777826065 in train samples separation (46, 451)
[anndt]: Generating 4 new samples.
[anndt]: Generated new node with split x_301 > 0.06701800203651462 in train samples separation (5, 41)
[anndt]: Generating 44 new samples.
[anndt]: Generated new node with split x_288 > 0.03739006353521823 in train samples separation (2, 3)
[anndt]: Generating 25 new samples.
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generating 25 new samples.
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generating 6

[anndt]: Generated new node with split x_218 > 0.019301516921505422 in train samples separation (3, 8)
[anndt]: Generating 43 new samples.
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generating 7 new samples.
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generated new node with split x_473 > 0.049124185275493344 in train samples separation (18, 41)
[anndt]: Generating 32 new samples.
[anndt]: Generated new node with split x_15 > 0.08870415241524131 in train samples separation (0, 18)
[anndt]: stopping rule - low number of train samples
[anndt]: Generating 2 new samples.
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generating 9 new samples.
[anndt]: Generated new node with split x_462 > 0.04155629930465232 in train samples separation (8, 33)
[anndt]: Generating 40 new samples.
[anndt]: Generated new node with split x_7 > 0.20233480799291859 in train samples separation (1, 7)
[anndt]: stopping rule - low number of train sampl

[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generating 2 new samples.
[anndt]: Generated new node with split x_12 > 0.12061375619414348 in train samples separation (1, 34)
[anndt]: stopping rule - max depth exceeded (10)
[anndt]: stopping rule - max depth exceeded (10)
[anndt]: Stopping rule - fraction of the founded node is to low so the leaf is generated.
[anndt]: Generated new node with split x_131 > 0.07138082104335486 in train samples separation (516, 7861)
[anndt]: Generated new node with split x_131 > 0.18414171863682788 in train samples separation (94, 422)
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generated new node with split x_44 > 0.05096901382621156 in train samples separation (79, 343)
[anndt]: Generated new node with split x_49 > 0.06458443420971804 in train samples separation (15, 64)
[anndt]: Generating 35 new samples.
[anndt]: Statistics test passed at confidence level 0.05
[anndt]: Generated new node with split x_254 > 0.

Decision Tree Visualization (.pdf) highlighting extracted rules will be saved in the current directory

Saving Vocabulary to interpret Decision Tree Visualization

In [27]:
v = list(vectorizer.vocabulary_.values())
w = list(vectorizer.vocabulary_.keys())
voc = {}
for i, j in enumerate(v):
    voc[j] = w[i]

In [28]:
import csv
with open('voc Dec8.csv', 'w') as f:
    for key in voc.keys():
        f.write("%s,%s\n"%(key,voc[key]))