!wget https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt -P assets/# Load the Pretrained Model and the dataset
We use bilstm as the model and chnsenticorp as the dataset for example. More models can be found in [PaddleNLP Model Zoo](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html#transformer).

BiLSTM does not have a tokenizer so we use JiebaTokenizer.

Obviously, PaddleNLP is needed to run this notebook, which is easy to install:
```bash
pip install setuptools_scm 
pip install --upgrade paddlenlp
```

In [2]:
!wget https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt -P assets/

--2022-03-28 12:30:06--  https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt
Resolving paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)... 10.70.0.165
Connecting to paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)|10.70.0.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14600150 (14M) [text/plain]
Saving to: ‘assets/senta_word_dict.txt’


2022-03-28 12:30:07 (37.5 MB/s) - ‘assets/senta_word_dict.txt’ saved [14600150/14600150]



In [3]:
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab

VOCAB_PATH = "assets/senta_word_dict.txt"

vocab = Vocab.load_vocabulary(VOCAB_PATH, unk_token='[UNK]', pad_token='[PAD]')

vocab_size = len(vocab)
num_classes = 2
pad_token_id = vocab.to_indices('[PAD]')

In [8]:
from paddlenlp.datasets import load_dataset
DATASET_NAME = 'chnsenticorp'
train_ds, dev_ds, test_ds = load_dataset(
    DATASET_NAME, splits=["train", "dev", "test"]
)

In [9]:
import paddle
import paddlenlp
import paddle.nn as nn
import paddle.nn.functional as F

MODEL_NAME = 'bilstm'

class LSTMModel(nn.Layer):
    def __init__(self,
                 vocab_size,
                 num_classes,
                 emb_dim=128,
                 padding_idx=0,
                 lstm_hidden_size=198,
                 direction='forward',
                 lstm_layers=1,
                 dropout_rate=0.0,
                 pooling_type=None,
                 fc_hidden_size=96):
        super().__init__()

        # 首先将输入word id 查表后映射成 word embedding
        self.embedder = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=emb_dim,
            padding_idx=padding_idx)

        # 将word embedding经过LSTMEncoder变换到文本语义表征空间中
        self.lstm_encoder = paddlenlp.seq2vec.LSTMEncoder(
            emb_dim,
            lstm_hidden_size,
            num_layers=lstm_layers,
            direction=direction,
            dropout=dropout_rate,
            pooling_type=pooling_type)

        # LSTMEncoder.get_output_dim()方法可以获取经过encoder之后的文本表示hidden_size
        self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size)

        # 最后的分类器
        self.output_layer = nn.Linear(fc_hidden_size, num_classes)

    def forward(self, text, seq_len):
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)

        # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
        # num_directions = 2 if direction is 'bidirectional' else 1
        text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)


        # Shape: (batch_size, fc_hidden_size)
        fc_out = paddle.tanh(self.fc(text_repr))

        # Shape: (batch_size, num_classes)
        logits = self.output_layer(fc_out)
        return logits

model_ = LSTMModel(
        len(vocab),
        2,
        direction='bidirectional',
        padding_idx=vocab['[PAD]'])

model = paddle.Model(model_)

W0328 12:38:25.235689 19403 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.4, Runtime API Version: 10.2
W0328 12:38:25.241302 19403 device_context.cc:465] device: 0, cuDNN Version: 7.6.


# Prepare the Model

## Train the model

In [10]:
import numpy as np
from functools import partial
from assets.utils import create_dataloader

def convert_example(example, tokenizer, is_test=False):
    input_ids = tokenizer.encode(example["text"])
    valid_length = np.array(len(input_ids), dtype='int64')
    input_ids = np.array(input_ids, dtype='int64')

    if not is_test:
        label = np.array(example["label"], dtype="int64")
        return input_ids, valid_length, label
    else:
        qid = np.array(example["qid"], dtype="int64")
        return input_ids, valid_length, qid

# Reads data and generates mini-batches.
tokenizer = JiebaTokenizer(vocab)
trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)

batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)),  # input_ids
    Stack(dtype="int64"),  # seq len
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]
                        
train_loader = create_dataloader(
    train_ds,
    trans_fn=trans_fn,
    batch_size=128,
    mode='train',
    batchify_fn=batchify_fn)
                        
dev_loader = create_dataloader(
    dev_ds,
    trans_fn=trans_fn,
    batch_size=128,
    mode='validation',
    batchify_fn=batchify_fn)

optimizer = paddle.optimizer.Adam(
    parameters=model.parameters(), learning_rate=5e-4)

# Defines loss and metric.
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

model.prepare(optimizer, criterion, metric)

model.fit(
    train_loader,
    dev_loader,
    epochs=3,
    save_dir=f'assets/{DATASET_NAME}-{MODEL_NAME}',
    verbose=1
)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/3
save checkpoint at /root/codespace/InterpretDL-master/tutorials/assets/chnsenticorp-bilstm/0
Eval begin...
Eval samples: 1200
Epoch 2/3
save checkpoint at /root/codespace/InterpretDL-master/tutorials/assets/chnsenticorp-bilstm/1
Eval begin...
Eval samples: 1200
Epoch 3/3
save checkpoint at /root/codespace/InterpretDL-master/tutorials/assets/chnsenticorp-bilstm/2
Eval begin...
Eval samples: 1200
save checkpoint at /root/codespace/InterpretDL-master/tutorials/assets/chnsenticorp-bilstm/final


## Or Load the trained model

In [11]:
# Load the trained model.
state_dict = paddle.load(f'assets/{DATASET_NAME}-{MODEL_NAME}/final.pdparams')
model_.set_dict(state_dict)

# See the prediction results

In [12]:
from assets.utils import predict

data = [
    {"text":'这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般'},
    {"text":'怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片'},
    {"text":'作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。'},
]

label_map = {0: 'negative', 1: 'positive'}

batch_size = 32

# results = predict(
#     model, data, tokenizer, label_map, batch_size=batch_size)

# for idx, text in enumerate(data):
#     print('Data: {} \t Lable: {}'.format(text, results[idx]))

# Prepare for Interpretations

In [13]:
import interpretdl as it
import numpy as np
from assets.utils import convert_example, aggregate_subwords_and_importances
from paddlenlp.data import Stack, Tuple, Pad
from interpretdl.data_processor.visualizer import VisualizationTextRecord, visualize_text

import jieba
def preprocess_fn(text):
    texts = []
    seq_lens = []
    tokens = " ".join(jieba.cut(text)).split(' ')
    ids = []
    unk_id = vocab.token_to_idx.get('[UNK]', None)
    for token in tokens:
        wid = vocab.token_to_idx.get(token, unk_id)
        if wid:
            ids.append(wid)
    texts.append(ids)
    seq_lens.append(len(ids))

    pad_token_id = 0
    max_seq_len = max(seq_lens)

    texts = paddle.to_tensor(texts)
    seq_lens = paddle.to_tensor(seq_lens)
    return texts, seq_lens

## IG Interpreter

In [17]:
ig = it.IntGradNLPInterpreter(model_, device='gpu:0')

true_labels = [0, 0, 1]
recs = []
for review in data:
    text = review['text']

    pred_labels, pred_probs, avg_gradients = ig.interpret(
        preprocess_fn(text),
        steps=10,
        embedding_name='embedder',
        return_pred=True)

    for i in range(avg_gradients.shape[0]):
        subwords = " ".join(jieba.cut(text)).split(' ')
        subword_importances = avg_gradients[i]
        words, word_importances = aggregate_subwords_and_importances(subwords, subword_importances)
        word_importances = np.array(word_importances) / np.linalg.norm(
            word_importances)

        pred_label = pred_labels[i]
        pred_prob = pred_probs[i, pred_label]
        true_label = true_labels[i]
        interp_class = pred_label

        if interp_class == 0:
            word_importances = -word_importances
        recs.append(
            VisualizationTextRecord(words, word_importances, true_label,
                                    pred_label, pred_prob, interp_class)
        )

visualize_text(recs)
# The visualization is not available at github

True Label,Predicted Label (Prob),Target Label,Word Importance
0.0,0 (0.82),0.0,这个 宾馆 比较 陈旧 了 ， 特价 的 房间 也 很 一般 。 总体 来说 一般
,,,
0.0,0 (0.92),0.0,怀着 十分 激动 的 心情 放映 ， 可是 看着 看着 发现 ， 在 放映 完毕 后 ， 出现 一集 米老鼠 的 动画片
,,,
0.0,1 (1.00),1.0,作为 老 的 四星 酒店 ， 房间 依然 很 整洁 ， 相当 不错 。 机场 接机 服务 很 好 ， 可以 在 车上 办理 入住 手续 ， 节省时间 。
,,,


## LIME Interpreter

In [18]:
true_labels = [0, 0, 1]
recs = []

lime = it.LIMENLPInterpreter(model_, device='gpu:0')
for i, review in enumerate(data):
    review = review['text']
    pred_class, pred_prob, lime_weights = lime.interpret(
        review,
        preprocess_fn,
        num_samples=1000,
        batch_size=10,
        unk_id=vocab['[UNK]'],
        pad_id=0,
        return_pred=True)

    # subwords
    subwords = " ".join(jieba.cut(review)).split(' ')
    interp_class = list(lime_weights.keys())[0]
    weights = lime_weights[interp_class][1 : -1]
    subword_importances = [t[1] for t in lime_weights[interp_class][1 : -1]]
    
    words, word_importances = subwords, subword_importances
    word_importances = np.array(word_importances) / np.linalg.norm(
        word_importances)
    
    true_label = true_labels[i]
    
    if interp_class == 0:
        word_importances = -word_importances
        
    rec = VisualizationTextRecord(
        words, 
        word_importances, 
        true_label,                   
        pred_class[0], 
        pred_prob[0],
        interp_class
    )
    
    recs.append(rec)

visualize_text(recs)
# The visualization is not available at github

True Label,Predicted Label (Prob),Target Label,Word Importance
0.0,0 (0.82),0.0,这个 宾馆 比较 陈旧 了 ， 特价 的 房间 也 很 一般 。 总体
,,,
0.0,0 (0.92),0.0,怀着 十分 激动 的 心情 放映 ， 可是 看着 看着 发现 ， 在 放映 完毕 后 ， 出现 一集 米老鼠
,,,
1.0,1 (1.00),1.0,作为 老 的 四星 酒店 ， 房间 依然 很 整洁 ， 相当 不错 。 机场 接机 服务 很 好 ， 可以 在 车上 办理 入住 手续 ，
,,,


## GradShapNLPInterpreter

In [19]:
ig = it.GradShapNLPInterpreter(model_, device='gpu:0')

true_labels = [0, 0, 1]
recs = []

for review in data:
    text = review['text']

    pred_labels, pred_probs, avg_gradients = ig.interpret(
        preprocess_fn(text),
        n_samples=50,
        noise_amount=0.1,
        embedding_name='embedder',
        return_pred=True)
    
    for i in range(avg_gradients.shape[0]):
        subwords = " ".join(jieba.cut(text)).split(' ')
        subword_importances = avg_gradients[i]
        words, word_importances = aggregate_subwords_and_importances(subwords, subword_importances)
        word_importances = np.array(word_importances) / np.linalg.norm(
            word_importances)

        pred_label = pred_labels[i]
        pred_prob = pred_probs[i, pred_label]
        true_label = true_labels[i]
        interp_class = pred_label

        if interp_class == 0:
            word_importances = -word_importances
        recs.append(
            VisualizationTextRecord(words, word_importances, true_label,
                                    pred_label, pred_prob, interp_class)
        )

visualize_text(recs)
# The visualization is not available at github

True Label,Predicted Label (Prob),Target Label,Word Importance
0.0,0 (0.82),0.0,这个 宾馆 比较 陈旧 了 ， 特价 的 房间 也 很 一般 。 总体 来说 一般
,,,
0.0,0 (0.92),0.0,怀着 十分 激动 的 心情 放映 ， 可是 看着 看着 发现 ， 在 放映 完毕 后 ， 出现 一集 米老鼠 的 动画片
,,,
0.0,1 (1.00),1.0,作为 老 的 四星 酒店 ， 房间 依然 很 整洁 ， 相当 不错 。 机场 接机 服务 很 好 ， 可以 在 车上 办理 入住 手续 ， 节省时间 。
,,,
