# Discovering Important Words for Sentiments With NormLIME

This notebook loads the pretrained Bi-LSTM model following [PaddleNLP TextClassification](https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/examples/text_classification/rnn) and performs sentiment analysis on reviews data. The full official PaddlePaddle sentiment classification tutorial can be found [here](https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/examples/text_classification). 

NormLIME method aggregates local models into global and class-specific interpretations. It is effective at recognizing important features. In this notebook, we use NormLIME method, specifically `NormLIMENLPInterpreter`, to discover the words that contribute the most to positive and negative sentiment predictions.

Install PaddleNLP first:
``` bash
pip install setuptools_scm 
pip install --upgrade paddlenlp==2.1 
```

In [1]:
import paddle
import numpy as np
import interpretdl as it
import jieba
!ln -s ../tutorials/assets assets

In [2]:
import warnings 
warnings.filterwarnings("ignore")

Load the word dict and specify the pretrained model path. Define the `unk_id` to be the word id for *\[UNK\]* token. Other possible choices include empty token *\"\"* and *\[PAD\]* token.

Follow our tutorial (`tutorials/bilstm-zh-chnsenticorp.ipynb`) to get the pretrained weights 

In [3]:
def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = {}
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        token = token.rstrip("\n").split("\t")[0]
        vocab[token] = index
    return vocab

PARAMS_PATH = "assets/final.pdparams"
VOCAB_PATH = "assets/senta_word_dict.txt"

vocab = load_vocab(VOCAB_PATH)
unk_token_id = vocab['[UNK]']
pad_token_id = vocab['[PAD]']

Initialize the BiLSTM model using **paddlenlp.models** and load pretrained weights.

In [4]:
import paddle
import paddlenlp
import paddle.nn as nn
import paddle.nn.functional as F

class LSTMModel(nn.Layer):
    def __init__(self,
                 vocab_size,
                 num_classes,
                 emb_dim=128,
                 padding_idx=0,
                 lstm_hidden_size=198,
                 direction='forward',
                 lstm_layers=1,
                 dropout_rate=0.0,
                 pooling_type=None,
                 fc_hidden_size=96):
        super().__init__()

        # 首先将输入word id 查表后映射成 word embedding
        self.embedder = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=emb_dim,
            padding_idx=padding_idx)

        # 将word embedding经过LSTMEncoder变换到文本语义表征空间中
        self.lstm_encoder = paddlenlp.seq2vec.LSTMEncoder(
            emb_dim,
            lstm_hidden_size,
            num_layers=lstm_layers,
            direction=direction,
            dropout=dropout_rate,
            pooling_type=pooling_type)

        # LSTMEncoder.get_output_dim()方法可以获取经过encoder之后的文本表示hidden_size
        self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size)

        # 最后的分类器
        self.output_layer = nn.Linear(fc_hidden_size, num_classes)

    def forward(self, text, seq_len):
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)

        # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
        # num_directions = 2 if direction is 'bidirectional' else 1
        text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)


        # Shape: (batch_size, fc_hidden_size)
        fc_out = paddle.tanh(self.fc(text_repr))

        # Shape: (batch_size, num_classes)
        logits = self.output_layer(fc_out)
        return logits

model = LSTMModel(
    len(vocab),
    2,
    direction='bidirectional',
    padding_idx=vocab['[PAD]'])

W0328 13:11:28.445207 20572 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.4, Runtime API Version: 10.2
W0328 13:11:28.449889 20572 device_context.cc:465] device: 0, cuDNN Version: 7.6.


In [5]:
PARAMS_PATH = 'assets/chnsenticorp-bilstm/final.pdparams'
state_dict = paddle.load(PARAMS_PATH)
model.set_dict(state_dict)

Define a preprocessing function that takes in **a raw string** and outputs the model inputs that can be fed into paddle_model.

In this case, the raw string is splitted and mapped to word ids. *texts* is a list of lists, where each list contains a sequence of padded word ids. *seq_lens* is a list that contains the sequence length of each unpadded word ids in *texts*. 

Since the input data is a single raw string. Both *texts* and *seq_lens* has length 1.

In [7]:
def preprocess_fn(text):
    texts = []
    seq_lens = []

    tokens = " ".join(jieba.cut(text)).split(' ')
    ids = []
    unk_id = vocab.get('[UNK]', None)
    for token in tokens:
        wid = vocab.get(token, unk_id)
        if wid:
            ids.append(wid)
    texts.append(ids)
    seq_lens.append(len(ids))

    pad_token_id = 0
    max_seq_len = max(seq_lens)

    texts = paddle.to_tensor(texts)
    seq_lens = paddle.to_tensor(seq_lens)
    return texts, seq_lens

We use the first 1200 samples in the training set as our data.

In [8]:
from paddlenlp.datasets import load_dataset
DATASET_NAME = 'chnsenticorp'
train_ds, dev_ds, test_ds = load_dataset(
    DATASET_NAME, splits=["train", "dev", "test"]
)


In [9]:
data = [d['text'] for d in list(train_ds)[:1200]]
print('total of %d sentences' % len(data))

total of 1200 sentences


Initialize the `NormLIMENLPInterpreter`. We save the temporary results into a *.npz* file so that we don't have to run the whole process again if we want to rerun the same dataset.

In [10]:
normlime = it.NormLIMENLPInterpreter(model, device='gpu:0')

Begin `interpret`ing the whole dataset. This may take some time.

In [None]:
normlime_weights = normlime.interpret(
    data,
    preprocess_fn,
    unk_id=unk_token_id,
    pad_id=pad_token_id,
    num_samples=500,
    batch_size=50, 
    temp_data_file='assets/all_lime_weights_nlp.npz')

In the cells below, we print the words with top 20 largest weights for positive and negative sentiments. Only words that appear at least 5 times are included.

In [12]:
import pandas as pd
id2word = dict(zip(vocab.values(), vocab.keys()))
# Positive 
temp = {
    id2word[wid]: normlime_weights[1][wid]
    for wid in normlime_weights[1]
}
W = [(word, weight[0], weight[1]) for word, weight in temp.items() if  weight[1] >= 5]
pd.DataFrame(data = sorted(W, key=lambda x: -x[1])[:20], columns = ['word', 'weight', 'frequency'])

Unnamed: 0,word,weight,frequency
0,每次,0.031743,12
1,实惠,0.024094,10
2,精致,0.022729,5
3,很漂亮,0.01989,11
4,超值,0.015587,11
5,小巧,0.013401,12
6,没得说,0.013223,5
7,时尚,0.012933,7
8,大方,0.012203,5
9,出差,0.011033,6


In [13]:
# Negative
temp = {
    id2word[wid]: normlime_weights[0][wid]
    for wid in normlime_weights[0]
}
W = [(word, weight[0], weight[1]) for word, weight in temp.items() if  weight[1] >= 5]
pd.DataFrame(data = sorted(W, key=lambda x: -x[1])[:20], columns = ['word', 'weight', 'frequency'])

Unnamed: 0,word,weight,frequency
0,很差,0.04336,13
1,穴位,0.033197,5
2,偏,0.029332,5
3,差,0.023744,62
4,宣传,0.021885,5
5,缺点,0.021287,9
6,声卡,0.021061,6
7,求医,0.020905,6
8,发热量,0.018769,8
9,热,0.017466,21
