# Sentimet Analysis

Demo tutorial for how to use nlp_toolkit to train classification model and predict new samples. The task we choose is sentiment binary classification.

The dataset is crawled from Kanzhun.com and Dajie.com, which is about company pros and cons.

Available models:
    1. Bi-LSTM Attention
    2. Multi Head Self Attention
    3. TextCNN

In [1]:
import sys
sys.path.append('../')
from nlp_toolkit import Dataset, Classifier

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Data Processing

### Load config dict

In [27]:
import yaml
config = yaml.load(open('../nlp_toolkit/config.yaml'))

### Load Data

In [28]:
dataset = Dataset(fname='../data/company_pro_con.txt', task_type='classification', mode='train', segment=False, config=config)

2018-11-10 20:10:20,439 - data.py[line:89] - INFO: data loaded


In [4]:
for x, y in zip(dataset.texts[0:10], dataset.labels[0:10]):
    print(x, y)

进去 前 许诺 的 工资 给 的 高 0
校园 环境 优美 ， 美女 很多 ， 适合 居住 ， 食堂 饭菜 便宜 ， 操场 好 ， 可以 天天 运动 0
老板 人 很好 老 员工 会 各种 教 你 东西 ， 而且 不会 有所 保留 薪水 在 大连 还 算 可以 0
人员 比较 多 ， 复杂   办公室 容易 形成 拉帮结派 不利于 企业 发展 1
出差 太多 了 。 在 现场 开发 很苦 逼 。 1
公司 目前 地理位置 不 太 理想 ， 离 城市 中心 较 远点 。 1
公司 的 技术 水平 国内 顶尖 ， 十几 年 的 资历 ， 制作 的 作品 几乎 都 是 精品 ， 参与 过 很多 知名 项目 。 0
工作 流程 复杂     个人 上升 空间 有限     新产品 的 创新 能力 有限   组织 架构 稍 显 臃肿 1
无偿 加班 ， 加班 多 ， 没 加班费 ， 压力 很大 1
环境 比较 轻松 ， 跟 项目 走 ， 能 学 不少 专业 知识 ， 经验 很 重要 0


### Transform data to index

In [29]:
# if we want to use pre_trained embeddings, we need a gensim-format embedding file
x, y, config = dataset.transform()
print(x['word'].shape, y.shape)

2018-11-10 20:10:31,165 - data.py[line:116] - INFO: texts and labels transformed to number index
2018-11-10 20:10:35,576 - utilities.py[line:73] - INFO: OOV rate: 0.00 %
2018-11-10 20:10:35,593 - data.py[line:122] - INFO: Loaded Pre_trained Embeddings


(94635, 100) (94635, 2)


if your want to see the vocab and label index mapping dict

In [6]:
# dataset.transformer._word_vocab._token2id

In [7]:
# dataset.transformer._label_vocab._token2id

In [31]:
transformer = dataset.transformer

## Classifier Training

### Define classifier

In [32]:
model_name='bi_lstm_att'
# if you want to get attention weights during prediction, please set return_attention=True
config[model_name]['return_att'] = True
text_classifier = Classifier(model_name=model_name, transformer=transformer, seq_type='bucket', config=config)

### Train Model

In [10]:
trained_model = text_classifier.train(x, y)

2018-11-10 19:08:34,366 - trainer.py[line:113] - INFO: bi_lstm_att model structure...
2018-11-10 19:08:34,385 - trainer.py[line:123] - INFO: train/valid set: 75708/18927
2018-11-10 19:08:34,386 - trainer.py[line:80] - INFO: use bucket sequence to speed up model training
2018-11-10 19:08:34,388 - sequence.py[line:300] - INFO: Training with 99 non-empty buckets


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
word (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 300)    7333200     word[0][0]                       
__________________________________________________________________________________________________
activation_1 (Activation)       (None, None, 300)    0           embedding[0][0]                  
__________________________________________________________________________________________________
embed_drop (SpatialDropout1D)   (None, None, 300)    0           activation_1[0][0]               
__________________________________________________________________________________________________
bi_lstm_0 

2018-11-10 19:08:34,725 - sequence.py[line:300] - INFO: Training with 98 non-empty buckets


mointor training process using f1 score
Successfully made a directory: models/bi_lstm_att_201811101908
using Early Stopping
using Reduce LR On Plateau
tracking loss history and metrics


2018-11-10 19:08:35,100 - trainer.py[line:154] - INFO: saving model parameters and transformer to models/bi_lstm_att_201811101908


model hyperparameters:
 {'nb_classes': 2, 'nb_tokens': 24444, 'maxlen': None, 'embedding_dim': 300, 'rnn_size': 512, 'attention_dim': 128, 'embed_dropout_rate': 0.25, 'final_dropout_rate': 0.5, 'return_attention': True}
Epoch 1/25
0 - f1: 93.41
1 - f1: 92.46

Epoch 00001: f1 improved from -inf to 0.92963, saving model to models/bi_lstm_att_201811101908/model_weights_01_0.8823_0.9296.h5
Epoch 2/25
0 - f1: 94.31
1 - f1: 93.63

Epoch 00002: f1 improved from 0.92963 to 0.93985, saving model to models/bi_lstm_att_201811101908/model_weights_02_0.8846_0.9398.h5
Epoch 3/25
0 - f1: 94.43
1 - f1: 93.93

Epoch 00003: f1 improved from 0.93985 to 0.94196, saving model to models/bi_lstm_att_201811101908/model_weights_03_0.8746_0.9420.h5
Epoch 4/25
0 - f1: 94.39
1 - f1: 93.73

Epoch 00004: f1 did not improve from 0.94196
Epoch 5/25
0 - f1: 94.46
1 - f1: 93.83

Epoch 00005: f1 did not improve from 0.94196
Epoch 6/25
0 - f1: 94.48
1 - f1: 93.88

Epoch 00006: f1 did not improve from 0.94196
best f1: 0.9

### 10-fold training

In [None]:
config['train']['train_mode'] = 'fold'
text_classifier = Classifier(model_name=model_name, transformer=transformer, seq_type='bucket', config=config)
text_classifier.train(x, y)

2018-11-10 20:13:01,319 - trainer.py[line:169] - INFO: 10-fold starts!



------------------------ fold 0------------------------


2018-11-10 20:13:04,139 - trainer.py[line:180] - INFO: bi_lstm_att model structure...
2018-11-10 20:13:04,271 - trainer.py[line:80] - INFO: use bucket sequence to speed up model training
2018-11-10 20:13:04,274 - sequence.py[line:300] - INFO: Training with 99 non-empty buckets


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
word (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 300)    7333200     word[0][0]                       
__________________________________________________________________________________________________
activation_4 (Activation)       (None, None, 300)    0           embedding[0][0]                  
__________________________________________________________________________________________________
embed_drop (SpatialDropout1D)   (None, None, 300)    0           activation_4[0][0]               
__________________________________________________________________________________________________
bi_lstm_0 

2018-11-10 20:13:04,632 - sequence.py[line:300] - INFO: Training with 96 non-empty buckets


mointor training process using f1 score
using Early Stopping
using Reduce LR On Plateau
tracking loss history and metrics
Epoch 1/25
0 - f1: 94.37
1 - f1: 93.73
Epoch 2/25
0 - f1: 94.72
1 - f1: 94.30
Epoch 3/25
0 - f1: 94.95
1 - f1: 94.38
Epoch 4/25
0 - f1: 94.95
1 - f1: 94.42
Epoch 5/25
0 - f1: 94.99
1 - f1: 94.48
Epoch 6/25
0 - f1: 95.05
1 - f1: 94.51
Epoch 7/25
0 - f1: 95.09
1 - f1: 94.56
Epoch 8/25
0 - f1: 95.08
1 - f1: 94.54
Epoch 9/25
0 - f1: 95.03
1 - f1: 94.48
Epoch 10/25
0 - f1: 95.07
1 - f1: 94.54

------------------------ fold 1------------------------


2018-11-10 20:40:50,748 - trainer.py[line:80] - INFO: use bucket sequence to speed up model training
2018-11-10 20:40:50,751 - sequence.py[line:300] - INFO: Training with 99 non-empty buckets
2018-11-10 20:40:51,101 - sequence.py[line:300] - INFO: Training with 97 non-empty buckets


mointor training process using f1 score
using Early Stopping
using Reduce LR On Plateau
tracking loss history and metrics
Epoch 1/25
0 - f1: 92.48
1 - f1: 92.05
Epoch 2/25
0 - f1: 92.92
1 - f1: 92.58
Epoch 3/25
0 - f1: 93.33
1 - f1: 93.01
Epoch 4/25
0 - f1: 93.71
1 - f1: 93.32
Epoch 5/25
0 - f1: 93.94
1 - f1: 93.62
Epoch 6/25
0 - f1: 94.16
1 - f1: 93.76
Epoch 7/25
0 - f1: 94.24
1 - f1: 93.92
Epoch 8/25
0 - f1: 94.60
1 - f1: 94.25
Epoch 9/25
0 - f1: 94.78
1 - f1: 94.40
Epoch 10/25
0 - f1: 94.77
1 - f1: 94.45
Epoch 11/25
0 - f1: 94.89
1 - f1: 94.60
Epoch 12/25
0 - f1: 95.10
1 - f1: 94.76
Epoch 13/25
0 - f1: 95.18
1 - f1: 94.88
Epoch 14/25
0 - f1: 95.17
1 - f1: 94.81
Epoch 15/25

## Predict New Samples

## Load data and transformer

In [2]:
dataset = Dataset('../data/company_pro_con_predict.txt',
                  task_type='classification', mode='predict',
                  tran_fname='models/bi_lstm_att_201811101908/transformer.h5',
                  segment=False)
x_seq = dataset.transform()

2018-11-10 19:44:27,999 - data.py[line:73] - INFO: transformer loaded
2018-11-10 19:44:28,283 - data.py[line:89] - INFO: data loaded


data transformer loaded


## Load model

In [3]:
text_classifier = Classifier('bi_lstm_att', dataset.transformer)
text_classifier.load(weight_fname='models/bi_lstm_att_201811101908/model_weights_03_0.8746_0.9420.h5',
                     para_fname='models/bi_lstm_att_201811101908/model_parameters.json')

model loaded


## predict samples

In [6]:
y_pred, attention = text_classifier.predict(x_seq)

2018-11-10 19:50:14,200 - classifier.py[line:121] - INFO: predict 94635 samples used 193.8s


### attention visualization

In [8]:
x_len = x_seq['length']
attention_true = [attention[i][:x_len[i]] for i in range(len(x_len))]

In [22]:
from nlp_toolkit import visualization as vs

In [23]:
vs.mk_html(dataset.texts[2].split(), attention_true[2])

' <span style="background-color: #FFFEFE">老板</span> <span style="background-color: #FFFEFE">人</span> <span style="background-color: #FFD6D6">很好</span> <span style="background-color: #FFFCFC">老</span> <span style="background-color: #FFFEFE">员工</span> <span style="background-color: #FFFEFE">会</span> <span style="background-color: #FFFBFB">各种</span> <span style="background-color: #FFD3D3">教</span> <span style="background-color: #FFD7D7">你</span> <span style="background-color: #FFECEC">东西</span> <span style="background-color: #FFFCFC">，</span> <span style="background-color: #FFFEFE">而且</span> <span style="background-color: #FFD4D4">不会</span> <span style="background-color: #FFFEFE">有所</span> <span style="background-color: #FFFEFE">保留</span> <span style="background-color: #FFFEFE">薪水</span> <span style="background-color: #FFFEFE">在</span> <span style="background-color: #FFFEFE">大连</span> <span style="background-color: #FFFEFE">还</span> <span style="background-color: #FFEDED">算</span> <span s

<span style="background-color: #FFFEFE">老板</span> <span style="background-color: #FFFEFE">人</span> <span style="background-color: #FFD6D6">很好</span> <span style="background-color: #FFFCFC">老</span> <span style="background-color: #FFFEFE">员工</span> <span style="background-color: #FFFEFE">会</span> <span style="background-color: #FFFBFB">各种</span> <span style="background-color: #FFD3D3">教</span> <span style="background-color: #FFD7D7">你</span> <span style="background-color: #FFECEC">东西</span> <span style="background-color: #FFFCFC">，</span> <span style="background-color: #FFFEFE">而且</span> <span style="background-color: #FFD4D4">不会</span> <span style="background-color: #FFFEFE">有所</span> <span style="background-color: #FFFEFE">保留</span> <span style="background-color: #FFFEFE">薪水</span> <span style="background-color: #FFFEFE">在</span> <span style="background-color: #FFFEFE">大连</span> <span style="background-color: #FFFEFE">还</span> <span style="background-color: #FFEDED">算</span> <span style="background-color: #FFD5D5">可以</span><br><br>

or you can write all results to html file and open it in a browser

In [26]:
vs.attention_visualization(dataset.texts, attention_true, x_len, output_fname='result.html')