# Chunking

Demo tutorial for how to use nlp_toolkit to train sequence labeling model and predict new samples. The task we choose is chunking labeling.

The dataset includes working experience texts from different cv, and we want to label noun phrases in given text.

Available models:

1. WordRNN

In [1]:
import sys
sys.path.append('../')
from nlp_toolkit import Dataset, Labeler

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Data Processing

### Load config dict

In [2]:
import yaml
config = yaml.load(open('../nlp_toolkit/config.yaml'))
config['data']['inner_char'] = True
config['embed']['pre'] = False
config['train']['metric'] = 'f1_seq'

### Load data

In [3]:
dataset = Dataset(fname='../data/cv_word.txt', task_type='sequence_labeling', mode='train', segment=False, config=config)

2018-11-10 21:21:35,567 - data.py[line:89] - INFO: data loaded


In [4]:
for x, y in zip(dataset.texts[0:10], dataset.labels[0:10]):
    print(x, y)

主要 帮助 工地 师傅 一起 超平 , 防线 工作 O O B-Chunk E-Chunk O O O O O
协助 线 上 、 线 下 活动 的 执行 O O O O B-Chunk I-Chunk E-Chunk O O
执行 各项 培训 相关 的 各项 工作 流程 O O O O O O B-Chunk E-Chunk
云南 : 曲靖 、 昭通 下属 的 5 个 县级 供电 公司 10 个 供电所 O O O O O O O O O O B-Chunk E-Chunk O O O
担任 培训 学校 英语 讲师 一 职 和 学生 管理 O O B-Chunk I-Chunk E-Chunk O O O B-Chunk E-Chunk
搜寻 招标 公告 , 告知 领导 及 业务 人员 , 确认 是否 报名 O B-Chunk E-Chunk O O O O B-Chunk E-Chunk O O O O
2001 / 10 -- 2002 / 04 : 上海 润 宝 工贸 公司 <s> 所属 行业 : <s> 环保 <s> 销售部 <s> 销售 代表 <s> 负责 江浙 一带 工业 圆 区 的 空气过滤器 的 销售 和 维护 , 期间 昆山 翊 腾 电子 是 长期 的 客户 O O O O O O O O O O O B-Chunk E-Chunk O O O O O O O O O B-Chunk E-Chunk O O O O O O O O O O O O O O O O O O O O O O O
<s> 仓库 管理 : 对 仓库 进行 合理 布局 , 为 方便 员工 操作 和 减少 失误 , 能够 独立 编排 库 位 图 和 货位 表 O B-Chunk E-Chunk O O O O O O O O O B-Chunk E-Chunk O O O O O O O O O O O O O
<s> 档案 管理 : 能 独立 制定 仓库 管理 文档 , 专人 负责 仓库 资料 的 更新 归档 并 定期 检查 O B-Chunk E-Chunk O O O O B-Chunk E-Chunk O O O O O O O O O O O O
电子 技术 / 半导体 / 集成电路 B-Chunk E-Chunk O O O O


### Transform data to index

In [5]:
# if we want to use pre_trained embeddings, we need a gensim-format embedding file
x, y, config = dataset.transform()
print(x['word'].shape, y.shape)

2018-11-10 21:21:55,130 - data.py[line:116] - INFO: texts and labels transformed to number index
2018-11-10 21:21:55,131 - data.py[line:124] - INFO: Use Embeddings from Straching


(57415, 100) (57415, 100, 4)


if your want to see the vocab and label index mapping dict

In [6]:
# dataset.transformer._word_vocab._token2id

In [7]:
# dataset.transformer._label_vocab._token2id

In [6]:
transformer = dataset.transformer

## Chunking Labeling

### Define Sequence Labeler

In [7]:
model_name='word_rnn'
seq_labeler = Labeler(model_name=model_name, transformer=transformer, seq_type='bucket', config=config)

### Train model

In [8]:
trained_model = seq_labeler.train(x, y)

2018-11-10 20:47:03,102 - trainer.py[line:113] - INFO: word_rnn model structure...


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char (InputLayer)               (None, None, None)   0                                            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, None, 3 114016      char[0][0]                       
__________________________________________________________________________________________________
word (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
char_cnn (TimeDistributed)      (None, None, None, 6 4160        time_distributed_1[0][0]         
__________________________________________________________________________________________________
word_embed

2018-11-10 20:47:03,549 - trainer.py[line:123] - INFO: train/valid set: 45932/11483
2018-11-10 20:47:03,550 - trainer.py[line:80] - INFO: use bucket sequence to speed up model training
2018-11-10 20:47:03,552 - sequence.py[line:300] - INFO: Training with 77 non-empty buckets
2018-11-10 20:47:03,898 - sequence.py[line:300] - INFO: Training with 77 non-empty buckets


mointor training process using f1 score and label acc
Successfully made a directory: models/word_rnn_201811102046
using Early Stopping
using Reduce LR On Plateau
tracking loss history and metrics


2018-11-10 20:47:04,855 - trainer.py[line:154] - INFO: saving model parameters and transformer to models/word_rnn_201811102046


model hyperparameters:
 {'nb_classes': 4, 'nb_tokens': 39673, 'maxlen': None, 'embedding_dim': 64, 'rnn_type': 'lstm', 'nb_rnn_layers': 2, 'drop_rate': 0.5, 're_drop_rate': 0.15, 'use_crf': True, 'inner_char': True, 'word_rnn_size': 128, 'integration_method': 'attention', 'char_feature_method': 'cnn', 'max_charlen': 10, 'nb_char_tokens': 3563, 'char_embedding_dim': 32, 'nb_filters': 64, 'conv_kernel_size': 2}
Epoch 1/25
 - acc: 84.46
 - f1: 87.70
             precision    recall  f1-score   support

      Chunk       0.91      0.84      0.88      1523

avg / total       0.91      0.84      0.88      1523


Epoch 00001: f1_seq improved from -inf to 0.87705, saving model to models/word_rnn_201811102046/model_weights_01_0.9258_0.8770.h5
Epoch 2/25
 - acc: 86.62
 - f1: 89.38
             precision    recall  f1-score   support

      Chunk       0.92      0.87      0.89      1520

avg / total       0.92      0.87      0.89      1520


Epoch 00002: f1_seq improved from 0.87705 to 0.89376, s

### 10-fold training

In [None]:
config['train']['train_mode'] = 'fold'
seq_labeler = Labeler(model_name=model_name, transformer=transformer, config=config)
seq_labeler.train(x,y)

2018-11-10 21:21:56,592 - trainer.py[line:172] - INFO: 10-fold starts!



------------------------ fold 0------------------------


2018-11-10 21:21:58,183 - trainer.py[line:183] - INFO: word_rnn model structure...


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char (InputLayer)               (None, None, None)   0                                            
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, None, 3 114016      char[0][0]                       
__________________________________________________________________________________________________
word (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
char_cnn (TimeDistributed)      (None, None, None, 6 4160        time_distributed_2[0][0]         
__________________________________________________________________________________________________
word_embed

2018-11-10 21:22:01,797 - trainer.py[line:83] - INFO: use bucket sequence to speed up model training
2018-11-10 21:22:01,799 - sequence.py[line:300] - INFO: Training with 77 non-empty buckets
2018-11-10 21:22:02,207 - sequence.py[line:300] - INFO: Training with 77 non-empty buckets


mointor training process using f1 score and label acc
using Early Stopping
using Reduce LR On Plateau
tracking loss history and metrics
Epoch 1/25
 20/840 [..............................] - ETA: 7:44 - loss: 0.9048 - acc: 0.7473

## Predict New Samples

### Load data and transformer

In [2]:
dataset = Dataset('../data/cv_word_predict.txt',
                  task_type='sequence_labeling', mode='predict',
                  tran_fname='models/word_rnn_201811102046/transformer.h5',
                  segment=False)
x_seq = dataset.transform()

2018-11-10 21:10:46,455 - data.py[line:73] - INFO: transformer loaded
2018-11-10 21:10:46,786 - data.py[line:89] - INFO: data loaded


data transformer loaded


### Load model

In [3]:
seq_labeler = Labeler('word_rnn', dataset.transformer)
seq_labeler.load(weight_fname='models/word_rnn_201811102046/model_weights_02_0.9273_0.8938.h5',
                 para_fname='models/word_rnn_201811102046/model_parameters.json')

model loaded


### predict samples

In [4]:
y_pred = seq_labeler.predict(x_seq)

2018-11-10 21:13:12,591 - labeler.py[line:119] - INFO: predict 57415 samples used 131.9s


In [8]:
x_len = x_seq['length']
y_pred_true = [y_pred[i][:x_len[i]] for i in range(len(x_len))]
print([(x, y) for x, y in zip(dataset.texts[0].split(' '), y_pred_true[0])])

[('主要', 'O'), ('帮助', 'O'), ('工地', 'B-Chunk'), ('师傅', 'E-Chunk'), ('一起', 'O'), ('超平', 'O'), (',', 'O'), ('防线', 'O'), ('工作', 'O')]
