## 基于CRF的中文命名实体识别

- 随机场是由若干个位置组成的整体，当按照某种分布给每一个位置随机赋予一个值之后，其全体就叫做随机场。
- 马尔科夫随机场是随机场的特例，它假设随机场中某一个位置的赋值仅仅与和它相邻的位置的赋值有关，和与其不相邻的位置的赋值无关。
- CRF 是马尔科夫随机场的特例，它假设马尔科夫随机场中只有 X 和 Y 两种变量，X 一般是给定的，而 Y 一般是在给定 X 的条件下我们的输出。这样马尔科夫随机场就特化成了条件随机场。
- 对于 CRF，我们给出准确的数学语言描述：设 X 与 Y 是随机变量，P(Y|X) 是给定 X 时 Y 的条件概率分布，若随机变量 Y 构成的是一个马尔科夫随机场，则称条件概率分布 P(Y|X) 是条件随机场。

数据格式要求形如：

12 O
末 O
周 O
商 B-industry
品 I-industry
房 I-industry
成 O
交 O
面 O
积 O
有 O
所 O
放 O
大 O
。


字与标签之间以空格隔开，每句话中间以换行分割

In [2]:
import re
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from sklearn.externals import joblib
import yaml
import warnings

warnings.filterwarnings('ignore')

In [3]:
def load_data(data_path):
    data_read_all = list()
    data_sent_with_label = list()
    with open(data_path, mode='r', encoding="utf-8") as f:
        for line in f:
            if line.strip() == "":
                data_read_all.append(data_sent_with_label.copy())
                data_sent_with_label.clear()
            else:
                data_sent_with_label.append(tuple(line.strip().split(" ")))
    return data_read_all

In [4]:
def word2features(sent, i):
    word = sent[i][0]

    features = {
        'bias': 1.0,
        'word': word,
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1][0]
        words = word1 + word
        features.update({
            '-1:word': word1,
            '-1:words': words,
            '-1:word.isdigit()': word1.isdigit(),
        })
    else:
        features['BOS'] = True

    if i > 1:
        word2 = sent[i-2][0]
        word1 = sent[i-1][0]
        words = word1 + word2 + word
        features.update({
            '-2:word': word2,
            '-2:words': words,
            '-3:word.isdigit()': word1.isdigit(),
        })

    if i > 2:
        word3 = sent[i - 3][0]
        word2 = sent[i - 2][0]
        word1 = sent[i - 1][0]
        words = word1 + word2 + word3 + word
        features.update({
            '-3:word': word3,
            '-3:words': words,
            '-3:word.isdigit()': word1.isdigit(),
        })

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        words = word1 + word
        features.update({
            '+1:word': word1,
            '+1:words': words,
            '+1:word.isdigit()': word1.isdigit(),
        })
    else:
        features['EOS'] = True

    if i < len(sent)-2:
        word2 = sent[i + 2][0]
        word1 = sent[i + 1][0]
        words = word + word1 + word2
        features.update({
            '+2:word': word2,
            '+2:words': words,
            '+2:word.isdigit()': word2.isdigit(),
        })

    if i < len(sent)-3:
        word3 = sent[i + 3][0]
        word2 = sent[i + 2][0]
        word1 = sent[i + 1][0]
        words = word + word1 + word2 + word3
        features.update({
            '+3:word': word3,
            '+3:words': words,
            '+3:word.isdigit()': word3.isdigit(),
        })

    return features

In [5]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [ele[-1] for ele in sent]


In [6]:
with open('./NER_config.yml') as f:
        hp = yaml.load(f)

### 1.读取数据

In [7]:
train_corpus = load_data(hp['data_train'])
test_corpus = load_data(hp['data_dev'])

X_train = [sent2features(s) for s in train_corpus]
y_train = [sent2labels(s) for s in train_corpus]

X_dev = [sent2features(s) for s in test_corpus]
y_dev = [sent2labels(s) for s in test_corpus]

### 2.模型训练

In [8]:
# **表示该位置接受任意多个关键字（keyword）参数，在函数**位置上转化为词典 [key:value, key:value ]
crf_model = sklearn_crfsuite.CRF(**hp["CRF_hyper"],verbose=True)
crf_model.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 4232/4232 [00:06<00:00, 649.85it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 688227
Seconds required: 1.886

L-BFGS optimization
c1: 0.250000
c2: 0.018000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.40  loss=74015.64 active=686162 feature_norm=1.00
Iter 2   time=0.21  loss=55168.33 active=680668 feature_norm=1.44
Iter 3   time=0.23  loss=53448.33 active=257166 feature_norm=1.53
Iter 4   time=0.22  loss=50159.52 active=225741 feature_norm=1.76
Iter 5   time=0.22  loss=43384.43 active=157893 feature_norm=2.03
Iter 6   time=0.58  loss=39856.83 active=133583 feature_norm=3.00
Iter 7   time=0.22  loss=36560.75 active=159338 feature_norm=2.95
Iter 8   time=0.22  loss=34255.21 active=158556 feature_norm=3.19
Iter 9   time=0.59  loss=31861.58 active=157267 feature_norm=3.63
Iter

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.25, c2=0.018, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=True)

### 3.验证模型效果

In [10]:
labels=list(crf_model.classes_)
labels.remove("O")
y_pred = crf_model.predict(X_dev)
metrics.flat_f1_score(y_dev, y_pred,
                      average='weighted', labels=labels)
sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0]))
print(metrics.flat_classification_report(
    y_dev, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

  B-industry      0.875     0.759     0.813       572
  I-industry      0.868     0.776     0.819      1352

   micro avg      0.870     0.771     0.817      1924
   macro avg      0.871     0.767     0.816      1924
weighted avg      0.870     0.771     0.817      1924



### 4.保存模型

In [11]:
import joblib
joblib.dump(crf_model, "./small_corpus_crf_model.pkl")

['./small_corpus_crf_model.pkl']

### 5.使用模型预测

In [14]:
text = '国际成品油需求大幅下滑，国际原油的最大需求来源于成品油。'

NER_tagger = joblib.load('./small_corpus_crf_model.pkl')
list_result = []
new_sents = re.split(u'(。|！|\!|？|\?)', text)
sents_feature = [sent2features(sent) for sent in new_sents]
y_pred = NER_tagger.predict(sents_feature)
for sent, ner_tag in zip(new_sents, y_pred):
    for word, tag in zip(sent, ner_tag):
        list_result.append((word,tag))
list_result        

[('国', 'O'),
 ('际', 'O'),
 ('成', 'O'),
 ('品', 'O'),
 ('油', 'O'),
 ('需', 'O'),
 ('求', 'O'),
 ('大', 'O'),
 ('幅', 'O'),
 ('下', 'O'),
 ('滑', 'O'),
 ('，', 'O'),
 ('国', 'O'),
 ('际', 'O'),
 ('原', 'B-industry'),
 ('油', 'I-industry'),
 ('的', 'O'),
 ('最', 'O'),
 ('大', 'O'),
 ('需', 'O'),
 ('求', 'O'),
 ('来', 'O'),
 ('源', 'O'),
 ('于', 'O'),
 ('成', 'O'),
 ('品', 'O'),
 ('油', 'O'),
 ('。', 'O')]

## 可以在模型训练前使用RandomizedSearchCV或GridSearchCV寻找最优参数

GridSearchCV：网格搜索和交叉验证

RandomizedSearchCV ：以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索

In [52]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed: 13.1min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 63.2min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=CRF(algorithm='lbfgs', all_possible_states=None,
                                 all_possible_transitions=True, averaging=None,
                                 c=None, c1=None, c2=None,
                                 calibration_candidates=None,
                                 calibration_eta=None,
                                 calibration_max_trials=None,
                                 calibration_rate=None,
                                 calibration_samples=None, delta=None,
                                 epsilon=None, error_sensitive=None,...
                   iid='warn', n_iter=50, n_jobs=-1,
                   param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcf9f05f978>,
                                        'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcf9f076358>},
                   pre_dispatch='2*n_jobs', random_state=

In [54]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.25321620648556353, 'c2': 0.018405602278209005}
best CV score: 0.7675154799048618
model size: 1.66M
