# <center>自然语言处理--补充知识</center>

## 课程内容
* 1.实体识别介绍
* 2.简单模型实践
* 3.复杂模型实践

## 1.实体识别介绍

命名实体识别（Named Entity Recognition，NER）是NLP中一项非常基础的任务。NER是信息提取、问答系统、句法分析、机器翻译等众多NLP任务的重要基础工具。

**实体识别**

简单的理解，实体识别就是将你想要获取到的实体类型，从一句话里面挑出来的过程。

|小明|在|北京大学|的|燕园|读书|
| ---- |:----:|:----:|:----:|:----:|:----:|
|PER||ORG||LOC|&nbsp;|

如上面的例子所示，句子“小明在北京大学的燕园看了中国男篮 的一场比赛”，通过NER模型，将“小明 ”以PER，“北京大学”以ORG，“燕园”以LOC为类别分别挑了出来。

**数据标注**

采用BIO三位标注(B-begin，I-inside，O-outside)方式进行数据标注，把不属于实体的字用O标注，把实体用BI规则标注，最后按照BIO规则把实体提取出来。

|小|明|在|北|京|大|学|的|燕|园|读|书|
| :-: |:---:|:---:|:---:|:---:|:---:|:---:|:---:| :---:|:---:|:---:| ---:|
|B-PER|I-PER|O|B-ORG|I-ORG|I-ORG|I-ORG|O|B-LOC|I-LOC|O|O|



## 2. 简单模型实践

**Bidirectional LSTM**

![](images/ner.png)

* 图中B-Person、I-Person代表人名首字、人名非首字，B-Organization、I-Organization代表组织机构名首字、组织机构名非首字，O代表该字不属于命名实体的一部分。
* 图中输入是word embedding,使用双向lstm进行encode，对于lstm的hidden层，接入一个大小为[hidden_dim,num_label]的一个全连接层就可以得到每一个step对应的每个label的概率，也就是上图黄色框的部分。


### 2.1 数据处理

In [None]:
import pandas as pd
import numpy as np

# 加载数据
data = pd.read_csv("data/ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill") #向下填充，用前一个非缺失值去填充该缺失值

In [None]:
# 查看数据
data.head(50)

In [None]:
# 统计单词个数
words = list(set(data["Word"].values))

words.append("UNKNOWN")
words.append("ENDPAD")

n_words = len(words)
print(n_words)

In [None]:
# 统计标签个数
tags = list(set(data["Tag"].values))
n_tags = len(tags)
n_tags

In [None]:
# 按句子分割数据
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
grouped = data.groupby("Sentence").apply(agg_func)

In [None]:
# 计算句子长度
sentence_lens = [len(s) for s in grouped]
sentence_lens

In [None]:
# 统计句子长度分布
import numpy as np
print("最短文本长度=", min(sentence_lens))
print("最长文本长度=", max(sentence_lens))
print("平均文本长度=", np.mean(sentence_lens))

In [None]:
# 绘制句子长度分布图
import matplotlib.pyplot as plt
plt.hist(sentence_lens, bins=range(min(sentence_lens), max(sentence_lens)+2, 2))
plt.show()

In [None]:
max_len = 40

In [None]:
# 将Series转换为List
sentences = [s for s in grouped]
sentences

In [None]:
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}

In [None]:
tag2idx

In [None]:
word2idx["ENDPAD"]

In [None]:
from keras.preprocessing.sequence import pad_sequences
x_data = [[word2idx[w[0]] for w in s] for s in sentences]
x_data = pad_sequences(maxlen=max_len, sequences=x_data, padding="post", value=n_words - 1)

In [None]:
from keras.preprocessing.sequence import pad_sequences
y_data = [[tag2idx[w[2]] for w in s] for s in sentences]
y_data = pad_sequences(maxlen=max_len, sequences=y_data, padding="post", value=tag2idx["O"])

In [None]:
print(x_data[0])
print(y_data[0])

In [None]:
from keras.utils import to_categorical
y_data = to_categorical(y_data, num_classes=n_tags)

In [None]:
print(x_data[0])
print(y_data[0])

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)

### 2.2 构建模型

In [None]:
from keras.models import Model, Sequential
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional,CuDNNLSTM

model = Sequential()
model.add(Embedding(input_dim=n_words, output_dim=100, input_length=max_len))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dense(n_tags, activation="softmax"))
model.summary()

In [None]:
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

### 2.3 模型训练 

In [None]:
model.fit(x_train, y_train, batch_size=64, epochs=1, validation_split=0.1, verbose=2)

### 2.4 模型评价

In [None]:
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

## 3. 复杂模型实践

**BiLSTM与CRF**
![](images/crf.png)

**keras-contrib安装**
```py
pip install git+https://www.github.com/keras-team/keras-contrib.git
```

**BiLSTM-CRF模型**
```py
from keras.models import Model, Sequential
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional,CuDNNLSTM
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy

model = Sequential()
model.add(Embedding(input_dim=n_words, output_dim=100, input_length=max_len))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
crf = CRF(n_tags, sparse_target=True)
model.add(crf)
model.summary()

model.compile('adam', loss=crf_loss, metrics=[crf_viterbi_accuracy])
```

In [None]:
# 小练习
# 基于BiLSTM-CRF模型的实体识别



# Any Questions?