命名实体识别 (Named Entity Recognition, NER) 涉及实体边界的确定和命名实体识别类别的识别,是自然语言处理 (NLP) 领域的一项基础性工作。
本项目针对 Chinese NER 任务,已复现 BiLSTM-CRF、Lattice LSTM、LR-CNN、WC-LSTM 等模型。
另外,基于 Graph 的模型 LGN 源码实现见 github,基于 Sequence 的模型 SLK-NER 源码实现见 github。
Pytorch v0.4.0
Python v3.6.2
numpy
tqdm
Resume 开源数据集是Yue等人在 Sina Finance 采集的简历数据集,主要包括来自中国股票市场上市公司的高级管理人员的简历数据,可在 [Yang et al., 2018] 中获取,并将其放入目录./data/resume
下。
数据统计:
Typing | Train | Dev | Test |
---|---|---|---|
Sentence | 3.8k | 0.46k | 0.48k |
Char | 124.1k | 13.9k | 15.1k |
标注策略:BMEO
分割方式: '\t' (吴 \t B-NAME)
标注具体类型:
该数据集使用 YEDDA System [Yang et al.,2018] 手动注释了8种命名实体。
Tag | Meaning | Train | Dev | Test |
---|---|---|---|---|
CONT | Country | 260 | 33 | 28 |
EDU | Educational Institution | 858 | 106 | 112 |
LOC | Location | 47 | 2 | 6 |
NAME | Personal Name | 952 | 110 | 112 |
ORG | Organization | 4611 | 523 | 553 |
PRO | Profession | 287 | 18 | 33 |
RACE | Ethnicity Background | 115 | 15 | 14 |
TITLE | Job Title | 6308 | 690 | 772 |
Total Entity | --- | 13438 | 1497 | 1630 |
详见目录data/resume
。
预训练 Embeddings 使用了分词器 RichWordSegmentor [Yang et al.,2017a] 的 baseline。
参数配置文件是 ./*.conf, 运行实例:
python main.py --conf_path ./wclstm_ner.conf # conf_path 配置文件地址
在配置文件 ./*.conf 中设置参数 status 为 test,运行实例:
python main.py --conf_path ./wclstm_ner.conf
在 Resume 数据集下的结果如下表:
Models | P | R | F1 |
---|---|---|---|
BiLSTM-CRF [Lample et al., 2016] | 93.7 | 93.3 | 93.5 |
BiLSTM-CRF + bichar [Yang et al., 2017a] | 93.9 | 94.1 | 94.0 |
CAN [Zhu et al., 2019] | 95.1 | 94.8 | 94.9 |
BERT [Devlin et al., 2019] | 94.2 | 95.8 | 95.0 |
Lattice LSTM [Yang et al., 2018] | 94.8 | 94.1 | 94.5 |
LR-CNN [Gui et al., 2019] | 95.4 | 94.8 | 95.1 |
WC-LSTM [Liu et al., 2019] | 95.3 | 95.2 | 95.2 |
LGN [Gui et al., 2019] | 95.3 | 95.5 | 95.4 |
SLK-NER [Hu et al., 2020] | 95.2 | 96.4 | 95.8 |
参考文献
[1] Jie Yang, Yue Zhang, Linwei Li, and Xingxuan Li. 2018. Yedda: A lightweight collaborative text span annotation tool. In ACL. Demonstration.
[2] Jie Yang, Zhiyang Teng, Meishan Zhang, and Yue Zhang. 2016. Combining discrete and neural features for sequence labeling. In CICLing.
[3] Ma, Xuezhe, and Eduard Hovy. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Strubell, E., Verga, P. , Belanger,D. , & Mccallum, A. . (2017). Fast and accurate entity recognition with iterated dilated convolutions.
[4] Lample, Guillaume, et al. Neural Architectures for Named Entity Recognition. Proceedings of NAACL-HLT. 2016.
[5] Yang, Jie, Yue Zhang, and Fei Dong. Neural Word Segmentation with Rich Pretraining. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017.
[6] Yuying Zhu and Guoxin Wang. Can-ner: Convolutional attention network for chinese named entity recognition. In NAACL, pages 3384–3393, 2019.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi-rectional transformers for language understanding. In NAACL, pages 4171–4186, Minneapolis, June 2019.
[8] Zhang, Yue, and Jie Yang. Chinese NER Using Lattice LSTM. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
[9] Tao Gui, Ruotian Ma, Qi Zhang, Lujun Zhao, Yu-Gang Jiang, & Xuanjing Huang. 2019. CNN-Based Chinese NER with Lexicon Rethinking, In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), August 10-16.
[10] Liu, Wei, et al. An Encoding Strategy Based Word-Character LSTM for Chinese NER. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
[11] Tao Gui, Yicheng Zou, Qi Zhang, Minlong Peng, Jinlan Fu, Zhongyu Wei, and Xuan-Jing Huang. A lexicon-based graph neural network for chinese ner. In EMNLP- IJCNLP, pages 1039–1049, 2019.
[12] Dou Hu and Lingwei Wei. ”SLK-NER: Exploiting Second-order Lexicon Knowledge for Chinese NER.” The 32st International Conference on Software & Knowledge Engineering. 2020.