Learning Named Entity Tagger from Domain-Specific Dictionary
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
data train and test pipeline Jul 7, 2018
docs update doc Oct 7, 2018
model_partial_ner debugged Oct 6, 2018
src remove unused code Sep 9, 2018
.gitignore add label generations Jul 6, 2018
LICENSE Update LICENSE Sep 14, 2018
Makefile train and test pipeline Jul 7, 2018
README.md modified readme w. information about our projects Oct 6, 2018
autoner_test.sh debugged for the new framework, may have some bugs Sep 9, 2018
autoner_train.sh update the bash to download pre-encoded embedding Sep 11, 2018
test_partial_ner.py debugged Oct 6, 2018
train_partial_ner.py debugged Oct 6, 2018



Check Our New NER Toolkit🚀🚀🚀

  • Inference:
    • LightNER: inference w. models pre-trained / trained w. any following tools, efficiently.
  • Training:
    • LD-Net: train NER models w. efficient contextualized representations.
    • VanillaNER: train vanilla NER models w. pre-trained embedding.
  • Distant Training:
    • AutoNER: train NER models w.o. line-by-line annotations and get competitive performance.

License Documentation Status

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599

Model Notes



Method Precision Recall F1
Supervised Benchmark 88.84 85.16 86.96
Dictionary Match 93.93 58.35 71.98
Fuzzy-LSTM-CRF 88.27 76.75 82.11
AutoNER 88.96 81.00 84.80


Required Inputs

  • Tokenized Raw Texts
    • Example: data/BC5CDR/raw_text.txt
      • One token per line.
      • An empty line means the end of a sentence.
  • Two Dictionaries
    • Core Dictionary w/ Type Info
      • Example: data/BC5CDR/dict_core.txt
        • Two columns (i.e., Type, Tokenized Surface) per line.
        • Tab separated.
      • How to obtain?
        • From domain-specific dictionaries.
    • Full Dictionary w/o Type Info
      • Example: data/BC5CDR/dict_full.txt
        • One tokenized high-quality phrases per line.
      • How to obtain?
        • From domain-specific dictionaries.
        • Applying the high-quality phrase mining tool on domain-specific corpus.
  • Pre-trained word embeddings
    • Train your own or download from the web.
    • The example run uses embedding/bio_embedding.txt, which can be downloaded from our group's server. For example, curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in the autoner_train.sh.
  • [Optional] Development & Test Sets.
    • Example: data/BC5CDR/truth_dev.ck and data/BC5CDR/truth_test.ck
      • Three columns (i.e., token, Tie or Break label, entity type).
      • I is Berak.
      • O is Tie.
      • Two special tokens <s> and <eof> mean the start and end of the sentence.


The dependent package for this project is listed as below:



To train an AutoNER model, please run


To apply the trained AutoNER model, please run


You can specify the parameters in the bash files. The variables names are self-explained.


Please cite the following two papers if you are using our tool. Thanks!

  title = {Learning Named Entity Tagger using Domain-Specific Dictionary}, 
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei}, 
  booktitle = {EMNLP}, 
  year = 2018, 

  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}