# BERT based NER using CoNLL-2003
> Author: Xin Xu <xxucs@zju.edu.cn>

## Overview
- **Named-entity recognition (NER)** (also known as named entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
- [**CoNLL-2003**](https://www.clips.uantwerpen.be/conll2003/ner/) is a dataset for NER, concentrating on four types of named entities related to persons, locations, organizations, and names of miscellaneous entities. The dataset is in 'data' folder, containing *train.txt*, *valid.txt* and *test.txt*
- [**Bidirectional Encoder Representations from Transformers (BERT)**](https://github.com/google-research/bert) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.

## Clone Repository
The 1st step is to clone DeepKE Github Repository.

In [None]:
!git clone https://github.com/xxupiano/BERTNER.git

In [None]:
!cd BERTNER

## Prepare the runtime environment

In [None]:
!pip install -r requirements.txt

## Fine-Tune
- Finetune or train the **bert-base** model run the 'run_ner.py'
- In below command we have to pass different arguments:
  - '--data_dir' argument required to collect dataset. Pass 'data/' as argument which we can see as directory inside 'BERT-NER' folder for the previous comment and command for 'BERT-NER files'.
  - '--bert_model' used to download pretrained bert base model of Hugging Face transformers. There are different model-names as suggested by hugging face for argument, here we select 'bert-base-cased'.
  - '--task_name' argument used for task to perform. Enter 'ner' as we will train the model for Named Entity Recogintion(NER).
  - '--output_dir' argument is for where to store fine-tuned model. We give name 'out_base' for directory where fine-tuned model stored.
  - Other arguments like '--max_seq_length', '--num_train_epochs' and '--warmup_proportion', just give values as suggested in repository.
  - For training pass argument '--do_train' and after that evaluating for results pass argument '--do_eval'.

In [None]:
!python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_ner --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1

## Prediction
- Set the variable *text* in the following cell as the sentence to be NERed
- Run the following cell to get the NER result

In [None]:
from bert import Ner
model = Ner("out_ner/")

text= "Irene, a master student in Zhejiang University, Hangzhou, is traveling in Warsaw for Chopin Music Festival."
print("Text to predict Entity:")
print(text)
print('Results of NER:')

result = model.predict(text)
for k,v in result.items():
    if v:
        print(v,end=': ')
        if k=='PER':
            print('Person')
        elif k=='LOC':
            print('Location')
        elif k=='ORG':
            print('Organization')
        elif k=='MISC':
            print('Miscellaneous')