<a href="https://colab.research.google.com/github/sandeep16064/Named-Entity-Recognition-NER-Papers/blob/master/model%20of%2004_NER_using_spaCy_CoNLL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
spacyNER_data 

## Step 1: Converting data to json structures so it can be used by Spacy

In [1]:
import os

In [3]:
# upload train.txt, test.txt, valid.txt from Data/conll2003/en
try:
    from google.colab import files
    uploaded = files.upload()
except ModuleNotFoundError:
    print('Not using colab')

Saving dev.txt to dev.txt
Saving test.txt to test.txt
Saving train.txt to train (1).txt


In [6]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data

# !mkdir spacyNER_data
os.mkdir('spacyNER_data')
        
#the above lines create folder if it doesn't exist. If it does, the output shows a message that it
#already exists and cannot be created again
try:
    import google.colab 
    !python -m spacy convert "train.txt" spacyNER_data -c ner
    !python -m spacy convert "test.txt" spacyNER_data -c ner
    !python -m spacy convert "dev.txt" spacyNER_data -c ner
except ModuleNotFoundError:
    !python -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/dev.txt" spacyNER_data -c ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents): spacyNER_data/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/test.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/dev.json[0m


#### For example, the data before and after running spacy's convert program looks as follows.

In [7]:
try:
    import google.colab
    !echo "BEFORE : (train.txt)"
    !head "train.txt" -n 11 | tail -n 9
except ModuleNotFoundError:
    print("BEFORE : (Data/conll2003/en/train.txt)")
    file = open("Data/conll2003/en/train.txt")
    content = file.readlines()
    print(*content[1:11])

BEFORE : (train.txt)
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O


In [8]:
try:
    import google.colab
    !echo "AFTER : (spacyNER_data/train.json)"
    !head "spacyNER_data/train.json" -n 77 | tail -n 58
except ModuleNotFoundError:
    print("AFTER : (spacyNER_data/train.json)")
    f = open('spacyNER_data/train.json')
    content = f.readlines()
    print(*content[19:77])

AFTER : (spacyNER_data/train.json)
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"EU",
                "tag":"NNP",
                "ner":"U-ORG"
              },
              {
                "orth":"rejects",
                "tag":"VBZ",
                "ner":"O"
              },
              {
                "orth":"German",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {
                "orth":"call",
                "tag":"NN",
                "ner":"O"
              },
              {
                "orth":"to",
                "tag":"TO",
                "ner":"O"
              },
              {
                "orth":"boycott",
                "tag":"VB",
                "ner":"O"
              },
              {
                "orth":"British",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {


## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json". 

-G stands for gpu option.
-p stands for pipeline, and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously

In [10]:
!python -m spacy train en model spacyNER_data/train.json spacyNER_data/dev.json -G -p tagger,ner

[38;5;2m✔ Created output directory: model[0m
Training pipeline: ['tagger', 'ner']
Starting with blank model 'en'
Counting training words (limit=0)
  "__main__", mod_spec)

Itn  Tag Loss    Tag %    NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  --------  ---------  ------  ------  ------  -------  -------
  1  31411.855    94.143  16692.765  83.458  82.868  83.162  100.000    12182
  2  16916.010    94.819   7588.607  86.702  86.133  86.416  100.000    12355
  3  13723.936    95.130   5222.184  88.040  87.462  87.750  100.000    12042
  4  11705.437    95.308   3899.803  88.289  87.799  88.043  100.000    12375
  5  10372.514    95.302   3016.821  88.227  88.034  88.131  100.000    12296
  6   9653.006    95.391   2564.013  88.445  87.984  88.214  100.000    12308
  7   8948.312    95.455   2221.047  88.468  88.051  88.259  100.000    12321
  8   8299.700    95.498   1969.310  88.695  88.068  88.380  100.000    12414
  9   7895.707    95.585   1745.215  88.547  8

Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [16]:
#create a folder to store the output and visualizations. 
# !mkdir result
os.mkdir('result')
!python -m spacy evaluate model/model-best spacyNER_data/test.json -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[1m

Time      4.29 s
Words     51578 
Words/s   12024 
TOK       100.00
POS       95.59 
UAS       0.00  
LAS       0.00  
NER P     88.75 
NER R     88.15 
NER F     88.45 
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
result

[38;5;1m✘ Evaluation data not found[0m
data/test.json



a Visualization of the entity tagged test data can be seen in result/entities.html folder. 

### On spacy's Pretrained NER model (`en_core_web_sm`)

In [12]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [13]:
# !mkdir pretrained_result
os.mkdir('pretrained_result')
!python -m spacy evaluate en_core_web_sm spacyNER_data/test.json -dp pretrained_result

[1m

Time      7.77 s
Words     51578 
Words/s   6637  
TOK       100.00
POS       87.09 
UAS       0.00  
LAS       0.00  
NER P     5.15  
NER R     7.17  
NER F     5.99  
Textcat   0.00  

  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder. 