<a href="https://colab.research.google.com/github/sunnypwang/CRF_demo/blob/master/CRF_for_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a named-entity recognizer using Conditional Random Fields

We want to build a feature-based NER loaded with the following features



*   Morphological and orthographical features
*   Word type features
*   Word features
*   Gazetteer features

We want to incrementally evaluate the features sets on the English NER task.


In [0]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train.openNLP

In [0]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa.openNLP

In [0]:
!pip install python-crfsuite

In [0]:
import pycrfsuite
import spacy

# Orthographical features

Write feature functions
*  Is the first letter of the word capitalized?
*  Are the entire words capitalized?
*  What is the shape of the word in general? 


In [0]:
def is_first_letter_capitalized(word_seq, feature_dict_seq, position):
  word = word_seq[position]
  feature_dict_seq[position]['is_first_letter_cap'] = word[0].isupper() and len(word) > 1 and word[1:].islower()


def is_all_caps(word_seq, feature_dict_seq, position):
  pass

def word_shape(word_seq, feature_dict_seq, position):
  pass




# Train a simple model and evaluate

* Load the training set

* Use a featurizer above on the training set and test set. Fit a CRF model

* Evaluate the results on the test set

In [0]:
def load_data(file_name):
  sequence_list = []
  cur_seq = []
  for line in open(file_name):
    if line.strip() == '':
      if len(cur_seq) > 1:
        sequence_list.append(cur_seq)
      cur_seq = []
    else: 
      word, _, _, ner_tag = line.strip().split()
      if ner_tag[0] == 'I' and (len(cur_seq) == 0 or cur_seq[-1][1][0] == 'O'):
        ner_tag = 'B' + ner_tag[1:]
      cur_seq.append((word, ner_tag))
  return sequence_list


In [0]:
training_set = load_data('eng.train.openNLP')
test_set = load_data('eng.testa.openNLP')

In [0]:
training_set[200]

In [0]:
def train_and_evaluate(training_set, test_set, feature_function_list):
  pass

train_and_evaluate(training_set, test_set, [is_first_letter_capitalized, is_all_caps, word_shape])


# Word type features

Use spacy NLP preprocessor to obtain universal part of speech tags and write feature functions:

* POS tag of the current word (t-0)
* POS tag of the next word (t+1)
* POS tag of the previous word (t-1)
* t-1 AND t-0 AND t+1


In [0]:
nlp_processor = spacy.load('en')

# Word features
This feature basically makes the model remember the name strings
* current word (t-0)
* next word (t+1)
* previous word (t-1)
* w-1 AND w-0 AND w+1

# Gazetteer features

Gazetteer = list of names (in NLP)

The training set cannot possibly cover all of the names in the world. But a list of names from trustworthy sources provides an enormous source of knowledge. 

First, write a regular expression to extract the name list from 

https://names.mongabay.com//baby_names/boys-2017.html

https://names.mongabay.com//baby_names/girls-2017.html

and download list of countries

https://gist.githubusercontent.com/kalinchernev/486393efcca01623b18d/raw/daa24c9fea66afb7d68f8d69f0c4b8eeb9406e83/countries


And use these lists to form features

* Is this word in the list?

Normally, we would scrape wikipedia or other sources e.g. medical dictionary for more coverage.

In [0]:
import requests
import re

In [0]:
!wget https://gist.githubusercontent.com/kalinchernev/486393efcca01623b18d/raw/daa24c9fea66afb7d68f8d69f0c4b8eeb9406e83/countries

In [0]:
website_text = requests.get('https://names.mongabay.com//baby_names/boys-2017.html').text

# Putting it all together

Try different combinations of features and see if you can get good results. 