# Named Entity Recognition using Conditional Random Fields



### Overview


The objective is to build an entity recognition model to predict various entities/phrases from unstructured text. 

The data consists of a set of sentences (sequences) each of which contains a series of words (e.g., 'Zamibian', 'officials') and the respective IOB labels (e.g., 'B-gpe', 'O').  

Quick peek of the dataset below


| sentence_id  | word       | Tag     
|--------------|------------|-------
| 704          | Zamibian   | B-gpe   
| 704          | officials  | O       
| 704          | say        | O    


The entities to be recognized are as follow  

nat -> Natural Phenomenon  
gpe -> Geopolitical  
tim -> Time   
geo -> Geographical   
org -> Organization  
per -> Person  
art -> Artifact  
eve -> Event  

The target variable to predict is Tag and it follows the below convention.  
B-{ } : Beginning of an entity phrase  
I-{ } : Inside an entity phrase  
O     : Outside      

We will predict the entities using **Conditional Random Fields**.
  
Conditional Random Field is a standard model for predicting the most likely sequence of labels that correspond to a sequence of inputs.


### Environment set-up

Install the libraries defined in `requirements.txt` file using the below command

In [1]:
!pip install -r requirements.txt



### Load Libraries

In [2]:
# Data Analysis
import numpy as np
import csv
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option("display.max_colwidth", 30)
pd.set_option("display.max_columns", 30)

# Text feature extraction - Custom functions defined in utils.py
from utils import sent2features

# Modeling Algorithm
import sklearn_crfsuite

# Model Selection
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn_crfsuite import scorers, metrics

# Saving and loading Model 
import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vinubalan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Vinubalan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


<h3>1. Load Data</h3>

In [3]:
df = pd.read_csv("data/train_new.txt", sep = "\s+",quoting=csv.QUOTE_NONE)
print(df.shape)
df.head()

(46464, 3)


Unnamed: 0,sentence_id,word,Tag
0,704,Zambian,B-gpe
1,704,officials,O
2,704,say,O
3,704,reports,O
4,704,that,O


In [4]:
df.Tag.nunique()

17

In [5]:
from sklearn.feature_extraction import DictVectorizer

df_new =  df.head(100).drop(columns = ['sentence_id','Tag'])

v = DictVectorizer(sparse=False)
X = v.fit_transform(df_new.to_dict('records'))


In [6]:
df_new.head()

Unnamed: 0,word
0,Zambian
1,officials
2,say
3,reports
4,that


In [7]:
df_new.word.nunique()

78

### 2. Data Exploration

In [8]:
print("Unique sentences : {} , Unique words :{}\n".format(df.sentence_id.nunique(), df.word.nunique()))

print("Proportion of Tags")
print(np.round(df.Tag.value_counts(normalize = True),4)*100)
print("\nProportion of Tags excluding'O'")
print(np.round(df[df.Tag != 'O'].Tag.value_counts(normalize = True),4)*100)

Unique sentences : 2099 , Unique words :7310

Proportion of Tags
O        84.92
B-geo     3.11
I-per     1.90
B-org     1.87
B-gpe     1.87
B-tim     1.75
B-per     1.67
I-org     1.43
I-geo     0.61
I-tim     0.49
B-art     0.09
B-eve     0.08
I-eve     0.06
I-art     0.06
I-gpe     0.05
B-nat     0.03
I-nat     0.01
Name: Tag, dtype: float64

Proportion of Tags excluding'O'
B-geo    20.62
I-per    12.61
B-org    12.43
B-gpe    12.39
B-tim    11.57
B-per    11.07
I-org     9.47
I-geo     4.05
I-tim     3.22
B-art     0.57
B-eve     0.51
I-eve     0.43
I-art     0.41
I-gpe     0.36
B-nat     0.20
I-nat     0.07
Name: Tag, dtype: float64


Proportion of 'O' is ~85% and outweighs all other tags.

### 3. Data Preprocessing

For each sentence, we will retrieve tokens and their corresponding tags using the below code

In [9]:
sentences = []
y = []

# for each sentence ID, extract
# 1. List of words in the sentence
# 2. List of corresponding tags to the words

for i in df.sentence_id.unique():
    sentences.append(df[df.sentence_id == i].word.tolist())
    y.append(df[df.sentence_id == i].Tag.tolist())

print('First Sentence Words Sequence\n', sentences[0])
print('First Sentence Tags\n', y[0])

First Sentence Words Sequence
 ['Zambian', 'officials', 'say', 'reports', 'that', 'President', 'Levy', 'Mwanawasa', 'has', 'died', 'are', 'FALSE', '.']
First Sentence Tags
 ['B-gpe', 'O', 'O', 'O', 'O', 'B-per', 'I-per', 'I-per', 'O', 'O', 'O', 'O', 'O']


### 3. Feature Extraction

In a sequence of sentence, for a word at position [t], the characteristics of the word can be expressed by its surrounding words in positions [t-1] and [t+1]

We will therefore enhance the feature set by taking account of features of surrounding words.

For each word in the sentence, the following features are extracted

* **Word in Lowercase**    
Word converted to lowercase  
Features : word.lower, +1:word.lower, -1:word.lower  
<br />  
* **Word Shape**    
Indicates whether word is number, uppercase, lowercase, capitalized, camelcase, mixedcase, wildcard, endingdot, abbrevation, contains-hyphen  
Features : word.shape, +1:word.shape, -1:word.shape  
<br />  
* **Stemmed Word**  
Normalized word using stemming. PorterStemmer api from nltk library used.    
Features : word.stem, +1:word.stem, -1:word.stem   
<br />  
* **Lemmatized Word**    
Normalized word using Lemmatization. WordLemmatizer api from nltk library used.  
Features : word.lemma, +1:word.lemma, -1:word.lemma  
<br />  
* **Parts of Speech**     
Extracts parts of speech of the word. pos_tag api from nltk library used.    
Features : word.pos, +1:word.pos, -1:word.pos  
Related features : word.pos[:2], +1:word.pos[:2], -1:word.pos[:2] - Implies last two characters of POS  
<br />   
* **Beginning or End of Sentence**  
Indicates whether word is in beginning of sentence or end of sentence  
Features : BOS, EOS  


_Note: Features of previous word is denoted with a prefix '-1' and features of next word with prefix '+1'_ 

The features are stored in format required by sklearn-crfsuite.



In [None]:
X = [sent2features(s) for s in sentences]

print('Features of words from First Sentence below')
pd.DataFrame(X[0])

### 4. Model Training and Selection

Now we will train a Linear-chain Conditional Random Fields model using `sklearn_crfsuite.CRF`

Gradient descent using the L-BFGS method (default) is used with regularization.

Instead of training the model with default regularization parameters (c1 = 0 and c2 = 1), we will search for optimal values using `sklearn.model_selection.RandomizedSearchCV` with 10 iterations

5-fold cross validation used to assess performace of the model across combinations of c1 and c2 values. 

With the best found parameters, `RandomizedSearchCV` refits on the whole dataset by default.
<br />  
**Evaluation Metric for Model Selection**

NER can be viewed as multi-class classification problem, where the IOB Tag `O` comprises of ~85% of all labels

Predicting `O` as `O` is of least importance to us and it can be considered equivalent to predicting True Negatives in binary imbalance classification problem.

True Positives, False Negatives and False Positives are of prime importance for each of the labels and therefore we will use F1 score as evaluation metric.  

In a multiclass setting, the average parameter in the F1 score function needs to be additionally selected from the choices below

* `micro` : calculates the F1 directly by using the global number of TP, FN and FP  

   $F1_{class1 + class2 +...+classN}$
<br />

* `macro` : calculates the F1 separated by class but not using weights for the aggregation   
   $F1_{class1}$ + $F1_{class2}$ +...+ $F1_{classN}$
<br />

* `weighted` :calculates F1 score for each class independently. Final F1 score is calculated using weights that depends on the number of true labels of each class:  

    $F1_{class1}$ ∗ $W_{class1}$ + $F1_{class2}$ ∗ $W_{class2}$ +...+ $F1_{classN}$ ∗ $W_{classN}$    
<br /> 


**As `macro` results in bigger penalisation when model does not perform well with the minority classes, we will choose F1 score average method macro**

In [None]:
# Conditional Random Field Classifier
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=50,
    all_possible_transitions=True
)

# Parameter Space for L1 (c1) and L2 (c2) regularization
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# Metric for evaluation - F1 Macro
f1_scorer = make_scorer(metrics.flat_f1_score, average='macro')

# Random Search with 5 fold CV and 10 Iterations
rs = RandomizedSearchCV(crf, 
                        params_space,
                        cv=5,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=10,
                        scoring=f1_scorer)

rs.fit(X, y)

In [None]:
# Extract and View CV results
cvresults = pd.DataFrame(rs.cv_results_)
print(cvresults.shape)
cvresults.sort_values('rank_test_score').head()

In [None]:
# Print Best Parameters
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)

We achieved best F1 Score (macro) of 0.5648 at c1 value 0.21 and c2-value  0.01


Final estimator is refitted by `RandomizedSearchCV` on the entire dataset (X) with the above regulatization values

### 5. Save Model

The best estimator is extracted and saved in folder `model` for future scoring

In [None]:
model_name = "model/ner_model.pickle"
pickle.dump(rs.best_estimator_, open(model_name, 'wb'))

### Scope for further work


Spacy and few other python libraries have open-sourced pre-trained NER models. These models can be further custom trained with our IOB tags.   


Code example  
https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718



### References 

* [Named Entity Recognition and Classification](https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2)
* [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html)