## Introduction 

In this notebook we will go through a CRF only named-entity recognition implementation based on finance corpus. The following would be the sequence of the notebook:
<br>
1. Data Preprocessing
2. Extract features from the sentences (Feature Engineering)
3. Training a Condtional Random Field model
4. Evaluating the trained CRF model
5. Optimising the hyperparameters 

## Import the required libraries

In [1]:
import pandas as pd
import numpy as np
import glob

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split

from sklearn_crfsuite import CRF
from sklearn.metrics import make_scorer
from sklearn_crfsuite import metrics
from sklearn.exceptions import UndefinedMetricWarning 

import warnings
import nltk
import math
import sys

## Data Preprocessing


First, the data is loaded into a Pandas DataFrame. This can be done easily using the read_csv function, specifying that the separator is a space. It's also useful to keep the blank lines, which are helpful later for determining the sentence breaks. <br>
<br>
Once the data is loaded into a DataFrame, the easy access we have to columns allows a couple of useful things to be done - group the data by the "ne" column to see the distributions of each tag, and extract the classes (disregarding 'O' and blank lines with NaN values) as a list

### Parts of Speech Tag Generation

In [2]:
# Read the NER data keeping blank lines and adding columns
ner_data = pd.concat([pd.read_csv(f, skip_blank_lines=False, encoding="utf-8", index_col=None) for f in glob.glob("../Data/*.csv")])
ner_data.columns = ["Token", "NE"]

POS_tags =  nltk.pos_tag(ner_data["Token"])
POS_List = []

for w in POS_tags:
    POS_List.append(w[1])
    
ner_data["POS"] = POS_List
    
print(ner_data)

          Token                    NE   POS
0    $2,000,000     B-Notional Amount    CD
1           USD     I-Notional Amount   NNP
2     6/20/2011     B-Expiration Date    CD
3     Agreement                     O   NNP
4          with                     O    IN
5            JP        B-Counterparty   NNP
6        Morgan        I-Counterparty   NNP
7         dated                     O   VBD
8       6/17/06                     O    CD
9       whereby                     O    IN
10          the                     O    DT
11    Portfolio                     O   NNP
12         will                     O    MD
13      receive  B-Direction of Trade    VB
14        0.35%          B-Fixed Rate    CD
15          per                     O    IN
16    yeartimes                     O   NNS
17          the                     O    DT
18     notional                     O    JJ
19      amount.                     O    NN
20           \\                   NaN   VBD
21          The                 

### Visualize Tag Distribution

In [3]:
tag_distribution = ner_data.groupby("NE").size().reset_index(name='counts')
print(tag_distribution)

                      NE  counts
0         B-Counterparty      16
1   B-Direction of Trade      16
2      B-Expiration Date      16
3           B-Fixed Rate      16
4      B-Notional Amount      16
5     B-Reference Entity      16
6         I-Counterparty      18
7      I-Expiration Date      14
8      I-Notional Amount       1
9     I-Reference Entity      64
10                     O     478


Now filtering the classes of Named Entity that we do not require in this analysis

In [4]:
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["NE"].unique())))
print(classes)

['B-Notional Amount', 'I-Notional Amount', 'B-Expiration Date', 'B-Counterparty', 'I-Counterparty', 'B-Direction of Trade', 'B-Fixed Rate', 'B-Reference Entity', 'I-Reference Entity', 'I-Expiration Date']


### Extract sentences from dataset


Next, sentences need to be extracted from the data - it's useful to have the sentences as a list of lists, with each sublist containing the token, POS tag, and NE label for every word token in the sentence

In [5]:
# Create a sentences dictionary and an initial single sentence dictionary
sentences, sentence = [], []
# Create a progress bar
# pbar = pyprind.ProgBar(len(ner_data))
# For each row in the NER data...
for index, row in ner_data.iterrows():
    # If the row is empty (no string in the token column)
    if '\\' in row["Token"]:
        # If the current sentence is not empty, append it to the sentences and create a new sentence
        if len(sentence) > 0:
            sentences.append(sentence)
            sentence = []
    # Otherwise...
    else:
        # If the row does not indicate the start of a document, add the token to the current sentence
        if type(row["Token"]) != float and type(row["POS"]) != float and type(row["NE"]) != float:
            sentence.append([row["Token"], row["POS"], row["NE"]])
    #pbar.update()

## Feature Engineering

Now we are going to define a function which would allow us to extract the word features in the sentence. This includes the following:
<br>
1. Current Parts of Speech Tags
2. Previous and Next Parts of Speech Tags
3. Current Words
4. Previous Words
5. Next Words

<br>For now, we have avoided chunking however a little internet research shows us that chunking indeed can improve the accuracy and sensitivity of the model

In [6]:
def extractWordFeatures(sentence, iterator):
    POS = sentence[iterator][1]
    Token = sentence[iterator][0]

    # Aggregating a feuature dicitonary based on the features of the current POS and word
    
    featureDict = { "POS[:2]" : POS[:2],
                 "POS" : POS,
                 "Token.isdigit()" : Token.isdigit(),
                 "Token.istitle()" : Token.istitle(),
                 "Token.isupper()" : Token.isupper(),
                 "Token[-2:]" : Token[-2:],
                 "Token[-3:]" : Token[-3:],
                 "Token.lower()" : Token.lower(),
                 "bias" : 1.0,
    }
    
    if iterator > 1:
        previousWord = sentence[iterator-1][0]
        previousPosTag = sentence[iterator-1][1]
        
        # Add characteristics of the sentence's previous word and POS to the feature dictionary
        featureDict.update({ "-1:Token.lower()": previousWord.lower(),
                          "-1:Token.istitle()": previousWord.istitle(),
                          "-1:Token.isupper()": previousWord.isupper(),
                          "-1:POS": previousPosTag,
                          "-1:POS[:2]": previousPosTag[:2],
                        })
        
    # Add "Beginning of Sentence" at the start of the dictionary    
    else:
        featureDict["BOS"] = True
    
    if iterator < len(sentence)-1:
        nextWord = sentence[iterator+1][0]
        nextPos = sentence[iterator+1][1]
        # Add characteristics of the sentence's previous next and POS to the feature dictionary
        featureDict.update({ "+1:Token.lower()": nextWord.lower(),
                          "+1:Token.istitle()": nextWord.istitle(),
                          "+1:Token.isupper()": nextWord.isupper(),
                          "+1:POS": nextPos,
                          "+1:POS[:2]": nextPos[:2],
                        })
        
    else:
        featureDict["EOS"] = True
    
    return featureDict    

Using the word_features function, a list of feature dictionaries for each word token in a sentence can be extracted, corresponding to a list of NE labels for each word token in a sentence.

In [7]:
# Return a feature dictionary for each word in a given sentence
def sentence_features(sentence):
    return [extractWordFeatures(sentence, iterator) for iterator in range(len(sentence))]

# Return the label (NER tag) for each word in a given sentence
def sentence_labels(sentence):
    return [label for token, pos, label in sentence]

## Training a Condtional Random Field model

Using the predefined functions, X and y can be extracted as lists of feature dictionaries for each word token in each sentence, and as lists of NE labels for each word token in each sentence, respectively. scikit-learn's 'test_train_split' function can then be used to split X and y into training and test sets, split 80% training to 20% test.

In [8]:
# For each sentence, extract the sentence features as X, and the labels as y
X = [sentence_features(sentence) for sentence in sentences]
y = [sentence_labels(sentence) for sentence in sentences]

# Split X and y into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("First token features:\n{}\n{}".format("-"*21, X_train[0][0]))
print("\nFirst token label:\n{}\n{}".format("-"*18, y_train[0][0]))

First token features:
---------------------
{'+1:Token.lower()': 'monthly', 'Token.istitle()': True, 'POS': 'NNP', 'Token.lower()': 'receive', '+1:POS[:2]': 'RB', '+1:POS': 'RB', 'Token[-3:]': 'ive', 'Token.isupper()': False, 'POS[:2]': 'NN', '+1:Token.isupper()': False, '+1:Token.istitle()': False, 'BOS': True, 'Token.isdigit()': False, 'bias': 1.0, 'Token[-2:]': 've'}

First token label:
------------------
B-Direction of Trade


In [9]:
# Create a new CRF model
crf = CRF(algorithm="lbfgs",
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)

# Train the CRF model on the supplied training data
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

In [10]:
# Use the CRF model to make predictions on the test data
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=classes))

                      precision    recall  f1-score   support

   B-Notional Amount       1.00      0.75      0.86         4
   I-Notional Amount       0.00      0.00      0.00         0
   B-Expiration Date       1.00      0.75      0.86         4
      B-Counterparty       1.00      1.00      1.00         3
      I-Counterparty       1.00      1.00      1.00         2
B-Direction of Trade       1.00      1.00      1.00         3
        B-Fixed Rate       1.00      1.00      1.00         3
  B-Reference Entity       1.00      0.75      0.86         4
  I-Reference Entity       1.00      0.81      0.90        16
   I-Expiration Date       1.00      1.00      1.00         3

           micro avg       1.00      0.86      0.92        42
           macro avg       0.90      0.81      0.85        42
        weighted avg       1.00      0.86      0.92        42



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
