## Introduction 

In this notebook we will go through a CRF only named-entity recognition implementation based on finance corpus. The following would be the sequence of the notebook:
<br>
1. Data Preprocessing
2. Extract features from the sentences (Feature Engineering)
3. Training a Condtional Random Field model
4. Evaluating the trained CRF model
5. Optimising the hyperparameters 

## Import the required libraries

In [1]:
import pandas as pd
import numpy as np
import glob

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split

from sklearn_crfsuite import CRF
from sklearn.metrics import make_scorer
from sklearn_crfsuite import metrics
from sklearn.exceptions import UndefinedMetricWarning 

import warnings
import nltk
import math
import sys

## Data Preprocessing


First, the data is loaded into a Pandas DataFrame. This can be done easily using the read_csv function, specifying that the separator is a space. It's also useful to keep the blank lines, which are helpful later for determining the sentence breaks. <br>
<br>
Once the data is loaded into a DataFrame, the easy access we have to columns allows a couple of useful things to be done - group the data by the "ne" column to see the distributions of each tag, and extract the classes (disregarding 'O' and blank lines with NaN values) as a list

### Parts of Speech Tag Generation

In [2]:
# Read the NER data keeping blank lines and adding columns
ner_data = pd.concat([pd.read_csv(f, skip_blank_lines=False, encoding="utf-8", index_col=None, na_values=' ') for f in glob.glob("../Data/*.csv")])
ner_data.columns = ["Token", "NE"]
ner_data["Token"] = ner_data["Token"].astype(str)
print(ner_data["Token"])
POS_tags =  nltk.pos_tag(ner_data["Token"])

POS_List = []

for w in POS_tags:
    POS_List.append(w[1])
    
ner_data["POS"] = POS_List
    
print(ner_data)

0            Bought
1            credit
2           default
3        protection
4                on
5        RadioShack
6             Corp.
7               and
8               pay
9             1.16%
10          Broker,
11              UBS
12          Warburg
13          Expires
14         December
15             2010
16         USD7,625
17           Bought
18           credit
19          default
20       protection
21               on
22          Limited
23          Brands,
24             Inc.
25              and
26              pay
27            1.065
28          Broker,
29              UBS
           ...     
192           equal
193              to
194           0.12%
195      multiplied
196              by
197             the
198        notional
199          amount
200             and
201         receive
202            from
203        Barclays
204            Bank
205             plc
206            upon
207            each
208         default
209           event
210              of


### Visualize Tag Distribution

In [3]:
tag_distribution = ner_data.groupby("NE").size().reset_index(name='counts')
print(tag_distribution)

                      NE  counts
0         B-Counterparty     261
1   B-Direction of Trade     260
2      B-Expiration Date     265
3           B-Fixed Rate     263
4      B-Notional Amount     261
5     B-Reference Entity     252
6                    B-o      18
7         I-Counterparty     511
8      I-Expiration Date      85
9           I-Fixed Rate       1
10     I-Notional Amount       2
11    I-Reference Entity     596
12                     O    7602


Now filtering the classes of Named Entity that we do not require in this analysis

In [4]:
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["NE"].unique())))
print(classes)

['B-Direction of Trade', 'B-Reference Entity', 'I-Reference Entity', 'B-Fixed Rate', 'B-Counterparty', 'I-Counterparty', 'B-Expiration Date', 'I-Expiration Date', 'B-Notional Amount', 'I-Notional Amount', 'I-Fixed Rate', 'B-o']


### Extract sentences from dataset


Next, sentences need to be extracted from the data - it's useful to have the sentences as a list of lists, with each sublist containing the token, POS tag, and NE label for every word token in the sentence

In [5]:
# Create a sentences dictionary and an initial single sentence dictionary
sentences, sentence = [], []
# Create a progress bar
# pbar = pyprind.ProgBar(len(ner_data))
# For each row in the NER data...
for index, row in ner_data.iterrows():
    # If the row is empty (no string in the token column)
    if '\\' in row["Token"]:
        # If the current sentence is not empty, append it to the sentences and create a new sentence
        if len(sentence) > 0:
            sentences.append(sentence)
            sentence = []
    # Otherwise...
    else:
        # If the row does not indicate the start of a document, add the token to the current sentence
        if type(row["Token"]) != float and type(row["POS"]) != float and type(row["NE"]) != float:
            sentence.append([row["Token"], row["POS"], row["NE"]])
    #pbar.update()

## Feature Engineering

Now we are going to define a function which would allow us to extract the word features in the sentence. This includes the following:
<br>
1. Current Parts of Speech Tags
2. Previous and Next Parts of Speech Tags
3. Current Words
4. Previous Words
5. Next Words

<br>For now, we have avoided chunking however a little internet research shows us that chunking indeed can improve the accuracy and sensitivity of the model

In [6]:
def extractWordFeatures(sentence, iterator):
    POS = sentence[iterator][1]
    Token = sentence[iterator][0]

    # Aggregating a feuature dicitonary based on the features of the current POS and word
    
    featureDict = { "POS[:2]" : POS[:2],
                 "POS" : POS,
                 "Token.isdigit()" : Token.isdigit(),
                 "Token.istitle()" : Token.istitle(),
                 "Token.isupper()" : Token.isupper(),
                 "Token[-2:]" : Token[-2:],
                 "Token[-3:]" : Token[-3:],
                 "Token.lower()" : Token.lower(),
                 "bias" : 1.0,
    }
    
    if iterator > 1:
        previousWord = sentence[iterator-1][0]
        previousPosTag = sentence[iterator-1][1]
        
        # Add characteristics of the sentence's previous word and POS to the feature dictionary
        featureDict.update({ "-1:Token.lower()": previousWord.lower(),
                          "-1:Token.istitle()": previousWord.istitle(),
                          "-1:Token.isupper()": previousWord.isupper(),
                          "-1:POS": previousPosTag,
                          "-1:POS[:2]": previousPosTag[:2],
                        })
        
    # Add "Beginning of Sentence" at the start of the dictionary    
    else:
        featureDict["BOS"] = True
    
    if iterator < len(sentence)-1:
        nextWord = sentence[iterator+1][0]
        nextPos = sentence[iterator+1][1]
        # Add characteristics of the sentence's previous next and POS to the feature dictionary
        featureDict.update({ "+1:Token.lower()": nextWord.lower(),
                          "+1:Token.istitle()": nextWord.istitle(),
                          "+1:Token.isupper()": nextWord.isupper(),
                          "+1:POS": nextPos,
                          "+1:POS[:2]": nextPos[:2],
                        })
        
    else:
        featureDict["EOS"] = True
    
    return featureDict    

Using the word_features function, a list of feature dictionaries for each word token in a sentence can be extracted, corresponding to a list of NE labels for each word token in a sentence.

In [7]:
# Return a feature dictionary for each word in a given sentence
def sentence_features(sentence):
    return [extractWordFeatures(sentence, iterator) for iterator in range(len(sentence))]

# Return the label (NER tag) for each word in a given sentence
def sentence_labels(sentence):
    return [label for token, pos, label in sentence]

## Training a Condtional Random Field model

Using the predefined functions, X and y can be extracted as lists of feature dictionaries for each word token in each sentence, and as lists of NE labels for each word token in each sentence, respectively. scikit-learn's 'test_train_split' function can then be used to split X and y into training and test sets, split 80% training to 20% test.

In [8]:
# For each sentence, extract the sentence features as X, and the labels as y
X = [sentence_features(sentence) for sentence in sentences]
y = [sentence_labels(sentence) for sentence in sentences]

# Split X and y into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("First token features:\n{}\n{}".format("-"*21, X_train[0][0]))
print("\nFirst token label:\n{}\n{}".format("-"*18, y_train[0][0]))

First token features:
---------------------
{'Token.lower()': '1,00,00,000', 'Token.isupper()': False, '+1:Token.istitle()': False, 'Token.isdigit()': False, 'Token[-3:]': '000', 'POS[:2]': 'CD', 'BOS': True, 'POS': 'CD', '+1:POS[:2]': 'NN', 'Token.istitle()': False, '+1:Token.lower()': 'usd', '+1:POS': 'NNP', 'bias': 1.0, 'Token[-2:]': '00', '+1:Token.isupper()': True}

First token label:
------------------
B-Notional Amount


In [9]:
# Create a new CRF model
crf = CRF(algorithm="lbfgs",
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)

# Train the CRF model on the supplied training data
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

## Evaluating the trained CRF model

The trained model can now be used to make predictions based on the test data, which can in turn be compared to the expected labels from the test data to produce a classification report (precision, recall and F1 scores).

In [10]:
# Use the CRF model to make predictions on the test data
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=classes))

                      precision    recall  f1-score   support

B-Direction of Trade       0.96      1.00      0.98        51
  B-Reference Entity       0.94      0.98      0.96        51
  I-Reference Entity       0.95      0.98      0.97       114
        B-Fixed Rate       1.00      1.00      1.00        52
      B-Counterparty       1.00      1.00      1.00        52
      I-Counterparty       1.00      1.00      1.00        96
   B-Expiration Date       1.00      1.00      1.00        52
   I-Expiration Date       0.94      1.00      0.97        15
   B-Notional Amount       0.98      0.94      0.96        51
   I-Notional Amount       0.00      0.00      0.00         1
        I-Fixed Rate       0.00      0.00      0.00         0
                 B-o       0.00      0.00      0.00         3

           micro avg       0.97      0.98      0.98       538
           macro avg       0.73      0.74      0.74       538
        weighted avg       0.97      0.98      0.98       538



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


## Optimising the hyperparameters

Under this section, we will experiment with different values of C1 and C2 values for the elastic net regularisation. In order to achieve this we will use cross-validated randomised search. To avoid a computationally intensive task, we will limit the iterations to 50 and use a 3-fold cross-validation. This in turn would mean that we are essentially training a 150 models.  
<br>
Following the optimisation, we can see that lower values (increased regularisation strength) for both C1 and C2 values result in the best performing model - particularly for C1

In [11]:
# Set up a parameter grid to experiment with different values for C1 and C2
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = {"c1": param_range,
              "c2": param_range}

# Set up a bespoke scorer that will compare the cross validated models according to their F1 scores
f1_scorer = make_scorer(metrics.flat_f1_score, average='weighted', labels=classes)

# Perform a 3-fold cross-validated, randomised search of 50 combinations for different values for C1 and C2
rs = RandomizedSearchCV(estimator=crf,
                        param_distributions=param_grid,
                        scoring=f1_scorer,
                        cv=3,
                        verbose=1,
                        n_iter=50,
                        n_jobs=-1)

# Train the models in the randomised search, ignoring any 'UndefinedMetricWarning' that comes up 
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
    rs.fit(X_train, y_train)

# Print the model that scored highest in the randomised search, and the parameters it used
print(rs.best_score_)
print(rs.best_params_)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   20.8s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  1.9min finished


0.9359328334687578
{'c1': 0.1, 'c2': 0.1}


## Evaluate the optimised CRF model

In [12]:
crf = rs.best_estimator_

y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=classes))

                      precision    recall  f1-score   support

B-Direction of Trade       0.96      1.00      0.98        51
  B-Reference Entity       0.94      0.98      0.96        51
  I-Reference Entity       0.95      0.98      0.97       114
        B-Fixed Rate       1.00      1.00      1.00        52
      B-Counterparty       1.00      1.00      1.00        52
      I-Counterparty       1.00      1.00      1.00        96
   B-Expiration Date       1.00      1.00      1.00        52
   I-Expiration Date       0.94      1.00      0.97        15
   B-Notional Amount       0.98      0.94      0.96        51
   I-Notional Amount       0.00      0.00      0.00         1
        I-Fixed Rate       0.00      0.00      0.00         0
                 B-o       0.00      0.00      0.00         3

           micro avg       0.97      0.98      0.98       538
           macro avg       0.73      0.74      0.74       538
        weighted avg       0.97      0.98      0.98       538



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
