## Introduction 

In this notebook we will go through a CRF only named-entity recognition implementation based on finance corpus. The following would be the sequence of the notebook:
<br>
1. Loading the dataset into a dataframe
2. Data Preprocessing
3. Extract features from the sentences (Feature Engineering)
4. Training a Condtional Random Field model
5. Evaluating the trained CRF model
6. Optimising the hyperparameters 

## Import the required libraries

In [9]:
import pandas as pd
import numpy as np 

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split

from sklearn_crfsuite import CRF
from sklearn.metrics import make_scorer
from sklearn_crfsuite import metrics
from sklearn.exceptions import UndefinedMetricWarning 

import warnings
import nltk
import math
import sys

## Data Preprocessing

### Parts of Speech Tag Generation

In [30]:
# Read the NER data keeping blank lines and adding columns
ner_data = pd.read_csv("../Data/tag1.csv", skip_blank_lines=False, encoding="utf-8", index_col=None)
ner_data.columns = ["Token", "NE"]

POS_tags =  nltk.pos_tag(ner_data["Token"])
POS_List = []

for w in POS_tags:
    POS_List.append(w[1])
    
ner_data["POS"] = POS_List
    
print(ner_data)

         Token                    NE  POS
0   $2,000,000     B-Notional Amount   CD
1          USD     I-Notional Amount  NNP
2    6/20/2011     B-Expiration Date   CD
3    Agreement                     O  NNP
4         with                     O   IN
5           JP        B-Counterparty  NNP
6       Morgan        I-Counterparty  NNP
7        dated                     O  VBD
8      6/17/06                     O   CD
9      whereby                     O   IN
10         the                     O   DT
11   Portfolio                     O  NNP
12        will                     O   MD
13     receive  B-Direction of Trade   VB
14       0.35%          B-Fixed Rate   CD
15         per                     O   IN
16   yeartimes                     O  NNS
17         the                     O   DT
18    notional                     O   JJ
19  amount.The                     O   NN
20   Portfolio                     O  NNP
21       makes                     O  VBZ
22           a                    

### Visualize Tag Distribution

In [12]:
tag_distribution = ner_data.groupby("NE").size().reset_index(name='counts')
print(tag_distribution)

                     NE  counts
0        B-Counterparty       2
1  B-Direction of Trade       2
2     B-Expiration Date       2
3          B-Fixed Rate       2
4     B-Notional Amount       2
5    B-Reference Entity       2
6        I-Counterparty       2
7     I-Notional Amount       1
8    I-Reference Entity       5
9                     O      55


Now filtering the classes of Named Entity that we do not require in this analysis

In [13]:
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["NE"].unique())))
print(classes)

['B-Notional Amount', 'I-Notional Amount', 'B-Expiration Date', 'B-Counterparty', 'I-Counterparty', 'B-Direction of Trade', 'B-Fixed Rate', 'B-Reference Entity', 'I-Reference Entity']
