## Introduction 

In this notebook we will go through a CRF only named-entity recognition implementation based on finance corpus. The following would be the sequence of the notebook:
<br>
1. Loading the dataset into a dataframe
2. Data Preprocessing
3. Extract features from the sentences (Feature Engineering)
4. Training a Condtional Random Field model
5. Evaluating the trained CRF model
6. Optimising the hyperparameters 

## Import the required libraries

In [9]:
import pandas as pd
import numpy as np 

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split

from sklearn_crfsuite import CRF
from sklearn.metrics import make_scorer
from sklearn_crfsuite import metrics
from sklearn.exceptions import UndefinedMetricWarning 

import warnings
import nltk
import math
import sys

## Data Preprocessing

### Parts of Speech Tag Generation

In [11]:
# Read the NER data keeping blank lines and adding columns
ner_data = pd.read_csv("../Data/tag1.csv", skip_blank_lines=False, encoding="utf-8", index_col=None)
ner_data.columns = ["Token", "NE"]

ner_data["POS"] =  nltk.pos_tag(ner_data["Token"])
print(ner_data)

         Token                    NE               POS
0   $2,000,000     B-Notional Amount  ($2,000,000, CD)
1          USD     I-Notional Amount        (USD, NNP)
2    6/20/2011     B-Expiration Date   (6/20/2011, CD)
3    Agreement                     O  (Agreement, NNP)
4         with                     O        (with, IN)
5           JP        B-Counterparty         (JP, NNP)
6       Morgan        I-Counterparty     (Morgan, NNP)
7        dated                     O      (dated, VBD)
8      6/17/06                     O     (6/17/06, CD)
9      whereby                     O     (whereby, IN)
10         the                     O         (the, DT)
11   Portfolio                     O  (Portfolio, NNP)
12        will                     O        (will, MD)
13     receive  B-Direction of Trade     (receive, VB)
14       0.35%          B-Fixed Rate       (0.35%, CD)
15         per                     O         (per, IN)
16   yeartimes                     O  (yeartimes, NNS)
17        

### Visualize Tag Distribution

In [None]:
tag_distribution = ner_data.groupby("NE").size().reset_index(name='counts')
print(tag_distribution)

Now filtering the classes of Named Entity that we do not require in this analysis

In [None]:
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["NE"].unique())))
print(classes)