## Practice on NER
This is a simple tutorial to recap and better understand application of NER in NLP problems.

In [1]:
# Required libraries
import pandas as pd
import numpy as np

# Take note of encoding
pd_dataset = pd.read_csv("ner_dataset.csv", encoding="latin1")

## Inspect dataset
Since the first column has labels only on the first word of each sentence, I'll use the 'ffill' method to fill forward the data to other rows.

In [2]:
pd_dataset.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [3]:
pd_dataset.fillna(method="pad", inplace=True) # Can use either pad/ffill

# top 5 rows
pd_dataset.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [4]:
# last 5 rows
pd_dataset.tail()

Unnamed: 0,Sentence #,Word,POS,Tag
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O
1048574,Sentence: 47959,attack,NN,O


In [5]:
# datatypes of columns
pd_dataset.dtypes

Sentence #    object
Word          object
POS           object
Tag           object
dtype: object

In [6]:
words = list(set(pd_dataset['Word'].values))
len(words)

35178

Currently, we can see that there are 47959 sentences and 35178 unique words.

### Define a class to retrieve sentences

In [7]:
class SentenceConcatenator(object):
    def __init__(self, dataset):
        self.sentence_pos = 1
        self.dataset = dataset
        self.is_empty = False
        
    def get_next_word(self):
        try:
            # Getting all words in a particular sentence through checking first column data
            sentence = self.dataset[self.dataset['Sentence #'] == "Sentence: {}".format(self.sentence_pos)]
            self.sentence_pos += 1
            return sentence['Word'].tolist(), sentence['POS'].tolist(), sentence['Tag'].tolist()
        except:
            # Case: No dataset
            self.empty = True
            return None, None, None

In [8]:
conca = SentenceConcatenator(pd_dataset)
sentence, pos, tag = conca.get_next_word()

### Memorization
The baseline would be to remember the most common named entity for each word and predict that. In the event where we don't know the word, we would just predict 'O'.

Links to libraries:<br>
1) BaseEstimator: http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html <br>
2) TransformerMixin: http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html

<strong>BaseEstimator</strong> provides a default implementation for the <i>get_params</i> and <i>set_params</i> methods. This is useful in making the model grid search-able with <i>GridSearchCV</i> for automated parameters tuning and behaves well with others when combined in a <i>Pipeline</i>.

<strong>TransformerMixin</strong> is a simple class that provides a <i>.fit_transform</i> method that does <i>.fit</i> and <i>.transform</i> and it works well with scikit-learn.

In [9]:
# Required libraries
from sklearn.base import BaseEstimator, TransformerMixin

class MemoryTagger(BaseEstimator, TransformerMixin):
    def fit(self, X, y):
        '''
        Expects a list of words as X and a list of tags as y.
        '''
        voc = {}
        self.tags = []
        for x, t in zip(X, y):
            if t not in self.tags:
                self.tags.append(t)
            if x in voc:
                if t in voc[x]:
                    voc[x][t] += 1
                else:
                    voc[x][t] = 1
            else:
                voc[x] = {t: 1}
        self.memory = {}
        for k, d in voc.items():
            self.memory[k] = max(d, key=d.get)
        return self.memory
    
    def predict(self, X, y=None):
        '''
        Predict the the tag from memory. If word is unknown, predict 'O'.
        '''
        return [self.memory.get(x, 'O') for x in X]

In [10]:
# Initialize tagger
tagger = MemoryTagger()

# Fit the tagger
print(tagger.fit(sentence, tag))

# Getting all unique tags
tagger.tags

{'Thousands': 'O', 'of': 'O', 'demonstrators': 'O', 'have': 'O', 'marched': 'O', 'through': 'O', 'London': 'B-geo', 'to': 'O', 'protest': 'O', 'the': 'O', 'war': 'O', 'in': 'O', 'Iraq': 'B-geo', 'and': 'O', 'demand': 'O', 'withdrawal': 'O', 'British': 'B-gpe', 'troops': 'O', 'from': 'O', 'that': 'O', 'country': 'O', '.': 'O'}


['O', 'B-geo', 'B-gpe']

### Perform Cross-Validation

In [11]:
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report

all_words = pd_dataset['Word'].values.tolist()
all_tags = pd_dataset['Tag'].values.tolist()



### Recap on classfication report metrics
1) Precision (Low Precision = High False Positive)<br> 
<li>The ability of the classifier not to label as positive a sample that is negative.</li>

2) Recall (Low Recall = High False Negative)<br> 
<li>The ability of the classifier to find all the positive samples.</li>

3) f1-score <br>
<li>A weighted average of the precision and recall, where an f1 score reaches its best value at 1 and worst score at 0.</li>

4) support <br>
<li>The number of instances in the test set that have true label.</li>

In [12]:
# Predicting respective tags
pred = cross_val_predict(estimator=MemoryTagger(), X=all_words, y=all_tags, cv=5)

report = classification_report(y_pred=pred, y_true=all_tags)
print(report)

             precision    recall  f1-score   support

      B-art       0.20      0.05      0.09       402
      B-eve       0.54      0.25      0.34       308
      B-geo       0.78      0.85      0.81     37644
      B-gpe       0.94      0.93      0.94     15870
      B-nat       0.42      0.28      0.33       201
      B-org       0.67      0.49      0.56     20143
      B-per       0.78      0.65      0.71     16990
      B-tim       0.87      0.77      0.82     20333
      I-art       0.04      0.01      0.01       297
      I-eve       0.39      0.12      0.18       253
      I-geo       0.73      0.58      0.65      7414
      I-gpe       0.62      0.45      0.52       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.69      0.53      0.60     16784
      I-per       0.73      0.65      0.69     17251
      I-tim       0.58      0.13      0.21      6528
          O       0.97      0.99      0.98    887908

avg / total       0.94      0.95      0.94  

The recall value is relatively low and signifies that it is weak. This is mainly because we are unable to predict on words we don't know about. Now for the fun part, let's apply some machine learning models and see which works best!

### Machine Learning Modeling

In [13]:
from sklearn.ensemble import RandomForestClassifier

# These simple features are created using methods in python library
def feature_map(word):
    '''Simple feature map.'''
    return np.array([word.istitle(), word.islower(), word.isupper(), len(word),
                     word.isdigit(),  word.isalpha()])

words = [feature_map(w) for w in pd_dataset["Word"].values.tolist()]

pred = cross_val_predict(RandomForestClassifier(n_estimators=20),
                         X=words, y=all_tags, cv=5)

report = classification_report(y_pred=pred, y_true=all_tags)
print(report)

             precision    recall  f1-score   support

      B-art       0.00      0.00      0.00       402
      B-eve       0.00      0.00      0.00       308
      B-geo       0.26      0.79      0.40     37644
      B-gpe       0.26      0.06      0.09     15870
      B-nat       0.00      0.00      0.00       201
      B-org       0.65      0.17      0.27     20143
      B-per       0.97      0.20      0.33     16990
      B-tim       0.29      0.32      0.30     20333
      I-art       0.00      0.00      0.00       297
      I-eve       0.00      0.00      0.00       253
      I-geo       0.00      0.00      0.00      7414
      I-gpe       0.00      0.00      0.00       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.36      0.03      0.06     16784
      I-per       0.47      0.02      0.04     17251
      I-tim       0.50      0.06      0.11      6528
          O       0.97      0.98      0.97    887908

avg / total       0.88      0.87      0.86  

  'precision', 'predicted', average, warn_for)


The results worsened and this can be attributed to the lack of information necessary for the decision to be made. Next, we shall enhance our simple features by memory and also, context information.

## Feature Engineering

<strong>Label-Encoding</strong><br>
In essence, label-encoding is used to transform categorical data into suitable numeric values. It converts each value in a column to a number.

In [14]:
from sklearn.preprocessing import LabelEncoder

class FeatureTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.memory_tagger = MemoryTagger()
        self.tag_encoder = LabelEncoder()
        self.pos_encoder = LabelEncoder()
        
    def fit(self, X, y):
        words = X["Word"].values.tolist()
        self.pos = X["POS"].values.tolist()
        tags = X["Tag"].values.tolist()
        self.memory_tagger.fit(all_words, all_tags)
        self.tag_encoder.fit(all_tags)
        self.pos_encoder.fit(self.pos)
        return self
    
    def transform(self, X, y=None):
        def pos_default(p):
            if p in self.pos:
                return self.pos_encoder.transform([p])[0]
            else:
                return -1
        
        pos = X["POS"].values.tolist()
        words = X["Word"].values.tolist()
        out = []
        for i in range(len(words)):
            w = words[i]
            p = pos[i]
            if i < len(words) - 1:
                wp = self.tag_encoder.transform(self.memory_tagger.predict([words[i+1]]))[0]
                posp = pos_default(pos[i+1])
            else:
                wp = self.tag_encoder.transform(['O'])[0]
                posp = pos_default(".")
            if i > 0:
                if words[i-1] != ".":
                    wm = self.tag_encoder.transform(self.memory_tagger.predict([words[i-1]]))[0]
                    posm = pos_default(pos[i-1])
                else:
                    wm = self.tag_encoder.transform(['O'])[0]
                    posm = pos_default(".")
            else:
                posm = pos_default(".")
                wm = self.tag_encoder.transform(['O'])[0]
            out.append(np.array([w.istitle(), w.islower(), w.isupper(), len(w), w.isdigit(), w.isalpha(),
                                 self.tag_encoder.transform(self.memory_tagger.predict([w]))[0],
                                 pos_default(p), wp, wm, posp, posm]))
        return out

## Scikit-learn Pipeline

Next, we will use the pipeline from scikit-learn to build our model.<br>
Pipeline lists the steps that our model puts the text corpus through.

In [15]:
from sklearn.pipeline import Pipeline

pred = cross_val_predict(Pipeline([("feature_map", FeatureTransformer()), 
                                   ("clf", RandomForestClassifier(n_estimators=20, n_jobs=3))]),
                         X=pd_dataset, y=all_tags, cv=5)

report = classification_report(y_pred=pred, y_true=all_tags)
print(report)

             precision    recall  f1-score   support

      B-art       0.54      0.36      0.43       402
      B-eve       0.47      0.28      0.35       308
      B-geo       0.84      0.90      0.87     37644
      B-gpe       0.98      0.93      0.96     15870
      B-nat       0.46      0.23      0.31       201
      B-org       0.78      0.71      0.75     20143
      B-per       0.86      0.86      0.86     16990
      B-tim       0.90      0.82      0.86     20333
      I-art       0.37      0.14      0.20       297
      I-eve       0.27      0.13      0.18       253
      I-geo       0.79      0.72      0.75      7414
      I-gpe       0.72      0.47      0.57       198
      I-nat       0.86      0.24      0.37        51
      I-org       0.80      0.77      0.78     16784
      I-per       0.89      0.89      0.89     17251
      I-tim       0.84      0.54      0.66      6528
          O       0.99      1.00      0.99    887908

avg / total       0.97      0.97      0.97  

Results have shown an improvement, showing feature engineering being an integral step in word-processing. Nonetheless, this is just an implementation of simple features and way more can be done to improve the outcome.