## Coursework Task 1: Extracting Entities from Scientific Abstracts

In Natural Language Processing, Named Entity Recognition is a process of extracting the important information from large unstructered text data and classifying those entities into suitable categories. In this project, we will look at the task of extracting entity types such as Task, Process and Material from corpus of text.

## Method 1: Conditional random fields + POS

In [1]:
#conda install -c conda-forge sklearn-crfsuite

In [2]:
#import the necessary libraries
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools

from pathlib import Path
from scienceie_loader import *

from sklearn_crfsuite import CRF, metrics
from sklearn.metrics import make_scorer, f1_score,classification_report
from sklearn.model_selection import train_test_split
from sklearn.exceptions import UndefinedMetricWarning

from IPython import display

import nltk
from nltk import pos_tag
from nltk.tag import CRFTagger

import warnings
warnings.filterwarnings("ignore")

## 1. Data Pre-processing

In this subsection, we will divide a document into sentences and assign POS tags to each word in the sentence. 

In [3]:
#dataset path
data_root = os.path.join(os.getcwd(), 'original_datasets')
data_train = os.path.join(data_root, 'scienceie2017_train/train2')
data_dev = os.path.join(data_root, 'scienceie2017_dev/dev')
data_test = os.path.join(data_root, 'semeval_articles_test')

In [4]:
#tokenized data
train_docs, train_rels, _ = load_tokenized_data(data_train)
dev_docs, dev_rels, _ = load_tokenized_data(data_dev)
test_docs, test_rels, _ = load_tokenized_data(data_test)

print(f'number of training documents: {len(train_docs)}')
print(f'number of dev documents: {len(dev_docs)}')
print(f'number of test documents: {len(test_docs)}')

number of training documents: 350
number of dev documents: 50
number of test documents: 100


In [5]:
#labels
labels = sorted(list({tag for doc in train_docs for word, tag in doc}))

I have decided to split the document into sentences. So, it would give us a more datapoint.

In [6]:
def docs_sentence(docs):
    """function that split document into sentences"""
    answer = []
    result = []
    a = []
    for doc in docs:
        for word, ner in doc:
            if word !=".":
                a.append((word, ner))
            else:
                a.append(('.', 'O'))
                result.append(a)
                a = list()
        answer.append(result)
        result = list()
        
    return answer

In [7]:
def pos(lst):
    """function that assign pos tag to each word in a sentence"""
    answer = []
    temp = []
    
    for doc in lst:
        for sentence in doc:
            words = [word for word, ner in sentence]
            pos_tags = nltk.pos_tag(words)
            atemp = []
            
            for num in range(len(pos_tags)):
                atemp.append((sentence[num][0], pos_tags[num][1], sentence[num][1]))
            temp.append(atemp)
            atemp = list()
        answer.append(temp)
        temp = list()
        
    return answer
        
        
    

In [8]:
#train_set and test_set are in the form [[[sentence1, sentence2, ...]]]
train_set = pos(docs_sentence(train_docs))
test_set = pos(docs_sentence(test_docs))

In [9]:
#train and test
train = [sentence for document in train_set for sentence in document]
test = [sentence for document in test_set for sentence in document]

In [10]:
#lengths of train and test
print(f"The lengths of train and test are {len(train)} and {len(test)} respectively.")

The lengths of train and test are 717 and 248 respectively.


## 2. Features

Next, we will create features needed to perform named entity recognition. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used.

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

sklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.

In [11]:
def CRFfeatures(sentence, counter):
    
    """function to create features for NER"""
    
    word = sentence[counter][0]
    pos_tag = sentence[counter][1]
    
    """dictionary to save all the values of the features"""
    result = { "POS_Tag[:2]" : pos_tag[:2],
               "POS_Tag" : pos_tag,
               "word.isdigit()" : word.isdigit(),
               "word.istitle()" : word.istitle(),
               "word.isupper()" : word.isupper(),
               "word[-2:]" : word[-2:],
               "word[-3:]" : word[-3:],
               "word.lower()" : word.lower(),
               "bias" : 1.0
    }
    
    if counter > 1:
        previousword = sentence[counter-1][0]
        previouspos_tag = sentence[counter-1][1]
        
        result.update({ "previous:word.lower()": previousword.lower(),
                          "previous:word.istitle()": previousword.istitle(),
                          "previous:word.isupper()": previousword.isupper(),
                          "previous:POS_Tag": previouspos_tag,
                          "previous:POS_Tag[:2]": previouspos_tag[:2]
                        })
    else:
        result["BOS"] = True
    
    if counter < len(sentence)-1:
        nextword = sentence[counter+1][0]
        nextpos_tag = sentence[counter+1][1]
   
        result.update({ "next:word.lower()": nextword.lower(),
                          "next:word.istitle()": nextword.istitle(),
                          "next:word.isupper()": nextword.isupper(),
                          "next:POS_Tag": nextpos_tag,
                          "next:POS_Tag[:2]": nextpos_tag[:2]
                        })  
    else:
        result["EOS"] = True
    
    return result

In [13]:
def extract_features(sentence):
    
    """function that returns the features for each word in a sentence"""
    return [CRFfeatures(sentence, number) for number in range(len(sentence))]

def assign_labels(sentence):
    
    """function that returns the NER tag for each word in a sentence"""
    return [label for token, pos, label in sentence]

In [14]:
X_train = [extract_features(sentence) for sentence in train]
y_train = [assign_labels(sentence) for sentence in train]
X_test = [extract_features(sentence) for sentence in test]
y_test = [assign_labels(sentence) for sentence in test]

In [15]:
crf = CRF(algorithm="lbfgs",
          c1=0.30017377233926223,
          c2=0.11259458226548057,
          max_iterations=1000,
          all_possible_transitions=True)

c1 and c2 values are obtained using RandomizedSearchCV technique. Sometime, I get CRF keep_tempfiles error and only way I could solve this issue by restarting the notebook. Thats why I have removed my hyperparamter tuning step.

In [None]:
crf.fit(X_train, y_train)

In [23]:
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred))

              precision    recall  f1-score   support

  B-Material       0.38      0.22      0.28       810
   B-Process       0.41      0.27      0.32       883
      B-Task       0.15      0.12      0.14       187
  I-Material       0.34      0.32      0.33       759
   I-Process       0.26      0.32      0.29      1248
      I-Task       0.16      0.22      0.18       737
           O       0.87      0.88      0.87     16473

    accuracy                           0.74     21097
   macro avg       0.37      0.34      0.34     21097
weighted avg       0.74      0.74      0.74     21097



In [26]:
def BPT(lst):
    
    """function that converts B and I tags of a given entity to single
    NER tag"""
    
    #example: "B-Process" and "I-Process" to "Process"

    a = list(itertools.chain.from_iterable(lst))
    result = []
    
    for element in a:
        if element == "O":
            result.append("O")
        elif element == "B-Material" or element == "I-Material":
            result.append("Material")
        elif element == "B-Process" or element == "I-Process":
            result.append("Process")
        elif element == "B-Task" or element == "I-Task":
            result.append("Task")
            
    return result      

In [27]:
print(classification_report(BPT(y_test), BPT(y_pred)))

              precision    recall  f1-score   support

    Material       0.41      0.31      0.35      1569
           O       0.87      0.88      0.87     16473
     Process       0.35      0.34      0.35      2131
        Task       0.17      0.21      0.19       924

    accuracy                           0.75     21097
   macro avg       0.45      0.44      0.44     21097
weighted avg       0.75      0.75      0.75     21097



## Error analysis

In [69]:
#store the values of entity, true and pred
dictionary = {
             "true tag":[],
             "predicted tag":[]}

In [76]:
dictionary["entity"] = [word for word, pos, ne_tag in test[82][:11]]
dictionary["true tag"] = y_test[82][:11]
dictionary["predicted tag"] = y_pred[82][:11]

In [77]:
#dataframe
df = pd.DataFrame(data = dictionary)
df = df[["entity", "true tag", "predicted tag"]]
df

Unnamed: 0,entity,true tag,predicted tag
0,Apache,B-Process,O
1,Pig,I-Process,O
2,is,O,O
3,a,O,O
4,platform,O,O
5,for,O,O
6,creating,O,B-Task
7,MapReduce,B-Process,I-Task
8,workflows,I-Process,I-Task
9,with,O,I-Task
