In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../input/titlestyle/style2.css", "r").read()
    return HTML(styles)
css_styling()

<div class="heading">
   <h1><span style="color: white">Content</span></h1>
</div>

* [Named-Entity Recognition (NER) - Intro and Concept](#topic1)
* [Applications](#topic2)
* [POS Tagging](#topic3)
* [Chunking](#topic4)
* [IOB Tags](#topic5)
* [Extracting Named Entities (NLTK)](#topic6)
* [spaCy Named Entity Recognition](#topic7)
* [Visualizing Named Entities (spaCy)](#topic8)
* [NER BERT Hugging Face](#topic9)
* [NER BERT Finetuning Existing Model](#topic10)
* [Other Usefull Resources and References](#topic11)

<a id =topic1> </a>
<div class="heading">
   <h1><span style="color: white">Named-Entity Recognition (NER)</span></h1>
</div>
<div class="content">

<b>Named-entity recognition (NER) is a problem that has a goal to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, etc.</b>
<br><br>
NER is used in many fields in NPL and it can help to answer some questions such as:
<ul>
<li>Which companies and persons are mentioned in the documents?
<li>In which articles or posts is specified product mentioned?
<li>Does the article contain medical terms and which ones?
</ul>

State-of-the-art NER systems for English produce near-human performance where the best system scored 93.39% of F-measure while human annotators have a score around 97%.
<br><br>
Named-entity recognition is often broken down into two distinct problems:
<ul>
<li>Detection of names
<li>Classification of the names by the type of entity they refer to (person, organization or location)
</ul>
<u>Detection of names</u> is typical simplified to a segmentation problem where a single name might be constructed of several substrings. For example "Bank of America" is a single name despite that the substring "America" is itself a name. 
<u>Classification of names</u> requires choosing an ontology by which to organize categories of things.
<br><br>
While doing NER, besides the correct and incorrect predicted terms, we'll probably face some "partially correct" predictions. For example:
<br>
<ul>
<li>Uncomplete names (missing the last token of "John Smith, M.D.")
<li>Names with more tokens (including the "mr." token in "mr. John Smith")
<li>Partitioning adjacent entities differently (treating two names as one "Hans, Jones Blick")
<li>Assigning related but inexact type (for example, "substance" vs. "drug", or "school" vs. "organization")
<li>Correctly identifying an entity, when what the user wanted was a smaller- or larger-scope entity (for example, identifying "James Madison" as a personal name, when it's part of "James Madison University")
</ul>
<br><br>
</div>

<a id =topic2> </a>
<div class="heading">
   <h1><span style="color: white">Applications</span></h1>
</div>
<div class="content">

NER can be appllied in many real-world situations when analyzing a large quantity of text is helpful. Some examples of NER includes:

<ul>
<li> Improve customer support by categorizing and filtering user requests, complaints, and questions. It can help businesses obtain insights about their customers.
<li> Help categorize applicants’ CVs and speed up the process.
<li> Improve search and recommendation engines using recognized entities.
<li> Search and extract useful information from documents and blog posts.
</ul>
    
</div>

<a id =topic3> </a>
<div class="heading">
   <h1><span style="color: white">POS Tagging</span></h1>
</div>
<div class="content">

<b>POS Tagging (Parts of Speech Tagging)</b> is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.
</div>

In [None]:
import numpy as np
import pandas as pd
import os
from nltk import word_tokenize, pos_tag

#os.listdir('../input/nlp-getting-started/train.csv')
# read data from nlp-getting-started
nlp_start_df = pd.read_csv('../input/nlp-getting-started/train.csv')

# take one example sentence 
ex = nlp_start_df.loc[159]['text']

# tokenize the sencence and apply POS tagging
sent = pos_tag(word_tokenize(ex))
sent

<div class="content">
From the example above, we have that the word <b>'Experts' is 'NNS' (noun plural), word 'France' is 'NNP' (proper noun) and 'French' is 'JJ' (adjective)</b>. The whole list of abbreviations can be found <a href="https://www.guru99.com/pos-tagging-chunking-nltk.html">here</a>. After this step, we can start with noun phrase chunking to named entities
</div>
<a id =topic4> </a>
<div class="heading">
   <h1><span style="color: white">Chunking</span></h1>
</div>
<div class="content">

<b>Chunking in NLP is a process of grouping small pieces of information into large units.</b> The primary use of Chunking is making groups of "noun phrases." It is used to add structure to the sentence by following POS tagging combined with regular expressions. The resulted group of words are called "chunks."

There are no pre-defined rules for Chunking, but we can made according to our needs. Thus, if we want to chunk <b>only 'NN'</b> tags, we need to use pattern <pre><code>`mychunk:{&lt;NN>}`</code></pre> but if we need to chunk all types of tags which <b>start with 'NN'</b>, we'll use <pre><code>`mychunk:{&lt;NN.*>}`.</code></pre> More about regex patterns can be found <a href="https://www.w3schools.com/python/python_regex.asp">here</a>
</div>

In [None]:
!pip install svgling

In [None]:
from nltk import RegexpParser
from nltk.draw.tree import TreeView
from IPython.display import Image
import svgling

# chunk all adjacence nouns
patterns= """mychunk:{<NN.*>+}"""
chunker = RegexpParser(patterns)
output = chunker.parse(sent)
print("After Chunking",output)
svgling.draw_tree(output)

<a id =topic5> </a>
<div class="heading">
   <h1><span style="color: white">IOB Tags</span></h1>
</div>
<div class="content">

Similarly as part-of-speech tags, <b>IOB tags</b> are a slightly different way for representing chunk structures. This format can denote the inside, outside, and beginning of a chunk.
</div>

In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(output)
iob_tagged

<a id =topic6> </a>
<div class="heading">
   <h1><span style="color: white">Extracting Named Entities</span></h1>
</div>
<div class="content">
Recognizing a <b>named entity</b> is a specific kind of chunk extraction that uses entity tags along with chunk tags. Common entity tags include <b>PERSON, LOCATION, and ORGANIZATION</b>. NLTK has already a pre-trained named entity chunker which can be used using ne_chunk() method in the nltk.chunk module.
</div>

In [None]:
from nltk.chunk import ne_chunk

def extract_ne(trees, labels):
    
    ne_list = []
    for tree in ne_res:
        if hasattr(tree, 'label'):
            if tree.label() in labels:
                ne_list.append(tree)
    
    return ne_list
            
ne_res = ne_chunk(pos_tag(word_tokenize(ex)))
labels = ['ORGANIZATION']

print(extract_ne(ne_res, labels))

<a id =topic7> </a>
<div class="heading">
   <h1><span style="color: white">spaCy Named Entity Recognition</span></h1>
</div>
<div class="content">
<b>spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens</b>. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products.
</div>

In [None]:
import sqlite3

cnx = sqlite3.connect('../input/wikibooks-dataset/wikibooks.sqlite')
df_wikibooks = pd.read_sql_query("SELECT * FROM en", cnx)
df_wikibooks.head()

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
wiki_ex = df_wikibooks.iloc[11]['body_text']
doc = nlp(wiki_ex)
doc

In [None]:
print('All entity types that spacy recognised from the document above')
set([ent.label_ for ent in doc.ents])

In [None]:
print('Persons from the document above')
print(set([ent for ent in doc.ents if ent.label_ == 'PERSON']))
print('Organizations from the document above')
print(set([ent for ent in doc.ents if ent.label_ == 'ORG']))

<a id =topic8> </a>
<div class="heading">
   <h1><span style="color: white">Visualizing Named Entities</span></h1>
</div>

In [None]:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

<a id =topic9> </a>
<div class="heading">
   <h1><span style="color: white">NER BERT Hugging Face</span></h1>
</div>
<div class="content">
<b>BERT (Bidirectional Encoder Representations from Transformers)</b> is a neural network that is capable of parsing language in the same way a human does. It uses word embeddings to translate words into numbers and then back again, allowing it to understand word context and meaning.
<br><br>
BERT focuses on understanding language comprehension, identifying linguistic cues in the text, and determining the context in which words are used.
<br><br>
Besides text classification, summarization, question answering and several other tasks, BERT can be used to solve NER.
<br><br>
<b>Hugging face</b> is one of the most used NLP libraries. With this library, we can leverage popular NLP models, such as BERT, DistilBERT roBERTa and use those models to manipulate text in one way or another. Hugging Face provides simple access to a variety of models and datasets used for all possible NLP tasks.
<br><br>
Models for NER tasks can be found under the "Token Classification" section here https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads. We can use the default model or select one from the mentioned link.
<br><br>
Great Hugging Face course about transformers and the usage of the library can be found here https://huggingface.co/course/chapter0?fw=pt
</div>

In [None]:
from transformers import pipeline

nlp_start_df = pd.read_csv('../input/nlp-getting-started/train.csv')
# take one example sentence 
ex = nlp_start_df.loc[159]['text']

generator = pipeline("ner",
                     #model="dslim/bert-base-NER",
                     grouped_entities=True)
generator(ex)


<a id =topic10> </a>
<div class="heading">
   <h1><span style="color: white">NER BERT Finetuning Existing Model</span></h1>
</div>
<div class="content">

The process of training a neural network is a difficult and time-consuming process and for most of the users not even feasible. Because of that, instead of training the model from scratch, we can use models from Hugging Face which has been trained using a large amount of text. 
<br><br>
These types of models through training developed a statistical understanding of the language they have been trained on, but they might not be useful for our specific task. In order to utilize the knowledge of the model, we can apply fine-tuning. It means that we can take pretrained model and train it a little bit more with our annotated data.
<br><br>
This process is called transfer learning when the knowledge is transfered from one model to another one and that strategy is often used in deep learning.
<br><br>
First of all, we can show the annotated data for the NER task.
</div>

In [None]:
import pandas as pd
data_dir = '../input/entity-annotated-corpus'
df = pd.read_csv(f'{data_dir}/ner_dataset.csv')
df['Sentence #'] = df['Sentence #'].ffill()
df_gr = df.groupby('Sentence #').agg(lambda x: list(x))

df_gr.head()

<div class="content">
In our case, we'll need columns `Word` (tokenized sentence) and `Tag` (entities in the sentence). Also, in order to fine-tune the model, we'll need to have the same entities as the model is trained on. Let's print entities from our data set and from the pretrained model.
</div>

In [None]:
tags = []
for tag in df_gr['Tag'].to_list():
    tags.extend(tag)
print('Entities in our data set')
set(tags)

In [None]:
from transformers import AutoModelForTokenClassification

model_checkpoint = "dslim/bert-base-NER"
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

print('Entities from the pretrained model')
model.config.id2label


Create the mapping for converting entities from our data set into the entity id (key) from the 0

In [None]:
entity_mapping = {
'O':0,'B-per':3, 'I-per':4, 'B-org':5, 'I-org':6,'B-geo':7, 'I-geo':8,
'B-art':1, 'B-eve':1 , 'B-gpe':1, 'B-nat':1, 'B-tim':1,
'I-art':1, 'I-eve':1 , 'I-gpe':1, 'I-nat':1, 'I-tim':1,
}

In [None]:
class NERDataset:
    def __init__(self, df):
        # input is annotated data frame
        self.texts = df['Word'].to_list()
        self.tags = df['Tag'].to_list()
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, item):
        text = self.texts[item]
        tags = self.tags[item]
        
        ids = []
        target_tag =[]
        
        # tokenize words and define tags accordingly
        # running -> [run, ##ning]
        # tags - ['O', 'O']
        for i, s in enumerate(text):
            inputs = tokenizer.encode(s, add_special_tokens=False)
            input_len = len(inputs)
            ids.extend(inputs)
            target_tag.extend([entity_mapping[tags[i]]] * input_len)
        
        # truncate
        ids = ids[:MAX_LEN - 2]
        target_tag = target_tag[:MAX_LEN - 2]
        
        # add special tokens
        ids = [101] + ids + [102]
        target_tag = [0] + target_tag + [0]
        mask = [1] * len(ids)
        token_type_ids = [0] * len(ids)
        
        # construct padding
        padding_len = MAX_LEN - len(ids)
        ids = ids + ([0] * padding_len)
        mask = mask + ([0] * padding_len)
        token_type_ids = token_type_ids + ([0] * padding_len)
        target_tag = target_tag + ([0] * padding_len)
        
        return {'input_ids': torch.tensor(ids, dtype=torch.long),
                'attention_mask': torch.tensor(mask, dtype=torch.long),
                'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
                'labels': torch.tensor(target_tag, dtype=torch.long)
               }

In [None]:
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
import torch

df_train, df_val = train_test_split(df_gr, test_size=0.2, random_state=42)
df_val, df_test = train_test_split(df_val, test_size=0.5, random_state=42)

model_checkpoint = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

MAX_LEN = 128

data_train = NERDataset(df_train)
data_val = NERDataset(df_val)
data_test = NERDataset(df_test)

# initialize DataLoader used to return batches for training/validation
loader_train = torch.utils.data.DataLoader(
    data_train, batch_size=32, num_workers=4
)

loader_val = torch.utils.data.DataLoader(
    data_val, batch_size=32, num_workers=4
)

In [None]:
from transformers import AdamW, AutoModelForTokenClassification, get_scheduler
from tqdm.notebook import tqdm
from sklearn.metrics import f1_score, accuracy_score

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

# just train the linear classifier on top of BERT
param_optimizer = list(model.classifier.named_parameters())
optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=3e-5,
    eps=1e-12
)
## full finetuning
#optimizer = AdamW(model.parameters())

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# add scheduler to linearly reduce the learning rate throughout the epochs.
num_epochs = 3
num_training_steps = num_epochs * len(loader_train)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)


progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
    model.train()
    final_loss = 0
    predictions , true_labels = [], []
    for batch in loader_train:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        true_labels.extend(batch['labels'].detach().cpu().numpy().ravel())
        predictions.extend(np.argmax(outputs[1].detach().cpu().numpy(), axis=2).ravel())
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        final_loss+=loss.item()
        
    print(f'Training loss: {final_loss/len(loader_train)}')
    print('Training F1: {}'.format(f1_score(predictions, true_labels, average='macro')))
    print(f'Training acc: {accuracy_score(predictions, true_labels)}')
    print('*'*20)
    
    model.eval()
    final_loss = 0
    predictions , true_labels = [], []
    for batch in loader_val:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        final_loss+=outputs.loss.item()
        true_labels.extend(batch['labels'].detach().cpu().numpy().ravel())
        predictions.extend(np.argmax(outputs[1].detach().cpu().numpy(), axis=2).ravel())
    print(f'Validation loss: {final_loss/len(loader_val)}')
    print('Vallidation F1: {}'.format(f1_score(predictions, true_labels, average='macro')))
    print(f'Validaton acc: {accuracy_score(predictions, true_labels)}')
    print('*'*20)

In [None]:
import numpy as np

# test the model
test_sentence = """
Mr. Trump’s tweets began just moments after a Fox News report by Mike Tobin, a 
reporter for the network, about protests in Minnesota and elsewhere. 
"""
tokenized_sentence = tokenizer.encode(test_sentence)
input_ids = torch.tensor([tokenized_sentence]).cuda()
with torch.no_grad():
    output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
# join bpe split tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
new_tokens, new_labels = [], []

for token, label_idx in zip(tokens, label_indices[0]):
    if token.startswith("##"):
        new_tokens[-1] = new_tokens[-1] + token[2:]
    else:
        new_labels.append(label_idx)
        new_tokens.append(token)
        
for token, label in zip(new_tokens, new_labels):
    print("{}\t{}".format(model.config.id2label[label], token))

<a id =topic11> </a>
<div class="heading">
   <h1><span style="color: white">Other Usefull Resources and References</span></h1>
</div>

* https://en.wikipedia.org/wiki/Named-entity_recognition
* https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
* https://www.guru99.com/pos-tagging-chunking-nltk.html
* https://www.geeksforgeeks.org/nlp-extracting-named-entities/
* https://spacy.io/usage/linguistic-features#named-entities
* https://www.kaggle.com/abhishek/entity-extraction-model-using-bert-pytorch
* https://huggingface.co/course/chapter3/4?fw=pt
* https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/
* https://www.lighttag.io/blog/sequence-labeling-with-transformers/example
* https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV
* https://discuss.huggingface.co/search?q=token%20classification