<a href="https://colab.research.google.com/github/sssanthosh107/AWS_Python_code/blob/master/BERT_Based_NER_using_CONLL_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab file is created by [Pragnakalp Techlabs](https://www.pragnakalp.com/).

You can copy this colab in your drive and then execute the command in given order. For more details on BERT based NER check our blog [BERT Based Named Entity Recognition (NER) Tutorial and Demo](https://www.pragnakalp.com/bert-named-entity-recognition-ner-tutorial-demo/)

You can also [purchase the Demo of our BERT based NER system including model fine-tuned on 5 datasets](https://www.pragnakalp.com/buy-bert-based-ner-model-code/).

Check all our [NLP Demos on demos.pragnakalp.com](https://demos.pragnakalp.com) 

##**BERT Fine-tuning and Prediction on CONLL 2003:** 

---
>**OVERVIEW**:

>**Named Entity Recognition (NER)** also known as information extraction/chunking is the process in which algorithm extracts the real world noun entity from the text data and classifies them into predefined categories like person, place, time, organization, etc.

> **CONLL 2003** is most basic dataset, concentrating on four types of named entities related to persons, locations, organizations, and names of miscellaneous entities. CONLL 2003 follow BIO schema which contain four columns separated by a single space.

>This colab file shows how to fine-tune BERT Model on **CONLL 2003** dataset by using **KAMALRAJ** Github Repository, and then how to perform the prediction. Using this you can create your own Named Entity Recognition(NER) System.


### **Change Runtime to GPU**

> On the main menu, click on **Runtime** and select **Change runtime type**. Set "**GPU**" as the hardware accelerator.


### **Clone Repository:**
>First Step is to clone KAMALKRAJ Github Repository. Below code is command to clone repository:

In [1]:
!git clone https://github.com/kamalkraj/BERT-NER.git

Cloning into 'BERT-NER'...
remote: Enumerating objects: 246, done.[K
remote: Total 246 (delta 0), reused 0 (delta 0), pack-reused 246[K
Receiving objects: 100% (246/246), 1.67 MiB | 1.54 MiB/s, done.
Resolving deltas: 100% (125/125), done.


Use "ls -l" command for verfying the repository cloned properly.

In [2]:
ls -l

total 8
drwxr-xr-x 6 root root 4096 Jul 14 06:47 [0m[01;34mBERT-NER[0m/
drwxr-xr-x 1 root root 4096 Jul 10 16:29 [01;34msample_data[0m/


Now go to 'BERT-NER' directory by using below command:

In [3]:
cd BERT-NER/

/content/BERT-NER


**BERT-NER files:** 
> Use 'ls -l' to check content inside BERT-NER folder. These below files and folders we will use for finetuning and prediction.


In [4]:
ls -l

total 100
-rw-r--r-- 1 root root   470 Jul 14 06:47 api.py
-rw-r--r-- 1 root root  4674 Jul 14 06:47 bert.py
drwxr-xr-x 3 root root  4096 Jul 14 06:47 [0m[01;34mcpp-app[0m/
drwxr-xr-x 2 root root  4096 Jul 14 06:47 [01;34mdata[0m/
drwxr-xr-x 2 root root  4096 Jul 14 06:47 [01;34mimg[0m/
-rw-r--r-- 1 root root 34523 Jul 14 06:47 LICENSE.txt
-rw-r--r-- 1 root root  4777 Jul 14 06:47 README.md
-rw-r--r-- 1 root root   173 Jul 14 06:47 requirements.txt
-rw-r--r-- 1 root root 27023 Jul 14 06:47 run_ner.py


"requirements.txt" contains all the pacakages that required for trainig and inference. By using below command install all the pacakages.

In [5]:
!pip3 install -r requirements.txt

Collecting pytorch-transformers==1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |████████████████████████████████| 184kB 2.8MB/s 
[?25hCollecting torch==1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
[K     |████████████████████████████████| 748.9MB 21kB/s 
[?25hCollecting seqeval==0.0.5
  Downloading https://files.pythonhosted.org/packages/dc/b6/6e58b54c0fa343f9c24969cb887f3e76c13d16dded640cc620a914f27dc4/seqeval-0.0.5-py3-none-any.whl
Collecting tqdm==4.31.1
[?25l  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 6.5MB/s 
[?25hCollecting nltk

### **Fine-Tuning:**
> For finetuning or training the **bert-base** model run the 'run_ner.py' as command given below.

> In below command we have to pass different arguments:
   
*   '--data_dir' argument required to collect dataset. Pass 'data/' as argument which we can see as directory inside 'BERT-NER' folder for the previous comment and command for 'BERT-NER files' .
*   '--bert_model' used to download **pretrained bert base** model of Hugging Face transformers. There are different model-names as suggested by hugging face for argument, here we select 'bert-base-cased'.
*   '--task_name' argument used for task to perform. Enter 'ner' as we will train the model for Named Entity Recogintion(NER).
*   '--output_dir' argument is for where to store fine-tuned model. We give name 'out_base' for directory  where fine-tuned model stored.
*   Other arguments like '--max_seq_length', '--num_train_epochs' and '--warmup_proportion', just give values as suggested in repository.
*   For training pass argument '--do_train' and after that evaluating for results pass argument '--do_eval'.





    

In [6]:
!python run_ner.py --data_dir=data/ --bert_model=bert-base-multilingual-cased --task_name=ner --output_dir=out_ner --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1

07/14/2020 06:52:42 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
07/14/2020 06:52:43 - INFO - pytorch_transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt not found in cache or force_download set to True, downloading to /tmp/tmp6_4ig7jc
100% 995526/995526 [00:01<00:00, 914332.45B/s]
07/14/2020 06:52:45 - INFO - pytorch_transformers.file_utils -   copying /tmp/tmp6_4ig7jc to cache at /root/.cache/torch/pytorch_transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
07/14/2020 06:52:45 - INFO - pytorch_transformers.file_utils -   creating metadata file for /root/.cache/torch/pytorch_transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
07/14/2020 06:52:45 - INFO - pytorch_transformers.file_utils - 

### **Prediction:**

'pwd' command used to check the path of the directory. As we have all the required files and folders inside 'BERT-NER' folder, we are verfying it by this command.

In [7]:
pwd

'/content/BERT-NER'

**Overwrite 'bert.py' files:**
> In 'bert.py' we have made changes for better representation and display of '**entity detected**' and their '**entity types**' for the given sentence to test or inference.
>Overwrite the 'bert.py' by using below command.

In [8]:
%%writefile bert.py
"""BERT NER Inference."""

from __future__ import absolute_import, division, print_function

import json
import os

import torch
import torch.nn.functional as F
from nltk import word_tokenize
from pytorch_transformers import (BertConfig, BertForTokenClassification,
                                  BertTokenizer)


class BertNer(BertForTokenClassification):

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, valid_ids=None):
        sequence_output = self.bert(input_ids, token_type_ids, attention_mask, head_mask=None)[0]
        batch_size,max_len,feat_dim = sequence_output.shape
        valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device='cuda' if torch.cuda.is_available() else 'cpu')
        for i in range(batch_size):
            jj = -1
            for j in range(max_len):
                    if valid_ids[i][j].item() == 1:
                        jj += 1
                        valid_output[i][jj] = sequence_output[i][j]
        sequence_output = self.dropout(valid_output)
        logits = self.classifier(sequence_output)
        return logits

class Ner:

    def __init__(self,model_dir: str):
        self.model , self.tokenizer, self.model_config = self.load_model(model_dir)
        self.label_map = self.model_config["label_map"]
        self.max_seq_length = self.model_config["max_seq_length"]
        self.label_map = {int(k):v for k,v in self.label_map.items()}
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = self.model.to(self.device)
        self.model.eval()

    def load_model(self, model_dir: str, model_config: str = "model_config.json"):
        model_config = os.path.join(model_dir,model_config)
        model_config = json.load(open(model_config))
        model = BertNer.from_pretrained(model_dir)
        tokenizer = BertTokenizer.from_pretrained(model_dir, do_lower_case=model_config["do_lower"])
        return model, tokenizer, model_config

    def tokenize(self, text: str):
        """ tokenize input"""
        words = word_tokenize(text)
        tokens = []
        valid_positions = []
        for i,word in enumerate(words):
            token = self.tokenizer.tokenize(word)
            tokens.extend(token)
            for i in range(len(token)):
                if i == 0:
                    valid_positions.append(1)
                else:
                    valid_positions.append(0)
        return tokens, valid_positions

    def preprocess(self, text: str):
        """ preprocess """
        tokens, valid_positions = self.tokenize(text)
        ## insert "[CLS]"
        tokens.insert(0,"[CLS]")
        valid_positions.insert(0,1)
        ## insert "[SEP]"
        tokens.append("[SEP]")
        valid_positions.append(1)
        segment_ids = []
        for i in range(len(tokens)):
            segment_ids.append(0)
        input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
        input_mask = [1] * len(input_ids)
        while len(input_ids) < self.max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
            valid_positions.append(0)
        return input_ids,input_mask,segment_ids,valid_positions

    def predict(self, text: str):
        input_ids,input_mask,segment_ids,valid_ids = self.preprocess(text)
        input_ids = torch.tensor([input_ids],dtype=torch.long,device=self.device)
        input_mask = torch.tensor([input_mask],dtype=torch.long,device=self.device)
        segment_ids = torch.tensor([segment_ids],dtype=torch.long,device=self.device)
        valid_ids = torch.tensor([valid_ids],dtype=torch.long,device=self.device)
        with torch.no_grad():
            logits = self.model(input_ids, segment_ids, input_mask,valid_ids)
        logits = F.softmax(logits,dim=2)
        logits_label = torch.argmax(logits,dim=2)
        logits_label = logits_label.detach().cpu().numpy().tolist()[0]

        logits_confidence = [values[label].item() for values,label in zip(logits[0],logits_label)]

        logits = []
        pos = 0
        for index,mask in enumerate(valid_ids[0]):
            if index == 0:
                continue
            if mask == 1:
                logits.append((logits_label[index-pos],logits_confidence[index-pos]))
            else:
                pos += 1
        logits.pop()

        labels = [(self.label_map[label],confidence) for label,confidence in logits]
        words = word_tokenize(text)
        assert len(labels) == len(words)

        Person = []
        Location = []
        Organization = []
        Miscelleneous = []

        for word, (label, confidence) in zip(words, labels):
            if label=="B-PER" or label=="I-PER":
                Person.append(word)
            elif label=="B-LOC" or label=="I-LOC":
                Location.append(word)
            elif label=="B-ORG" or label=="I-ORG":
                Organization.append(word)
            elif label=="B-MISC" or label=="I-MISC":
                Miscelleneous.append(word)
            else:
                output = None

        output = []
        for word, (label, confidence) in zip(words, labels):      
            if label == "B-PER":
                output.append(' '.join(Person) + ": Person")
            if label=="B-LOC":
                output.append(' '.join(Location) + ": Location")
            if label=="B-MISC":
                output.append(' '.join(Miscelleneous) + ": Miscelleneous Entity")
            if label=="B-ORG":
                output.append(' '.join(Organization) + ": Organization")
                
        return output



Overwriting bert.py


Run below command for import and download 'nltk' library as it is important for predictions of entities of the sentence.

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Run the below cell for final output: 

>In the below cell first line we call the '**Ner**' class from the 'bert.py' file. '**Ner**' class intialize our fine-tuned model.

>Store the model in any variable. In below cell we store our fine-tuned model into 'model' variable.

> Pass any text as a string for entity detection. We pass "Bob Ross lived in Florida." in 'text' variable for below cell.

> Use '**predict**' function of class '**Ner**' for detecting entities for the 'text' and stored it into 'output'. In 'output' variable we have a detected entities of list.

> Run 'for loop' for list formed 'output'. Print the 'prediction' of 'for loop'.

In [24]:
# from Prediction_utils_vietnam_labels import Ner
from bert import Ner
model = Ner("out_ner/")

text= "Rahul company Every morning Jack and I usually play cricket and go to United states through metro train happily can you do that working in software company"
# text='நான் கூகுளில் வேளை பார்க்கிறேன் ஏன் பெயர் மார்க் சுலேகா மார்க் என்று ஆளைப்பார்கள்'
print("Text to predict Entity:", text)

output = model.predict(text)
for prediction in output:
    print(prediction)

Text to predict Entity: Rahul company Every morning Jack and I usually play cricket and go to United states through metro train happily can you do that working in software company
Jack: Person
United: Location


In [None]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
!cp -r /content/BERT-NER/out_ner/* /content/drive/My\ Drive/NER_Vietnam_finetune_model/