<a href="https://colab.research.google.com/github/reban87/ML-Projects/blob/main/Task1_NER_Recognition_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `Task 1`: NER using Transformer
prepared by: Rebanta Aryal | 
RPALabs | 31st March, 2022

[Named Entity Recognition(NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, monetary values, percentage etc.

It aims to assign a class to each token ( usually in a single word) in a sequence.Because of this, NER is also referred to as token classification.

##**Simple Transformer**
Simple transformer is built on the top of the excellent Transformers library by [Hugging Face](https://huggingface.co)

In [None]:
#install simple transformers
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.6-py3-none-any.whl (249 kB)
[?25l[K     |█▎                              | 10 kB 24.4 MB/s eta 0:00:01[K     |██▋                             | 20 kB 30.0 MB/s eta 0:00:01[K     |████                            | 30 kB 36.7 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 18.3 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 14.6 MB/s eta 0:00:01[K     |████████                        | 61 kB 16.9 MB/s eta 0:00:01[K     |█████████▏                      | 71 kB 14.8 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 15.9 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 17.5 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 15.5 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 15.5 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 15.5 MB/s eta 0:00:01[K     |█████████████████               |

In [None]:
import pandas as pd

In [None]:
data=pd.read_csv("/content/drive/MyDrive/archive/ner_datasetreference.csv",encoding="latin1") #Latin-1 encodes just the first 256 code points of the Unicode character set,
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [None]:
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


###Let us preprocess the data by refilling the NaN value with  the sentence 1 and so on...

In [None]:
data=data.fillna(method='ffill') #propagate last valid observation using ffill method
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [None]:
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import accuracy_score

``Label Encoding`` is being used  to encode the sentence column by random values, as we need to pass the model the encoded format.

In [None]:
data["Sentence #"]=LabelEncoder().fit_transform(data["Sentence #"])
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,0,Thousands,NNS,O
1,0,of,IN,O
2,0,demonstrators,NNS,O
3,0,have,VBP,O
4,0,marched,VBN,O
5,0,through,IN,O
6,0,London,NNP,B-geo
7,0,to,TO,O
8,0,protest,VB,O
9,0,the,DT,O


Since the simple transformer doesnot take the format that we have in our dataset, therefore, we now rename the format of our dataset as below...

In [None]:
data.rename(columns={"Sentence #":"sentence_id","Word":"words","Tag":"labels"}, inplace=True)

Let us define independent variables be ``sentence_id`` and ``words`` and the dependent variable be ``labels``

In [None]:
X=data[["sentence_id","words"]]
y=data["labels"]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2) 

In [None]:
#Let's create our train and test data
train_data=pd.DataFrame({"sentence_id":X_train["sentence_id"],"words":X_train["words"],"labels":y_train})
test_data=pd.DataFrame({"sentence_id":X_test["sentence_id"],"words":X_test["words"],"labels":y_test})

In [None]:
train_data.head(10)

Unnamed: 0,sentence_id,words,labels
524892,15564,.,O
137680,43824,suspect,O
919823,35585,50,O
676429,23237,say,O
750906,27030,reopen,O
946164,36932,compound,O
634835,21133,the,O
109626,42395,published,O
599926,19349,the,O
347660,6561,",",O


In [None]:
test_data.head(10)

Unnamed: 0,sentence_id,words,labels
134853,43682,city,O
292346,3766,is,O
241174,1166,by,O
433121,10857,with,O
682726,23549,'s,O
55454,16846,.,O
641562,21480,He,O
944671,36853,on,O
697843,24329,de,B-org
538432,16267,The,O


### Model Training
lets use simple tranformer library using NERModel, NERArgs for building the model

In [None]:
!pip install setuptools==59.5.0

Collecting setuptools==59.5.0
  Downloading setuptools-59.5.0-py3-none-any.whl (952 kB)
[?25l[K     |▍                               | 10 kB 30.7 MB/s eta 0:00:01[K     |▊                               | 20 kB 20.2 MB/s eta 0:00:01[K     |█                               | 30 kB 10.6 MB/s eta 0:00:01[K     |█▍                              | 40 kB 8.5 MB/s eta 0:00:01[K     |█▊                              | 51 kB 4.6 MB/s eta 0:00:01[K     |██                              | 61 kB 5.4 MB/s eta 0:00:01[K     |██▍                             | 71 kB 5.5 MB/s eta 0:00:01[K     |██▊                             | 81 kB 5.5 MB/s eta 0:00:01[K     |███                             | 92 kB 6.1 MB/s eta 0:00:01[K     |███▍                            | 102 kB 5.2 MB/s eta 0:00:01[K     |███▉                            | 112 kB 5.2 MB/s eta 0:00:01[K     |████▏                           | 122 kB 5.2 MB/s eta 0:00:01[K     |████▌                           | 133 kB 5.2 MB/s 

In [None]:
!pip install simpletransformers


Collecting simpletransformers
  Downloading simpletransformers-0.63.6-py3-none-any.whl (249 kB)
[?25l[K     |█▎                              | 10 kB 27.1 MB/s eta 0:00:01[K     |██▋                             | 20 kB 21.0 MB/s eta 0:00:01[K     |████                            | 30 kB 10.6 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 8.6 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 4.6 MB/s eta 0:00:01[K     |████████                        | 61 kB 5.5 MB/s eta 0:00:01[K     |█████████▏                      | 71 kB 5.5 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 5.9 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 6.6 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 5.3 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 5.3 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 5.3 MB/s eta 0:00:01[K     |█████████████████               | 133 kB 5

In [None]:
from simpletransformers.ner import NERModel,NERArgs

``Unique labels`` of NER  datasets are converted and stored in a list. 

In [None]:
label = data["labels"].unique().tolist()
label

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [None]:
args = NERArgs()
args.num_train_epochs = 1
args.learning_rate = 1e-4
args.overwrite_output_dir =True
args.train_batch_size = 32
args.eval_batch_size = 32

In [None]:
model_NER=NERModel('bert','bert-base-cased',labels=label,args=args,use_cuda=True)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [None]:
model_NER.train_model(train_data,eval_data = test_data,acc=accuracy_score)

  0%|          | 0/2 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1499 [00:00<?, ?it/s]



(1499, 0.18992250606795405)

In [None]:
result, model_outputs, preds_list = model_NER.eval_model(test_data)

  0%|          | 0/2 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1462 [00:00<?, ?it/s]

In [None]:
result

{'eval_loss': 0.1688306446871357,
 'precision': 0.8304639932815453,
 'recall': 0.759091636014713,
 'f1_score': 0.7931754758284177}

In [None]:
prediction, model_output = model_NER.predict(["What is the capital city of Nepal"])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
prediction

[[{'What': 'O'},
  {'is': 'O'},
  {'the': 'O'},
  {'capital': 'O'},
  {'city': 'O'},
  {'of': 'O'},
  {'Nepal': 'B-gpe'}]]

In [None]:
prediction1, model_output = model_NER.predict(["My name is Rebanta and I live in Kathmandu"])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
prediction1

[[{'My': 'O'},
  {'name': 'O'},
  {'is': 'O'},
  {'Rebanta': 'B-per'},
  {'and': 'O'},
  {'I': 'O'},
  {'live': 'O'},
  {'in': 'O'},
  {'Kathmandu': 'B-geo'}]]

In [None]:
#save the model
import pickle


In [None]:
NER_Task = 'NER_Task1.sav'
pickle.dump(model_NER, open(NER_Task, 'wb'))