<a href="https://colab.research.google.com/github/reban87/NLP-Projects/blob/main/NER_Recognition_Transformer01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Named Entity Recognition(NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, monetary values, percentage etc.

It aims to assign a class to each token ( usually in a single word) in a sequence.Because of this, NER is also referred to as token classification.

##**Simple Transformer**
Simple transformer is built on the top of the excellent Transformers library by [Hugging Face](https://huggingface.co)

In [3]:
# Uncomment the line below to install simple transformers
#!pip install simpletransformers

In [4]:
import pandas as pd

In [8]:
#Latin-1 encodes just the first 256 code points of the Unicode character set
data=pd.read_csv("/content/drive/MyDrive/archive/ner_datasetreference.csv",encoding="latin1") 
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [9]:
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


Let us preprocess the data by refilling the NaN value with  the sentence 1 and so on...

In [10]:
# PROPAGATE LAST VALID OBSERVATION USING FFILL METHOD

data=data.fillna(method='ffill')      
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [11]:
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import accuracy_score

``Label Encoding`` is being used  to encode the sentence column by random values, as we need to pass the model the encoded format.

In [12]:
data["Sentence #"]=LabelEncoder().fit_transform(data["Sentence #"])
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,0,Thousands,NNS,O
1,0,of,IN,O
2,0,demonstrators,NNS,O
3,0,have,VBP,O
4,0,marched,VBN,O
5,0,through,IN,O
6,0,London,NNP,B-geo
7,0,to,TO,O
8,0,protest,VB,O
9,0,the,DT,O


Since the simple transformer doesnot take the format that we have in our dataset, therefore, we now rename the format of our dataset as below...

In [13]:
data.rename(columns={"Sentence #":"sentence_id","Word":"words","Tag":"labels"}, inplace=True)

Let us define independent variables be ``sentence_id`` and ``words`` and the dependent variable be ``labels``

In [14]:
X=data[["sentence_id","words"]]
y=data["labels"]

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2) 

In [16]:
#Let's create our train and test data
train_data=pd.DataFrame({"sentence_id":X_train["sentence_id"],"words":X_train["words"],"labels":y_train})
test_data=pd.DataFrame({"sentence_id":X_test["sentence_id"],"words":X_test["words"],"labels":y_test})

In [17]:
train_data.head(10)

Unnamed: 0,sentence_id,words,labels
795140,29276,ripped,O
686685,23754,who,O
283003,3291,31,I-tim
981597,38758,Iraq,B-geo
865932,32854,the,O
874797,33299,until,O
546538,16655,primary,O
81733,30224,Iraqi,B-gpe
139820,43931,troops,O
402136,9307,the,O


In [18]:
test_data.head(10)

Unnamed: 0,sentence_id,words,labels
830791,31084,sentences,O
547526,16705,.,O
815018,30287,Liberia,B-geo
160203,44965,Fund-Global,I-org
619670,20366,DRC,O
681892,23507,at,O
906877,34927,top,O
881288,33631,resulted,O
21248,47516,talks,O
56530,17335,in,O


### Model Training
lets use simple tranformer library using NERModel, NERArgs for building the model

In [24]:
# Uncomment for installation
# !pip install setuptools==59.5.0
# !pip install simpletransformers
from simpletransformers.ner import NERModel,NERArgs

``Unique labels`` of NER  datasets are converted and stored in a list. 

In [25]:
label = data["labels"].unique().tolist()
label

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [26]:
args = NERArgs()
args.num_train_epochs = 1
args.learning_rate = 1e-4
args.overwrite_output_dir =True
args.train_batch_size = 32
args.eval_batch_size = 32

In [37]:
# use of pretrrained bert-base-cased for traning
model_NER=NERModel('bert','bert-base-cased',labels=label,args=args,use_cuda=True)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [38]:
model_NER.train_model(train_data,eval_data = test_data,acc=accuracy_score)

  0%|          | 0/3 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1499 [00:00<?, ?it/s]



(1499, 0.19330047081666363)

In [39]:
result, model_outputs, preds_list = model_NER.eval_model(test_data)

  0%|          | 0/3 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1461 [00:00<?, ?it/s]

In [40]:
result

{'eval_loss': 0.16965139810503305,
 'f1_score': 0.7940279606085348,
 'precision': 0.8309559515920901,
 'recall': 0.7602425015954052}

In [41]:
prediction, model_output = model_NER.predict(["What is the capital city of Nepal"])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [42]:
prediction

[[{'What': 'O'},
  {'is': 'O'},
  {'the': 'O'},
  {'capital': 'O'},
  {'city': 'O'},
  {'of': 'O'},
  {'Nepal': 'B-gpe'}]]

In [43]:
prediction1, model_output = model_NER.predict(["My name is Rebanta and I live in Kathmandu"])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [44]:
prediction1

[[{'My': 'O'},
  {'name': 'O'},
  {'is': 'O'},
  {'Rebanta': 'B-per'},
  {'and': 'O'},
  {'I': 'O'},
  {'live': 'O'},
  {'in': 'O'},
  {'Kathmandu': 'B-geo'}]]

In [45]:
#save the model
import pickle


In [46]:
NER_Task = 'NER_Task1.sav'
pickle.dump(model_NER, open(NER_Task, 'wb'))