# NAMED ENTITY RECOGNITION:

1. The named entities are pre-defined categories chosen according to the use case such as names of people, organizations, places, codes, time notations, monetary values, etc.

1. NER aims to assign a class to each token (usually a single word) in a sequence. Because of this, NER is also referred to as token classification.

In [1]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/af/62/c27d9781c7469d4abe7a6e6658120bfb4a41535e8212d11f9d41d379af5d/simpletransformers-0.61.5-py3-none-any.whl (220kB)
[K     |█▌                              | 10kB 22.7MB/s eta 0:00:01[K     |███                             | 20kB 30.3MB/s eta 0:00:01[K     |████▌                           | 30kB 35.6MB/s eta 0:00:01[K     |██████                          | 40kB 31.6MB/s eta 0:00:01[K     |███████▍                        | 51kB 33.2MB/s eta 0:00:01[K     |█████████                       | 61kB 35.6MB/s eta 0:00:01[K     |██████████▍                     | 71kB 26.4MB/s eta 0:00:01[K     |████████████                    | 81kB 27.4MB/s eta 0:00:01[K     |█████████████▍                  | 92kB 28.8MB/s eta 0:00:01[K     |██████████████▉                 | 102kB 27.5MB/s eta 0:00:01[K     |████████████████▍               | 112kB 27.5MB/s eta 0:00:01[K     |█████████████████▉   

In [2]:
import pandas as pd
data = pd.read_csv("ner_dataset.csv",encoding="latin1" )

In [3]:
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [4]:
data =data.fillna(method ="ffill")

In [5]:
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [6]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [7]:
data["Sentence #"] = LabelEncoder().fit_transform(data["Sentence #"] )

In [8]:
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,0,Thousands,NNS,O
1,0,of,IN,O
2,0,demonstrators,NNS,O
3,0,have,VBP,O
4,0,marched,VBN,O
5,0,through,IN,O
6,0,London,NNP,B-geo
7,0,to,TO,O
8,0,protest,VB,O
9,0,the,DT,O


In [9]:
data.rename(columns={"Sentence #":"sentence_id","Word":"words","Tag":"labels"}, inplace =True)

In [10]:
data["labels"] = data["labels"].str.upper()

In [11]:
X= data[["sentence_id","words"]]
Y =data["labels"]

In [12]:
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size =0.2)

In [13]:
#building up train data and test data
train_data = pd.DataFrame({"sentence_id":x_train["sentence_id"],"words":x_train["words"],"labels":y_train})
test_data = pd.DataFrame({"sentence_id":x_test["sentence_id"],"words":x_test["words"],"labels":y_test})

In [14]:
train_data

Unnamed: 0,sentence_id,words,labels
425028,10443,a,O
275645,2910,Chiweshe,I-PER
170551,30589,donors,O
447415,11616,$,O
624686,20615,North,B-GEO
...,...,...,...
350820,6722,extending,O
557995,17250,Siad,I-PER
24749,1347,cuts,O
427380,10569,of,O


# Model Training


In [15]:
from simpletransformers.ner import NERModel,NERArgs

In [16]:
label = data["labels"].unique().tolist()
label

['O',
 'B-GEO',
 'B-GPE',
 'B-PER',
 'I-GEO',
 'B-ORG',
 'I-ORG',
 'B-TIM',
 'B-ART',
 'I-ART',
 'I-PER',
 'I-GPE',
 'I-TIM',
 'B-NAT',
 'B-EVE',
 'I-EVE',
 'I-NAT']

In [17]:
args = NERArgs()
args.num_train_epochs = 4
args.learning_rate = 1e-4
args.overwrite_output_dir = True
args.train_batch_size = 32
args.eval_batch_size = 32


In [18]:
model = NERModel('bert', 'bert-base-cased',labels=label,args =args)


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [19]:
model.train_model(train_data,eval_data = test_data,acc=accuracy_score)

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Running Epoch 0 of 4:   0%|          | 0/1033 [00:00<?, ?it/s]



Running Epoch 1 of 4:   0%|          | 0/1033 [00:00<?, ?it/s]

Running Epoch 2 of 4:   0%|          | 0/1033 [00:00<?, ?it/s]

Running Epoch 3 of 4:   0%|          | 0/1033 [00:00<?, ?it/s]

(4132, 0.12850200420129226)

In [20]:
result, model_outputs, preds_list = model.eval_model(test_data)

  0%|          | 0/3 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1005 [00:00<?, ?it/s]

In [21]:
result


{'eval_loss': 0.2433706720188186,
 'f1_score': 0.7736200716845878,
 'precision': 0.7957137239480928,
 'recall': 0.7527201711150376}

In [22]:
prediction, model_output = model.predict(["2021, Israel, Netanyahu continue to defy Biden with airstrikes amid more Hamas rocket fire"])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [23]:
prediction

[[{'2021,': 'B-TIM'},
  {'Israel,': 'B-GEO'},
  {'Netanyahu': 'I-PER'},
  {'continue': 'O'},
  {'to': 'O'},
  {'defy': 'O'},
  {'Biden': 'B-PER'},
  {'with': 'O'},
  {'airstrikes': 'O'},
  {'amid': 'O'},
  {'more': 'O'},
  {'Hamas': 'B-ORG'},
  {'rocket': 'O'},
  {'fire': 'O'}]]