# Hindi NER
NER(Named Entity Recognition) is a token classification task, Where we classify the given token into several given categories. Here I have used 2-Layer transformer architecture to generate the contextual embeddings the we simply apply softmax for multiclass classification.

## Import Required libraries

In [2]:
from datasets import load_dataset
from tokenizers import decoders,models,pre_tokenizers,processors,trainers,Tokenizer,normalizers
from torch.utils.data import Dataset,DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F

## Load dataset
Here we are using <b>cfilt/HiNER-collapsed</b>. It was released by CFILT lab <b>IIT Bombay</b> in 2022. It has text and their NER tags as following
- 0-B-loc 
- 3-I-loc 
- 2-B-per  
- 5-I-per 
- 1-B-org  
- 4-I-org 
- 6-O \
the above given ordering is correct and the ordering given in the dataset library is incorrect. This dataset has 3 splits. 
- train-75.8k training exaples
- test-21.7k training examples
- validation -10.9k training examples \
we also have full dataset <b>cfilt/HiNER</b>. It has 23 categories. 

In [3]:
ds = load_dataset("cfilt/HiNER-collapsed")

Downloading data:   0%|          | 0.00/7.10M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.02M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.02M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/75827 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10851 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/21657 [00:00<?, ? examples/s]

In [4]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 75827
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 10851
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 21657
    })
})


In [5]:
train=ds['train'].data.to_pandas()

In [6]:
print (train)

          id                                             tokens  \
0          0  [इस, क़ानून, का, कई, संगठनों, ने, विरोध, किया,...   
1          1  [देर, रात, तक, जुहू, चौपाटी, में, यह, नजारा, आ...   
2          2  [रामनगर, इगलास, ,, अलीगढ़, ,, उत्तर, प्रदेश, स...   
3          3  [पहाड़िया, आदिवासी, विद्रोह, १७७२-८०, के, मध्य...   
4          4  [दूसरे, मुख्य, टाँके, महल, परिसर, में, स्थित, ...   
...      ...                                                ...   
75822  75822                              [मछुआरों, की, नाव, .]   
75823  75823  [चरणदासियों, का, मंदिर, भी, बहुत, पुराना, है, ...   
75824  75824  [भैडियाणा,, गैरसैण, तहसील, में, भारत, के, उत्त...   
75825  75825  [ज़िन्दगी, (, अंग्रेज़ी:, Life), 1976, में, बन...   
75826  75826  [जब, बर्फ़, थोड़ी, कठोर, हो, और, पैर, से, दबा,...   

                                                ner_tags  
0      [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...  
1                      [6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]  
2                 

In [7]:
train.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[इस, क़ानून, का, कई, संगठनों, ने, विरोध, किया,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
1,1,"[देर, रात, तक, जुहू, चौपाटी, में, यह, नजारा, आ...","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]"
2,2,"[रामनगर, इगलास, ,, अलीगढ़, ,, उत्तर, प्रदेश, स...","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6]"
3,3,"[पहाड़िया, आदिवासी, विद्रोह, १७७२-८०, के, मध्य...","[6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, 6, 6]"
4,4,"[दूसरे, मुख्य, टाँके, महल, परिसर, में, स्थित, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]"


In [8]:
features=ds['train'].features['ner_tags'].feature.names

In [9]:
len(features)

7

In [10]:
print(features)

['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']


In [11]:
validation=ds['validation'].data.to_pandas()
validation.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[अमरीकी, सेना, के, प्रवक्ता, कैप्टन, विलियम, प...","[1, 4, 6, 6, 6, 2, 5, 6, 1, 6, 6, 6, 6, 6, 6, ..."
1,1,"[अम्दौं, N.Z.A., ,, धारी, तहसील, में, भारत, के...","[0, 3, 6, 0, 6, 6, 0, 6, 0, 6, 6, 6, 0, 6, 6, ..."
2,2,"[उनके, ज्योतिषी, ने, सलाह, दी, है, कि, अगर, उन...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
3,3,"[सतपाली-रिंग०-१,, चौबटाखाल, तहसील, में, भारत, ...","[0, 0, 6, 6, 0, 6, 0, 6, 6, 6, 0, 6, 6, 0, 6, ..."
4,4,"[बीसीसीआई, की, इस, राहत, का, अर्थ, है, कि, अब,...","[1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."


In [12]:
test=ds['test'].data.to_pandas()
test.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[स्वाइन, फ्लू:, कहाँ-कहाँ, .]","[6, 6, 6, 6]"
1,1,"[काश, मधुबाला, को, लंबी, उम्र, मिली, होती, .]","[6, 2, 6, 6, 6, 6, 6, 6]"
2,2,"[परभूपुर, भारत, के, उत्तर, प्रदेश, राज्य, के, ...","[0, 0, 6, 0, 3, 6, 6, 0, 6, 6, 0, 6, 6, 6, 6, ..."
3,3,"[७२, हेक्टेयर, भूमि, निर्मित, यह, राजप्रासाद, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
4,4,"[सुर्ख़ियो, में, .]","[6, 6, 6]"


In [13]:
text=""
for txt in train['tokens']:
    text+=" ".join(txt)+"\n"

In [14]:
for txt in validation['tokens']:
    text+=" ".join(txt)+"\n"

In [15]:
for txt in test['tokens']:
    text+=" ".join(txt)+"\n"

In [16]:
len(text)

11077784

In [17]:
with open("./data.txt", "w",encoding='utf-8') as f:
    f.write(text)

## Tokenizer
Here we are using <b>Byte-Pair-Encoding Tokenizer</b> for tokeninzing our text. I have used tokenizers library to make the tokenizer with vocab size 30000.

In [19]:
bpe_tokenizer=Tokenizer(model=models.BPE(unk_token="[UNK]"))

In [20]:
bpe_tokenizer.normalizer=normalizers.NFC()

In [21]:
bpe_tokenizer.pre_tokenizer=pre_tokenizers.Sequence([pre_tokenizers.Whitespace(),pre_tokenizers.Punctuation()])

In [22]:
bpe_trainer=trainers.BpeTrainer(vocab_size=30000,min_frequency=2,special_tokens=["[UNK]","[PAD]","[CLS]","[SEP]"])

In [23]:
bpe_tokenizer.train(files=['./data.txt'],trainer=bpe_trainer)

In [24]:
bpe_tokenizer.save("bpe_tokenizer.json")

In [25]:
bpe_tokenizer.encode("स्वाइन फ्लू: कहाँ-कहाँ").ids

[3684, 7762, 26, 4413, 13, 4413]

In [26]:
bpe_tokenizer.encode(["स्वाइन", "फ्लू:", "कहाँ-कहाँ", "."],is_pretokenized=True).tokens

['स्वाइन', 'फ्लू', ':', 'कहाँ', '-', 'कहाँ', '.']

In [27]:
bpe_tokenizer.encode(["स्वाइन", "फ्लू:", "कहाँ-कहाँ", "."],is_pretokenized=True).word_ids

[0, 1, 1, 2, 2, 2, 3]

In [28]:
bpe_tokenizer.decode([3684, 7762, 26, 4413, 13, 4413])

'स्वाइन फ्लू : कहाँ - कहाँ'

In [29]:
tags=ds['train'].features['ner_tags'].feature

In [30]:
tags.int2str(1)

'B-PER'

In [31]:
def tokenize_words(example):
    return bpe_tokenizer.encode(example,is_pretokenized=True).ids

In [32]:
train['token_ids']=train['tokens'].apply(tokenize_words)
test['token_ids']=test['tokens'].apply(tokenize_words)
validation['token_ids']=validation['tokens'].apply(tokenize_words)

In [33]:
train.head()

Unnamed: 0,id,tokens,ner_tags,token_ids
0,0,"[इस, क़ानून, का, कई, संगठनों, ने, विरोध, किया,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...","[995, 2312, 935, 1285, 4190, 937, 2126, 1072, ..."
1,1,"[देर, रात, तक, जुहू, चौपाटी, में, यह, नजारा, आ...","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]","[3140, 1738, 1149, 19612, 18475, 934, 1103, 73..."
2,2,"[रामनगर, इगलास, ,, अलीगढ़, ,, उत्तर, प्रदेश, स...","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6]","[4290, 3791, 12, 1595, 12, 1086, 1033, 1021, 9..."
3,3,"[पहाड़िया, आदिवासी, विद्रोह, १७७२-८०, के, मध्य...","[6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, 6, 6]","[2472, 2191, 7071, 4250, 4695, 16342, 13, 8897..."
4,4,"[दूसरे, मुख्य, टाँके, महल, परिसर, में, स्थित, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]","[1725, 1455, 15391, 4025, 4172, 934, 1021, 972..."


In [34]:
def get_word_ids(example):
    return bpe_tokenizer.encode(example,is_pretokenized=True).word_ids

In [35]:
train['word_ids']=train['tokens'].apply(get_word_ids)
test['word_ids']=test['tokens'].apply(get_word_ids)
validation['word_ids']=validation['tokens'].apply(get_word_ids)

In [36]:
train.head()

Unnamed: 0,id,tokens,ner_tags,token_ids,word_ids
0,0,"[इस, क़ानून, का, कई, संगठनों, ने, विरोध, किया,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...","[995, 2312, 935, 1285, 4190, 937, 2126, 1072, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
1,1,"[देर, रात, तक, जुहू, चौपाटी, में, यह, नजारा, आ...","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]","[3140, 1738, 1149, 19612, 18475, 934, 1103, 73...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]"
2,2,"[रामनगर, इगलास, ,, अलीगढ़, ,, उत्तर, प्रदेश, स...","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6]","[4290, 3791, 12, 1595, 12, 1086, 1033, 1021, 9...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]"
3,3,"[पहाड़िया, आदिवासी, विद्रोह, १७७२-८०, के, मध्य...","[6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, 6, 6]","[2472, 2191, 7071, 4250, 4695, 16342, 13, 8897...","[0, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10,..."
4,4,"[दूसरे, मुख्य, टाँके, महल, परिसर, में, स्थित, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]","[1725, 1455, 15391, 4025, 4172, 934, 1021, 972...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11, 12,..."


In [37]:
print(train.iloc[10,4])
print(train.iloc[10,2])
print(len(train.iloc[10,4]))
print(len(train.iloc[10,2]))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 40]
[6 6 6 6 1 6 2 5 0 3 6 6 1 6 2 5 5 0 6 0 6 6 6 2 5 0 3 6 6 6 6 6 2 5 0 3 6
 0 6 6 6]
45
41


since we have used byte pair encoding tokenizer to encode the text which can split the words in several different subwords. So our NER tags cab be mismatched in this case. we will use same tag for these subwords.

In [38]:
def align_ner_tags(ner_tags,word_ids):
  adjusted_tag=[]
  for i in word_ids:
    adjusted_tag.append(ner_tags[i])
  return adjusted_tag

In [39]:
adj=align_ner_tags(train.iloc[10,2],train.iloc[10,4])
print(adj)

[6, 6, 6, 6, 1, 6, 2, 5, 0, 3, 6, 6, 6, 6, 1, 6, 2, 5, 5, 0, 6, 0, 6, 6, 6, 2, 5, 0, 3, 6, 6, 6, 6, 6, 6, 2, 5, 0, 3, 6, 0, 6, 6, 6, 6]


In [40]:
print(len(adj))

45


In [41]:
adjusted_tags=[]
for i in range(len(train)):
  adjusted_tags.append(align_ner_tags(train.iloc[i,2],train.iloc[i,4]))

In [42]:
train['adjusted_tags']=adjusted_tags

In [43]:
train.head()

Unnamed: 0,id,tokens,ner_tags,token_ids,word_ids,adjusted_tags
0,0,"[इस, क़ानून, का, कई, संगठनों, ने, विरोध, किया,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...","[995, 2312, 935, 1285, 4190, 937, 2126, 1072, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
1,1,"[देर, रात, तक, जुहू, चौपाटी, में, यह, नजारा, आ...","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]","[3140, 1738, 1149, 19612, 18475, 934, 1103, 73...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]"
2,2,"[रामनगर, इगलास, ,, अलीगढ़, ,, उत्तर, प्रदेश, स...","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6]","[4290, 3791, 12, 1595, 12, 1086, 1033, 1021, 9...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6, 6]"
3,3,"[पहाड़िया, आदिवासी, विद्रोह, १७७२-८०, के, मध्य...","[6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, 6, 6]","[2472, 2191, 7071, 4250, 4695, 16342, 13, 8897...","[0, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, ..."
4,4,"[दूसरे, मुख्य, टाँके, महल, परिसर, में, स्थित, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]","[1725, 1455, 15391, 4025, 4172, 934, 1021, 972...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11, 12,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]"


In [44]:
adjusted_tags=[]
for i in range(len(test)):
  adjusted_tags.append(align_ner_tags(test.iloc[i,2],test.iloc[i,4]))

In [45]:
test['adjusted_tags']=adjusted_tags

In [46]:
adjusted_tags=[]
for i in range(len(validation)):
  adjusted_tags.append(align_ner_tags(validation.iloc[i,2],validation.iloc[i,4]))

In [47]:
validation['adjusted_tags']=adjusted_tags


In [48]:
train.head()

Unnamed: 0,id,tokens,ner_tags,token_ids,word_ids,adjusted_tags
0,0,"[इस, क़ानून, का, कई, संगठनों, ने, विरोध, किया,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...","[995, 2312, 935, 1285, 4190, 937, 2126, 1072, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
1,1,"[देर, रात, तक, जुहू, चौपाटी, में, यह, नजारा, आ...","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]","[3140, 1738, 1149, 19612, 18475, 934, 1103, 73...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[6, 6, 6, 0, 3, 6, 6, 6, 6, 6, 6]"
2,2,"[रामनगर, इगलास, ,, अलीगढ़, ,, उत्तर, प्रदेश, स...","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6]","[4290, 3791, 12, 1595, 12, 1086, 1033, 1021, 9...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]","[0, 0, 6, 0, 6, 0, 3, 6, 6, 6, 6, 6]"
3,3,"[पहाड़िया, आदिवासी, विद्रोह, १७७२-८०, के, मध्य...","[6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, 6, 6]","[2472, 2191, 7071, 4250, 4695, 16342, 13, 8897...","[0, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, ..."
4,4,"[दूसरे, मुख्य, टाँके, महल, परिसर, में, स्थित, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]","[1725, 1455, 15391, 4025, 4172, 934, 1021, 972...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11, 12,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]"


## Define the dataset
Here I am using max sequence length 512 so will pad the text with padding token which has index 1 in the tokenizer.

In [50]:
class NERDataset(Dataset):
  def __init__(self,word_ids,token_tags):
    super(NERDataset,self).__init__()
    self.ids=word_ids
    self.tags=token_tags
  def __len__(self):
    return len(self.ids)
  def __getitem__(self,idx):
    if len(self.ids[idx])<512:
      padding=[1]*(512-len(self.ids[idx]))
      padding2=[-1]*(512-len(self.tags[idx]))
      self.ids[idx]+=padding
      self.tags[idx]+=padding2
    return torch.tensor(self.ids[idx]),torch.tensor(self.tags[idx])

In [51]:
train_dataset=NERDataset(train['token_ids'].to_list(),train['adjusted_tags'].to_list())

In [52]:
train_dataset[0]

(tensor([  995,  2312,   935,  1285,  4190,   937,  2126,  1072,  1031,   961,
          1058,  1031,   944,   995,  2312,   935,  8589,  2788,  5012,   930,
           944,  1365,  2509,  1334,  1000,  2109,   950,   422, 18429,   989,
          1494,   930,    14,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,  

In [None]:
# this function generates the padding mask for single example.
def generate_padding_mask(batch):
  batch_size,seq_length=batch.shape
  mask=torch.full((batch_size,seq_length),False) 
  indices=batch.argmin(dim=-1)
  for i in range(len(indices)):
    mask[i,0:indices[i]]=True 
  return mask


## Define the Model
In the model we are using 2 transformer encoder layer and d_model=256, feedforward dim=512 with 8 attension heads, Then finally a single fully connected layer which is an output layer

In [67]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model = d_model

    def forward(self):
        even_i = torch.arange(0, self.d_model, 2).float()
        denominator = torch.pow(10000, even_i/self.d_model)
        position = (torch.arange(self.max_sequence_length)
                          .reshape(self.max_sequence_length, 1))
        even_PE = torch.sin(position / denominator)
        odd_PE = torch.cos(position / denominator)
        stacked = torch.stack([even_PE, odd_PE], dim=2)
        PE = torch.flatten(stacked, start_dim=1, end_dim=2)
        return PE

class NERModel(nn.Module):
  def __init__(self,vocab_size,tagset_size,max_seq_length,d_model,num_layers):
    super(NERModel,self).__init__()
    self.embedding=nn.Embedding(vocab_size,d_model)
    self.transformer_encoder=nn.TransformerEncoder(nn.TransformerEncoderLayer(d_model=256,nhead=8,dim_feedforward=max_seq_length,batch_first=True),num_layers=num_layers)
    self.fc=nn.Linear(d_model,tagset_size)
    self.pos=PositionalEncoding(d_model,max_seq_length)
  def forward(self, word_ids):
        mask = generate_padding_mask(word_ids)
        word_ids = word_ids.long()
        word_ids = self.embedding(word_ids)
        word_ids = word_ids + self.pos().to(word_ids.device)
        word_ids = self.transformer_encoder(word_ids, src_key_padding_mask=mask.to(word_ids.device))
        word_ids = self.fc(word_ids)
        return word_ids



In [None]:
vocab_size=30000
target_size=7
max_seq_length=512
d_model=256
num_layers=2
batch_size=32
learning_rate=0.001
num_epochs=10
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [68]:
model=NERModel(vocab_size,target_size,max_seq_length,d_model,num_layers)

In [58]:
dataloader=DataLoader(train_dataset,batch_size=batch_size,shuffle=True)

In [71]:
criterion=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(model.parameters(),lr=learning_rate)

In [69]:
model=model.to(device)

In [61]:
device

device(type='cuda')

In [72]:
for i in range(num_epochs):
  model.train()
  for x,y in dataloader:
    x=x.to(device)
    y=y.to(device)
    out=model(x)
    indices=y.argmin(dim=-1)
    loss=0
    for j in range(len(indices)):
      loss+=criterion(out[j,0:indices[j]],y[j,0:indices[j]])
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(f"epoch={i} loss = {loss.item()}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
epoch=7 loss = 7.416934013366699
epoch=7 loss = 3.2502119541168213
epoch=7 loss = 8.193572998046875
epoch=7 loss = 2.7964730262756348
epoch=7 loss = 4.981128692626953
epoch=7 loss = 5.639580726623535
epoch=7 loss = 3.72403883934021
epoch=7 loss = 4.152966499328613
epoch=7 loss = 7.028319358825684
epoch=7 loss = 4.692986488342285
epoch=7 loss = 7.036635875701904
epoch=7 loss = 3.992140293121338
epoch=7 loss = 4.548179626464844
epoch=7 loss = 3.8135180473327637
epoch=7 loss = 4.211056709289551
epoch=7 loss = 5.800565242767334
epoch=7 loss = 4.14138126373291
epoch=7 loss = 5.1243414878845215
epoch=7 loss = 5.36757230758667
epoch=7 loss = 3.9451699256896973
epoch=7 loss = 3.9391870498657227
epoch=7 loss = 5.706661224365234
epoch=7 loss = 4.38700008392334
epoch=7 loss = 4.927907943725586
epoch=7 loss = 4.77551794052124
epoch=7 loss = 4.045258045196533
epoch=7 loss = 4.47740364074707
epoch=7 loss = 6.319763660430908
epoch=7 los

In [73]:
torch.save(model, "model.pt")

## test the model

In [74]:

test_dataset=NERDataset(test['token_ids'].to_list(),test['adjusted_tags'].to_list())
test_dataloader=DataLoader(test_dataset,batch_size=32,shuffle=True)

In [75]:
# evaluate the model on test data
model.eval()
with torch.no_grad():
  correct = 0
  total = 0
  for x, y in test_dataloader:
    x=x.to(device)
    y=y.to(device)
    out = model(x)
    indices=y.argmin(dim=-1)
    for i in range(len(indices)):
      correct += (out[i,0:indices[i]].argmax(dim=-1) == y[i,0:indices[i]]).sum().item()
      total += indices[i].sum().item()
  accuracy = correct / total
  print("Test Accuracy:", accuracy)

Test Accuracy: 0.9426961824503889


Although we have a good test accuracy but it is overfitted. We have a small dataset and we have trained it only for 10 epochs. Let's test it with unseen data.

In [76]:
example="मेरा नाम रहीम खान है"

In [79]:
tokens=bpe_tokenizer.encode(example)
print(tokens)

Encoding(num_tokens=5, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [82]:
ids=tokens.ids
print(ids)

[2449, 1377, 19501, 2959, 930]


In [81]:
tokens.tokens

['मेरा', 'नाम', 'रहीम', 'खान', 'है']

In [83]:
padding=[1]*(512-len(ids))
ids+=padding
print(ids)

[2449, 1377, 19501, 2959, 930, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [84]:
out=model(torch.tensor(ids).unsqueeze(0).to(device))

In [86]:
out=out.squeeze()


In [87]:
probs=F.softmax(out,dim=-1)

In [88]:
probs.argmax(dim=-1)

tensor([6, 6, 3, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
        6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

It has correctly identified "khan" as person name but "rahim" is incorrectly identified.