# Importing the dependencies

In [None]:
# Let us first check the GPU resource available
!nvidia-smi

Wed Jun 19 08:33:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8              17W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install transformers datasets tokenizer seqeval -q

In [None]:
!pip install accelerate -U -q
!pip install transformers[torch] -q

In [None]:
import datasets
import pandas as pd
import numpy as np
import json
import torch

from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

from transformers import TrainingArguments
from transformers import Trainer
from datasets import load_metric

from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

<!-- # NER and POS Theory -->
# Name Entity Recognition
This is quite an important topic in the field of NLP, wherein it's core focus lies in identifying and classifying specific types of information present in the textual data. Basically, it can be viewed as a highlighter for important things in the given text, with the help of which machine can build a better context as it tries to understand data.

Primarily, it does two things :     
1. **Identification** : It scans thorugh the text and pinpoints words and phrases that represent specific entities, such as :     
  * Names of the Person
  * Organizations
  * Locations
  * Dates
  * Times
  * Quantities

2. **Classification** : This process also enables to classify a given word/token into categories and label them, which aid the machine to understand the nature of information they've found.






# Part of Speech
Part-of-Speech (POS) tagging is a fundamental building block of many NLP tasks, which basically is like sorting words into grammatical buckets to understand their function within a sentence.

It works in the following manner :

1. **Assigning Labels**: Each word in a sentence is assigned a label corresponding to its grammatical category. Common POS tags include:

  * **Noun (NN)**
  * **Verb (VB)**
  * **Adjective (JJ)**
  * **Adverb (RB)**
  * **Pronoun (PRP)**
  * **Preposition (IN)**
  * **Conjunction (CC)**

By understanding the grammatical role of each word, NLP applications can achieve better performance in tasks such as machine translation, information retrieval, syntactic and semantic analysis.



# Loading the dataset
For this project, we will be using the **CoNLL-2003** dataset. The shared task of CoNLL-2003 concerns language-independent named entity recognition, where the concentrate is on four types of named entities:

* Persons
* Locations
* Organizations
* Names of miscellaneous entities that do not belong to the previous three groups.

You can check it out [here](https://huggingface.co/datasets/eriktks/conll2003)

In [None]:
conll2003 = datasets.load_dataset("conll2003", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Let us now view at our data and how it is distributed.

In [None]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

We can also see the description of the datasets, as show below.

In [None]:
print("Training Data")
print(conll2003["train"].description)
print("---------------------------------------------------------------------------------------")
print("Validation Data")
print(conll2003["validation"].description)
print("---------------------------------------------------------------------------------------")
print("Test Data")
print(conll2003["test"].description)
print("---------------------------------------------------------------------------------------")

Training Data
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on
four types of named entities: persons, locations, organizations and names of miscellaneous entities that do
not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on
a separate line and there is an empty line after each sentence. The first item on each line is a word, the second
a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags
and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only
if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag
B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2
tagging scheme, whereas the original dataset us

# Loading the Model
For this, we will be using the **BERT** model as our base model, on top of which we will be training our custom data.

It's a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. You can read more about it [here](https://arxiv.org/pdf/1810.04805).

BERT, despite being a language model (LM) (not a large language model, LLM), can perform well on NER data due to its  bidirectional training  mechanism. This allows it to capture contextual relationships between words in a sentence, even when processing them from left to right, which is crucial for tasks like NER.

In [None]:
model = "bert-base-uncased"

tokenizer = BertTokenizerFast.from_pretrained(model)

Let us now view the data which is used in the BERT Model.

In [None]:
example_text = conll2003["train"][0]["tokens"]
example_text

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [None]:
tokenizer_example_id = tokenizer(example_text, is_split_into_words=True)
tokenizer_example_id

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
example_tokens = tokenizer.convert_ids_to_tokens(tokenizer_example_id["input_ids"])
example_tokens

['[CLS]',
 'eu',
 'rejects',
 'german',
 'call',
 'to',
 'boycott',
 'british',
 'lamb',
 '.',
 '[SEP]']

As we can see, when BERT converts a text into tokens, it adds a "**[CLS]**" and "**[SEP]**" at the beginning and end of the data.

Here, **[CLS]** stands for "classification" and is used as the first token for classification tasks. **[SEP]** stands for "separator" and is used to separate different sentences.

Therefore, we also need our data set to have adjusted for these values as we aim to fine tune BERT on this data. Let us use a function to modify our data in the same way.

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=True):

    #Tokeinze IDs
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []


    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        """
        With this, we return a list mapping of the tokens to their
        actual word in the initial sentence. The output is a list indicating the word corresponding to each token.
        """

        previous_word_idx = None
        label_ids = []

        """
        Now, there are special tokens like '' and '<\s>' which are mapped to None.
        We need to set the label to -100 so they are automatically ignored in the loss function at the time of training.
        """

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)   # Set '–100' as the label for these special tokens

            #For the other tokens in a word, we set the label to either the current label or -100, depending on the label_all_tokens flag.

            elif word_idx != previous_word_idx:
                # If current word_idx is != previous_word_idx, then it's the most regular case for which we can add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # To take care of sub-words which have the same word_idx,
                # we set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # Finally, we mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

To see how the data is aligned to the labels, we can look at this example below.

In [None]:
example = tokenize_and_align_labels(conll2003["train"][0:1])

print("Tokens---------------------------------->Labels")
for token, label in zip(tokenizer.convert_ids_to_tokens(example["input_ids"][0]),example["labels"][0]):
    print(f"{token:-<40} {label}")

Tokens---------------------------------->Labels
[CLS]----------------------------------- -100
eu-------------------------------------- 3
rejects--------------------------------- 0
german---------------------------------- 7
call------------------------------------ 0
to-------------------------------------- 0
boycott--------------------------------- 0
british--------------------------------- 7
lamb------------------------------------ 0
.--------------------------------------- 0
[SEP]----------------------------------- -100


Let us now go ahead and map this data with our function to convert it into the desired format.

In [None]:
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

Now, we can go ahead and set the model.

In [None]:
num_labels = len(conll2003["train"].features["ner_tags"].feature.names)

In [None]:
model = AutoModelForTokenClassification.from_pretrained(model, num_labels=num_labels).to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Training the Model on Custom Data
Let us load the arugments and the metrics required for evaluation.

In [None]:
args=TrainingArguments(
    "test-ner",
    evaluation_strategy = "epoch",
    do_train=True,
    do_eval=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01
)



## Metric Evaluation using SequeVal
Here, we will be using the **Seqeval** metric, which is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.

In [None]:
metric = datasets.load_metric("seqeval",trust_remote_code=True)

  metric = datasets.load_metric("seqeval",trust_remote_code=True)


In [None]:
labels_list = conll2003["train"].features["ner_tags"].feature.names
labels_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

This shows the possible outcome labels that will be predicted for a sentence, where they mean :

1. **O**: This tag is used for tokens that are Outside of any named entity.
2. **B-PER**: This tag is used to mark the Beginning of a Person named entity.
3. **I-PER**: This tag is used to mark the Inside of a Person named entity.
4. **B-ORG**: This tag is used to mark the Beginning of an Organization named entity.
5. **I-ORG**: This tag is used to mark the Inside of an Organization named entity.
6. **B-LOC**: This tag is used to mark the Beginning of a Location named entity.
7. **I-LOC**: This tag is used to mark the Inside of a Location named entity.
8. **B-MISC**: This tag is used to mark the Beginning of a Miscellaneous named entity.
9. **I-MISC**: This tag is used to mark the Inside of a Miscellaneous named entity.

In [None]:
example = conll2003["train"][1002]
label_ids = example["ner_tags"]
example_labels = [labels_list[i] for i in label_ids]
example_labels

['O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']

Now, if we look at the real sentence.

In [None]:
example["tokens"]

['ONE', 'ROMANIAN', 'DIES', 'IN', 'BUS', 'CRASH', 'IN', 'BULGARIA', '.']

So here, it says that the word "Romanian" is the beginning of a miscellaneous enity and "Bulgaria" is the beginning of a location enity. Now with that, let us see how the metric works. For that, let us assume our model predicted the following :

**['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']**

Let us evaluate how that would be evaluated and what score it yields.

In [None]:
metric.compute(predictions=[example_labels], references=[["O","B-LOC","O","O","O","O","O","B-LOC","O"]])

  _warn_prf(average, modifier, msg_start, len(result))


{'LOC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 0},
 'overall_precision': 0.5,
 'overall_recall': 0.5,
 'overall_f1': 0.5,
 'overall_accuracy': 0.8888888888888888}

As we can see, it evaluates each NER tag and gives a respective score for this. With this knowledge, let us build a method which can evaluate and yield the overal score.

In [None]:
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # Here, the logits and the probabilities are in the same order, hence there is no need to apply softmax to them

    # Now, we will remove all the values where the label is "-100"
    predictions = [
        [labels_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
        [labels_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]
    results = metric.compute(predictions=predictions, references=true_labels)

    return {
          "precision": results["overall_precision"],
          "recall": results["overall_recall"],
          "f1": results["overall_f1"],
          "accuracy": results["overall_accuracy"],
  }

In [None]:
data_collator=DataCollatorForTokenClassification(tokenizer)

## Trainer
Now, we can go ahead and build the Trainer.

In [None]:
trainer = Trainer(
              model,
              args,
              train_dataset=tokenized_datasets["train"],
              eval_dataset=tokenized_datasets["validation"],
              data_collator=data_collator,
              tokenizer=tokenizer,
              compute_metrics=compute_metrics
          )

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2211,0.063035,0.914687,0.925942,0.92028,0.982144
2,0.0455,0.056911,0.933517,0.947198,0.940308,0.98575
3,0.0242,0.057938,0.938834,0.949547,0.94416,0.98656
4,0.0139,0.06098,0.934177,0.949435,0.941744,0.98602
5,0.0094,0.062548,0.941424,0.951113,0.946244,0.986767


TrainOutput(global_step=4390, training_loss=0.05012114933248533, metrics={'train_runtime': 1141.8752, 'train_samples_per_second': 61.482, 'train_steps_per_second': 3.845, 'total_flos': 1702317283240608.0, 'train_loss': 0.05012114933248533, 'epoch': 5.0})

In [None]:
model.save_pretrained("NER_Model_2.0")
tokenizer.save_pretrained("NER_Tokenizer_2.0")

('NER_Tokenizer_2.0/tokenizer_config.json',
 'NER_Tokenizer_2.0/special_tokens_map.json',
 'NER_Tokenizer_2.0/vocab.txt',
 'NER_Tokenizer_2.0/added_tokens.json',
 'NER_Tokenizer_2.0/tokenizer.json')

Now, before we proceed, let us have a quick look at the configuration of the model.

In [None]:
config = json.load(open("/content/NER_Model_2.0/config.json"))
config

{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'LABEL_0',
  '1': 'LABEL_1',
  '2': 'LABEL_2',
  '3': 'LABEL_3',
  '4': 'LABEL_4',
  '5': 'LABEL_5',
  '6': 'LABEL_6',
  '7': 'LABEL_7',
  '8': 'LABEL_8'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'LABEL_0': 0,
  'LABEL_1': 1,
  'LABEL_2': 2,
  'LABEL_3': 3,
  'LABEL_4': 4,
  'LABEL_5': 5,
  'LABEL_6': 6,
  'LABEL_7': 7,
  'LABEL_8': 8},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.41.2',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

As we can see above, the labels which we want to get are not enabled in the configuration of the model. The labels which are there in the dataset on which our model is trained, are not present here. So let us go ahead and set the same.

In [None]:
id2label = {
    str(i): label for i,label in enumerate(labels_list)
}

id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [None]:
label2id = {
    label: str(i) for i,label in enumerate(labels_list)
}

label2id

{'O': '0',
 'B-PER': '1',
 'I-PER': '2',
 'B-ORG': '3',
 'I-ORG': '4',
 'B-LOC': '5',
 'I-LOC': '6',
 'B-MISC': '7',
 'I-MISC': '8'}

In [None]:
config["id2label"] = id2label
config["label2id"] = label2id

The final configurations of our model are as shown below.

In [None]:
config

{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'O',
  '1': 'B-PER',
  '2': 'I-PER',
  '3': 'B-ORG',
  '4': 'I-ORG',
  '5': 'B-LOC',
  '6': 'I-LOC',
  '7': 'B-MISC',
  '8': 'I-MISC'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'O': '0',
  'B-PER': '1',
  'I-PER': '2',
  'B-ORG': '3',
  'I-ORG': '4',
  'B-LOC': '5',
  'I-LOC': '6',
  'B-MISC': '7',
  'I-MISC': '8'},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.41.2',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

In [None]:
json.dump(config,open("/content/NER_Model_2.0/config.json","w"))

# Transformer Pipeline
Let us now go ahead and build the pipeline for making predictions.

In [None]:
model_finetuned = AutoModelForTokenClassification.from_pretrained("/content/NER_Model_2.0")
tokenizer_finetuned = BertTokenizerFast.from_pretrained("/content/NER_Tokenizer_2.0")

nlp_pipeline = pipeline(
    "ner",
    model= model_finetuned,
    tokenizer=tokenizer_finetuned
    )

Let us now take an example to work with, and make a prediction on the same.

In [None]:
example = "Ben is a college student at UCLA"
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.9949121,
  'index': 1,
  'word': 'ben',
  'start': 0,
  'end': 3},
 {'entity': 'B-ORG',
  'score': 0.91264635,
  'index': 7,
  'word': 'ucla',
  'start': 28,
  'end': 32}]

In [None]:
example2 = "Apple launched a Mac with the M3 chip"
nlp_pipeline(example2)

[{'entity': 'B-ORG',
  'score': 0.99727315,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5},
 {'entity': 'B-MISC',
  'score': 0.94175667,
  'index': 4,
  'word': 'mac',
  'start': 17,
  'end': 20},
 {'entity': 'B-MISC',
  'score': 0.9121459,
  'index': 7,
  'word': 'm3',
  'start': 30,
  'end': 32}]

In [None]:
example3 = "EU rejects German call to boycott British lamb."
nlp_pipeline(example3)

[{'entity': 'B-ORG',
  'score': 0.9988004,
  'index': 1,
  'word': 'eu',
  'start': 0,
  'end': 2},
 {'entity': 'B-MISC',
  'score': 0.9991093,
  'index': 3,
  'word': 'german',
  'start': 11,
  'end': 17},
 {'entity': 'B-MISC',
  'score': 0.99863094,
  'index': 7,
  'word': 'british',
  'start': 34,
  'end': 41}]

In [None]:
example4 ="Microsoft Windows created their software in 2000"
nlp_pipeline(example4)

[{'entity': 'B-ORG',
  'score': 0.9977283,
  'index': 1,
  'word': 'microsoft',
  'start': 0,
  'end': 9},
 {'entity': 'I-ORG',
  'score': 0.98648596,
  'index': 2,
  'word': 'windows',
  'start': 10,
  'end': 17}]

In [None]:
example5 = "Mark is a founder of facebook and microsoft"
nlp_pipeline(example5)

[{'entity': 'B-PER',
  'score': 0.9974425,
  'index': 1,
  'word': 'mark',
  'start': 0,
  'end': 4},
 {'entity': 'B-ORG',
  'score': 0.99797136,
  'index': 6,
  'word': 'facebook',
  'start': 21,
  'end': 29},
 {'entity': 'B-ORG',
  'score': 0.9970747,
  'index': 8,
  'word': 'microsoft',
  'start': 34,
  'end': 43}]

# Conclusion
In this project, we saw the two key NLP techniques: Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. NER helps machines understand text by identifying important details like names and locations. POS tagging assigns grammatical labels to words, aiding tasks like translation and analysis.

We explored this with a custom NER system using the CoNLL-2003 dataset and the BERT model. This is just a taste of NLP's potential, fueled by deep learning and ever-growing data, to revolutionize human-computer interaction. You can check out this notebook and experiment for yourself [here](https://colab.research.google.com/drive/1HdardYoLm9j30bcOU_U9EJrhOVgYcMDG?usp=sharing).