<a href="https://colab.research.google.com/github/ritesh-tiwary/nlp/blob/main/NER_with_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER) with RoBERTa-based Model
This document provides an overview and usage instructions for a Python program that performs Named Entity Recognition (NER) using a pre-trained RoBERTa-based model. NER is a technique that identifies entities such as names of persons, organizations, locations, and more in a given text.

## Installation
Before using the program, you need to install the required Python packages using pip:

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
Col

## Program Description
The program utilizes the Hugging Face Transformers library to perform NER on a provided sentence using a RoBERTa-based model that has been fine-tuned for NER tasks. It tokenizes the input sentence and identifies named entities within it.

## Usage
1. Import the necessary libraries:

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

2. Load the RoBERTa-based NER model:

In [3]:
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


3. Define the input sentence you want to perform NER on:

In [8]:
sentence = "Bush has signed the operations for the Iraq"

4. Tokenize the input sentence:

In [9]:
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))

5. Run the model on the tokenized sentence and extract named entities:

In [10]:
inputs = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

token_predictions = outputs.logits.argmax(2).squeeze().tolist()
entity_predictions = [model.config.id2label[label_id] for label_id in token_predictions]

6. Extract and display the identified names:

In [19]:
names = []
current_name = ""
for token, entity in zip(tokens, entity_predictions):
    if entity == 'O':
        if current_name:
            names.append((current_name, entity))
            current_name = ""
    else:
        if entity.startswith('B-'):
            current_name = token
        elif entity.startswith('I-'):
            current_name += token

if current_name:
    names.append(current_name)

print("Entities in the sentence:", names)

Entities in the sentence: [('Bush', 'O'), ('Iraq', 'O')]


Please note that the program is designed to identify entities of various types (e.g., 'PER' for a person's name), and you can adapt it to your specific NER requirements by examining and modifying the entity_predictions and names lists as needed.