We will use a pre-trained NER model from Hugging Face, specifically dbmdz/bert-large-cased-finetuned-conll03-english which is a pre-trained BERT model fine-tuned specifically for Named Entity Recognition (NER) on the CoNLL-03 dataset. It recognizes entities like PER (Person), ORG (Organization), LOC (Location), and MISC (Miscellaneous).

https://www.aimodels.fyi/models/huggingFace/bert-large-cased-finetuned-conll03-english-dbmdz

We’ll apply the model to our preprocessed text to extract entities such as persons, organizations, locations, and other entities.

In [None]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm

In [None]:
ner_model = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple", device=0)

this is expected behavior for the pre-trained NER model. 
The warning simply means that the weights related to tasks other than NER (like sequence classification) are not being used, which is correct for our purpose of NER.

In [3]:
df_obama = pd.read_json('../data/preprocessed/obama_preprocessed.json', lines=True)
df_trump = pd.read_json('../data/preprocessed/trump_preprocessed.json', lines=True)
df_biden = pd.read_json('../data/preprocessed/biden_preprocessed.json', lines=True)

In [4]:
def extract_entities(texts):
    entities_list = []
    for text in tqdm(texts, desc="Processing NER", ncols=100):
        entities = ner_model(text)
        entities_list.append([(ent['word'], ent['entity_group']) for ent in entities])
    return entities_list

In [None]:
df_obama['entities'] = extract_entities(df_obama['processed_text'])
df_obama.to_json('../data/entities/obama_entities.json', orient='records', lines=True)
print("NER applied and entities saved for Obama dataset.")