We will use a pre-trained NER model from Hugging Face, specifically dbmdz/bert-large-cased-finetuned-conll03-english which is a pre-trained BERT model fine-tuned specifically for Named Entity Recognition (NER) on the CoNLL-03 dataset. It recognizes entities like PER (Person), ORG (Organization), LOC (Location), and MISC (Miscellaneous).

https://www.aimodels.fyi/models/huggingFace/bert-large-cased-finetuned-conll03-english-dbmdz

We’ll apply the model to our preprocessed text to extract entities such as persons, organizations, locations, and other entities.

In [18]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm

In [17]:
#ner_model = pipeline("ner", 
#                     model="dbmdz/bert-large-cased-finetuned-conll03-english", 
#                     aggregation_strategy="simple")

ner_model = pipeline("ner", 
                     model="elastic/distilbert-base-cased-finetuned-conll03-english",  # DistilBERT pre-trained on CoNLL-03 NER
                     aggregation_strategy="simple")

Downloading config.json:   0%|          | 0.00/954 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/257 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

this is expected behavior for the pre-trained NER model. 
The warning simply means that the weights related to tasks other than NER (like sequence classification) are not being used, which is correct for our purpose of NER.

In [19]:
df_obama = pd.read_json('../data/preprocessed/obama_preprocessed.json', lines=True)
#df_trump = pd.read_json('../data/preprocessed/trump_preprocessed.json', lines=True)
#df_biden = pd.read_json('../data/preprocessed/biden_preprocessed.json', lines=True)
print(len(df_obama))
df_obama = df_obama[['publish_date', 'processed_text']].head(1000)  # Only process the first 20 speeches


15712


In [21]:
def extract_entities(texts):
    entities_list = []
    for text in tqdm(texts, desc="Processing NER", ncols=100):
        entities = ner_model(text)
        entities_list.append([(ent['word'], ent['entity_group']) for ent in entities])
    return entities_list

In [22]:
df_obama['entities'] = extract_entities(df_obama['processed_text'])
df_obama = df_obama.drop(columns=['processed_text'])

df_obama.to_json('../data/entities/DistilBERTobama_entities.json', orient='records', lines=True)
print("NER applied and entities saved for Obama dataset.")

Processing NER: 100%|███████████████████████████████████████████| 1000/1000 [03:07<00:00,  5.32it/s]

NER applied and entities saved for Obama dataset.



