We will use a pre-trained NER model from Hugging Face, specifically dbmdz/bert-large-cased-finetuned-conll03-english which is a pre-trained BERT model fine-tuned specifically for Named Entity Recognition (NER) on the CoNLL-03 dataset. It recognizes entities like PER (Person), ORG (Organization), LOC (Location), and MISC (Miscellaneous).

https://www.aimodels.fyi/models/huggingFace/bert-large-cased-finetuned-conll03-english-dbmdz

We’ll apply the model to our preprocessed text to extract entities such as persons, organizations, locations, and other entities.

In [33]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm
from collections import Counter
from itertools import chain
import ast
import re

In [2]:
#ner_model = pipeline("ner", 
#                     model="dbmdz/bert-large-cased-finetuned-conll03-english", 
#                     aggregation_strategy="simple")

ner_model = pipeline("ner", 
                     model="elastic/distilbert-base-cased-finetuned-conll03-english",  # DistilBERT pre-trained on CoNLL-03 NER
                     aggregation_strategy="simple")

this is expected behavior for the pre-trained NER model. 
The warning simply means that the weights related to tasks other than NER (like sequence classification) are not being used, which is correct for our purpose of NER.

In [10]:
df_obama = pd.read_json('../data/preprocessed/obama_preprocessed.json', lines=True)
df_trump = pd.read_json('../data/preprocessed/trump_preprocessed.json', lines=True)
df_biden = pd.read_json('../data/preprocessed/biden_preprocessed.json', lines=True)

df_obama = df_obama[['publish_date', 'processed_text']].sample(n=7000, random_state=42)
df_trump = df_trump[['publish_date', 'processed_text']].sample(n=7000, random_state=42)
df_biden = df_biden[['publish_date', 'processed_text']].sample(n=7000, random_state=42)


In [7]:
def extract_entities(texts):
    entities_list = []
    for text in tqdm(texts, desc="Processing NER", ncols=100):
        entities = ner_model(text)
        entities_list.append([(ent['word'], ent['entity_group']) for ent in entities])
    return entities_list

In [8]:
df_obama['entities'] = extract_entities(df_obama['processed_text'])
df_obama = df_obama.drop(columns=['processed_text'])

df_obama.to_json('../data/entities/DistilBERTobama_entities.json', orient='records', lines=True)
print("NER applied and entities saved for Obama dataset.")

Processing NER: 100%|███████████████████████████████████████████| 7000/7000 [16:40<00:00,  7.00it/s]


NER applied and entities saved for Obama dataset.


In [11]:
# Trump dataset
df_trump['entities'] = extract_entities(df_trump['processed_text'])
df_trump = df_trump.drop(columns=['processed_text'])

df_trump.to_json('../data/entities/DistilBERTtrump_entities.json', orient='records', lines=True)
print("NER applied and entities saved for Trump dataset.")


# Biden dataset
df_biden['entities'] = extract_entities(df_biden['processed_text'])
df_biden = df_biden.drop(columns=['processed_text'])

df_biden.to_json('../data/entities/DistilBERTbiden_entities.json', orient='records', lines=True)
print("NER applied and entities saved for Biden dataset.")


Processing NER: 100%|███████████████████████████████████████████| 7000/7000 [14:55<00:00,  7.82it/s]


NER applied and entities saved for Trump dataset.


Processing NER: 100%|███████████████████████████████████████████| 7000/7000 [14:03<00:00,  8.30it/s]

NER applied and entities saved for Biden dataset.



