This file runs the fine-tuned BERT model on newspaper content collected from the Library of Congress and labels three NERs: Person, Location, and Organization. The program here then stores the labeled data in a JSON file.

Code attribution: most of the code here was written or debugged by Claude Sonnett 4.5. See each individual code cell for detailed attribution.

In [1]:
# Load the fine-tuned BERT model
'''Code attribute: the entire code cell was written by Claude Sonnett 4.5 with edits and debugs made by me'''
import torch, json
from transformers import AutoModelForTokenClassification, AutoTokenizer
import time
from tqdm import tqdm

modelPath = "/Users/Jerry/Desktop/CS372/CoNLLFine-TunedBERT1BestPerforming"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForTokenClassification.from_pretrained(modelPath)
model.to(device)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(modelPath)

print(f"Model loaded successfully from {modelPath}")
print(f"Number of labels: {model.config.num_labels}")

Model loaded successfully from /Users/Jerry/Desktop/CS372/CoNLLFine-TunedBERT1BestPerforming
Number of labels: 9


In [2]:
# Specify file path to the JSON that stores the newspaper txt
inputJsonFile = "/Users/Jerry/Desktop/CS372/FinalProject/data/newspaperCleanedContent.json"  

# Also specify a JSON path that stores the labeled NER
outputJsonFile = "entitiesOutput.json"

In [3]:
# Load label map
'''Code attribution: from CoNLL 2003 dataset Hugging Face documentation: https://huggingface.co/datasets/eriktks/conll2003'''
id2label = {
    0: "O",
    1: "B-PER",
    2: "I-PER",
    3: "B-ORG",
    4: "I-ORG",
    5: "B-LOC",
    6: "I-LOC",
    7: "B-MISC",
    8: "I-MISC"
}


In [4]:
# This code cell runs the fine-tuned BERT model on the 529 newspaper documents and stores the NER labels in a JSON file.
'''Code attribute: most of the two functions ()"extractEntities" and "processJsonFile") are written by Claude Sonnet 4.5. I've made changes to variable data types and file paths as well as added the functionality to report on inference time'''

def extractEntities(text, model, tokenizer, device):
    """
    Extract named entities from text using the fine-tuned BERT model.
    Returns a list of [entity_text, entity_type] pairs.
    """
    # Tokenize the text
    encoding = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
        return_offsets_mapping=True
    )

    offsetMapping = encoding.pop("offset_mapping").squeeze().tolist()
    inputIds = encoding["input_ids"].to(device)
    attentionMask = encoding["attention_mask"].to(device)

    # Run inference
    with torch.no_grad():
        outputs = model(inputIds, attention_mask=attentionMask)
        predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()

    # Convert tokens back to words and extract entities
    tokens = tokenizer.convert_ids_to_tokens(inputIds.squeeze().tolist())

    entities = []
    currentEntity = []
    currentEntityType = None
    currentEntityOffsets = []

    for idx, (token, predId, (start, end)) in enumerate(zip(tokens, predictions, offsetMapping)):
        # Skip special tokens and padding
        if token in ["[CLS]", "[SEP]", "[PAD]"] or start == end == 0:
            # Save any ongoing entity before skipping
            if currentEntity:
                entityText = text[currentEntityOffsets[0][0]:currentEntityOffsets[-1][1]]
                entities.append([entityText, currentEntityType])
                currentEntity = []
                currentEntityType = None
                currentEntityOffsets = []
            continue

        label = id2label.get(predId, "O")

        if label.startswith("B-"):  # Beginning of new entity
            # Save previous entity if exists
            if currentEntity:
                entityText = text[currentEntityOffsets[0][0]:currentEntityOffsets[-1][1]]
                entities.append([entityText, currentEntityType])

            # Start new entity
            entityType = label[2:]  # Remove "B-" prefix
            if entityType in ["PER", "ORG", "LOC"]:  # Only keep the three types you want
                currentEntity = [token]
                currentEntityType = "PERSON" if entityType == "PER" else \
                                     "ORGANIZATION" if entityType == "ORG" else "LOCATION"
                currentEntityOffsets = [(start, end)]
            else:
                currentEntity = []
                currentEntityType = None
                currentEntityOffsets = []

        elif label.startswith("I-") and currentEntity:  # Inside ongoing entity
            entityType = label[2:]
            expectedType = "PER" if currentEntityType == "PERSON" else \
                           "ORG" if currentEntityType == "ORGANIZATION" else "LOC"

            if entityType == expectedType:  # Continue same entity
                currentEntity.append(token)
                currentEntityOffsets.append((start, end))
            else:  # Type mismatch, save previous and reset
                entityText = text[currentEntityOffsets[0][0]:currentEntityOffsets[-1][1]]
                entities.append([entityText, currentEntityType])
                currentEntity = []
                currentEntityType = None
                currentEntityOffsets = []

        else:  # "O" or other label
            if currentEntity:
                entityText = text[currentEntityOffsets[0][0]:currentEntityOffsets[-1][1]]
                entities.append([entityText, currentEntityType])
                currentEntity = []
                currentEntityType = None
                currentEntityOffsets = []

    # Don't forget the last entity
    if currentEntity:
        entityText = text[currentEntityOffsets[0][0]:currentEntityOffsets[-1][1]]
        entities.append([entityText, currentEntityType])

    return entities


def processJsonFile(inputFile, outputFile, model, tokenizer, device):
    """
    Process the entire JSON file and extract entities for all texts.
    """
    # Load input JSON
    print(f"Loading {inputFile}...")
    with open(inputFile, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Process each text
    results = {}
    print(f"Processing {len(data)} documents...")

    totalInferenceTime = 0
    totalDocumentsProcessed = 0
    totalWordCount = 0
    for title, text in tqdm(data.items(), desc="Extracting entities"):
        startTime = time.time()
        entities = extractEntities(text, model, tokenizer, device)
        endTime = time.time()
        totalWordCount += len(text.split())
        inferenceTime = endTime - startTime
        totalInferenceTime += inferenceTime
        totalDocumentsProcessed += 1
        results[title] = entities

    # Save results
    print(f"Saving results to {outputFile}...")
    with open(outputFile, 'w', encoding='utf-8') as f:
        json.dump(results, f, indent=4, ensure_ascii=False)

    print(f"Done! Processed {len(results)} documents.")
    print(f"Total entities extracted: {sum(len(v) for v in results.values())}")

    # Print some statistics
    entityCounts = {"PERSON": 0, "ORGANIZATION": 0, "LOCATION": 0}
    for entities in results.values():
        for entity, entityType in entities:
            entityCounts[entityType] = entityCounts.get(entityType, 0) + 1

    print("\nEntity type breakdown:")
    for entityType, count in entityCounts.items():
        print(f"  {entityType}: {count}")

    return totalInferenceTime, totalDocumentsProcessed, totalWordCount


# Run the processing

totalInferenceTime, totalDocumentsProcessed, totalWordCount = processJsonFile(inputJsonFile, outputJsonFile, model, tokenizer, device)

Loading /Users/Jerry/Desktop/CS372/FinalProject/data/newspaperCleanedContent.json...
Processing 529 documents...


Extracting entities: 100%|██████████| 529/529 [02:42<00:00,  3.26it/s]

Saving results to entitiesOutput.json...
Done! Processed 529 documents.
Total entities extracted: 8347

Entity type breakdown:
  PERSON: 2284
  ORGANIZATION: 869
  LOCATION: 5194





In [23]:
# Find inference time:
'''Code attribute: all code mine'''

averageTimeToProcessOneNewspaper = totalInferenceTime / totalDocumentsProcessed
averageWordCountPerDocument = totalWordCount / totalDocumentsProcessed
print(f"Average inference time to process one newspaper: {averageTimeToProcessOneNewspaper} when average word count per document is: {averageWordCountPerDocument}")


Average inference time to process one newspaper: 0.023015342723219066 when average word count per document is: 2923.742911153119


I'm calculating the inference time per document through averaging all inference times when iterating through all documents in my dataset. To put the average inference time in perspective, I've also calculated the average word length of each document.

Average inference time: 0.023 sec.
Average word count per document: 2923.

Hardware specs: Running model using A100 GPU with 80GB GPU RAM. Running model through Google Colab Pro for Education account.