The training data can significantly impact the model's performance on your specific task or dataset.

HF The performance can vary depending on the specific model you choose. Some models might be larger and more generalized, leading to slower performance or less accuracy in certain domains. We can fine-tune models on specific datasets (model on your own annotated dataset), which might yield better results for specialized tasks.


In [None]:
from transformers import pipeline
import pandas as pd

Experiment with 2 Different HF Models:

In [15]:
# Model
nlp = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Load text data
data = pd.read_csv('all-the-news-2-1.csv', nrows=10)
text_column = 'article'

def process_ner(text):
    # Apply NER
    ner_results = nlp(text)
    
    # Format the results
    formatted_results = []
    for ent in ner_results:
        formatted_results.append(f"{ent['word']} ({ent['entity_group']})")
    
    return ', '.join(formatted_results)

# Open a text file to write the results
with open('ner_results.txt', 'w', encoding='utf-8') as f:
    # Process each row
    for i, row in data.iterrows():
        text = row[text_column]
        
        # Write the row number
        f.write(f"NER Results for Article {i}:\n")
        
        # Process the text and write the results
        results = process_ner(text)
        f.write(results + '\n\n')
        
        # Add a separator between rows
        f.write('=' * 50 + '\n\n')

print("NER results have been saved to 'ner_results.txt'")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


NER results have been saved to 'ner_results.txt'


In [14]:
from spacy import displacy

# Load the Hugging Face pipeline for NER
nlp = pipeline("ner", model="jean-baptiste/roberta-large-ner-english", aggregation_strategy="simple")

# Load text data
data = pd.read_csv('all-the-news-2-1.csv', nrows=10)
text_column = 'article'

def process_and_visualize(text, article):
    # Apply NER
    ner_results = nlp(text)
    
    # Convert Hugging Face NER results to spaCy's format
    ents = [
        {
            'start': ent['start'],
            'end': ent['end'],
            'label': ent['entity_group']
        }
        for ent in ner_results
    ]
    spacy_format = {"text": text, "ents": ents}
    
    print(f"NER Results for Article {article}:")
    displacy.render(spacy_format, style="ent", manual=True, jupyter=True)
    print("\n" + "="*50 + "\n")  # Separator between rows

# Process each row
for i, row in data.iterrows():
    text = row[text_column]
    process_and_visualize(text, i)

NER Results for Article 0:




NER Results for Article 1:




NER Results for Article 2:




NER Results for Article 3:




NER Results for Article 4:




NER Results for Article 5:




NER Results for Article 6:




NER Results for Article 7:




NER Results for Article 8:




NER Results for Article 9:




