#  Named Entity Recognition and Classification (NERC) Analysis

# NLP Final Project - NERC Component


**Systems to Compare**:
- System 1: spaCy's transformer-based model (`en_core_web_trf`)
- System 2: BERT-based model (`dslim/bert-base-NER`) via Hugging Face

**Training Datasets**:
- CoNLL-2003
- WikiANN (English)
- WNUT-17

**Evaluation**: Performance comparison using precision, recall, F1-score, and error analysis.

 Importing necessary libraries

In [21]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification, 
    TrainingArguments, DataCollatorForTokenClassification
)

import torch
from sklearn.metrics import classification_report
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score
import spacy
from spacy.training import Example
import random


test_data = pd.read_csv('datasets/NER-test.tsv', sep='\t')
test_data.drop('sentence_id', axis=1, inplace=True)
print(test_data.head())

   token_id     token BIO_NER_tag
0         0        If           O
1         1    you're           O
2         2  visiting           O
3         3     Paris  B-LOCATION
4         4         ,           O


Data Set Loading and Preprocessing


In [26]:
class NERDatasetLoader:
    def __init__(self):
        self.datasets = {}
        self.label_mappings = {}
    
    def load_conll2003(self):
        """Load CoNLL-2003 dataset"""
        print("Loading CoNLL-2003 dataset...")
        dataset = load_dataset("conll2003")
        
        # Extract labels
        labels = dataset["train"].features["ner_tags"].feature.names
        self.label_mappings['conll2003'] = {i: label for i, label in enumerate(labels)}
        
        self.datasets['conll2003'] = {
            'train': dataset['train'],
            'validation': dataset['validation'],
            'test': dataset['test'],
            'labels': labels
        }
        print(f"CoNLL-2003 loaded. Labels: {labels}")
        return self.datasets['conll2003']
    
    def load_wikiann(self, language='en'):
        """Load WikiANN dataset for English"""
        print(f"Loading WikiANN ({language}) dataset...")
        dataset = load_dataset("wikiann", language)
        
        # Extract labels
        labels = dataset["train"].features["ner_tags"].feature.names
        self.label_mappings['wikiann'] = {i: label for i, label in enumerate(labels)}
        
        self.datasets['wikiann'] = {
            'train': dataset['train'],
            'validation': dataset['validation'],
            'test': dataset['test'],
            'labels': labels
        }
        print(f"WikiANN ({language}) loaded. Labels: {labels}")
        return self.datasets['wikiann']
    
    def load_wnut17(self):
        """Load WNUT-17 dataset"""
        print("Loading WNUT-17 dataset...")
        dataset = load_dataset("wnut_17")
        
        # Extract labels
        labels = dataset["train"].features["ner_tags"].feature.names
        self.label_mappings['wnut17'] = {i: label for i, label in enumerate(labels)}
        
        self.datasets['wnut17'] = {
            'train': dataset['train'],
            'validation': dataset['validation'],
            'test': dataset['test'],
            'labels': labels
        }
        print(f"WNUT-17 loaded. Labels: {labels}")
        return self.datasets['wnut17']
    
    def get_combined_labels(self):
        """Get all unique labels across datasets"""
        all_labels = set()
        for dataset_name, mapping in self.label_mappings.items():
            all_labels.update(mapping.values())
        return sorted(list(all_labels))

**Data preprocessing for training

In [25]:
class NERDataPreprocessor:
    def __init__(self, tokenizer_name="bert-base-cased"):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        
    def align_labels_with_tokens(self, labels, word_ids):
        """Align labels with tokenized words"""
        new_labels = []
        current_word = None
        for word_id in word_ids:
            if word_id != current_word:
                current_word = word_id
                label = -100 if word_id is None else labels[word_id]
                new_labels.append(label)
            elif word_id is None:
                new_labels.append(-100)
            else:
                label = labels[word_id]
                if label % 2 == 1:  # If it's an I- tag
                    new_labels.append(label)
                else:  # If it's a B- tag, change to I-
                    new_labels.append(label + 1 if label != 0 else 0)
        return new_labels
    
    def tokenize_and_align_labels(self, examples, label_all_tokens=True):
        """Tokenize and align labels for BERT-style models"""
        tokenized_inputs = self.tokenizer(
            examples["tokens"], 
            truncation=True, 
            is_split_into_words=True,
            padding=True,
            max_length=512
        )
        
        labels = []
        for i, label in enumerate(examples["ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            aligned_labels = self.align_labels_with_tokens(label, word_ids)
            labels.append(aligned_labels)
        
        tokenized_inputs["labels"] = labels
        return tokenized_inputs
    
    def prepare_spacy_data(self, dataset):
        """Prepare data for spaCy training"""
        training_data = []
        
        for example in dataset:
            tokens = example['tokens']
            ner_tags = example['ner_tags']
            
            # Convert to spaCy format
            entities = []
            start_pos = 0
            
            for i, (token, tag) in enumerate(zip(tokens, ner_tags)):
                if tag != 0:  # Not 'O' tag
                    tag_name = dataset.features['ner_tags'].feature.names[tag]
                    if tag_name.startswith('B-'):
                        entity_start = start_pos
                        entity_label = tag_name[2:]
                        entity_end = start_pos + len(token)
                        
                        # Check for I- tags following this B- tag
                        j = i + 1
                        while j < len(ner_tags) and dataset.features['ner_tags'].feature.names[ner_tags[j]].startswith(f'I-{entity_label}'):
                            entity_end = start_pos + len(' '.join(tokens[i:j+1]))
                            j += 1
                        
                        entities.append((entity_start, entity_end, entity_label))
                
                start_pos += len(token) + 1  # +1 for space
            
            text = ' '.join(tokens)
            training_data.append((text, {"entities": entities}))
        
        return training_data

## 2. Dataset Loading and Exploration

### 2.1 Load Training Datasets from Hugging Face

In [7]:
# Load datasets from Hugging Face
print("Loading datasets...")

# CoNLL-2003 dataset
conll_dataset = load_dataset("conll2003")
print(f"✅ CoNLL-2003 loaded: {conll_dataset}")

# WikiANN (English) dataset
wikiann_dataset = load_dataset("wikiann", "en")
print(f"✅ WikiANN (English) loaded: {wikiann_dataset}")

# WNUT-17 dataset
wnut_dataset = load_dataset("wnut_17")
print(f"✅ WNUT-17 loaded: {wnut_dataset}")

Loading datasets...


NameError: name 'load_dataset' is not defined

In [None]:
# Dataset exploration function
def explore_dataset(dataset, name):
    print(f"\n{'='*50}")
    print(f"📊 {name} Dataset Statistics")
    print(f"{'='*50}")
    
    # Basic info
    train_size = len(dataset['train']) if 'train' in dataset else 0
    test_size = len(dataset['test']) if 'test' in dataset else 0
    val_size = len(dataset['validation']) if 'validation' in dataset else 0
    
    print(f"Train samples: {train_size:,}")
    print(f"Test samples: {test_size:,}")
    print(f"Validation samples: {val_size:,}")
    
    # Example sample
    if 'train' in dataset and len(dataset['train']) > 0:
        example = dataset['train'][0]
        print(f"\n📝 Example sample:")
        print(f"Tokens: {example['tokens'][:10]}...")
        print(f"NER tags: {example['ner_tags'][:10]}...")
        
        # Get unique NER tags
        if 'train' in dataset:
            all_tags = []
            for sample in dataset['train']:
                all_tags.extend(sample['ner_tags'])
            unique_tags = set(all_tags)
            print(f"\n🏷️  Unique NER tags: {len(unique_tags)}")
            print(f"Tag distribution: {Counter(all_tags).most_common(10)}")

# Explore each dataset
explore_dataset(conll_dataset, "CoNLL-2003")
explore_dataset(wikiann_dataset, "WikiANN")
explore_dataset(wnut_dataset, "WNUT-17")

In [None]:
# Get NER tag mappings
def get_tag_mapping(dataset, dataset_name):
    """Extract NER tag mappings from dataset features"""
    try:
        if 'train' in dataset:
            features = dataset['train'].features
            if 'ner_tags' in features:
                tag_names = features['ner_tags'].feature.names
                return {i: tag for i, tag in enumerate(tag_names)}
    except:
        pass
    return None

# Get tag mappings for each dataset
conll_tags = get_tag_mapping(conll_dataset, "CoNLL-2003")
wikiann_tags = get_tag_mapping(wikiann_dataset, "WikiANN")
wnut_tags = get_tag_mapping(wnut_dataset, "WNUT-17")

print("🏷️  Tag Mappings:")
print(f"CoNLL-2003: {conll_tags}")
print(f"WikiANN: {wikiann_tags}")
print(f"WNUT-17: {wnut_tags}")

## 3. Dataset Preprocessing and Format Standardization

### 3.1 Convert to Standard BIO Format

In [None]:
def convert_to_bio_format(dataset, tag_mapping, dataset_name):
    """Convert dataset to standard BIO format"""
    processed_data = []
    
    if 'train' in dataset:
        for sample in dataset['train']:
            tokens = sample['tokens']
            ner_tags = sample['ner_tags']
            
            # Convert numeric tags to string labels
            if tag_mapping:
                bio_tags = [tag_mapping[tag] for tag in ner_tags]
            else:
                bio_tags = [str(tag) for tag in ner_tags]
            
            processed_data.append({
                'tokens': tokens,
                'ner_tags': bio_tags,
                'sentence': ' '.join(tokens)
            })
    
    return processed_data[:1000]  # Limit for demo purposes

# Process datasets
conll_processed = convert_to_bio_format(conll_dataset, conll_tags, "CoNLL-2003")
wikiann_processed = convert_to_bio_format(wikiann_dataset, wikiann_tags, "WikiANN")
wnut_processed = convert_to_bio_format(wnut_dataset, wnut_tags, "WNUT-17")

print(f"✅ Processed datasets:")
print(f"CoNLL-2003: {len(conll_processed)} samples")
print(f"WikiANN: {len(wikiann_processed)} samples")
print(f"WNUT-17: {len(wnut_processed)} samples")

In [None]:
# Show examples from processed data
print("📝 Example processed samples:")
print("\n🔹 CoNLL-2003 Example:")
if conll_processed:
    example = conll_processed[0]
    print(f"Tokens: {example['tokens'][:10]}")
    print(f"NER Tags: {example['ner_tags'][:10]}")
    print(f"Sentence: {example['sentence'][:100]}...")

print("\n🔹 WikiANN Example:")
if wikiann_processed:
    example = wikiann_processed[0]
    print(f"Tokens: {example['tokens'][:10]}")
    print(f"NER Tags: {example['ner_tags'][:10]}")
    print(f"Sentence: {example['sentence'][:100]}...")

print("\n🔹 WNUT-17 Example:")
if wnut_processed:
    example = wnut_processed[0]
    print(f"Tokens: {example['tokens'][:10]}")
    print(f"NER Tags: {example['ner_tags'][:10]}")
    print(f"Sentence: {example['sentence'][:100]}...")

## 4. System 1: spaCy Transformer-based NER

### 4.1 Load spaCy Model

In [None]:
# Load spaCy transformer model
print("Loading spaCy transformer model...")
try:
    nlp_spacy = spacy.load("en_core_web_trf")
    print("✅ spaCy en_core_web_trf model loaded successfully!")
except OSError:
    print("❌ spaCy model not found. Installing...")
    os.system("python -m spacy download en_core_web_trf")
    nlp_spacy = spacy.load("en_core_web_trf")
    print("✅ spaCy model installed and loaded!")

# Test spaCy model
test_text = "Apple Inc. is based in Cupertino, California. Tim Cook is the CEO."
doc = nlp_spacy(test_text)

print(f"\n🧪 Testing spaCy on: '{test_text}'")
print("Entities found:")
for ent in doc.ents:
    print(f"  {ent.text} -> {ent.label_} ({ent.start_char}-{ent.end_char})")

In [None]:
def process_with_spacy(texts: List[str]) -> List[List[str]]:
    """Process texts with spaCy NER and return BIO tags"""
    results = []
    
    for text in texts:
        doc = nlp_spacy(text)
        tokens = [token.text for token in doc]
        ner_tags = []
        
        # Convert spaCy entities to BIO format
        for token in doc:
            if token.ent_type_:
                if token.ent_iob_ == 'B':
                    ner_tags.append(f"B-{token.ent_type_}")
                elif token.ent_iob_ == 'I':
                    ner_tags.append(f"I-{token.ent_type_}")
                else:
                    ner_tags.append('O')
            else:
                ner_tags.append('O')
        
        results.append(ner_tags)
    
    return results

# Test the function
test_sentences = [
    "Apple Inc. is based in Cupertino, California.",
    "Tim Cook met with Elon Musk in New York.",
    "Microsoft announced new features yesterday."
]

spacy_results = process_with_spacy(test_sentences)
print("🧪 spaCy NER Results:")
for i, (sentence, tags) in enumerate(zip(test_sentences, spacy_results)):
    print(f"\nSentence {i+1}: {sentence}")
    print(f"Tags: {tags}")

## 5. System 2: BERT-based NER with Hugging Face

### 5.1 Load BERT NER Model

In [None]:
# Load BERT-based NER model
print("Loading BERT NER model...")
bert_ner = pipeline("ner", 
                   model="dslim/bert-base-NER", 
                   tokenizer="dslim/bert-base-NER",
                   aggregation_strategy="simple")

print("✅ BERT NER model loaded successfully!")

# Test BERT model
test_text = "Apple Inc. is based in Cupertino, California. Tim Cook is the CEO."
bert_results = bert_ner(test_text)

print(f"\n🧪 Testing BERT NER on: '{test_text}'")
print("Entities found:")
for entity in bert_results:
    print(f"  {entity['word']} -> {entity['entity_group']} (confidence: {entity['score']:.3f})")

In [None]:
def process_with_bert(texts: List[str]) -> List[List[str]]:
    """Process texts with BERT NER and return BIO tags"""
    results = []
    
    for text in texts:
        # Get entities from BERT
        entities = bert_ner(text)
        
        # Tokenize text (simple whitespace tokenization for alignment)
        tokens = text.split()
        ner_tags = ['O'] * len(tokens)
        
        # Convert BERT entities to BIO format
        for entity in entities:
            entity_text = entity['word'].replace('##', '')  # Remove BERT subword markers
            entity_label = entity['entity_group']
            
            # Find the token(s) that match this entity
            for i, token in enumerate(tokens):
                if entity_text.lower() in token.lower() or token.lower() in entity_text.lower():
                    if ner_tags[i] == 'O':  # Only set if not already tagged
                        ner_tags[i] = f"B-{entity_label}"
                    break
        
        results.append(ner_tags)
    
    return results

# Test the function
bert_bio_results = process_with_bert(test_sentences)
print("🧪 BERT NER BIO Results:")
for i, (sentence, tags) in enumerate(zip(test_sentences, bert_bio_results)):
    print(f"\nSentence {i+1}: {sentence}")
    print(f"Tags: {tags}")

## 6. Test Data Processing

### 6.1 Create Test Dataset (Placeholder for NER-test.tsv)

In [None]:
# Create placeholder test data (replace with actual NER-test.tsv when available)
test_data = {
    'sentence_id': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
    'token': ['Apple', 'Inc.', 'announced', 'new', 'features', 'Tim', 'Cook', 'visited', 'New', 'York', 'Microsoft', 'acquired', 'a', 'startup', 'yesterday'],
    'true_label': ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'B-ORG', 'O', 'O', 'O', 'O']
}

# Convert to DataFrame
test_df = pd.DataFrame(test_data)
print("📊 Test Data (Placeholder):")
print(test_df)

# Group by sentence
test_sentences_df = test_df.groupby('sentence_id').agg({
    'token': lambda x: list(x),
    'true_label': lambda x: list(x)
}).reset_index()

# Create sentence texts
test_sentences_df['sentence'] = test_sentences_df['token'].apply(lambda x: ' '.join(x))

print("\n📝 Test Sentences:")
for idx, row in test_sentences_df.iterrows():
    print(f"Sentence {row['sentence_id']}: {row['sentence']}")
    print(f"True labels: {row['true_label']}")
    print()

In [None]:
# Apply both NER systems to test sentences
test_sentences_list = test_sentences_df['sentence'].tolist()

print("🔄 Applying NER systems to test data...")

# System 1: spaCy
spacy_predictions = process_with_spacy(test_sentences_list)
print("✅ spaCy predictions completed")

# System 2: BERT
bert_predictions = process_with_bert(test_sentences_list)
print("✅ BERT predictions completed")

# Create results DataFrame
results_data = []
for i, row in test_sentences_df.iterrows():
    sentence_id = row['sentence_id']
    tokens = row['token']
    true_labels = row['true_label']
    spacy_pred = spacy_predictions[i] if i < len(spacy_predictions) else ['O'] * len(tokens)
    bert_pred = bert_predictions[i] if i < len(bert_predictions) else ['O'] * len(tokens)
    
    # Ensure all lists have the same length
    max_len = max(len(tokens), len(true_labels), len(spacy_pred), len(bert_pred))
    tokens.extend([''] * (max_len - len(tokens)))
    true_labels.extend(['O'] * (max_len - len(true_labels)))
    spacy_pred.extend(['O'] * (max_len - len(spacy_pred)))
    bert_pred.extend(['O'] * (max_len - len(bert_pred)))
    
    for j in range(len(tokens)):
        if tokens[j]:  # Only add non-empty tokens
            results_data.append({
                'sentence_id': sentence_id,
                'token': tokens[j],
                'true_label': true_labels[j],
                'spacy_pred': spacy_pred[j],
                'bert_pred': bert_pred[j]
            })

results_df = pd.DataFrame(results_data)
print("\n📊 Prediction Results:")
print(results_df)

## 7. Performance Evaluation and Metrics

### 7.1 Calculate NER Metrics

In [None]:
# Prepare data for seqeval evaluation
def prepare_for_evaluation(df):
    """Group predictions by sentence for seqeval"""
    grouped = df.groupby('sentence_id')
    
    true_labels = []
    spacy_preds = []
    bert_preds = []
    
    for _, group in grouped:
        true_labels.append(group['true_label'].tolist())
        spacy_preds.append(group['spacy_pred'].tolist())
        bert_preds.append(group['bert_pred'].tolist())
    
    return true_labels, spacy_preds, bert_preds

true_labels, spacy_preds, bert_preds = prepare_for_evaluation(results_df)

print("📊 Evaluation Data Prepared:")
print(f"Number of sentences: {len(true_labels)}")
print(f"True labels example: {true_labels[0]}")
print(f"spaCy predictions example: {spacy_preds[0]}")
print(f"BERT predictions example: {bert_preds[0]}")

In [None]:
# Calculate metrics for both systems
print("🔍 Calculating NER Metrics...")

# spaCy Metrics
try:
    spacy_f1 = f1_score(true_labels, spacy_preds)
    spacy_precision = precision_score(true_labels, spacy_preds)
    spacy_recall = recall_score(true_labels, spacy_preds)
    spacy_report = classification_report(true_labels, spacy_preds, digits=4)
except Exception as e:
    print(f"Error calculating spaCy metrics: {e}")
    spacy_f1 = spacy_precision = spacy_recall = 0.0
    spacy_report = "Error in calculation"

# BERT Metrics
try:
    bert_f1 = f1_score(true_labels, bert_preds)
    bert_precision = precision_score(true_labels, bert_preds)
    bert_recall = recall_score(true_labels, bert_preds)
    bert_report = classification_report(true_labels, bert_preds, digits=4)
except Exception as e:
    print(f"Error calculating BERT metrics: {e}")
    bert_f1 = bert_precision = bert_recall = 0.0
    bert_report = "Error in calculation"

print("\n📈 METRICS SUMMARY:")
print(f"{'='*60}")
print(f"{'System':<15} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
print(f"{'='*60}")
print(f"{'spaCy':<15} {spacy_precision:<12.4f} {spacy_recall:<12.4f} {spacy_f1:<12.4f}")
print(f"{'BERT':<15} {bert_precision:<12.4f} {bert_recall:<12.4f} {bert_f1:<12.4f}")
print(f"{'='*60}")

# Detailed reports
print(f"\n🔍 spaCy Detailed Report:")
print(spacy_report)

print(f"\n🔍 BERT Detailed Report:")
print(bert_report)

In [None]:
# Create comparison visualization
metrics_data = {
    'System': ['spaCy', 'BERT'],
    'Precision': [spacy_precision, bert_precision],
    'Recall': [spacy_recall, bert_recall],
    'F1-Score': [spacy_f1, bert_f1]
}

metrics_df = pd.DataFrame(metrics_data)

# Create bar plot
fig = make_subplots(rows=1, cols=3, 
                    subplot_titles=('Precision', 'Recall', 'F1-Score'),
                    specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}]])

# Add bars for each metric
fig.add_trace(go.Bar(x=metrics_df['System'], y=metrics_df['Precision'], 
                     name='Precision', marker_color='lightblue'), row=1, col=1)
fig.add_trace(go.Bar(x=metrics_df['System'], y=metrics_df['Recall'], 
                     name='Recall', marker_color='lightgreen'), row=1, col=2)
fig.add_trace(go.Bar(x=metrics_df['System'], y=metrics_df['F1-Score'], 
                     name='F1-Score', marker_color='lightcoral'), row=1, col=3)

fig.update_layout(title_text="NER Systems Performance Comparison", showlegend=False)
fig.update_yaxes(range=[0, 1])
fig.show()

# Summary table for poster
print("\n📋 SUMMARY TABLE FOR POSTER:")
print(metrics_df.round(4))

## 8. Results Comparison and Analysis

### 8.1 Side-by-Side Comparison

In [None]:
# Detailed comparison analysis
print("🔍 DETAILED COMPARISON ANALYSIS")
print("="*60)

# Analyze agreement/disagreement
agreement_analysis = []
for i, row in results_df.iterrows():
    token = row['token']
    true_label = row['true_label']
    spacy_pred = row['spacy_pred']
    bert_pred = row['bert_pred']
    
    spacy_correct = spacy_pred == true_label
    bert_correct = bert_pred == true_label
    systems_agree = spacy_pred == bert_pred
    
    agreement_analysis.append({
        'token': token,
        'true_label': true_label,
        'spacy_pred': spacy_pred,
        'bert_pred': bert_pred,
        'spacy_correct': spacy_correct,
        'bert_correct': bert_correct,
        'systems_agree': systems_agree,
        'both_correct': spacy_correct and bert_correct,
        'both_wrong': not spacy_correct and not bert_correct,
        'spacy_only_correct': spacy_correct and not bert_correct,
        'bert_only_correct': bert_correct and not spacy_correct
    })

agreement_df = pd.DataFrame(agreement_analysis)

# Calculate agreement statistics
total_tokens = len(agreement_df)
systems_agree_count = agreement_df['systems_agree'].sum()
both_correct_count = agreement_df['both_correct'].sum()
both_wrong_count = agreement_df['both_wrong'].sum()
spacy_only_correct_count = agreement_df['spacy_only_correct'].sum()
bert_only_correct_count = agreement_df['bert_only_correct'].sum()

print(f"Total tokens analyzed: {total_tokens}")
print(f"Systems agree: {systems_agree_count} ({systems_agree_count/total_tokens*100:.1f}%)")
print(f"Both systems correct: {both_correct_count} ({both_correct_count/total_tokens*100:.1f}%)")
print(f"Both systems wrong: {both_wrong_count} ({both_wrong_count/total_tokens*100:.1f}%)")
print(f"Only spaCy correct: {spacy_only_correct_count} ({spacy_only_correct_count/total_tokens*100:.1f}%)")
print(f"Only BERT correct: {bert_only_correct_count} ({bert_only_correct_count/total_tokens*100:.1f}%)")

# Show examples of disagreement
print("\n🔍 Examples of System Disagreements:")
disagreements = agreement_df[~agreement_df['systems_agree']]
if len(disagreements) > 0:
    print(disagreements[['token', 'true_label', 'spacy_pred', 'bert_pred']].head(10))
else:
    print("No disagreements found in the test data.")

## 9. Error Analysis and Examples

### 9.1 Common Error Patterns

In [None]:
# Error Analysis
print("🔍 ERROR ANALYSIS")
print("="*50)

# spaCy Error Analysis
spacy_errors = agreement_df[~agreement_df['spacy_correct']]
print(f"\n📊 spaCy Errors ({len(spacy_errors)} total):")
if len(spacy_errors) > 0:
    spacy_error_patterns = spacy_errors.groupby(['true_label', 'spacy_pred']).size().reset_index(name='count')
    spacy_error_patterns = spacy_error_patterns.sort_values('count', ascending=False)
    print("Most common error patterns:")
    print(spacy_error_patterns.head(10))
    
    print("\nExample spaCy errors:")
    for _, row in spacy_errors.head(5).iterrows():
        print(f"  Token: '{row['token']}' | True: {row['true_label']} | Predicted: {row['spacy_pred']}")

# BERT Error Analysis
bert_errors = agreement_df[~agreement_df['bert_correct']]
print(f"\n📊 BERT Errors ({len(bert_errors)} total):")
if len(bert_errors) > 0:
    bert_error_patterns = bert_errors.groupby(['true_label', 'bert_pred']).size().reset_index(name='count')
    bert_error_patterns = bert_error_patterns.sort_values('count', ascending=False)
    print("Most common error patterns:")
    print(bert_error_patterns.head(10))
    
    print("\nExample BERT errors:")
    for _, row in bert_errors.head(5).iterrows():
        print(f"  Token: '{row['token']}' | True: {row['true_label']} | Predicted: {row['bert_pred']}")

# Entity type analysis
print("\n📊 Entity Type Performance:")
for entity_type in ['PER', 'ORG', 'LOC', 'MISC']:
    b_entity = f'B-{entity_type}'
    i_entity = f'I-{entity_type}'
    
    # Count entity occurrences
    entity_tokens = agreement_df[agreement_df['true_label'].str.contains(entity_type, na=False)]
    
    if len(entity_tokens) > 0:
        spacy_correct_entity = entity_tokens['spacy_correct'].sum()
        bert_correct_entity = entity_tokens['bert_correct'].sum()
        total_entity = len(entity_tokens)
        
        print(f"  {entity_type}: spaCy {spacy_correct_entity}/{total_entity} ({spacy_correct_entity/total_entity*100:.1f}%), BERT {bert_correct_entity}/{total_entity} ({bert_correct_entity/total_entity*100:.1f}%)")

In [None]:
# Create example outputs for poster
print("\n🎯 EXAMPLE OUTPUTS FOR POSTER")
print("="*60)

for i, row in test_sentences_df.iterrows():
    sentence = row['sentence']
    tokens = row['token']
    true_labels = row['true_label']
    
    # Get predictions for this sentence
    sentence_results = results_df[results_df['sentence_id'] == row['sentence_id']]
    
    print(f"\n📝 Example {i+1}: {sentence}")
    print(f"{'Token':<15} {'True':<10} {'spaCy':<10} {'BERT':<10}")
    print("-" * 50)
    
    for _, token_row in sentence_results.iterrows():
        token = token_row['token']
        true_label = token_row['true_label']
        spacy_pred = token_row['spacy_pred']
        bert_pred = token_row['bert_pred']
        
        # Add visual indicators for correctness
        spacy_indicator = "✅" if spacy_pred == true_label else "❌"
        bert_indicator = "✅" if bert_pred == true_label else "❌"
        
        print(f"{token:<15} {true_label:<10} {spacy_pred:<10} {bert_pred:<10} {spacy_indicator} {bert_indicator}")
    
    if i >= 2:  # Show only first 3 examples
        break

## 10. Conclusions and Potential Improvements

### 10.1 Key Findings

In [None]:
# Generate conclusions based on results
print("🎯 KEY FINDINGS AND CONCLUSIONS")
print("="*60)

# Performance comparison
better_system = "spaCy" if spacy_f1 > bert_f1 else "BERT"
f1_difference = abs(spacy_f1 - bert_f1)

print(f"\n📊 Performance Summary:")
print(f"• Best performing system: {better_system}")
print(f"• F1-score difference: {f1_difference:.4f}")
print(f"• System agreement rate: {systems_agree_count/total_tokens*100:.1f}%")

print(f"\n🔍 Training Data Insights:")
print(f"• CoNLL-2003: {len(conll_processed)} samples processed")
print(f"• WikiANN: {len(wikiann_processed)} samples processed")
print(f"• WNUT-17: {len(wnut_processed)} samples processed")

print(f"\n💡 Key Observations:")
if spacy_f1 > bert_f1:
    print(f"• spaCy transformer model outperformed BERT by {f1_difference:.4f} F1-score")
    print(f"• spaCy showed better precision: {spacy_precision:.4f} vs {bert_precision:.4f}")
else:
    print(f"• BERT model outperformed spaCy by {f1_difference:.4f} F1-score")
    print(f"• BERT showed better precision: {bert_precision:.4f} vs {spacy_precision:.4f}")

if systems_agree_count/total_tokens > 0.8:
    print(f"• High agreement between systems ({systems_agree_count/total_tokens*100:.1f}%) suggests consistent performance")
else:
    print(f"• Lower agreement between systems ({systems_agree_count/total_tokens*100:.1f}%) indicates different strengths")

print(f"\n🚀 Potential Improvements:")
print(f"• Fine-tune models on domain-specific data")
print(f"• Ensemble methods combining both systems")
print(f"• Address common error patterns (e.g., {entity_type} entities)")
print(f"• Expand training data with more diverse examples")
print(f"• Post-processing rules for specific entity types")

print(f"\n📈 Recommendations for Future Work:")
print(f"• Evaluate on larger test sets")
print(f"• Test on different domains (medical, legal, social media)")
print(f"• Investigate cross-lingual performance")
print(f"• Analyze computational efficiency")
print(f"• Study the impact of text preprocessing")

## 📋 Summary for Academic Poster

### Quick Reference Numbers:

**Systems Compared:**
- System 1: spaCy en_core_web_trf (Transformer-based)
- System 2: BERT dslim/bert-base-NER

**Training Datasets:**
- CoNLL-2003: Standard NER benchmark
- WikiANN: Multilingual Wikipedia-based dataset
- WNUT-17: Emerging entities from social media

**Performance Results:**
- Best system accuracy and methodological approach comparison
- Entity-level analysis (PERSON, ORGANIZATION, LOCATION, MISC)
- Error pattern identification and system agreement analysis

**Key Contributions:**
- Comparative evaluation of two state-of-the-art NER systems
- Analysis of system strengths and weaknesses
- Identification of potential improvements for real-world deployment

---

*This notebook provides comprehensive analysis suitable for academic presentation and future NER system development.*