<a href="https://colab.research.google.com/github/simon-clematide/colab-notebooks-for-teaching/blob/main/MGENRE_impresso_linking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Named Entity Linking (NEL) Model Exploration

This notebook demonstrates the use of a multilingual Named Entity Linking (NEL) model
from the 'impresso-project/nel-hipe-multilingual' collection. The main functionality
focuses on linking entities from text to their corresponding Wikipedia entries, with
confidence scores generated for each link.

We'll use the Hugging Face `transformers` library to:
- Load the model and tokenizer.
- Run the model on a single sentence.
- Extract and link entities from the input text to Wikipedia.

Key functions:
1. `load_model_and_tokenizer`: Loads the NEL model and tokenizer.
2. `link_entity_to_wikipedia`: Extracts entities from a sentence and links them to Wikipedia entries.
3. `run_single_sentence`: A streamlined function to run the model on a single sentence.




In [1]:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Step 1: Load Model and Tokenizer
def load_model_and_tokenizer(model_name="impresso-project/nel-hipe-multilingual"):
    """
    Loads the NEL model and tokenizer from Hugging Face.

    Args:
    model_name (str): The name of the pre-trained model to load.

    Returns:
    tuple: A tuple containing the tokenizer and the model.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).eval()
    return tokenizer, model

# Step 2: Extract Entities
def extract_entity(sentence):
    """
    Extract the entity enclosed between [START] and [END] in the sentence.

    Args:
    sentence (str): The sentence containing the entity to extract.

    Returns:
    str: The extracted entity.
    """
    start_token = "[START]"
    end_token = "[END]"
    start_idx = sentence.find(start_token) + len(start_token)
    end_idx = sentence.find(end_token)
    entity = sentence[start_idx:end_idx].strip()
    return entity

# Step 3: Link Entity to Wikipedia
def link_entity_to_wikipedia(sentence, tokenizer, model):
    """
    Links the extracted entity to potential Wikipedia articles using the model.

    Args:
    sentence (str): The input sentence containing the entity.
    tokenizer: The tokenizer used for processing the input.
    model: The pre-trained NEL model for generating Wikipedia links.

    Returns:
    dict: A dictionary containing the entity and possible Wikipedia links with confidence scores.
    """
    # Extract the entity
    entity = extract_entity(sentence)

    # Process the sentence with the model
    inputs = tokenizer([sentence], return_tensors="pt")
    outputs = model.generate(
        **inputs,
        num_beams=5,
        num_return_sequences=5,
        output_scores=True,
        return_dict_in_generate=True
    )

    # Decode the model outputs (representing possible Wikipedia links)
    decoded_outputs = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)
    logits_per_step = outputs.scores

    # Calculate probabilities and format results
    results = []
    for seq_idx, decoded_seq in enumerate(decoded_outputs):
        sequence_logits = logits_per_step[seq_idx]
        probs_per_token = torch.softmax(sequence_logits, dim=-1).max(dim=-1)[0]
        avg_confidence = probs_per_token.mean().item()
        normalized_confidence = avg_confidence * 100

        results.append({
            "wikipedia_link": decoded_seq,
            "confidence": f"{normalized_confidence:.2f}%"
        })

    return {
        "entity": entity,
        "links": results
    }

# Step 4: Run the Model on a Single Sentence
def run_single_sentence(sentence):
    """
    Encapsulates the functionality to run the NEL model on a single sentence.

    Args:
    sentence (str): The input sentence with the entity to be linked.

    Returns:
    dict: Results containing the extracted entity and possible Wikipedia links with confidence scores.
    """
    # Load the model and tokenizer
    tokenizer, model = load_model_and_tokenizer()

    # Get Wikipedia link predictions
    result = link_entity_to_wikipedia(sentence, tokenizer, model)

    # Display the results
    print(f"Original Sentence: {sentence}")
    print(f"Entity: {result['entity']}")
    print(f"Possible Wikipedia Links:")
    for link_info in result['links']:
        print(f" - {link_info['wikipedia_link']} (Confidence: {link_info['confidence']})")




tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.87M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/883 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/161 [00:00<?, ?B/s]



Original Sentence: [START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.
Entity: United Press
Possible Wikipedia Links:
 - United Press International >> en  (Confidence: 100.00%)
 - The United Press International >> en  (Confidence: 100.00%)
 - United Press International >> de  (Confidence: 87.73%)
 - United Press >> en  (Confidence: 87.67%)
 - Associated Press >> en  (Confidence: 99.78%)


# Running the model on a sentence

In [2]:
# Example: Running the model on a single sentence
sentence = "[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids."
run_single_sentence(sentence)

Original Sentence: [START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.
Entity: United Press
Possible Wikipedia Links:
 - United Press International >> en  (Confidence: 100.00%)
 - The United Press International >> en  (Confidence: 100.00%)
 - United Press International >> de  (Confidence: 87.73%)
 - United Press >> en  (Confidence: 87.67%)
 - Associated Press >> en  (Confidence: 99.78%)
