Inspired by "Transformers and LLMs" CM295 2025 (Stanford Univ.)

Alright. Let's move to an example for named entity recognition (NER) (token classification):

Predict (or classify) named entities into pre-defined categories. For this, we can use a DistilBERT model fine-tuned for NER, such as dslim/distilbert-NER (fine-tuned on CoNLL-2003). Datasets like CoNLL-2003 provide annotations for each token in a so called BIO format.

To make this more clear, let's code it. Predict the categories of each token for "Tim Cook presented the new iPhone in Las Vegas on Tuesday.". 

In [3]:
from transformers import pipeline

MODEL_NAME = "dslim/distilbert-NER"


def main() -> None:
    classifier = pipeline(
        task="token-classification",
        model=MODEL_NAME,
        framework="pt",
    )

    text = "Tim Cook presented the new iPhone in Las Vegas on Tuesday."
    result = classifier(text)  # pipeline returns a list of dictionaries

    for token in result:
        print(f"Token: {token['word']}")
        print(f"BIO tag: {token['entity']}")
        print(f"Softmax probability:{token['score']:.6f}") 
        print()


if __name__ == "__main__":
    main()


Device set to use mps:0


Token: Tim
BIO tag: B-PER
Softmax probability:0.998844

Token: Cook
BIO tag: I-PER
Softmax probability:0.998934

Token: iPhone
BIO tag: B-MISC
Softmax probability:0.780839

Token: Las
BIO tag: B-LOC
Softmax probability:0.997057

Token: Vegas
BIO tag: I-LOC
Softmax probability:0.997060



The token "Tim" is categorised as entity (BIO tag) "B-PER", "Cook" as "I-PER", and "iPhone" as "B-MISC". "B-PER" stands for "Begin-Person", i.e. it indicates that this token marks the beginning of a PERSON entity. "I-PER" stands for "Inside-Person", i.e. the token continues the same entity. "B-MISC" stands for "Begin-Miscellaneous", i.e. it indicates that this token marks the beginning of a MISC (OTHER) entity. "O" stands for "Outside" any entity, i.e. the token is not part of any entity, and when such a tag appears after a B- or I-tag, it signifies that the preceding entity span has ended.

To make this more clear, let us merge predictions for individual tokens into predictions for spans (span is defined as a continuous segment of text formed by one or more consecutive tokens):


In [4]:
from transformers import pipeline

MODEL_NAME = "dslim/distilbert-NER"


def main() -> None:
    classifier = pipeline(
        task="token-classification",
        model=MODEL_NAME,
        framework="pt",
        aggregation_strategy="simple",  # merge token-level predictions into full word/entity spans
    )

    text = "Tim Cook presented the new iPhone in Las Vegas on Tuesday."
    result = classifier(text) # pipeline returns a list of dictionaries

    for entity in result:
        print(f"Entity: {entity['word']}")
        print(f"Group: {entity['entity_group']}")
        print(f"Softmax probability: {entity['score']:.6f}")
        print()


if __name__ == "__main__":
    main()


Device set to use mps:0


Entity: Tim Cook
Group: PER
Softmax probability: 0.998889

Entity: iPhone
Group: MISC
Softmax probability: 0.780839

Entity: Las Vegas
Group: LOC
Softmax probability: 0.997058



Nice! The model successfully assigned PER, MISC and LOC to the entities "Tim Cook", "iPhone" and "Las Vegas". You may wonder what about "Tuesday". CoNLL-2003 does not annotate any kind of date as a named entity, therefore it ignores it. If I would like to predict / classify dates, then I could select e.g., a DistilBERT model fine-tuned on ontonotes5 which provides 18 entities. 

Let's code this.

In [5]:
from transformers import pipeline

MODEL_NAME = "nickprock/distilbert-finetuned-ner-ontonotes"


def main() -> None:
    classifier = pipeline(
        task="token-classification",
        model=MODEL_NAME,
        framework="pt",
        aggregation_strategy="simple",  # enable word-level named entities
    )

    text = "Tim Cook presented the new iPhone in Las Vegas on Tuesday."
    result = classifier(text) # pipeline returns a list of dictionaries (one per input)

    for entity in result:
        print(f"Entity: {entity['word']}")
        print(f"Type: {entity['entity_group']}")
        print(f"Softmax probability: {entity['score']:.6f}")
        print()


if __name__ == "__main__":
    main()


Device set to use mps:0


Entity: Tim Cook
Type: PERSON
Softmax probability: 0.999755

Entity: iPhone
Type: PRODUCT
Softmax probability: 0.996051

Entity: Las Vegas
Type: GPE
Softmax probability: 0.999699

Entity: Tuesday
Type: DATE
Softmax probability: 0.999402



Alright. Using the model fine-tuned on ontonotes5, the entity "Tuesday" is now successfully identified / classified as "DATE" with a probability of 0.999402.

Before we continue with an example for machine translation (text generation), I want to make clear that the previously introduced metrics (accuracy, precision, recall, F1) also apply to NER. The difference is that in NER these metrics are computed at the token/label level rather than at the sentence/label level, because NER assigns one BIO label per token.

Perfect!. Move on.