# CA6011 Deep Learning for NLP: Week 3 Lab Part 3 -- Hugging Face

In this part of the lab the aim is to get a first feel for how to create and apply NLP systems on Hugging Face.

We'll start with the quickest way to get an NLP system up and running, in one single step using `pipeline`, for sentiment analysis, prompted text generation and summarisation. See the pipeline documentation for details: https://huggingface.co/docs/transformers/main_classes/pipelines

Then we'll take a look at what happens behind the scenes, and code up the system in several steps equivalent to `pipeline`.

First up is a **sentiment analysis** system which we create with the simple line:

    classifier = pipeline("sentiment-analysis")

Note that here we simply state the *task*, nothing else, and everything needed to solve the task is supplied via defaults. Once created, we can simply call `classifier` with any text or list of texts as the argument and get a (list of) class label(s) and score(s) returned.

The range of tasks that can be addressed in this way can be found here: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task

The default sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) used is shown in the output below (we'll be using it explicitly later). It classifies input texts into one of two output classes (POSITIVE, NEGATIVE).

Note the model naming convention: the name of the raw model (`distilbert base`), followed by an indication that it is a finetuned version of the raw model (`finetuned`), followed by the name of the dataset it was finetuned on (`sst 2 english`).

The actual system output is shown last and identifies the class predicted (POSITIVE) and the probability assigned to it (0.99985).

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

output = classifier("A smart, solidly crafted procedural that's anchored in family drama, Anatomy of a Fall finds star Sandra Hüller and director/co-writer Justine Triet operating at peak power.")
print(output)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998509883880615}]


Getting the class label `POSITIVE` and a high probability score as the output returned is exactly what we would expect, given the glowing film review snippet we gave the system to classify.

Next we're creating and using a **prompted generation** system in the same way as the sentiment analyser above. Except that this time we're not using the default text-generation model; instead we're specifying `distilgpt2` as the model to be used.

Any model on the Hugging Face hub can in principle be used in this way in a pipeline: https://huggingface.co/models

Note that if the model defines the task, then the first task argument can be omitted.

In [3]:
generator = pipeline("text-generation", model="distilgpt2")

output = generator("A smart, solidly crafted procedural that's anchored in family drama, Anatomy of a Fall finds ",truncation=True, max_length=100, num_return_sequences=1)
print(output)

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "A smart, solidly crafted procedural that's anchored in family drama, Anatomy of a Fall finds iced-up children who are very young and are going to be interested. It's a bit off-putting, a bit off-putting, and it's a little tricky to get into the main cast, especially since the main characters are from the same generation.\n\n\n\n\nThe main characters are from the same generation.\nWith the cast having been cast over the years, we feel we have a solid cast, but now we have to be sure that we don't have another cast, so that we can create a cohesive cast.\nThere's a lot of room to expand, though, and it's not necessarily all of the problems we'll have with the cast. It's also a big part of the story, as well. We want to have a good story that's not about the characters but about the characters, the show, the stories about the world, the characters.\nThere's also a lot of room to expand.\nThere's a lot of room to expand.\nThere's a lot of room to expand.\nThere's a lot

The continuation we get back kind of reads like a review but likely is nonsensical in places. Note that if you run the generator multiple times you will get different continuations.

You can also ask for continuations of different length by setting `max_length` and get multiple continuations by setting `num_return_sequences`, the last two arguments in the `generator` call.

Next we're going to create a **summarizer**. This time we're specifying the model and the tokenizer we want to use in the arguments to the summarizer call, but we're moreover first loading their weights explicitly with the `from_pretrained` method.

For this we need to import matching model and tokenizer architectures to load the weights into (the first line of the code below). Re loading pretrained model checkpoints see here: https://huggingface.co/learn/nlp-course/chapter4/2#using-pretrained-models

In [4]:
from transformers import BartForConditionalGeneration, AutoTokenizer

text = """Anatomy of a Fall is an uncommonly perceptive and thought-provoking procedural.
          Because the movie transpires in France and works using the rules of French jurisprudence,
          it is better able to address questions of truth than a U.S.-based iteration of the same story
          would be able to do. (The case would never make it in front of a judge in an American court.)
          By focusing on the narrative that emerges during a trial rather than the events of what happened
          at the chalet shared by wife Sandra Voyter (Sandra Huller), husband Samuel Maleski (Samuel Theis),
          and their son, Daniel (Milo Machado Graner), Anatomy of a Fall can ponder the unknowability of any
          objective truth. It’s another facet of the Rashomon prism.
          Recognizing that images captured by the camera represent something concrete, director Justine Triet
          is careful about deciding what to show on-screen. The instance of death is never depicted; we see
          precursor moments and are by Daniel’s side when he discovers the body but the minutia surrounding
          the actual death is left for the lawyers to argue. And, because she wants to emphasize the elusive
          nature of an objective truth, Triet rejects a facile omniscient representation of the death-scene at any point."""

task = "summarization"
model_name = "sshleifer/distilbart-cnn-12-6"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

summariser = pipeline(task, model=model, tokenizer=tokenizer)

output = summariser(text, num_return_sequences=1)

print(output)

Device set to use mps:0


[{'summary_text': ' Anatomy of a Fall is an uncommonly perceptive and thought-provoking procedural procedural . The movie transpires in France and works using the rules of French jurisprudence . Director Justine Triet rejects a facile omniscient representation of the death-scene at any point .'}]


The summary produced is relevant, but relies on extraction rather than abstraction. Admittedly it's a tough review to summarise!

Running the above will give the same output every time, but we can ask for alternatives by increasing `num_return_sequences`.

Let's take a closer look at what the tokenizer does, using the same tokenizer as in the summarization example above, but using a shorter text so we can more easily see what's going on.

First we'll look at what the `tokenizer` call returns which is the token IDs and the attention mask indicating which tokens to ignore (none in this example). Note that the list of token IDs contains two extra token IDs: the start-of-sequence ID 0 and the end-of-sequence ID 2.

Next we'll look at the intermediate steps of converting the input text into a token representation, and the latter into token IDs (this time without delimiting tokens). Finally we decode the list of IDs back into a word sequence which should give us the original text back.

In [5]:
text = 'Anatomy of a Fall is an uncommonly perceptive and thought-provoking procedural.'

output = tokenizer(text)
print("tokenizer(text) output: ", output)

tokens = tokenizer.tokenize(text)
print("tokens: ", tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print("ids: ", ids)

decoded_ids = tokenizer.decode(ids)
print("decoded ids: ", decoded_ids)


tokenizer(text) output:  {'input_ids': [0, 4688, 415, 13604, 9, 10, 9197, 16, 41, 18186, 352, 228, 42579, 8, 802, 12, 13138, 14805, 24126, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokens:  ['An', 'at', 'omy', 'Ġof', 'Ġa', 'ĠFall', 'Ġis', 'Ġan', 'Ġuncommon', 'ly', 'Ġper', 'ceptive', 'Ġand', 'Ġthought', '-', 'prov', 'oking', 'Ġprocedural', '.']
ids:  [4688, 415, 13604, 9, 10, 9197, 16, 41, 18186, 352, 228, 42579, 8, 802, 12, 13138, 14805, 24126, 4]
decoded ids:  Anatomy of a Fall is an uncommonly perceptive and thought-provoking procedural.


And indeed we do get the original word sequence back. Decoding a shortened sequence will work too (or any other sequence for that matter):

In [6]:
tokenizer.decode([4688, 415, 13604, 9, 10, 9197, 16, 18186, 352, 228, 42579, 4])

'Anatomy of a Fall is uncommonly perceptive.'

In our final example, we're returning to **sentiment analysis**, but this time we're not using `pipeline`, instead we're running `model` directly to obtain logits, which we then pass to `softmax` to obtain probabilities, and finally to `argmax` to determine the winning class.

Note that this time we're also moving to multiple input texts, processed at the same time.

In [7]:
from transformers import AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = ["Anatomy of a Fall is an uncommonly perceptive and thought-provoking procedural.",
         "Anatomy of a Fall can ponder the unknowability of any objective truth.",
         "Triet rejects a facile omniscient representation of the death-scene at any point.",
         "It’s another facet of the Rashomon prism."]

tokenized_inputs = tokenizer(inputs, padding=True, truncation=True, max_length=256, return_tensors="pt")
print("Tokenized inputs using separate steps:\n", tokenized_inputs)

with torch.no_grad():
  output = model(**tokenized_inputs) # unpack dictionary to use as argument values
  print("Logits using separate steps:\n", output.logits)
  scores = F.softmax(output.logits, dim=1)
  print("Scores using separate steps:\n", scores)
  labels = torch.argmax(scores, dim=1)
  print("Labels using separate steps:\n", labels)

Tokenized inputs using separate steps:
 {'input_ids': tensor([[  101, 13336,  1997,  1037,  2991,  2003,  2019, 13191,  2135,  2566,
         28687,  1998,  2245,  1011,  4013, 22776, 24508,  1012,   102,     0,
             0,     0,     0],
        [  101, 13336,  1997,  1037,  2991,  2064, 29211,  1996,  4895,  2243,
         19779,  8010,  1997,  2151,  7863,  3606,  1012,   102,     0,     0,
             0,     0,     0],
        [  101, 13012,  3388, 19164,  1037,  6904,  6895,  2571, 18168,  8977,
         23402,  3372,  6630,  1997,  1996,  2331,  1011,  3496,  2012,  2151,
          2391,  1012,   102],
        [  101,  2009,  1521,  1055,  2178,  2227,  2102,  1997,  1996, 23438,
         19506,  2078, 26113,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 

This gives us in order the four tokenized inputs and their attention masks, the corresponding logits produced by the model, the probabilities that the logits map to, and finally the winning class labels.

Now let's check that a pipeline consisting of the same model and tokenizer would have produced the same result:

In [8]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

outputs = classifier(inputs)
print("Labels and scores using pipeline: ", outputs)

Device set to use mps:0


Labels and scores using pipeline:  [{'label': 'POSITIVE', 'score': 0.9997468590736389}, {'label': 'NEGATIVE', 'score': 0.8784357309341431}, {'label': 'NEGATIVE', 'score': 0.9952684044837952}, {'label': 'NEGATIVE', 'score': 0.8752526044845581}]


If you compare the probabilities with those previously produced by the separate steps above, you'll find they're the same.

Now it's over to you. In the cell below, create a pipeline for a task of your choosing, first just specifying the task, then also the model and tokenizer, finally replacing the pipeline with steps in the same way we did above.

In [9]:
# Step 1: Create a pipeline for a task of your choosing, then test it with some inputs

## INSERT YOUR CODE HERE ##
ner = pipeline("ner")
preds = ner("Did Ronaldo score in the football o match")
preds = [
    {
        "entity": pred["entity"],
        "score": round(pred["score"], 4),
        "index": pred["index"],
        "word": pred["word"],
        "start": pred["start"],
        "end": pred["end"],
    }
    for pred in preds
]
if preds:
    print(*preds, sep="\n")
else:
    print("no entities found")
## END OF YOUR CODE ##


# Step 2: Now specify the model and tokenizer you wish to use in the pipeline

## INSERT YOUR CODE HERE ##

## END OF YOUR CODE ##


# Step 3: Finally, use separate steps equivalent to your pipeline with specified model and tokenizer
# to produce outputs for the same inputs as in Step 1, and check that the outputs are the same

## INSERT YOUR CODE HERE ##

## END OF YOUR CODE ##

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


{'entity': 'I-PER', 'score': np.float32(0.9949), 'index': 2, 'word': 'Ronald', 'start': 4, 'end': 10}
{'entity': 'I-PER', 'score': np.float32(0.9817), 'index': 3, 'word': '##o', 'start': 10, 'end': 11}


In [12]:
pipe = pipeline(model="suno/bark-small")
output = pipe("Hey it's HuggingFace on the phone! Rachel are you talking to me or are you talking to your cat again")

audio = output["audio"]
sampling_rate = output["sampling_rate"]

from IPython.display import Audio
Audio(audio, rate=sampling_rate)

Device set to use mps:0
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [16]:
from transformers import pipeline

# 1. Sentiment pipeline
sentiment_pipe = pipeline("sentiment-analysis")

doc = "Cristiano Ronaldo scored a goal in the last game, he could be on form again!"

def get_player_sentiment(document: str, player_name: str):
    # Split into sentences
    sentences = document.split('.')
    
    # Find sentences mentioning the player
    player_sentences = [s.strip() for s in sentences 
                        if player_name.lower() in s.lower() and s.strip()]
    
    if not player_sentences:
        return {"player": player_name, "mentions": 0, "sentiment": None}
    
    # Get sentiment for each mention
    results = sentiment_pipe(player_sentences)
    
    return {
        "player": player_name,
        "mentions": len(player_sentences),
        "sentences": player_sentences,
        "sentiments": results
    }
print(get_player_sentiment(doc, "Cristiano Ronaldo"))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'player': 'Cristiano Ronaldo', 'mentions': 1, 'sentences': ['Cristiano Ronaldo scored a goal in the last game, he could be on form again!'], 'sentiments': [{'label': 'POSITIVE', 'score': 0.9986085295677185}]}
