# LLMs with Hugging Face
In this notebook, we'll take a whirlwind tour of some top applications using Large Language Models (LLMs):

* Summarization
* Sentiment analysis
* Translation
* Zero-shot classification
* Few-shot learning

## Learning Objectives
1. Use a variety of existing models for a variety of common applications.
2. Understand basic prompt engineering.
3. Understand search vs. sampling for LLM inference.
4. Get familiar with the main Hugging Face abstractions: datasets, pipelines, tokenizers, and models.

Libraries :

* sacremoses : for the translation model [Helsinki-NLP/opus-mt-en-es]
* for English -> Spanish

In [None]:
%%capture
%pip install sacremoses==0.0.53

In [None]:
%%capture
%pip install datasets
%pip install transformers


## Common LLM Applications

In [None]:
from datasets import load_dataset

In [None]:
from transformers import pipeline

#### **Summarization**¶
Summarization can take two forms:

**Extractive** (selecting representative excerpts from the text)

**Abstractive** (generating novel text summaries)
Here, we will use a model which does abstractive summarization.

**Data and Model:**

**Data** : xsum dataset, which provides a set of BBC articles and summaries.

**Model** : t5-small model

In [None]:
mkdir cache


In [None]:
x_sum_dataset = load_dataset("xsum",version="1.2.0", cache_dir="cache")

xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

The repository for xsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/xsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [None]:
x_sum_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

#### ABOUT DATASET
Dataset:
* Train
* Test
* Validation

Features:
* document
* summary
* ID


In [None]:
xsum_sample = x_sum_dataset["train"].select(range(10))
display(xsum_sample.to_pandas())

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


### Hugging Face library to load pre-trained model

In [None]:
summarizer = pipeline(
    task="summarization",
    model="t5-small",
    max_length = 20,
    min_length = 10,
    truncation =True,
    model_kwargs={"cache_dir": "../working/cache/"})
# Note: We specify cache_dir to use predownloaded models.

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
#apply to article 1:
summarizer(xsum_sample["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in pee'}]

In [None]:
#apply to a batch of articles
results = summarizer(xsum_sample["document"])

In [None]:
results

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in pee'},
 {'summary_text': 'a fire alarm went off at the Holiday Inn in Hope Street on sunday .'},
 {'summary_text': 'Sebastian Vettel will start third ahead of team-mate Kimi Raikkonen .'},
 {'summary_text': 'the 67-year-old is accused of committing the offences between March 1972 and'},
 {'summary_text': 'a man receiving psychiatric treatment at the clinic threatened to shoot himself and others'},
 {'summary_text': 'Gregor Townsend gave a debut to powerhouse wing Taqele Nai'},
 {'summary_text': 'Veronica Vanessa Chango-Alverez, 31, was killed and another man injured in'},
 {'summary_text': 'the 25-year-old was hit by a motorbike during the Gent-Wevel'},
 {'summary_text': 'gundogan will not be fit for the start of the premier league season at Brighton .'},
 {'summary_text': 'the crash happened about 07:20 GMT at the junction of the A127 and Progress Road '}]

In [None]:
# Display the generated summary side-by-side

import pandas as pd

display(
    pd.DataFrame.from_dict(results)
    .rename({"summary_text" : "generated_summary"},axis = 1)
    .join(pd.DataFrame.from_dict(xsum_sample))
    [["generated_summary","summary","document","id"]]
)

Unnamed: 0,generated_summary,summary,document,id
0,the full cost of damage in Newton Stewart is s...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one...",35232142
1,a fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...,40143035
2,Sebastian Vettel will start third ahead of tea...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...,35951548
3,the 67-year-old is accused of committing the o...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco...",36266422
4,a man receiving psychiatric treatment at the c...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...,38826984
5,Gregor Townsend gave a debut to powerhouse win...,Defending Pro12 champions Glasgow Warriors bag...,Simone Favaro got the crucial try with the las...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,"Veronica Vanessa Chango-Alverez, 31, was kille...",20836172
7,the 25-year-old was hit by a motorbike during ...,Welsh cyclist Luke Rowe says changes to the sp...,Belgian cyclist Demoitie died after a collisio...,35932467
8,gundogan will not be fit for the start of the ...,Manchester City midfielder Ilkay Gundogan says...,"Gundogan, 26, told BBC Sport he ""can see the f...",40758845
9,the crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,The crash happened about 07:20 GMT at the junc...,30358490


## Sentiment Analysis

Sentiment analysis is a text classification task of estimating whether a piece of text is positive, negative, or another "sentiment" label.

In [None]:
poem_dataset = load_dataset("poem_sentiment",cache_dir="cache")
poem_sample = poem_dataset["train"].select(range(10))
display(poem_sample.to_pandas())

README.md:   0%|          | 0.00/5.70k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/6.16k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/892 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/105 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/104 [00:00<?, ? examples/s]

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,1
1,1,"it flows so long as falls the rain,",2
2,2,"and that is why, the lonesome day,",0
3,3,"when i peruse the conquered fame of heroes, an...",3
4,4,of inward strife for truth and liberty.,3
5,5,the red sword sealed their vows!,3
6,6,and very venus of a pipe.,2
7,7,"who the man, who, called a brother.",2
8,8,"and so on. then a worthless gaud or two,",0
9,9,to hide the orb of truth--and every throne,2


In [None]:
sentiment_classifier = pipeline(
    task="text-classification",
    model="nickwong64/bert-base-uncased-poems-sentiment",
    model_kwargs = {"cache_dir" : "../working/cache/"},
)

config.json:   0%|          | 0.00/923 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



In [None]:
results = sentiment_classifier(poem_sample["verse_text"])

In [None]:
# display the predicted sentiment side-by-side

joined_data =(
    pd.DataFrame.from_dict(results)
    .rename({'label' : "predicted_label"},axis = 1)
    .join(pd.DataFrame.from_dict(poem_sample).rename({"label" : "true_label"},axis =1))

    )

#change label indices to text labels
sentiment_labels = {
    0 : "negative",
    1 : "positive",
    2 : "no_impact",
    3 : "mixed"
}

joined_data = joined_data.replace({"true_label" : sentiment_labels})

display(joined_data[["predicted_label","true_label","score","verse_text"]])


Unnamed: 0,predicted_label,true_label,score,verse_text
0,positive,positive,0.996594,with pale blue berries. in these peaceful shad...
1,no_impact,no_impact,0.998741,"it flows so long as falls the rain,"
2,negative,negative,0.995966,"and that is why, the lonesome day,"
3,mixed,mixed,0.968735,"when i peruse the conquered fame of heroes, an..."
4,mixed,mixed,0.975967,of inward strife for truth and liberty.
5,mixed,mixed,0.96658,the red sword sealed their vows!
6,no_impact,no_impact,0.998639,and very venus of a pipe.
7,no_impact,no_impact,0.998611,"who the man, who, called a brother."
8,negative,negative,0.996557,"and so on. then a worthless gaud or two,"
9,no_impact,no_impact,0.998519,to hide the orb of truth--and every throne


## Translation

In [None]:
en_to_es_translation_pipeline = pipeline(
    task = "translation",
    model = "Helsinki-NLP/opus-mt-en-es",
    model_kwargs = {"cache_dir" : "../working/cache/"},
)


config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]



In [None]:
en_to_es_translation_pipeline("more than a club")

[{'translation_text': 'más que un club'}]

Using model that handle multiple languages

"t5-small" model

 English to:
* French
* Romainian
* German



In [None]:
t5_small_pipeline = pipeline(
    task = "text2text-generation",
    model = "t5-small",
    max_length = 50,
    model_kwargs = {"cache_dir" : "../working/cache/"},
)


In [None]:
t5_small_pipeline("translate English to German: more than a club")

[{'generated_text': 'mehr als ein Club'}]

## Zero-shot classification

the task of classifying a piece of text into one of a few given categories or labels, without having explicitly trained the model to predict those categories beforehand

In [None]:
deberta_v3 = pipeline(
    task = "zero-shot-classification",
    model = 'cross-encoder/nli-deberta-v3-small',
    model_kwargs = {"cache_dir": ".../working/cache/"},
    )


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/568M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]



In [None]:
candidate_labels=[
            "politics",
            "finance",
            "sports",
            "science and technology",
            "pop culture",
            "breaking news",
        ]

In [None]:
article =  """
Simone Favaro got the crucial try with the last move of the game, following earlier touchdowns by Chris Fusaro, Zander Fagerson and Junior Bulumakau.
Rynard Landman and Ashton Hewitt got a try in either half for the Dragons.
Glasgow showed far superior strength in depth as they took control of a messy match in the second period.
Home coach Gregor Townsend gave a debut to powerhouse Fijian-born Wallaby wing Taqele Naiyaravoro, and centre Alex Dunbar returned from long-term injury, while the Dragons gave first starts of the season to wing Aled Brew and hooker Elliot Dee.
Glasgow lost hooker Pat McArthur to an early shoulder injury but took advantage of their first pressure when Rory Clegg slotted over a penalty on 12 minutes.
It took 24 minutes for a disjointed game to produce a try as Sarel Pretorius sniped from close range and Landman forced his way over for Jason Tovey to convert - although it was the lock's last contribution as he departed with a chest injury shortly afterwards.
Glasgow struck back when Fusaro drove over from a rolling maul on 35 minutes for Clegg to convert.
But the Dragons levelled at 10-10 before half-time when Naiyaravoro was yellow-carded for an aerial tackle on Brew and Tovey slotted the easy goal.
The visitors could not make the most of their one-man advantage after the break as their error count cost them dearly.
It was Glasgow's bench experience that showed when Mike Blair's break led to a short-range score from teenage prop Fagerson, converted by Clegg.
Debutant Favaro was the second home player to be sin-binned, on 63 minutes, but again the Warriors made light of it as replacement wing Bulumakau, a recruit from the Army, pounced to deftly hack through a bouncing ball for an opportunist try.
The Dragons got back within striking range with some excellent combined handling putting Hewitt over unopposed after 72 minutes.
However, Favaro became sinner-turned-saint as he got on the end of another effective rolling maul to earn his side the extra point with the last move of the game, Clegg converting.
Dragons director of rugby Lyn Jones said: "We're disappointed to have lost but our performance was a lot better [than against Leinster] and the game could have gone either way.
"Unfortunately too many errors behind the scrum cost us a great deal, though from where we were a fortnight ago in Dublin our workrate and desire was excellent.
"It was simply error count from individuals behind the scrum that cost us field position, it's not rocket science - they were correct in how they played and we had a few errors, that was the difference."
Glasgow Warriors: Rory Hughes, Taqele Naiyaravoro, Alex Dunbar, Fraser Lyle, Lee Jones, Rory Clegg, Grayson Hart; Alex Allan, Pat MacArthur, Zander Fagerson, Rob Harley (capt), Scott Cummings, Hugh Blake, Chris Fusaro, Adam Ashe.
Replacements: Fergus Scott, Jerry Yanuyanutawa, Mike Cusack, Greg Peterson, Simone Favaro, Mike Blair, Gregor Hunter, Junior Bulumakau.
Dragons: Carl Meyer, Ashton Hewitt, Ross Wardle, Adam Warren, Aled Brew, Jason Tovey, Sarel Pretorius; Boris Stankovich, Elliot Dee, Brok Harris, Nick Crosswell, Rynard Landman (capt), Lewis Evans, Nic Cudd, Ed Jackson.
Replacements: Rhys Buckley, Phil Price, Shaun Knight, Matthew Screech, Ollie Griffiths, Luc Jones, Charlie Davies, Nick Scott.
"""

In [None]:
res = deberta_v3(article,candidate_labels=candidate_labels)


In [None]:
display(pd.DataFrame(res))

Unnamed: 0,sequence,labels,scores
0,\nSimone Favaro got the crucial try with the l...,sports,0.469012
1,\nSimone Favaro got the crucial try with the l...,breaking news,0.223165
2,\nSimone Favaro got the crucial try with the l...,science and technology,0.107025
3,\nSimone Favaro got the crucial try with the l...,pop culture,0.104471
4,\nSimone Favaro got the crucial try with the l...,politics,0.05739
5,\nSimone Favaro got the crucial try with the l...,finance,0.038937


## Few-Shot Learning

you give the model an instruction, a few query-response examples of how to follow that instruction, and then a new query. The model must generate the response for that new query.

In [None]:
few_shot_pipeline = pipeline(
    task = "text-generation",
    model = "EleutherAI/gpt-neo-1.3B",
    max_new_tokens = 10, #limiting the response length
    model_kwargs = {"cache_dir" : "../working/cache/"},
)

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]



 In the few-shot prompts below, we separate the examples with a special token "###" and use the same token to encourage the LLM to end its output after answering the query. We will tell the pipeline to use that special token as the end-of-sequence (EOS) token below.

In [None]:
eos_token_id = few_shot_pipeline.tokenizer.encode("###")[0]


In [None]:
eos_token_id

21017

In [None]:
# without any examples, the model output is inconsistent and usually incorrect.

results = few_shot_pipeline(
    """ For each tweet, describe its sentiment:

    Tweet: "I loved the new Batman movie!"
    Sentiment:""",

    eos_token_id = eos_token_id,
)
print(results[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


 For each tweet, describe its sentiment:

    Tweet: "I loved the new Batman movie!"
    Sentiment: Very negative

We use the sentiment score as


In [None]:
# With only 1 example, the model may or may not get the answer right.
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


For each tweet, describe its sentiment:

[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]: Positive
###


In [None]:
# With 1 example for each sentiment, the model is more likely to understand!
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "I hate it when my phone battery dies."
[Sentiment]: Negative
###
[Tweet]: "My day has been 👍"
[Sentiment]: Positive
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


For each tweet, describe its sentiment:

[Tweet]: "I hate it when my phone battery dies."
[Sentiment]: Negative
###
[Tweet]: "My day has been 👍"
[Sentiment]: Positive
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]: Neutral
#favicon_favicon


In [None]:
# These book descriptions were taken from their corresponding Wikipedia pages.
results = few_shot_pipeline(
    """Generate a book summary from the title:

[book title]: "Stranger in a Strange Land"
[book description]: "This novel tells the story of Valentine Michael Smith, a human who comes to Earth in early adulthood after being born on the planet Mars and raised by Martians, and explores his interaction with and eventual transformation of Terran culture."
###
[book title]: "The Adventures of Tom Sawyer"
[book description]: "This novel is about a boy growing up along the Mississippi River. It is set in the 1840s in the town of St. Petersburg, which is based on Hannibal, Missouri, where Twain lived as a boy. In the novel, Tom Sawyer has several adventures, often with his friend Huckleberry Finn."
###
[book title]: "Dune"
[book description]: "This novel is set in the distant future amidst a feudal interstellar society in which various noble houses control planetary fiefs. It tells the story of young Paul Atreides, whose family accepts the stewardship of the planet Arrakis. While the planet is an inhospitable and sparsely populated desert wasteland, it is the only source of melange, or spice, a drug that extends life and enhances mental abilities.  The story explores the multilayered interactions of politics, religion, ecology, technology, and human emotion, as the factions of the empire confront each other in a struggle for the control of Arrakis and its spice."
###
[book title]: "Blue Mars"
[book description]:""",
    eos_token_id=eos_token_id,
    max_new_tokens=50,
)
print(results[0]["generated_text"])


Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


Generate a book summary from the title:

[book title]: "Stranger in a Strange Land"
[book description]: "This novel tells the story of Valentine Michael Smith, a human who comes to Earth in early adulthood after being born on the planet Mars and raised by Martians, and explores his interaction with and eventual transformation of Terran culture."
###
[book title]: "The Adventures of Tom Sawyer"
[book description]: "This novel is about a boy growing up along the Mississippi River. It is set in the 1840s in the town of St. Petersburg, which is based on Hannibal, Missouri, where Twain lived as a boy. In the novel, Tom Sawyer has several adventures, often with his friend Huckleberry Finn."
###
[book title]: "Dune"
[book description]: "This novel is set in the distant future amidst a feudal interstellar society in which various noble houses control planetary fiefs. It tells the story of young Paul Atreides, whose family accepts the stewardship of the planet Arrakis. While the planet is an in




# Hugging Face API

* Searching and sampling to generate text
* Auto loaders for tokenizer and models
* Model-Specific Loaders



In [None]:
display(xsum_sample.to_pandas())

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


## Searching and Sampling in inference



```
# num_beamns, do_sample
these are inference configuration
```

* LLMs work by predicting (generating) the next token, then the next, and so on.
* Goal -> Generate high probability sequence of tokens


#### Two main methods of LLM
* *Search* :
On the given token, pick the next most likely token
  * **Greedy Search** : Pick the single next most likely token in a greedy search.
  * **Beam search** : which searches down several sequence paths, via the parameter ```num_beams```.

* *Sampling* : pick the next token by sampling from the predicted distribution of tokens
  * **Top-k Sampling**: The parameter ```top_k``` modifies sampling by limiting it to the ```k``` most likely tokens.
  * **Top-p sampling**: The parameter ```top_p``` modifies sampling by limiting it to the most likely tokens up to probability mass ```p```.


* You can toggle between search and sampling via parameter `do_sample`.



In [None]:
# We previously called the summarization pipeline using the default inference configuration.

# This does Greedy Search

summarizer(xsum_sample["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in pee'}]

In [None]:
# We can instead do a beam search by specifying num_beams.
# This takes longer to run, but it might find a better (more likely) sequence of text.
summarizer(xsum_sample["document"][0], num_beams=10)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in pee'}]

In [None]:
 #Alternatively, we could use sampling.
summarizer(xsum_sample["document"][0], do_sample=True)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in pee'}]

In [None]:
# We can modify sampling to be more greedy by limiting sampling to the top_k or top_p most likely next tokens.
summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in pee'}]

## Auto* loaders for tokenizers and models¶
We have already seen the `dataset` and `pipeline` abstractions from Hugging Face. While a `pipeline` is a quick way to set up an LLM for a given task, the slightly lower-level abstractions `model` and `tokenizer` permit a bit more control over options. We will show how to use those briefly, following this pattern:

* Given input articles.
* Tokenize them (converting to token indices).
* Apply the model on the tokenized data to generate summaries (represented as token indices).
* Decode the summaries into human-readable text.

We will first look at the `Auto* classes` for tokenizers and model types which can simplify loading pre-trained tokenizers and models.



* `AutoTokenizer`
* `AutoModelForSeq2SeqLM`

In [None]:
from transformers import AutoTokenizer

In [None]:
from transformers import AutoModelForSeq2SeqLM
# pretrained model
# Task - Seq 2 Seq Task
# translation, summarization and text generation

In [None]:


#load the pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("t5-small", cache_dir=".../working/cache")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", cache_dir=".../working/cache")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# from transformers import AutoTokenizer

# # Load a pre-trained tokenizer model
# tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# # Tokenize a piece of text
# text = "This is an example sentence."
# inputs = tokenizer.encode_plus(
#     text,
#     add_special_tokens=True,
#     max_length=512,
#     truncation=True,
#     return_attention_mask=True,
#     return_tensors='pt'
# )

# print(inputs['input_ids'])
# print(inputs['attention_mask'])

In [None]:
# For summarization, T5-small expects a prefix "summarize: ", so we prepend that to each article as a prompt.

articles = list(map(lambda article: "summarize: " + article, xsum_sample["document"]))
display(pd.DataFrame(articles,columns=["prompts"]))

Unnamed: 0,prompts
0,summarize: The full cost of damage in Newton S...
1,summarize: A fire alarm went off at the Holida...
2,summarize: Ferrari appeared in a position to c...
3,"summarize: John Edward Bates, formerly of Spal..."
4,summarize: Patients and staff were evacuated f...
5,summarize: Simone Favaro got the crucial try w...
6,"summarize: Veronica Vanessa Chango-Alverez, 31..."
7,summarize: Belgian cyclist Demoitie died after...
8,"summarize: Gundogan, 26, told BBC Sport he ""ca..."
9,summarize: The crash happened about 07:20 GMT ...


In [None]:
# tokenize the input

inputs = tokenizer(
    articles, max_length=1024, return_tensors ="pt",padding = True, truncation = True
    )

print("input_ids:")
print(inputs["input_ids"])
print("attention_mask:")
print(inputs["attention_mask"])

input_ids:
tensor([[21603,    10,    37,  ...,     0,     0,     0],
        [21603,    10,    71,  ...,     0,     0,     0],
        [21603,    10, 21945,  ..., 18002,    21,     1],
        ...,
        [21603,    10, 21768,  ...,     0,     0,     0],
        [21603,    10,  9982,  ...,     0,     0,     0],
        [21603,    10,    37,  ...,     0,     0,     0]])
attention_mask:
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [None]:
# Generate summaries
summary_ids = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    num_beams=2,
    min_length=0,
    max_length=40,
)
print(summary_ids)

tensor([[    0,     8,   423,   583,    13,  1783,    16, 20126, 16496,    19,
           341,   271, 14841,     3,     5,   186,  7540,    16,   158,    15,
          2296,     7,  5718,  2367, 14621,  4161,    57,  4125,   387,     3,
             5,     3,     9,  8347,  5685,  3048,    16,   286,   640,     8],
        [    0,  1472,  6196,   877,   326,    44,     8,  9108,    86,    29,
            16,  6000,  1887,    30,  1856,     3,     5,  2554,   130,  1380,
            12,  1175,     8,  1595,     3,     5,    80,    13,     8,   192,
         14264,    19,    45, 13692,    63,     6,     8,   119,    45, 20576],
        [    0,     3,   849,  2239,     7,   163, 14014,     3,    60,  8234,
           232,   227,     3, 19585,   643,   845,   150,  8033,    47,   787,
            30,   213,     3,    88,   225,  2447,     3,     5,     3,   849,
          2239,     7,   497,     3,    31,    29,    32,   964,  8033,    47],
        [    0,     8,     3,  3708,    18,  1201

(id of generated summary)

In [None]:
# Decode the generated summaries
decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
display(pd.DataFrame(decoded_summaries, columns=["decoded_summaries"]))

Unnamed: 0,decoded_summaries
0,the full cost of damage in Newton Stewart is s...
1,fire alarm went off at the Holiday Inn in Hope...
2,stewards only handed reprimand after governing...
3,the 67-year-old is accused of committing the o...
4,a man receiving treatment at the clinic threat...
5,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...
8,gundogan says he can see the finishing line af...
9,the crash happened about 07:20 GMT at the junc...


## Model-specific tokenizer and model loaders¶
You can also more directly load specific tokenizer and model types, rather than relying on `Auto*` classes to choose the right ones for you.

* T5Tokenizer
* T5ForConditionalGeneration

In [None]:
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration

In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-small", cache_dir=".../working/cache")
model = T5ForConditionalGeneration.from_pretrained("t5-small", cache_dir=".../working/cache")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
#The tokenizer and model can then be used similarly to auto* class

inputs = tokenizer(
    articles,
    max_length = 1024,
    return_tensors = "pt",
    padding = True,
    truncation = True,
)

summary_id =model.generate(
    inputs.input_ids,
    attention_mask = inputs.attention_mask,
    num_beams = 2,
    min_length = 0,
    max_length = 40,
)

decoded_summaries = tokenizer.batch_decode(summary_ids,skip_special_tokens = True)
display(pd.DataFrame(decoded_summaries,columns = ["decoded_summaries"]))

Unnamed: 0,Decoded_Summaries
0,the full cost of damage in Newton Stewart is s...
1,fire alarm went off at the Holiday Inn in Hope...
2,stewards only handed reprimand after governing...
3,the 67-year-old is accused of committing the o...
4,a man receiving treatment at the clinic threat...
5,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...
8,gundogan says he can see the finishing line af...
9,the crash happened about 07:20 GMT at the junc...


## Summary

Covered some common LLM applications and seen how to get started with them quickly using pre-trained models from the Hugging Face Hub. We've also see how to tweak some configurations.