#This notebook is on text summarization using LLM models. We explore prompt engineering by comparing zero shot, one shot and few shot inferences and then try out different generative configuration parameters for inference.

#Without fine tuning these are the two things we can do to affect the output of a specific LLM model: Instruction fine tuning (prompt engineering) and configuring the parameters for inference.

#Pre-requisite libraries

In [4]:
## Library versions pinned so when there are updates in the future code still runs
%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 --quiet
%pip install py7zr

[0mCollecting py7zr
  Downloading py7zr-0.21.0-py3-none-any.whl.metadata (17 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl.metadata (6.3 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.0-cp310-cp310-manylinux_2

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

#Dataset Structure

The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in conversations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two interlocutors (about 75% of all conversations), the rest is between three or more people.

Information about the dataset:

The SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, the conversations were annotated with summaries. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person. The SAMSum dataset was prepared by Samsung R&D Institute Poland and is distributed for research purposes.

Columns in the dataset:
*  Dialogue: text of dialogue
*  Summary: human written summary of the dialogue
*  ID: unique id of an example

In [5]:
huggingface_dataset_name = "samsum" # https://huggingface.co/datasets/samsum

dataset = load_dataset(huggingface_dataset_name)

Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Exploring the dataset with two examples, looking at dialogue and human summaries

In [6]:
example_indices = [1, 2]

dash_line = '.'.join('' for x in range(150))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

.....................................................................................................................................................
Example  1
.....................................................................................................................................................
INPUT DIALOGUE:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)
.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Eric and Rob are going to watch a stand

#Importing the flan-t5-small model Huggingface (detail here: https://huggingface.co/google/flan-t5-small)

In [7]:
model_name='google/flan-t5-small' # a general purpose model (using small version for speed)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # AutoModelForSeq2SeqLM can be used
# to load any seq2seq (or encoder-decoder) model that has a language modeling (LM) head on top.
# AutoModelForCausalLM is used for auto-regressive language models like all the GPT models.



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) # used to convert
# raw text from our conversation into our vector space that can be processed by our
# flan-t5 model
# The tokenizer's job is to convert raw text into numbers, these numbers point to a set
# of vectors or the embeddings as they're often called

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [9]:
sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0],
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

DECODED SENTENCE:
What time is it, Tom?


#Summaries below generated by the model **without** any prompt engineering.

In [10]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

.....................................................................................................................................................
Example  1
.....................................................................................................................................................
INPUT PROMPT:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)
.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Eric and Rob are going to watch a stand-u

#Model generated summaries lack important details of the dialogue.

##1st example: The model summary misses information: Eric is also going to watch stand-up youtube.

##2nd example: The model summary misses information: The bit around "Bob will help Lenny with the first pair of trousers", this isn't useful information. The baseline summary is much more useful: "Lenny can't decide which trousers to buy. Bob advised Lenny on that topic."

#Below we apply a bit of prompt engineering and we convert the dialogue into an instruction prompt. This is through putting in the following text:

"Summarise the following conversation.

{dialogue}

Summary:"

This is zero shot infernce with an instruction prompt. Maybe this well help us in improving our model generated summary?


In [11]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

.....................................................................................................................................................
Example  1
.....................................................................................................................................................
INPUT PROMPT:

Summarize the following conversation.

Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary:
    
.....................................................................................................................................................
BASELINE H

1st example: The model summary is able to generate information about Eric also wanting to watch stand-up youtube.

2nd example: Not much difference. Some difference in output but not really useful as summary text.

Now we change the instruction prompt. This is what we had before (switching the order):

"Summarise the following conversation.

{dialogue}

Summary:"

We change it to:

"Dialogue: {dialogue}

What was going on?"

to see if that makes a difference:

In [12]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

.....................................................................................................................................................
Example  1
.....................................................................................................................................................
INPUT PROMPT:

Dialogue:

Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

What was going on?

.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Eric and

##For the 1st example the model generated output stayed the same. For the 2nd example it got worse and left out some details.

#Next we try one shot learning (providing the model one example to learn from) with an instruction prompt (prompt with an instruction):

##We provide an example using the below code:
"Dialogue:
{dialogue}

What was going on?

{summary}"

##And then the dialogue we want the model to generate a summary for:

"Dialogue: {dialogue}

What was going on?"

In [13]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""

    dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""

    return prompt

In [14]:
example_indices_full = [40]
example_index_to_summarize = 1

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

Sebastian: It's been already a year since we moved here.
Sebastian: This is totally the best time of my life.
Kevin: Really? 
Sebastian: Yeah! Totally maaan.
Sebastian: During this 1 year I learned more than ever. 
Sebastian: I learned how to be resourceful, I'm learning responsibility, and I literally have the power to make my dreams come true.
Kevin: It's great to hear that.
Kevin: It's great that you are satisfied with your decisions.
Kevin: And above all it's great to see that you have someone you love by your side :)
Sebastian: Exactly!
Sebastian: That's another part of my life that is going great.
Kevin: I wish I had such a person by my side.
Sebastian: Don't worry about it.
Sebastian: I have a feeling this day will come shortly.
Kevin: Haha. I don' think so, but thanks.
Sebastian: This one year proved to me that when you want something really badly, you can achieve it.
Kevin: I want to win lottery and I never did :D
Sebastian: If you devoted your lif

In [15]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Eric and Rob are going to watch a stand-up on youtube.

.....................................................................................................................................................
MODEL GENERATION - ONE SHOT:
Eric and Rob are watching a stand-up on YouTube.


In [16]:
example_indices_full = [40]
example_index_to_summarize = 2

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

Sebastian: It's been already a year since we moved here.
Sebastian: This is totally the best time of my life.
Kevin: Really? 
Sebastian: Yeah! Totally maaan.
Sebastian: During this 1 year I learned more than ever. 
Sebastian: I learned how to be resourceful, I'm learning responsibility, and I literally have the power to make my dreams come true.
Kevin: It's great to hear that.
Kevin: It's great that you are satisfied with your decisions.
Kevin: And above all it's great to see that you have someone you love by your side :)
Sebastian: Exactly!
Sebastian: That's another part of my life that is going great.
Kevin: I wish I had such a person by my side.
Sebastian: Don't worry about it.
Sebastian: I have a feeling this day will come shortly.
Kevin: Haha. I don' think so, but thanks.
Sebastian: This one year proved to me that when you want something really badly, you can achieve it.
Kevin: I want to win lottery and I never did :D
Sebastian: If you devoted your lif

In [17]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (537 > 512). Running this sequence through the model will result in indexing errors


.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.

.....................................................................................................................................................
MODEL GENERATION - ONE SHOT:
Bob will help Lenny with the outfit.



##In both examples not much of a difference in qualitative performance compared to when we used an instruction prompt alone.

1st example:
Only thing that could be improved is the bit around "going to" vs "are", Eric and Rob aren't currently watching stand-up on YouTube but they plan to.

2nd example:
Weak summary generated by model. Doesn't tell us about many details of the dialogue e.g. that Bob advised Lenny and then Lenny goes with Bob's advice to pick the trousers that are of the best quality even if they're the same colour as ones he already has.


##Below we try few shot learning learning by adding two more dialogue-summary pairs to the prompt

In [18]:
example_indices_full = [40,50,60]
example_index_to_summarize = 3

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

Sebastian: It's been already a year since we moved here.
Sebastian: This is totally the best time of my life.
Kevin: Really? 
Sebastian: Yeah! Totally maaan.
Sebastian: During this 1 year I learned more than ever. 
Sebastian: I learned how to be resourceful, I'm learning responsibility, and I literally have the power to make my dreams come true.
Kevin: It's great to hear that.
Kevin: It's great that you are satisfied with your decisions.
Kevin: And above all it's great to see that you have someone you love by your side :)
Sebastian: Exactly!
Sebastian: That's another part of my life that is going great.
Kevin: I wish I had such a person by my side.
Sebastian: Don't worry about it.
Sebastian: I have a feeling this day will come shortly.
Kevin: Haha. I don' think so, but thanks.
Sebastian: This one year proved to me that when you want something really badly, you can achieve it.
Kevin: I want to win lottery and I never did :D
Sebastian: If you devoted your lif

In [19]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Emma will be home soon and she will let Will know.

.....................................................................................................................................................
MODEL GENERATION - FEW SHOT:
Emma is going to pick Will up. Will will pick her up soon.


In this case, few shot did not provide much of an improvement over one shot inference.  And, anything above 5 or 6 shot will typically not help much, either.  We need to also make sure we don't exceed the model's input-context length is 512 tokens here.  Anything above the context length will be ignored.

Overall in for example 1 we found that doing some prompt engineering (using an instruction prompt) allowed us to improve the generated output. However beyond that it didn't.

#**Generative Configuration Parameters for Inference**

Changing the configuration parameters of the `generate()` method allows different outputs from the LLM.

So far the only parameter that has been set is `max_new_tokens=50`, which defines the maximum number of tokens to generate.

Full list of available parameters can be found in the [Hugging Face Generation documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig).

The `GenerationConfig` class is used to organise the configuration parameters.

Another parameter explored is temperature:

In [20]:
example_indices_full = [40,50,60]
example_index_to_summarize = 1

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

In [21]:
generation_config = GenerationConfig(max_new_tokens=50, do_sample=False, temperature=100.0) # setting temperature to 100

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Emma will be home soon and she will let Will know.

.....................................................................................................................................................
MODEL GENERATION - FEW SHOT:
Eric and Rob are watching a show on YouTube.




Putting the parameter `do_sample = True`, we activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p` although we don't explore these here).

When we set this as false we find that changing the temperature **makes no difference** to the output.

In [22]:
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=100.0) # setting temperature to 100 and do_sample to True

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Emma will be home soon and she will let Will know.

.....................................................................................................................................................
MODEL GENERATION - FEW SHOT:
Fredi teaches by doing stands–for russia not wearing shirt by Ukrop to look. Robert finds these as useful there which could probably help some English-Berswerke, Serbia/Polixanites as per USs culture


Above we set do_sample to True and the temperature to 100.0 (possible values are 0 to infinity as long as it's a positive float number).

In [23]:
generation_config = GenerationConfig(max_new_tokens=3, do_sample=False, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

.....................................................................................................................................................
BASELINE HUMAN SUMMARY:
Emma will be home soon and she will let Will know.

.....................................................................................................................................................
MODEL GENERATION - FEW SHOT:
Eric and Rob


Above we set the max_new_tokens to 3. This makes the output text too short, so the dialogue summary will be cut.

## Other configuration parameters to explore include: min_new tokens, num_beams, top_k and top_p can be found here https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig