In [None]:
## Student: Yuqing Qiao
## Date: 07/29/2024

In this exercise, you will perform prompt engineering on a dialogue summarization task using [Flan-T5](https://huggingface.co/google/flan-t5-large) and the [dialogsum dataset](https://huggingface.co/datasets/knkarthick/dialogsum). You will explore how different prompts affect the output of the model, and compare zero-shot and few-shot inferences. <br/>
Complete the code in the cells below.

### 1. Set up Required Dependencies

In [3]:
!pip install datasets -q --quiet

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
from datasets import load_dataset
import re

### 2. Explore the Dataset

In [2]:
from datasets import load_dataset

dataset = load_dataset('knkarthick/dialogsum')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [48]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Print several dialogues with their baseline summaries.

In [5]:
example_indices = [0, 42, 800]
dash_line = '-' * 100

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

----------------------------------------------------------------------------------------------------
Example 1
----------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to 

### 3. Summarize Dialogues without Prompt Engineering

Load the Flan-T5-large model and its tokenizer.

In [4]:
model_name = 'google/flan-t5-large'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

**Exercise**: Use the pre-trained model to summarize the example dialogues without any prompt engineering. Use the `model.generate()` function with `max_new_tokens=50`.

> Add blockquote



In [34]:
### WRITE YOUR CODE HERE

def summarize_dialogue(prompt,truncation = False):
    inputs = tokenizer(prompt, return_tensors='pt',truncation=truncation)
    summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50)
    summary = ""
    for ids in summary_ids:
        summary += tokenizer.decode(ids,skip_special_tokens = True)

    return summary

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = summarize_dialogue(dialogue)
    print(dash_line)
    print("Example Summary", i+1)
    print(dash_line)
    print("Original:", dialogue)
    print(dash_line)
    print(f"Summary:{summary}")
    print(dash_line)
    print()

----------------------------------------------------------------------------------------------------
Example Summary 1
----------------------------------------------------------------------------------------------------
Original: #Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging t

You can see that the model generations make some sense, but the model doesn't seem to be sure what task it is supposed to accomplish and it often just makes up the next sentence in the dialogue. Prompt engineering can help here.

### 4. Summarize Dialogues with Instruction Prompts

In order to instruct the model to perform a task (e.g., summarize a dialogue), you can take the dialogue and convert it into an instruction prompt. This is often called **zero-shot inference**.



```
# This is formatted as code
```

**Exercise**: Wrap the dialogues in a descriptive instruction (e.g., "Summarize the following conversation."), and examine how the generated text changes.

In [32]:
### WRITE YOUR CODE HERE
instruction = "Summarize the following conversation."

def summarize_with_instruction(dialogue, instruction):
    input = f"{instruction} : {dialogue}"
    return summarize_dialogue(input)

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary_with_instruction = summarize_with_instruction(dialogue, instruction)
    summary_without_instruction = summarize_with_instruction(dialogue,"")

    print(dash_line)
    print("Example Summary", i+1)
    print(dash_line)
    print("Summary with instruction:", summary_with_instruction)
    print(dash_line)
    print("Summary without instruction:", summary_without_instruction)
    print(dash_line)
    print()

----------------------------------------------------------------------------------------------------
Example Summary 1
----------------------------------------------------------------------------------------------------
Summary with instruction: #Person1# wants Ms. Dawson to take dictation for him.
----------------------------------------------------------------------------------------------------
Summary without instruction: #Person1#: Ms. Dawson, please take dictation for me.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
Example Summary 2
----------------------------------------------------------------------------------------------------
Summary with instruction: #Person1# is worried about his future. #Person2# gives him some advice.
-------------------------------------------------------------------------------------------------

This is much better! But the model still does not pick up on the nuance of the conversations though.

**Exercise:** Experiment with the prompt text and see how it influences the generated output. Do the inferences change if you end the prompt with just empty string vs. `Summary: `?

In [10]:
### WRITE YOUR CODE HERE
instructions = [
    "Summarize the following conversation.",
    "Summarize the following conversation. Summary:"
]

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    print(dash_line)
    print(f"Summary example:{i}")
    print(dash_line)
    for i, instruction in enumerate(instructions):
        print(f"Prompt instruction:{i}")
        summary_with_instruction = summarize_with_instruction(dialogue, instruction)
        print(dash_line)
        print("Summary with instruction:", summary_with_instruction)
        print(dash_line)
        print()

----------------------------------------------------------------------------------------------------
Summary example:0
----------------------------------------------------------------------------------------------------
Prompt instruction:0
----------------------------------------------------------------------------------------------------
Summary with instruction: #Person1# wants Ms. Dawson to take dictation for him.
----------------------------------------------------------------------------------------------------

Prompt instruction:1
----------------------------------------------------------------------------------------------------
Summary with instruction: #Person1# wants to send a memo to all employees.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
Summary example:1
----------------------------------------------------------

**Exercise:** Flan-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py). Try using its pre-built prompts for dialogue summarization (e.g., the ones under the `"samsum"` key) and see how they influence the outputs.


In [10]:
### WRITE YOUR CODE HERE
templates = {
    "samsum": [
        ("{dialogue}\n\nBriefly summarize that dialogue.", "{summary}"),
        ("Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
         "{summary}"),
        ("Dialogue:\n{dialogue}\n\nWhat is a summary of this dialogue?",
         "{summary}"),
        ("{dialogue}\n\nWhat was that dialogue about, in two sentences or less?",
         "{summary}"),
        ("Here is a dialogue:\n{dialogue}\n\nWhat were they talking about?",
         "{summary}"),
        ("Dialogue:\n{dialogue}\nWhat were the main points in that "
         "conversation?", "{summary}"),
        ("Dialogue:\n{dialogue}\nWhat was going on in that conversation?",
         "{summary}"),
        ("Write a dialog about anything you want.", "{dialogue}"),
        ("Write a dialog based on this summary:\n{summary}.", "{dialogue}"),
        ("Write a dialog with this premise \"{summary}\".", "{dialogue}"),
    ]}

def is_only_dialogue_check(template:str)->bool:
    pattern = r'^.*\{dialogue\}\s*$'
    return bool(re.fullmatch(pattern, template))

def summarize_with_template(dialogue, template):
    # assert is_only_dialogue_check(template[0]), "Template must only contain {dialogue}"
    # dialogue temple is the first item in tuple
    prompt = template[0].format(dialogue=dialogue)
    return summarize_dialogue(prompt)

# Generate summary for each template, using sample examples
for i,index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    print(dash_line)
    print(f"Summary example:{i}")
    print(dash_line)

    for idx, template in enumerate(templates["samsum"][:6]):
        summary = summarize_with_template(dialogue, template)
        print(f"Summary with template {idx}:", summary)
        print(dash_line)
        print()


----------------------------------------------------------------------------------------------------
Summary example:0
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------

Summary with template 1: Person1 wants to send an intra-office memo to all employees. It's about a new policy on communications. Employees who use Instant Messaging during working hours will be warned and placed on probation. Employees who continue to use Instant
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------

Summary with template 3: Person1 wants Ms. Dawson to take dictation for him. The memo should be distributed to all employees by this afternoon.
------------------------------------------------------------------

Notice that the prompts from Flan-T5 did help, but the model still struggles to pick up on the nuance of the conversation in some cases. This is what you will try to solve with few-shot inferencing.

### 5. Summarize Dialogues with a Few-Shot Inference

**Few-shot inference** is the practice of providing an LLM with several examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task.

**Exercise:** Build a function that takes a list of `in_context_example_indexes`, generates a prompt with the examples, then at the end appends the prompt that you want the model to complete (`test_example_index`). Use the same Flan-T5 prompt template from Section 3. Make sure to separate between the examples with `"\n\n\n"`.

In [16]:
template = (
    dash_line + "\nExample Summary {i}\n" + dash_line +
    "\nOriginal: {dialogue}\n" + dash_line +
    "\nSummary: {summary}\n" + dash_line +
    "\n\n\n"
)

In [26]:
def make_prompt(in_context_example_indices, test_example_index):
    ### WRITE YOUR CODE HERE

    prompt = ""

    #paste examples dialogues and summaries into prompt
    for i, index in enumerate(in_context_example_indices):
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        prompt+=template.format(i=i+1, dialogue=dialogue, summary=summary)
    # Paste selected template and dialogue into prompt
    dialogue = dataset['test'][test_example_index]['dialogue']
    prompt += template.format(i=len(in_context_example_indices)+1,dialogue=dialogue,summary="")
    prompt += "\n Briefly summarize that dialogue."
    return prompt

In [27]:
in_context_example_indices = [0, 10, 20]
test_example_index = 800

few_shot_prompt = make_prompt(in_context_example_indices, test_example_index)
print(few_shot_prompt)

----------------------------------------------------------------------------------------------------
Example Summary 1
----------------------------------------------------------------------------------------------------
Original: #Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging t

Now pass this prompt to the model perform a few shot inference:

In [30]:
### WRITE YOUR CODE HERE
summary = summarize_dialogue(few_shot_prompt)
print(summary)


Token indices sequence length is longer than the specified maximum sequence length for this model (1744 > 512). Running this sequence through the model will result in indexing errors


Dad and his family are coming to visit them in Europe next year.


**Exercise:** Experiment with the few-shot inferencing:
- Choose different dialogues - change the indices in the `in_context_example_indices` list and `test_example_index` value.
- Change the number of examples. Be sure to stay within the model's 512 context length, however.

How well does few-shot inference work with other examples?

In [50]:
### WRITE YOUR CODE HERE
my_in_context_example_indices = [75, 85, 95]
my_test_example_index = 100

my_few_shot_prompt = make_prompt(my_in_context_example_indices, my_test_example_index)
print(my_few_shot_prompt)


my_in_context_summary = summarize_dialogue(my_few_shot_prompt)
print("\n\n\n")
print("In context summary:", my_in_context_summary)

----------------------------------------------------------------------------------------------------
Example Summary 1
----------------------------------------------------------------------------------------------------
Original: #Person1#: Do you drink much?
#Person2#: Depending on what you consider a lot.
#Person1#: How frequently do you drink?
#Person2#: Couple times a week. How about you?
#Person1#: Only when I go out. I'm not a big drinker.
#Person2#: How much can you drink?
#Person1#: I usually only have 2 beers.
#Person2#: You're a light weight.
#Person1#: How much can you drink?
#Person2#: I'm usually drinking all night long. At least 10 drinks.
#Person1#: Don't you spend a lot of money then?
#Person2#: No. We usually go to places that have specials. Dante's over on the Ave has $ 5. 00 pitchers on Mondays. So for ten, fifteen bucks, I can get a lot of drinks.
#Person1#: That's true.
#Person2#: If you don't like beer, have you tried mixed drinks? Some of them are pretty good.
#P

### 6. Generative Configuration Parameters for Inference

You can change the configuration parameters of the `generate()` method to see a different output from the LLM. So far the only parameter that you have been setting was `max_new_tokens=50`, which defines the maximum number of tokens to generate. A convenient way of organizing the configuration parameters is to use `GenerationConfig` class. By setting the parameter `do_sample = True`, you can activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p`). A full list of available parameters can be found in the [Hugging Face Generation documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig).

**Exercise:** Change the configuration parameters to investigate their influence on the output. Analyze your results.

In [39]:
### WRITE YOUR CODE HERE
from transformers import GenerationConfig

# General configuration
generation_config = GenerationConfig(
    do_sample=True,
    top_k = 10,
    top_p = 0.9,
    temperature = 0.9,
    max_new_tokens = 100
)

def summarize_dialogue_with_config(prompt,config):
    inputs = tokenizer(prompt, return_tensors='pt')
    summary_ids = model.generate(inputs['input_ids'], generation_config=config)
    summary = ""
    for ids in summary_ids:
        summary += tokenizer.decode(ids,skip_special_tokens = True)

    return summary



In [45]:
summary_with_config = summarize_dialogue_with_config(few_shot_prompt,generation_config)
print(summary_with_config)

Dad is talking to his uncle Bill, his wife and two daughters in New Zealand. They want to travel to Europe next year.


## Analyze: Used as sample summary for analyze.

In [44]:
# Generation Config with low temperature
generation_config = GenerationConfig(
    do_sample=True,
    top_k = 10,
    top_p = 0.9,
    temperature = 0.2,
    max_new_tokens = 100,
    out_put_scores = True
    )
summary_with_config = summarize_dialogue_with_config(few_shot_prompt,generation_config)
print(summary_with_config)

Dad and Sarah's uncle Bill, his wife and two daughters are coming to visit them in Europe next year.


## Low Temperature Analyze: lower temperature model generates conservative output. less randomness context.

In [43]:
# config with large top_k
generation_config = GenerationConfig(
    do_sample=True,
    top_k = 100,
    top_p = 0.9,
    temperature = 0.9,
    max_new_tokens = 100
)
summary_with_config = summarize_dialogue_with_config(few_shot_prompt,generation_config)
print(summary_with_config)

Dad is talking to his uncle Bill, his wife and their two daughters. They want to travel to Europe next year.


## Large top_k Analyze: larger top_k introduces more randomness. less coherant output.

In [46]:
# config with low p, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
generation_config = GenerationConfig(
    do_sample=True,
    top_k = 10,
    top_p = 0.2,
    temperature = 0.9,
    max_new_tokens = 100
)
summary_with_config = summarize_dialogue_with_config(few_shot_prompt,generation_config)
print(summary_with_config)

Dad and his family are coming to visit them in Europe next year.


## Analyze: Low top_p generates concise output, with more details omitted.

In [51]:
generation_config = GenerationConfig(
    do_sample=True,
    num_beams = 10,
    top_k = 10,
    top_p = 0.9,
    temperature = 0.9,
    max_new_tokens = 100
)
summary_with_config = summarize_dialogue_with_config(few_shot_prompt,generation_config)
print(summary_with_config)

Dad keeps talking about his family in New Zealand. His uncle Bill, his wife and two of their daughters are his cousins. Sarah and Jane are both his cousins although they are step-sisters. They want to travel to Europe next year and will visit them at the same Ae.


## num_beams Analyze: Higher beam numbers model generates output with more details, provide higher quality summary.