###PA4: Prompt Engineering

In this exercise, you will perform prompt engineering on a dialogue summarization task using [Flan-T5](https://huggingface.co/google/flan-t5-large) and the [dialogsum dataset](https://huggingface.co/datasets/knkarthick/dialogsum). You will explore how different prompts affect the output of the model, and compare zero-shot and few-shot inferences. <br/>
Complete the code in the cells below.

### 1. Set up Required Dependencies

In [63]:
!pip3 install datasets -q

In [64]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
from datasets import load_dataset

### 2. Explore the Dataset

In [65]:
from datasets import load_dataset

dataset = load_dataset('knkarthick/dialogsum')

Print several dialogues with their baseline summaries.

In [66]:
example_indices = [0, 42, 800]
dash_line = '-' * 100

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

----------------------------------------------------------------------------------------------------
Example 1
----------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to 

### 3. Summarize Dialogues without Prompt Engineering

Load the Flan-T5-large model and its tokenizer.

In [67]:
model_name = 'google/flan-t5-large'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

**Exercise**: Use the pre-trained model to summarize the example dialogues without any prompt engineering. Use the `model.generate()` function with `max_new_tokens=50`.

In [68]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [69]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    inputs = tokenizer(dialogue, return_tensors='pt', truncation=True)


    summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

    print(dash_line)
    print('Example', i + 1)
    print(dash_line)
    print('MODEL GENERATED SUMMARY:')
    print(summary)
    print(dash_line)
    print()

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
Example 1
----------------------------------------------------------------------------------------------------
MODEL GENERATED SUMMARY:
#Person1: Ms. Dawson, please take dictation for me.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and s

You can see that the model generations make some sense, but the model doesn't seem to be sure what task it is supposed to accomplish and it often just makes up the next sentence in the dialogue. Prompt engineering can help here.

### 4. Summarize Dialogues with Instruction Prompts

In order to instruct the model to perform a task (e.g., summarize a dialogue), you can take the dialogue and convert it into an instruction prompt. This is often called **zero-shot inference**.

**Exercise**: Wrap the dialogues in a descriptive instruction (e.g., "Summarize the following conversation."), and examine how the generated text changes.

In [70]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']

    # Wrap dialogue in a descriptive instruction
    input_text = "Summarize the following conversation: " + dialogue

    # Tokenize the input
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True)

    summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

    print(dash_line)
    print('Example', i + 1)
    print(dash_line)
    print('MODEL GENERATED SUMMARY:')
    print(summary)
    print(dash_line)
    print()

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
Example 1
----------------------------------------------------------------------------------------------------
MODEL GENERATED SUMMARY:
#Person1# wants Ms. Dawson to take dictation for him.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and

This is much better! But the model still does not pick up on the nuance of the conversations though.

 **Exercise:** Experiment with the prompt text and see how it influences the generated output. Do the inferences change if you end the prompt with just empty string vs. `Summary: `?

In [71]:
def generate_summaries(prompt_text, description):
    print(f"{description}")
    for i, index in enumerate(example_indices):
        dialogue = dataset['test'][index]['dialogue']

        # Prepare input text with the specified prompt
        input_text = prompt_text + dialogue

        # Tokenize the input
        inputs = tokenizer(input_text, return_tensors='pt', truncation=True)


        summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        print(dash_line)
        print('BASELINE HUMAN SUMMARY:')
        print(dataset['test'][index]['summary'])
        print(dash_line)
        print('MODEL GENERATED SUMMARY:')
        print(summary)
        print(dash_line)
        print()

# Experiment with different prompts
generate_summaries("", "Empty String Prompt")
generate_summaries("Summary: ", "Summary: Prompt")

Empty String Prompt
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
----------------------------------------------------------------------------------------------------
MODEL GENERATED SUMMARY:
#Person1: Ms. Dawson, please take dictation for me.
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
----------------------------------------------------------------------------------------------------
MODEL GENERATED SUMMARY:
#Person1#: Thank you, #Person2#.
---------------------

In the "Summary: " string prompt the model generates summary instead of replying as dialogues in the no string prompt.

**Exercise:** Flan-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py). Try using its pre-built prompts for dialogue summarization (e.g., the ones under the `"samsum"` key) and see how they influence the outputs.


In [72]:
samsum_prompts = [
        "{dialogue}\n\nBriefly summarize that dialogue.",
        "Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
        "Dialogue:\n{dialogue}\n\nWhat is a summary of this dialogue?",
        "{dialogue}\n\nWhat was that dialogue about, in two sentences or less?",
        "Here is a dialogue:\n{dialogue}\n\nWhat were they talking about?",
        "Dialogue:\n{dialogue}\nWhat were the main points in that "
         "conversation?",
        "Dialogue:\n{dialogue}\nWhat was going on in that conversation?",
    ]

In [73]:
def generate_summaries(prompt_template, description):
    print(f"\n{'='*20} {description} {'='*20}\n")
    for i, index in enumerate(example_indices):
        dialogue = dataset['test'][index]['dialogue']
        human_summary = dataset['test'][index]['summary']

        # Prepare input text with the specified prompt template
        input_text = prompt_template.format(dialogue=dialogue, summary=human_summary)

        # Tokenize the input
        inputs = tokenizer(input_text, return_tensors='pt', truncation=True)

        # Generate summary with max_new_tokens=50

        summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        print(dash_line)
        print('BASELINE HUMAN SUMMARY:')
        print(dataset['test'][index]['summary'])
        print(dash_line)
        print('MODEL GENERATED SUMMARY:')
        print(summary)
        print(dash_line)
        print()

# Experiment with different SamSum prompts
for i, prompt_template in enumerate(samsum_prompts):
    generate_summaries(prompt_template, f"SamSum Prompt {i+1}")



----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
----------------------------------------------------------------------------------------------------
MODEL GENERATED SUMMARY:
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
----------------------------------------------------------------------------------------------------
MODEL GENERATED SUMMARY:
Person1 is worried about his future. He should get plenty of sleep, drink less wine and exercise.
--------------------------

Notice that the prompts from Flan-T5 did help, but the model still struggles to pick up on the nuance of the conversation in some cases. This is what you will try to solve with few-shot inferencing.

### 5. Summarize Dialogues with a Few-Shot Inference

**Few-shot inference** is the practice of providing an LLM with several examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task.

**Exercise:** Build a function that takes a list of `in_context_example_indexes`, generates a prompt with the examples, then at the end appends the prompt that you want the model to complete (`test_example_index`). Use the same Flan-T5 prompt template from Section 3. Make sure to separate between the examples with `"\n\n\n"`.

In [74]:
def make_prompt(in_context_example_indices, test_example_index):
    prompt = ""

    for idx in in_context_example_indices:
        example = dataset['test'][idx]
        prompt += example["dialogue"] + "\n" + example["summary"] + "\n\n\n"

    test_example = dataset['test'][test_example_index]
    prompt += test_example["dialogue"]

    return prompt

# Example usage
in_context_example_indices = [0, 10, 20]
test_example_index = 800

few_shot_prompt = make_prompt(in_context_example_indices, test_example_index)
print(few_shot_prompt)

#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Wh

Now pass this prompt to the model perform a few shot inference:

In [80]:
def generate_summary_for_test_example(in_context_example_indices, test_example_index):
    few_shot_prompt = make_prompt(in_context_example_indices, test_example_index)

    inputs = tokenizer(few_shot_prompt, return_tensors='pt', truncation=True, padding='longest')

    summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50, num_beams=4, early_stopping=True)

    generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return generated_summary

In [81]:
generated_summary = generate_summary_for_test_example(in_context_example_indices, test_example_index)
print("Generated Summary:")
print(generated_summary)

Generated Summary:
Brian is having a birthday party. He wants to dance with Person2 at the party.


**Exercise:** Experiment with the few-shot inferencing:
- Choose different dialogues - change the indices in the `in_context_example_indices` list and `test_example_index` value.
- Change the number of examples. Be sure to stay within the model's 512 context length, however.

How well does few-shot inference work with other examples?

In [82]:
in_context_example_indices = [0, 1, 2]
test_example_index = 8

few_shot_prompt = make_prompt(in_context_example_indices, test_example_index)
print(few_shot_prompt)

#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Wh

In [83]:
generated_summary = generate_summary_for_test_example(in_context_example_indices, test_example_index)
print("Generated Summary:")
print(generated_summary)

Generated Summary:
Sir, does this apply to intra-office communications only? Or will it also restrict external communications?


In [92]:
in_context_example_indices = [2, 23, 64, 5, 13]
test_example_index = 22

few_shot_prompt = make_prompt(in_context_example_indices, test_example_index)
print(few_shot_prompt)

#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Wh

In [94]:
generated_summary = generate_summary_for_test_example(in_context_example_indices, test_example_index)
print("Generated Summary:")
print(generated_summary)

Generated Summary:
The charge for the laundry service on Nov. 20th is 30 dollars.
#Person1# helps #Person2# correct a mischarged bill on laundry service and helps #Person2# check out.


In [95]:
print(dataset['test'][22]['summary'])

#Person1# helps #Person2# correct a mischarged bill on laundry service and helps #Person2# check out.


few-shot learns better and the generated summary close to the human generated summaries. It performs better than no shot learning.

### 6. Generative Configuration Parameters for Inference

You can change the configuration parameters of the `generate()` method to see a different output from the LLM. So far the only parameter that you have been setting was `max_new_tokens=50`, which defines the maximum number of tokens to generate. A convenient way of organizing the configuration parameters is to use `GenerationConfig` class. By setting the parameter `do_sample = True`, you can activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p`). A full list of available parameters can be found in the [Hugging Face Generation documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig).

In [96]:
def generate_summary(in_context_example_indices, test_example_index, max_new_tokens, num_beams, temperature, top_k, top_p):
    few_shot_prompt = make_prompt(in_context_example_indices, test_example_index)

    inputs = tokenizer(few_shot_prompt, return_tensors='pt', truncation=True, padding='longest')

    summary_ids = model.generate(
        inputs['input_ids'],
        max_new_tokens=max_new_tokens,
        num_beams=num_beams,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        early_stopping=True
    )

    generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return generated_summary

in_context_example_indices = [0, 10, 20]
test_example_index = 800

configurations = [
    {"max_new_tokens": 50, "num_beams": 4, "temperature": 0.7, "top_k": 50, "top_p": 0.9},
    {"max_new_tokens": 100, "num_beams": 1, "temperature": 1.0, "top_k": 10, "top_p": 0.8},
    {"max_new_tokens": 30, "num_beams": 8, "temperature": 0.3, "top_k": 100, "top_p": 0.95},
    {"max_new_tokens": 70, "num_beams": 4, "temperature": 1.0, "top_k": 50, "top_p": 0.7}
]

for config in configurations:
    generated_summary = generate_summary(
        in_context_example_indices,
        test_example_index,
        max_new_tokens=config["max_new_tokens"],
        num_beams=config["num_beams"],
        temperature=config["temperature"],
        top_k=config["top_k"],
        top_p=config["top_p"]
    )
    print(f"\nConfiguration: {config}")
    print("Generated Summary:")
    print(generated_summary)
    print("="*50)




Configuration: {'max_new_tokens': 50, 'num_beams': 4, 'temperature': 0.7, 'top_k': 50, 'top_p': 0.9}
Generated Summary:
Brian is having a birthday party. He wants to dance with Person2 at the party.





Configuration: {'max_new_tokens': 100, 'num_beams': 1, 'temperature': 1.0, 'top_k': 10, 'top_p': 0.8}
Generated Summary:
Brian is having a birthday party. He wants to dance with Person2 at the party.





Configuration: {'max_new_tokens': 30, 'num_beams': 8, 'temperature': 0.3, 'top_k': 100, 'top_p': 0.95}
Generated Summary:
Brian is having a birthday party. He wants to dance with Person2 at the party.





Configuration: {'max_new_tokens': 70, 'num_beams': 4, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.7}
Generated Summary:
Brian is having a birthday party. He wants to dance with Person2 at the party.


The answer seems same for the for different configurations.