# <span style="color: blue"> SUMMARIZE DIALOGUE </span>

## Installation

In [1]:
%pip install -U datasets==2.17.0 # installs the datasets library from Hugging Face
%pip install --upgrade pip
%pip install --disable-pip-version-check torch==1.13.1 torchdata==0.5.1 --quiet
%pip install transformers==4.27.2 --quiet # installs the transformers library from Hugging Face

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Imports from Hugging Face libraries `datasets` and `transformers`

In [169]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM # loads a pre-trained sequence-to-sequence model
from transformers import AutoTokenizer # loads the appropriate tokenizer for any given model
from transformers import GenerationConfig # allows you to configure generation parameters for text generation tasks, such as max length, temperature, top-k sampling, and others

# <span style="color: green"> Dataset & Model </span>

In [3]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name) # DatasetDict object

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Inspecting the dataset

In [19]:
print(f"type(dataset): {type(dataset)}")
print(f"len(dataset) = {len(dataset)}")
print(f"dataset.keys(): {dataset.keys()}")
print(f"type(dataset['test']): {type(dataset['test'])}")
print(f"testdata[0].keys(): {dataset['test'][0].keys()}")
print(f"number of examples in train data      : {len(dataset['train'])}")
print(f"number of examples in validation data : {len(dataset['validation'])}")
print(f"number of examples in test data       : {len(dataset['test'])}")

type(dataset): <class 'datasets.dataset_dict.DatasetDict'>
len(dataset) = 3
dataset.keys(): dict_keys(['train', 'validation', 'test'])
type(dataset['test']): <class 'datasets.arrow_dataset.Dataset'>
testdata[0].keys(): dict_keys(['id', 'dialogue', 'summary', 'topic'])
number of examples in train data      : 12460
number of examples in validation data : 500
number of examples in test data       : 1500


**Note:**
- `dataset` is **DatasetDict** dictionary with keys *train*, *validation*, and *test*.
- Each member of the dataset, e.g., `dataset['test']` includes `examples` that can be accessed by indexing: `dataset['test'][index]`
- Each `example` is a dictionary with keys *id*, *dialogue*, *summary*, and *topic*

In [30]:
testdata = dataset['test']
example_indices = [1, 123]

hbar = '_'*40

for i, index in enumerate(example_indices):
    print(hbar)
    print(f"Example {i}")
    print('\nINPUT DIALOGUE:')
    print(testdata[index]['dialogue'])
    print('\nBASELINE HUMAN SUMMARY:')
    print(testdata[index]['summary'])

________________________________________
Example 0

INPUT DIALOGUE:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this off

## Pretrained Model: FLAN-T5

In [31]:
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # creates an instance of AutoModelForSeq2SeqLM class with .from_pretrained() method



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Tokenizer

In [33]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) # tokenizer for the FLAN-T5 model using `AutoTokenizer.from_pretrained()` method

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

**Note:**
- Fast Tokenizer: Some models support a fast tokenizer implementation built on the Rust programming language, which is much faster than the standard Python implementation. Fast tokenizers provide significant performance improvements when tokenizing large amounts of text.
- Standard Tokenizer: The standard tokenizer implementation is written in Python and is slower compared to the fast version.

In [57]:
sentence = "How is it going Fakhreddin?"
sentence_encoded = tokenizer(sentence, return_tensors='pt') # return tensors in PyTorch ('pt') format
sentence_decoded = tokenizer.decode(sentence_encoded["input_ids"][0], skip_special_tokens=True)

print(f"ENCODED SENTENCE: {sentence_encoded['input_ids'][0]}")
print(f"SENTENCE:         {sentence}")
print(f"DECODED SENTENCE: {sentence_decoded}")

ENCODED SENTENCE: tensor([  571,    19,    34,   352,   377, 18965,  1271,  2644,    58,     1])
SENTENCE:         How is it going Fakhreddin?
DECODED SENTENCE: How is it going Fakhreddin?


**Note:**
- Other options for `return_tensors` are `'tf'` and `'np'` for tensorflow and numpy, respectively.
- `sentence_encoded` is a dictionary-like object with key 'input_ids'
- `sentence_encoded['input_ids']` is a 2D tensor with shape `(1,N)`, where `N` denotes the number of tokens in the sequence.

The `skip_special_tokens=True` parameter is used in the decode method to remove special tokens that were added by the tokenizer. Special tokens: In many NLP models, special tokens are added to the input for specific purposes. Setting `skip_special_tokens=True` tells the tokenizer's decode function to exclude these tokens from the output, so the decoded sentence appears clean and natural without extra symbols.

Example: Without `skip_special_tokens=True`, you might see output like this:

`How is it going Fakhreddin?</s>`

# <span style="color: green"> Model Performance </span>

## No Prompt Engineering

In [126]:
def make_zero_shot_prompt(dialogue):
    return dialogue # no prompt engineering

In [127]:
testdata = dataset['test']
test_indices = [11] # working with one single example

def generate(make_zero_shot_prompt):
    for i, index in enumerate(test_indices):
        dialogue = testdata[index]['dialogue']
        summary  = testdata[index]['summary']
    
        prompt = make_zero_shot_prompt(dialogue)
        inputs = tokenizer(prompt, return_tensors='pt')
    
        model_tokenized_output = model.generate(inputs["input_ids"], max_new_tokens=50)
        output = tokenizer.decode(model_tokenized_output[0], skip_special_tokens=True)
    
        print(hbar)
        print(f"Example {i}")
        print('\nINPUT DIALOGUE:')
        print(testdata[index]['dialogue'])
        print('\nBASELINE HUMAN SUMMARY:')
        print(testdata[index]['summary'])
        print('\nMODEL GENERATION:')
        print(output)
        print(f"\ninputs['input_ids'].size() = {inputs['input_ids'].size()}")
        print(f"\nmodel_tokenized_output.size() = {model_tokenized_output.size()}")
        
generate(make_zero_shot_prompt)

________________________________________
Example 0

INPUT DIALOGUE:
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday

BASELINE HUMAN SUMMARY:
#Person1# has a dance with Brian at Brian's birthday party. Brian thinks #Person1# looks great and is popular.

MODEL GENERATION:
Brian, thank you for coming to our party.

inputs['input_ids'].size() = torch.Size([1, 197])

model_tokenized

**Note:** `model.generate(encoded_sequence['input_ids'])` generates a sequence of tokens for completion. The `encoded_sequence['input_ids']` has a shape (1,N) and the output of this method has shape (1,M). Here, N and M are the number of tokens in the propmt and answer, respectively. `tokenizer.decode` however accepts 1D tokenized seqeuence, so we use `model_tokenized_output[0]` to extract the sequence for decoding

## Zero-Shot with Instruction Prompt

In [128]:
def make_zero_shot_prompt(dialogue):
    prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:"
    return prompt

In [129]:
generate(make_zero_shot_prompt)

________________________________________
Example 0

INPUT DIALOGUE:
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday

BASELINE HUMAN SUMMARY:
#Person1# has a dance with Brian at Brian's birthday party. Brian thinks #Person1# looks great and is popular.

MODEL GENERATION:
#Person1#: Happy birthday, Brian. #Person2#: Thank you for coming.

inputs['input_ids'].size() = torch.Size([

## One-Shot and Few-Shot Inference

In [150]:
# This function takes a list of `full_example_indices`, generates a prompt with full examples;
# then at the end appends the prompt which you want the model to complete (`dialogue_to_be_summarized`)
def make_few_shot_prompt(full_example_indices, dialogue_to_be_summarized):
    prompt = ''
    for index in full_example_indices:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']
        
        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"Dialogue:\n\n{dialogue}\n\nWhat was going on?\n{summary}\n"    
    
    
    prompt += f"Dialogue:\n\n{dialogue_to_be_summarized}\n\nWhat was going on?"
        
    return prompt

In [163]:
index = test_indices[0]
dialogue = testdata[index]['dialogue']
summary  = testdata[index]['summary']
full_example_indices = [23] # what happens if you use the index of the dialogue_to_be_summarized here?

generation_config = GenerationConfig(max_new_tokens=50)
def generate():
    prompt = make_few_shot_prompt(full_example_indices, dialogue)
    print(f"prompt:\n\n{prompt}")
    inputs = tokenizer(prompt, return_tensors='pt')

    model_tokenized_output = model.generate(inputs["input_ids"], generation_config=generation_config)
    output = tokenizer.decode(model_tokenized_output[0], skip_special_tokens=True)

    print(hbar)
    print(f"Example {i}")
    print('\nINPUT DIALOGUE:')
    print(testdata[index]['dialogue'])
    print('\nBASELINE HUMAN SUMMARY:')
    print(testdata[index]['summary'])
    print('\nMODEL GENERATION:')
    print(output)
    print(f"\ninputs['input_ids'].size() = {inputs['input_ids'].size()}")
    print(f"\nmodel_tokenized_output.size() = {model_tokenized_output.size()}")
    
generate()

prompt:

Dialogue:

#Person1#: Good coming. What can I do for you?
#Person2#: I'm in Room 309. I'm checking out today. Can I have my bill now?
#Person1#: Certainly. Please wait a moment. Here you are.
#Person2#: Thanks. Wait... What's this? The 30 dollar for?
#Person1#: Excuse me... The charge for your laundry service on Nov. 20th.
#Person2#: But I did't take any laundry service during my stay here. I think you have added someone else's.
#Person1#: Ummmm...Sorry, would you mind waiting a moment? We check it with the department concerned.
#Person2#: No. As long as we get this straightened out.
#Person1#: I'm very sorry. There has been a mistake. We'll correct the bill. Please take a look.
#Person2#: Okay, here you are.
#Person1#: Goodbye.

What was going on?
#Person2# finds #Person2# being mischarged. #Person1# corrects the bill and #Person2# pays for it.
Dialogue:

#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the part


## Generative Configuration Parameters

In [168]:
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
generate()

prompt:

Dialogue:

#Person1#: Good coming. What can I do for you?
#Person2#: I'm in Room 309. I'm checking out today. Can I have my bill now?
#Person1#: Certainly. Please wait a moment. Here you are.
#Person2#: Thanks. Wait... What's this? The 30 dollar for?
#Person1#: Excuse me... The charge for your laundry service on Nov. 20th.
#Person2#: But I did't take any laundry service during my stay here. I think you have added someone else's.
#Person1#: Ummmm...Sorry, would you mind waiting a moment? We check it with the department concerned.
#Person2#: No. As long as we get this straightened out.
#Person1#: I'm very sorry. There has been a mistake. We'll correct the bill. Please take a look.
#Person2#: Okay, here you are.
#Person1#: Goodbye.

What was going on?
#Person2# finds #Person2# being mischarged. #Person1# corrects the bill and #Person2# pays for it.
Dialogue:

#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the part