# **Test some parameters**

In [1]:
!pip install datasets



In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

In [3]:
hf_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(hf_dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [4]:
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [5]:
def make_prompt(example_indices_full, example_index_to_summarize):
  prompt = ''

  for index in example_indices_full:
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    # The stop sequence '{summary}\n\n\n' is important for FLAN-T5.
    # Other models may have their own preferred stop sequence.
    prompt += f"""
Dialogue:

{dialogue}

summary:
{summary}\n\n\n
"""

  dialogue = dataset['test'][example_index_to_summarize]['dialogue']

  prompt += f"""
Dialogue:

{dialogue}

Summarize:
"""

  return prompt

In [6]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

In [7]:
dialogue = dataset['test'][200]['dialogue']

summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(f'Dialogue: \n{dialogue}\n')

print(f'Generated Summary: {output}\n')

print(f'Human Summary: {summary}')

Token indices sequence length is longer than the specified maximum sequence length for this model (809 > 512). Running this sequence through the model will result in indexing errors


Dialogue: 
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Generated Summary: #Person1 wants to upgrade his computer. #Person2 wants to add a painting program to his software. #Person1 wants to upgrade his hardware.

Human Summary: #Person1# teaches #Person2# how to upgrade soft

# **max_new_tokens**


In [8]:
inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=10,
    )[0],
    skip_special_tokens=True
)

print(f'Generated Summary: {output}\n')

print(f'Human Summary: {summary}')

Generated Summary: #Person1 wants to upgrade his computer.

Human Summary: #Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.


# **do_sample**

Default values:

```
top_k=50  
top_p=1.0
```

In [9]:
inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
        do_sample=True       # turn on sampling
    )[0],
    skip_special_tokens=True
)

print(f'Generated Summary: {output}\n')

print(f'Human Summary: {summary}')

Generated Summary: #Person1 is asking if he has a list of options.

Human Summary: #Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.


# **do_sample** + **top_k** & **top_p**

Note:

```
top_k=0    # disables top-k
top_p=1.0  # disables top-p
```


In [10]:
inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
        do_sample=True,       # turn on sampling
        top_k=20,
        top_p=0.90
    )[0],
    skip_special_tokens=True
)

print(f'Generated Summary: {output}\n')

print(f'Human Summary: {summary}')

Generated Summary: Several software programs are available. Some of them need updating.

Human Summary: #Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.


# **temperature** (& **do_sample**)

`temperature` **MUST** be used with `do_sample=True`



*   temperature < 1 → less random
*   temperature > 1 → more random

Typical values:





*   temperature → 0 → approximates **greedy** decoding (picks the highest-probability token).
*   0 < temperature < 1 → makes output more deterministic (less random).
*   temperature = 1 → **default, normal sampling**.
*   temperature > 1 → makes output more random (more diverse).




In [11]:
inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
        do_sample=True,       # turn on sampling
        temperature=0.5
    )[0],
    skip_special_tokens=True
)

print(f'Generated Summary: {output}\n')

print(f'Human Summary: {summary}')



Generated Summary: #Person1 proposes some upgrades to his computer software.

Human Summary: #Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
