<a href="https://colab.research.google.com/github/wadra/LLM_from_Scratch/blob/main/code/myedit/T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T5

In this notebook (based on Sinan Ozdemir's [here](https://github.com/sinanuozdemir/oreilly-hands-on-transformers/blob/main/notebooks/t5.ipynb)), we use T5 "out of the box" for a broad range of NLP/generation tasks.

### Load dependencies

In [1]:
%%capture
!pip install transformers==4.28.0 sentencepiece==0.1.98

In [2]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

### Load model

In [3]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Perform inference

**Translation**:

In [4]:
input_ids = tokenizer.encode('translate English to German: Where is the chocolate?', return_tensors='pt')

translate_ids = model.generate(
    input_ids,
    num_beams=4, # set of most likely sequences at given step; higher can give better results but is more expensive
    no_repeat_ngram_size=3, # prevents repetition of n-grams of length n
    max_length=20, # maximum number of tokens
    early_stopping=True # allows generation to stop before max_length is reached
)

output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print (f"Translated text:\n{output}")

Translated text:
Wo ist die Schokolade?


**Summarization** of [T5 paper](arxiv.org/abs/1910.10683) abstract:

In [5]:
text_to_summarize = """Transfer learning, where a model is first pre-trained on a
data-rich task before being fine-tuned on a downstream task, has emerged as a
powerful technique in natural language processing (NLP). The effectiveness of
transfer learning has given rise to a diversity of approaches, methodology, and
practice. In this paper, we explore the landscape of transfer learning techniques
for NLP by introducing a unified framework that converts all text-based language
problems into a text-to-text format. Our systematic study compares pre-training
objectives, architectures, unlabeled data sets, transfer approaches, and other
factors on dozens of language understanding tasks. By combining the insights from
our exploration with scale and our new Colossal Clean Crawled Corpus, we
achieve state-of-the-art results on many benchmarks covering summarization,
question answering, text classification, and more. To facilitate future work on
transfer learning for NLP, we release our data set, pre-trained models, and code."""

preprocess_text = text_to_summarize.strip().replace("\n","")

t5_prepared_text = "summarize: " + preprocess_text # add prompt

input_ids = tokenizer.encode(t5_prepared_text, return_tensors="pt")

# summmarize
summary_ids = model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    min_length=30, # new but obvious
    max_length=50,
    early_stopping=True
)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print (f"Summarized text: \n{output}")

Summarized text: 
transfer learning has emerged as a powerful technique in natural language processing (NLP) a unified framework converts all text-based language problems into a text-to-text format. our study compares pre-training objectives


CoLA, the Corpus of Linguistic Acceptability, checks for **grammatical correctness**:

In [8]:
input_ids = tokenizer.encode('cola sentence: The class is going poorly', return_tensors='pt')

cola_ids = model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    max_length=20,
    early_stopping=True
)

output = tokenizer.decode(cola_ids[0], skip_special_tokens=True)

print(f"is grammatically correct?: \n{output}")

is grammatically correct?: 
acceptable


In [9]:
input_ids = tokenizer.encode('cola sentence: The poorly is going class', return_tensors='pt')

cola_ids = model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    max_length=20,
    early_stopping=True
)

output = tokenizer.decode(cola_ids[0], skip_special_tokens=True)

print(f"is grammatically correct?: \n{output}")

is grammatically correct?: 
unacceptable


STSB, the Semantic Text Similarity Benchmark, rates the **semantic similarity** between two sentences on a scale of five:

In [10]:
sentence_one = 'How to fish'
sentence_two = 'Guide for anglers'

input_ids = tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')

# calculate semantic similarity
translate_ids = model.generate(
    input_ids,
    max_length=3,
    early_stopping=True
)

output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"semantically similar? (0-5): \n{output}")

semantically similar? (0-5): 
3.2


In [11]:
sentence_one = 'How to fish'
sentence_two = 'Guide for hikers'

input_ids = tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')

# calculate semantic similarity
translate_ids = model.generate(
    input_ids,
    max_length=3,
    early_stopping=True
)

output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"semantically similar? (0-5): \n{output}")

semantically similar? (0-5): 
0.0
