# T5-base model

## Translation

In [None]:
# install as needed
pip install sentencepiece

In [8]:
from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5Model.from_pretrained("t5-base")

input_ids = tokenizer(
    "Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

# forward pass
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
last_hidden_states = outputs.last_hidden_state

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [9]:
last_hidden_states

tensor([[[ 0.0005, -0.1320,  0.2328,  ...,  0.0699,  0.0568, -0.1638],
         [-0.0052, -0.1856,  0.1215,  ..., -0.0225, -0.0012, -0.2632],
         [ 0.0824, -0.3187,  0.0708,  ...,  0.0134,  0.0274, -0.3392],
         [ 0.1529, -0.4006,  0.1589,  ...,  0.0711,  0.0720, -0.2287]]],
       grad_fn=<MulBackward0>)

In [10]:
last_hidden_states.shape

torch.Size([1, 4, 768])

## Summarization

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer=AutoTokenizer.from_pretrained('T5-base')
model=AutoModelWithLMHead.from_pretrained('T5-base', return_dict=True)



In [3]:
sequence = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")

In [4]:
inputs=tokenizer.encode("sumarize: " +sequence,return_tensors='pt', max_length=512, truncation=True)

In [5]:
output = model.generate(inputs, min_length=80, max_length=100)

In [6]:
summary=tokenizer.decode(output[0])
print(summary)

2023-06-19 09:42:59.840820: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<pad> data science is an interdisciplinary field focused on extracting knowledge from typically large data sets. it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, graphic design, complex systems, communication and business. the field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions.</s>
