In [None]:
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args

train_data = [
    ["binary classification", "Anakin was Luke's father" , "1"],
    ["binary classification", "Luke was a Sith Lord" , "0"],
    ["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
    ["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]

eval_data = [
    ["binary classification", "Leia was Luke's sister" , "1"],
    ["binary classification", "Han was a Sith Lord" , "0"],
    ["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
    ["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["prefix", "input_text", "target_text"]

model_args = T5Args()
model_args.num_train_epochs = 200
model_args.no_save = True
model_args.evaluate_generated_text = True
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True
model_args.overwrite_output_dir = True

model = T5Model("t5", "t5-base", args=model_args)


def count_matches(labels, preds):
    print(labels)
    print(preds)
    return sum([1 if label == pred else 0 for label, pred in zip(labels, preds)])


model.train_model(train_df, eval_data=eval_df)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specif

Generating outputs: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.75it/s][A

Decoding outputs:   0%|                                                               | 0/4 [00:00<?, ?it/s][A
Decoding outputs: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.87it/s][A
Epoch 4 of 200:   2%|▊                                                      | 3/200 [00:18<20:00,  6.09s/it]
Running Epoch 3 of 200:   0%|                                                         | 0/1 [00:00<?, ?it/s][A
Epochs 3/200. Running Loss:    3.2437:   0%|                                          | 0/1 [00:00<?, ?it/s][A
Epochs 3/200. Running Loss:    3.2437: 100%|██████████████████████████████████| 1/1 [00:00<00:00,  4.16it/s][A

100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 62.87it/s]

Generating outputs:   0%|                                                             | 0/1 [00:00<?, ?it/s