<a href="https://colab.research.google.com/github/ralfferreira/generate-abstract/blob/main/AbstractPszemraj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
pip install transformers datasets evaluate torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:21'

In [7]:
from datasets import load_dataset

sci_papers = load_dataset("hackathon-pln-es/scientific_papers_en", split="train")
sci_papers = sci_papers.train_test_split(test_size=0.2)



In [8]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pszemraj/long-t5-tglobal-base-16384-book-summary")
model = ("pszemraj/long-t5-tglobal-base-16384-book-summary")

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

In [9]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["full_text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["abstract"], max_length=500, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
tokenized_sci_papers = sci_papers.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [12]:
!pip install rouge_score
import evaluate

rouge = evaluate.load("rouge")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=754a73fc66f1ea6e4c1962ccd4e81d4cdd21589989993aa599298db893b84351
  Stored in directory: /root/.cache/pip/wheels/24/55/6f/ebfc4cb176d1c9665da4e306e1705496206d08215c1acd9dde
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [13]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [14]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("pszemraj/long-t5-tglobal-base-16384-book-summary")

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/990M [00:00<?, ?B/s]

In [15]:
import torch
cuda0 = torch.device('cuda:0')

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)

torch.cuda.empty_cache()
torch.cuda.memory_summary(device=cuda0, abbreviated=False)



In [16]:
training_args = Seq2SeqTrainingArguments(
    output_dir="test_dir",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_sci_papers["train"],
    eval_dataset=tokenized_sci_papers["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `LongT5ForConditionalGeneration.forward` and have been ignored: id, abstract, text_no_abstract, full_text. If id, abstract, text_no_abstract, full_text are not expected by `LongT5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1403
  Num Epochs = 2
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 1404
  Number of trainable parameters = 247587456
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.6465,0.416194,0.5708,0.3958,0.5472,0.5478,134.9516
2,0.5209,0.40357,0.5765,0.3997,0.5526,0.553,133.5726


Saving model checkpoint to test_dir/checkpoint-500
Configuration saved in test_dir/checkpoint-500/config.json
Model weights saved in test_dir/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test_dir/checkpoint-500/tokenizer_config.json
Special tokens file saved in test_dir/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `LongT5ForConditionalGeneration.forward` and have been ignored: id, abstract, text_no_abstract, full_text. If id, abstract, text_no_abstract, full_text are not expected by `LongT5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 351
  Batch size = 2
Saving model checkpoint to test_dir/checkpoint-1000
Configuration saved in test_dir/checkpoint-1000/config.json
Model weights saved in test_dir/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test_dir/checkpoint-1000/tokenizer_config.json
Special toke

TrainOutput(global_step=1404, training_loss=0.5616397694644765, metrics={'train_runtime': 8176.7981, 'train_samples_per_second': 0.343, 'train_steps_per_second': 0.172, 'total_flos': 3843017146368000.0, 'train_loss': 0.5616397694644765, 'epoch': 2.0})

In [17]:
artigo="""Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization tricks [21] and conditional
computation [32], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs."""

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pszemraj/long-t5-tglobal-base-16384-book-summary")
inputs = tokenizer(artigo, return_tensors="pt").input_ids

loading file spiece.model from cache at /root/.cache/huggingface/hub/models--pszemraj--long-t5-tglobal-base-16384-book-summary/snapshots/8180a3b656e2e04608ffc5ee5634a8e5f52d9962/spiece.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--pszemraj--long-t5-tglobal-base-16384-book-summary/snapshots/8180a3b656e2e04608ffc5ee5634a8e5f52d9962/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--pszemraj--long-t5-tglobal-base-16384-book-summary/snapshots/8180a3b656e2e04608ffc5ee5634a8e5f52d9962/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--pszemraj--long-t5-tglobal-base-16384-book-summary/snapshots/8180a3b656e2e04608ffc5ee5634a8e5f52d9962/tokenizer_config.json


In [19]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
outputs = model.generate(inputs, max_new_tokens=500, do_sample=False)

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/45b6053e29f785d9a3b94aecfe8473b015e67156/config.json
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": fals

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/45b6053e29f785d9a3b94aecfe8473b015e67156/pytorch_model.bin
All model checkpoint weights were used when initializing BartForConditionalGeneration.

All the weights of BartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-large-cnn.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BartForConditionalGeneration for predictions without further training.


In [20]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'would Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint I of sequential computation, however, remains. Attention mechanisms have becomet an integral part of compelling sequence modeling and transduc- dependencies without regard to their distance in the inputX'