# Source

* Paper source: Automatic Pull Request Title Generation
* Dataset: https://github.com/soarsmu/PRTiger/raw/main/data/PRTiger.zip
* model source: https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb

# Load dataset

In [1]:
PATH = '/kaggle/working'

In [2]:
!wget https://github.com/soarsmu/PRTiger/raw/main/data/PRTiger.zip

--2023-05-22 14:04:53--  https://github.com/soarsmu/PRTiger/raw/main/data/PRTiger.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/soarsmu/PRTiger/main/data/PRTiger.zip [following]
--2023-05-22 14:04:53--  https://raw.githubusercontent.com/soarsmu/PRTiger/main/data/PRTiger.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25051895 (24M) [application/zip]
Saving to: ‘PRTiger.zip’


2023-05-22 14:04:54 (188 MB/s) - ‘PRTiger.zip’ saved [25051895/25051895]



In [3]:
!unzip -a {PATH}/PRTiger.zip

Archive:  /kaggle/working/PRTiger.zip
   creating: PRTiger/
   creating: PRTiger/no-token/
  inflating: PRTiger/.DS_Store       [binary]
  inflating: __MACOSX/PRTiger/._.DS_Store  [binary]
   creating: PRTiger/with-token/
  inflating: PRTiger/no-token/valid.csv  [binary]
  inflating: __MACOSX/PRTiger/no-token/._valid.csv  [binary]
  inflating: PRTiger/no-token/test.csv  [binary]
  inflating: __MACOSX/PRTiger/no-token/._test.csv  [binary]
  inflating: PRTiger/no-token/train.csv  [binary]
  inflating: __MACOSX/PRTiger/no-token/._train.csv  [binary]
  inflating: PRTiger/with-token/valid.csv  [binary]
  inflating: __MACOSX/PRTiger/with-token/._valid.csv  [binary]
  inflating: PRTiger/with-token/test.csv  [binary]
  inflating: __MACOSX/PRTiger/with-token/._test.csv  [binary]
  inflating: PRTiger/with-token/train.csv  [binary]
  inflating: __MACOSX/PRTiger/with-token/._train.csv  [binary]


# Import

In [4]:
! pip install transformers
! pip install datasets
! pip install sentencepiece
! pip install rouge_score

[0mCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=de96405a0ef91a98c0fe042dab0dcb400fd9c5cd60e60ca43920ad44bce94913
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
[0m

In [5]:
import pandas as pd
import torch
import numpy as np
import datasets

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)

from tabulate import tabulate
import nltk
from datetime import datetime

import datasets
from datasets import Dataset, DatasetDict



# Model and tokenizer

In [6]:
model_name = "facebook/bart-base"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenization
encoder_max_length = 512 
decoder_max_length = 64

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Prepare data

In [7]:
df_train = pd.read_csv(f'{PATH}/PRTiger/no-token/train.csv')
df_valid = pd.read_csv(f'{PATH}/PRTiger/no-token/valid.csv')
df_test = pd.read_csv(f'{PATH}/PRTiger/no-token/test.csv')

In [8]:
def format_data(df_input):
  df_input = df_input[['text', 'summary']]
  df_input.columns = ["document", "summary"]
  df_input['document'] = df_input['document'].str.lower()
  df_input['summary'] = df_input['summary'].str.lower()
  return df_input

In [9]:
train = format_data(df_train)
validation = format_data(df_valid)
test = format_data(df_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_input['document'] = df_input['document'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_input['summary'] = df_input['summary'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_input['document'] = df_input['document'].str.lower()
A value is trying to be set on a copy o

In [10]:
train_data_txt = Dataset.from_pandas(train)
validation_data_txt = Dataset.from_pandas(validation)
test_data_txt = Dataset.from_pandas(test)

## Tokenize

In [11]:
def batch_tokenize_preprocess(batch, tokenizer, max_source_length, max_target_length):
    source, target = batch["document"], batch["summary"]
    source_tokenized = tokenizer(
        source, padding="max_length", truncation=True, max_length=max_source_length
    )
    target_tokenized = tokenizer(
        target, padding="max_length", truncation=True, max_length=max_target_length
    )

    batch = {k: v for k, v in source_tokenized.items()}
    # Ignore padding in the loss
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in l]
        for l in target_tokenized["input_ids"]
    ]
    return batch


train_data = train_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=train_data_txt.column_names,
)

validation_data = validation_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)


test_data = test_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)

  0%|          | 0/36 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

# Train Setting

## Metrics

In [12]:
# Borrowed from https://github.com/huggingface/transformers/blob/master/examples/seq2seq/run_summarization.py

nltk.download("punkt", quiet=True)

metric = datasets.load_metric("rouge")


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results from ROUGE
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

## Training arguments

In [13]:
training_args = Seq2SeqTrainingArguments(
    output_dir="baseline",
    seed = 42,
    data_seed = 42,
    num_train_epochs=4, 
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_steps=6000,
    evaluation_strategy ="steps",
    eval_steps = 6000, 
    save_steps = 6000,
    save_total_limit = 5, 
    load_best_model_at_end=True,
    report_to="none", 

)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=validation_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train

In [14]:
WANDB_INTEGRATION = False

In [15]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
6000,3.981,3.623652,43.6811,22.857,40.2063,40.2046,13.3749
12000,3.5753,3.527457,45.2726,23.7434,41.6732,41.6859,13.1084
18000,3.345,3.493475,46.0762,24.6738,42.195,42.2146,13.615
24000,3.0597,3.438034,45.8483,24.4818,42.265,42.2991,13.1705
30000,2.9079,3.469506,46.549,24.7135,42.8129,42.8285,13.1992


TrainOutput(global_step=35052, training_loss=3.291603646457406, metrics={'train_runtime': 8621.4603, 'train_samples_per_second': 16.263, 'train_steps_per_second': 4.066, 'total_flos': 4.274496466845696e+16, 'train_loss': 3.291603646457406, 'epoch': 4.0})

# Evaluate Valid

In [16]:
trainer.evaluate()

{'eval_loss': 3.4380335807800293,
 'eval_rouge1': 45.8483,
 'eval_rouge2': 24.4818,
 'eval_rougeL': 42.265,
 'eval_rougeLsum': 42.2991,
 'eval_gen_len': 13.1705,
 'eval_runtime': 346.7671,
 'eval_samples_per_second': 12.637,
 'eval_steps_per_second': 3.161,
 'epoch': 4.0}

# Save Model

In [17]:
trainer.save_model("/kaggle/working/baseline")

In [18]:
!zip -r baseline1.zip /kaggle/working/baseline

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: kaggle/working/baseline/ (stored 0%)
  adding: kaggle/working/baseline/special_tokens_map.json (deflated 52%)
  adding: kaggle/working/baseline/checkpoint-30000/ (stored 0%)
  adding: kaggle/working/baseline/checkpoint-30000/scheduler.pt (deflated 48%)
  adding: kaggle/working/baseline/checkpoint-30000/trainer_state.json (deflated 73%)
  adding: kaggle/working/baseline/checkpoint-30000/special_tokens_map.json (deflated 52%)
  adding: kaggle/working/baseline/checkpoint-30000/tokenizer_config.json (deflated 50%)
  adding: kaggle/working/baseline/checkpoint-30000/pytorch_model.bin (deflated 8%)
  adding: kaggle/working/baseline/checkpoint-30000/vocab.json (deflated 59%)
  adding: kaggle/working/baseli

# Evaluate Test

In [19]:
tester = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [20]:
tester.evaluate()

{'eval_loss': 3.4328579902648926,
 'eval_rouge1': 46.6314,
 'eval_rouge2': 24.9313,
 'eval_rougeL': 42.7895,
 'eval_rougeLsum': 42.7886,
 'eval_gen_len': 13.1917,
 'eval_runtime': 348.2106,
 'eval_samples_per_second': 12.584,
 'eval_steps_per_second': 3.148}

# Evaluation

**Valid**

In [21]:
def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples["document"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str


# model_before_tuning = AutoModelForSeq2SeqLM.from_pretrained(model_name)
validation_samples = validation_data_txt.select(range(16))

# summaries_before_tuning = generate_summary(validation_samples, model_before_tuning)[1]
summaries_after_tuning = generate_summary(validation_samples, model)[1]



In [22]:
print(
    tabulate(
        zip(
            range(len(summaries_after_tuning)),
            summaries_after_tuning,
            validation_samples["summary"],
        ),
        headers=["Id", "Summary after", "Summary before"],
    )
)
print("\nTarget summaries:\n")
print(
    tabulate(list(enumerate(validation_samples["summary"])), headers=["Id", "Target summary"])
)
print("\nSource documents:\n")
print(tabulate(list(enumerate(validation_samples["document"])), headers=["Id", "Document"]))

  Id  Summary after                                                              Summary before
----  -------------------------------------------------------------------------  -----------------------------------------------------------------------------------
   0  fix error when service type is nodeport                                    fix notes.txt when service type is nodeport
   1  add rotationangle prop to react-swipeable                                  add rotationangle prop to react-swipeable
   2  set max-width on search box and selection area                             fix long text causing the search box and selections to overflow on multiple selects
   3  cherry pick #14203 to 20.7: fix issue #14202                               cherry pick #14203 to 20.6: fix issue #14202
   4  clarify null and duplicates in javadocs                                    explicit handling of null values with retainduplicates
   5  implement basedtypetests for arrowstringdtype             

**Test**

In [23]:
test_samples = test_data_txt.select(range(16))
test_summaries_after_tuning = generate_summary(test_samples, model)[1]

In [24]:
print(
    tabulate(
        zip(
            range(len(test_summaries_after_tuning)),
            test_summaries_after_tuning,
            test_samples["summary"],
        ),
        headers=["Id", "Summary predict", "Summary target"],
    )
)
# print("\nTarget summaries:\n")
# print(
#     tabulate(list(enumerate(test_samples["summary"])), headers=["Id", "Target summary"])
# )
print("\nSource documents:\n")
print(tabulate(list(enumerate(test_samples["document"])), headers=["Id", "Document"]))

  Id  Summary predict                                                                        Summary target
----  -------------------------------------------------------------------------------------  -------------------------------------------------------------------------
   0  fix relative paths in tests, attempt 1                                                 fix relative paths in tests, part 1
   1  disallow null as valid parameter in get_class()                                        get_class() disallow null parameter rfc
   2  oneclasssvm n_support returns incorrect value                                          fixed n_support_ attr for oneclasssvm and svr
   3  use gettypecheckedbody to allow lazy type-checking when emitting function definitions  add and use abstractfunctiondecl::gettypecheckedbody
   4  implicit generic "any". and specify generic parameters                                 implicit generic "any" for builtins
   5  resolves checkstyle errors for lazy-loading