This notebook illustrates fine-tuning the GPT-2 model on the scientific article data

Import required libraries

In [17]:
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering, TrainingArguments, Trainer, default_data_collator
import torch
from torch import nn
import pickle
from datasets import Dataset, Features
import math
from huggingface_hub import notebook_login

Load the articles object

In [2]:
with open("articles.obj", "rb") as f:
    articles = pickle.load(f)

Convert the articles into a dictionary format appropriate for loading into a Hugging Face dataset and filter out any where the title was not read properly.

In [3]:
articles_t = {"pmc":list(),"title":list(),"abstract":list(),"introduction":list(),
              "result":list(),"discussion":list(),"conclusion":list()}
for article in articles:
    if(articles[article]["title"].strip()==""):
        continue
    articles_t["pmc"].append(article)
    for sec in articles[article]:
        articles_t[sec].append(articles[article][sec])

Convert the articles into a Hugging Face dataset and ensure it loaded correctly

In [4]:
articles_dataset = Dataset.from_dict(articles_t)
print(articles_dataset)

Dataset({
    features: ['pmc', 'title', 'abstract', 'introduction', 'result', 'discussion', 'conclusion'],
    num_rows: 7020
})


For later ease of processing, filter only the articles with introductions and then split into a training and test set.

In [5]:
intros_dataset = articles_dataset.filter(lambda example: example["introduction"]!="")
intros_dataset = intros_dataset.train_test_split(train_size=0.8)
print(intros_dataset)

  0%|          | 0/8 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['pmc', 'title', 'abstract', 'introduction', 'result', 'discussion', 'conclusion'],
        num_rows: 4284
    })
    test: Dataset({
        features: ['pmc', 'title', 'abstract', 'introduction', 'result', 'discussion', 'conclusion'],
        num_rows: 1071
    })
})


Load the GPT-2 tokenizer and model. Set the padding token to be the end of string token

In [6]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

Define a function to tokenize the title and introductions into prompt and text for the model. Tokenization converts the words of the text into a format interpretable by the model

In [7]:
def intros_preprocess_function(examples):
    inputs = tokenizer(examples["title"], examples["introduction"], max_length=1024, truncation="only_second", 
                     padding="max_length")
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

Tokenize the intros and ensure they look as expected

In [8]:
tokenized_intros = intros_dataset.map(intros_preprocess_function, num_proc=3,  
                                     remove_columns=intros_dataset['train'].column_names)

In [9]:
print(tokenized_intros)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4284
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1071
    })
})


Set the training arguments, most are default to the documentation, though lowered batch sizes due to technical limitations

In [10]:
data_collator = default_data_collator
training_args = TrainingArguments(
    output_dir='./gpt2-finetuned-scientific-articles',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
)

Set up the trainer for fine-tuning and run the training.

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_intros['train'],
    eval_dataset=tokenized_intros['test'],
    data_collator=data_collator,
    tokenizer=tokenizer
)

In [12]:
trainer.train()

***** Running training *****
  Num examples = 4284
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 2142


Epoch,Training Loss,Validation Loss
1,2.5293,2.389173
2,2.4821,2.379318


Saving model checkpoint to ./gpt2-finetuned-scientific-articles/checkpoint-500
Configuration saved in ./gpt2-finetuned-scientific-articles/checkpoint-500/config.json
Model weights saved in ./gpt2-finetuned-scientific-articles/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./gpt2-finetuned-scientific-articles/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./gpt2-finetuned-scientific-articles/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./gpt2-finetuned-scientific-articles/checkpoint-1000
Configuration saved in ./gpt2-finetuned-scientific-articles/checkpoint-1000/config.json
Model weights saved in ./gpt2-finetuned-scientific-articles/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./gpt2-finetuned-scientific-articles/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./gpt2-finetuned-scientific-articles/checkpoint-1000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1071
  Ba

TrainOutput(global_step=2142, training_loss=2.5094094583634234, metrics={'train_runtime': 56480.4096, 'train_samples_per_second': 0.152, 'train_steps_per_second': 0.038, 'total_flos': 4477500260352000.0, 'train_loss': 2.5094094583634234, 'epoch': 2.0})

Measure the perplexity of the trainer

In [15]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 10.80


Login to the Hugging Face hub and push the model to the repository

In [56]:
notebook_login()

Login successful
Your token has been saved to /home/shariq/.huggingface/token


In [58]:
#trainer.args.hub_model_id = "ssmadha/gpt2-finetuned-scientific-articles"
trainer.push_to_hub()

Saving model checkpoint to ./gpt2-finetuned-scientific-articles
Configuration saved in ./gpt2-finetuned-scientific-articles/config.json
Model weights saved in ./gpt2-finetuned-scientific-articles/pytorch_model.bin
tokenizer config file saved in ./gpt2-finetuned-scientific-articles/tokenizer_config.json
Special tokens file saved in ./gpt2-finetuned-scientific-articles/special_tokens_map.json
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
