# Code to pretrain GPT2 on chemistry paper titles
Pretrain GPT2 to use as the decoder component of a vision encoder-decoder for 'this JACS does not exist'

Check memory usage with !nvidia-smi (if running in a notebook, the kernel needs to be restarted to clear the memory)

In [None]:
!nvidia-smi

Initiate training dataframes from a csv

In [None]:
import torch
import pandas as pd
from ast import literal_eval
import os

# Load the csv
headings = ['title', 'file_name', 'Abstract']
df = pd.read_csv('data.csv', names = headings)

# Drop null values
df = df.mask(df.eq('None')).dropna()

# Data was written as byte strings of the form b'string' - evaluate in python then decode to utf-8
df['title'] = df['title'].map(lambda x: literal_eval(x).decode('utf-8'))

# Make train-test split
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

Draw a histogram of the lengths of the tokenized titles to find what the max length cutoff is. (For this dataset 128 was plenty)

In [3]:
df['title'].map(lambda x: len(text_processor(x)['input_ids'])).hist(bins=100)

GPT2 usually does not use padding tokens, only "<|endoftext|>". We need to add a padding token to the tokenizer so that model can distinguish between padding and end-of-sequence (EOS) tokens to learn end of sequence positions (i.e. to learn that titles should not be too long)

**Important:** Pretrained models' token embedding sizes will need to be resized to match the new tokenizer embedding size: use `model.resize_token_embeddings(len(new_vocab))`

In [4]:
from transformers import GPT2Tokenizer

text_processor = GPT2Tokenizer.from_pretrained("gpt2", pad_token="<|pad|>")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Write a torch dataset that loads our text dataset. 

We add an EOS token to the end of each title, then pad with the padding tokens defined above.

In [1]:
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, df, text_processor, max_target_len=128):
        self.df = df
        self.text_processor = text_processor
        self.max_target_len = max_target_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Add the EOS token
        title = self.df['title'][idx] + self.text_processor.eos_token
        text = self.text_processor(
            title,
            padding="max_length",
            max_length=self.max_target_len,
            truncation=True, # Need to set this as it does not auto-truncate
            ).input_ids

        return torch.tensor(text, dtype=torch.long)

Initialize our datasets

In [11]:
train_dataset = TextDataset(
    df=train_df,
    text_processor=text_processor,
    max_target_len=128)

eval_dataset = TextDataset(
    df=test_df,
    text_processor=text_processor,
    max_target_len=128)

print("Number of training samples:", len(train_dataset))
print("Number of eval samples:", len(eval_dataset))

Number of training samples: 57452
Number of eval samples: 6384


Check tokenization works properly - note the EOS token 50256 followed by multiple 50257 padding tokens

In [12]:
train_dataset[0]

tensor([   77, 39310, 46582,     9,  4225,  4658,  3401,  5039,   262,  3167,
         4754,   485, 33396, 32480,   286,  4551,   312,   342,  2101,  1134,
          316,   404,  9346, 15742, 50256, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257])

Initialise a data collator

In [13]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
      tokenizer=text_processor, mlm=False,
)


Initialise the parameters for the Huggingface transformers trainer

In [14]:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

# Load a pre-trained vanilla GPT2 model
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Resize model token embeddings to account for extra padding token
model.resize_token_embeddings(len(text_processor))

training_args = TrainingArguments(
    output_dir="./<OUTPUT-DIR>", # output directory
    overwrite_output_dir=True, # overwrite the content of the output directory
    num_train_epochs=800, # number of training epochs
    per_device_train_batch_size=16, # batch size for training
    per_device_eval_batch_size=16, # batch size for evaluation
    evaluation_strategy='steps', # which pattern to use for evaluation
    eval_steps=1000, # number of steps between two evaluations
    save_steps=10000, # number of steps between model saves 
    save_total_limit=3, # maximum number of models to keep (oldest ones are removed)
    warmup_steps=500, # number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )

# Pass arguments to trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

Now run the trainer!

**Note:** this is better run as a python script than a notebook, the cell output can become very long leading to crashes

In [None]:
trainer.train()

To save the model to local folder

In [None]:
trainer.save_model("./gpt2-pretrain")

Saving model checkpoint to ./gpt2-pretrain
Configuration saved in ./gpt2-pretrain\config.json
Model weights saved in ./gpt2-pretrain\pytorch_model.bin


Try loading a pretrained model

In [None]:
model = AutoModelForCausalLM.from_pretrained("gpt2-pretrain3\checkpoint-20000")

Write a generate function (can also use the transformers `model.generate()` function)

In [19]:
from tqdm import tqdm, trange
import torch.nn.functional as F

def generate(
    model,
    tokenizer,
    prompt,
    entry_count=10,
    entry_length=30, #maximum number of words
    top_p=0.8,
    temperature=1.,
):
    model.eval()
    generated_num = 0
    generated_list = []

    filter_value = -float("Inf")

    with torch.no_grad():

        for entry_idx in trange(entry_count):

            entry_finished = False
            generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

            for i in range(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.squeeze().numpy())
                    output_text = tokenizer.decode(output_list)
                    generated_list.append(output_text)
                    break
            
            if not entry_finished:
              output_list = list(generated.squeeze().numpy())
              output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" 
              generated_list.append(output_text)
                
    return generated_list

#Function to generate multiple sentences. Test data should be a dataframe
def text_generation(test_data):
  generated_titles = []
  for i in range(len(test_data)):
    x = generate(model.to('cpu'), text_processor, test_data['title'][i], entry_count=1)
    generated_titles.append(x)
  return generated_titles

Test the model by generating prompts from the first 4 words from training set entries

In [20]:
test_set = test_df['title'].sample(n=10)
test_set_prompt = test_set.str.split().str[:4].str.join(' ')
test_set_prompt = test_set_prompt.reset_index()
test_set_prompt

#Run the functions to generate the titles
generated_titles = text_generation(test_set_prompt)

100%|██████████| 1/1 [00:02<00:00,  2.61s/it]
100%|██████████| 1/1 [00:02<00:00,  2.46s/it]
100%|██████████| 1/1 [00:02<00:00,  2.06s/it]
100%|██████████| 1/1 [00:01<00:00,  1.02s/it]
100%|██████████| 1/1 [00:00<00:00,  1.33it/s]
100%|██████████| 1/1 [00:02<00:00,  2.49s/it]
100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
100%|██████████| 1/1 [00:01<00:00,  1.45s/it]
100%|██████████| 1/1 [00:02<00:00,  2.75s/it]
100%|██████████| 1/1 [00:02<00:00,  2.09s/it]


Compare generated titles to original titles

In [21]:
for i in generated_titles:
    print(i)

['Origin of Dark-Channel X-ray Absorption Fine Structure of (NH3)4Ca(OH)5 in Supercritical Carbon Dioxide<|endoftext|>']
['Comprehensive Thermochemistry of W–H and H–H Bonds in the Lanthanide Phosphate (Ln5Me4) System<|endoftext|>']
['Fragmentation Energetics of Clusters Based on Cluster Modification:\u2009 Assignment of the Concentration-Dependent Rate Constant<|endoftext|>']
['Transient Photoconductivity of Acceptor-Substituted Layered Zirconium Oxides<|endoftext|>']
['Palladium-Catalyzed Aerobic Oxidative Cyclization of Unactivated Alkenes<|endoftext|>']
['Mild Aerobic Oxidative Palladium(II)-Catalyzed Arylation of Indoles:\u2009 Access to Chiral Olefins<|endoftext|>']
['A Pentacoordinate Boron-Containing π-Electron System for High-Performance Polymer Solar Cells<|endoftext|>']
['Ferroelectric Alkylamide-Substituted Helicene Derivative: Synthesis, Characterization, and Redox Properties<|endoftext|>']
['Tandem Cyclopropanation/Ring-Closing Metathesis of Cyclohexadienes: Convergent Ac

In [22]:
for i in test_set:
    print(i)

Origin of Dark-Channel X-ray Fluorescence from Transition-Metal Ions in Water
Comprehensive Thermochemistry of W–H Bonding in the Metal Hydrides CpW(CO)2(IMes)H, [CpW(CO)2(IMes)H]•+, and [CpW(CO)2(IMes)(H)2]+. Influence of an N-Heterocyclic Carbene Ligand on Metal Hydride Bond Energies
Fragmentation Energetics of Clusters Relevant to Atmospheric New Particle Formation
Transient Photoconductivity of Acceptor-Substituted Poly(3-butylthiophene)
Palladium-Catalyzed Aerobic Oxidative Cyclization of N-Aryl Imines: Indole Synthesis from Anilines and Ketones
Mild Aerobic Oxidative Palladium (II) Catalyzed C−H Bond Functionalization:  Regioselective and Switchable C−H Alkenylation and Annulation of Pyrroles
A Pentacoordinate Boron-Containing π-Electron System with Cl–B–Cl Three-Center Four-Electron Bonds
Ferroelectric Alkylamide-Substituted Helicene Derivative with Two-Dimensional Hydrogen-Bonding Lamellar Phase
Tandem Cyclopropanation/Ring-Closing Metathesis of Dienynes
Cyclic Penta-Twinned Rh