# Galactica Samples Generation


This notebook runs the main script for generating the Galactica paper samples. The notebook ran on different machines with sample size based on `START` and `END` variables. These indices were used to index the arXiV input paper samples to generate abstract, introduction, conclusion based on the title of these papers. After generating 5k samples, the samples were concatinated and later filtered to form as an input for the classifiers.

## Set up

Define which titles to use for generation:

In [None]:
START = 0 #inclusive
END = 500 #exclusive
BATCH_SIZE = 2 # save after BATCH_SIZE inputs.

assert (END - START) % BATCH_SIZE == 0

Set up colab, specifying the folder where `galactica.csv`

In [None]:
COLAB_BASE_DIR = '/gdrive/MyDrive/colab/xai'

Mount drive folder:

In [None]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [None]:
!ls

Import libraries:

In [None]:
!pip install transformers
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 23.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 80.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 63.6 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.15.0-py3-none-any.whl (191 kB)
[K     |████████████████████████████████| 191

In [None]:
import numpy as np
import csv
import pandas as pd
from shutil import copyfile
import os
import math
from tqdm import tqdm

Download model:

In [None]:
from transformers import AutoTokenizer, OPTForCausalLM
import re
import pandas as pd
import torch

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-1.3b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-1.3b", device_map="auto")

Downloading:   0%|          | 0.00/166 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/789 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

Copy input and output files from Drive:

In [None]:
copyfile(COLAB_BASE_DIR + "/galactica.csv", "./galactica.csv")
output_file = f"galactica_{START}_{END}.csv"
if os.path.isfile(COLAB_BASE_DIR + output_file):
  copyfile(COLAB_BASE_DIR + output_file, output_file)
else:
  with open(output_file,'w') as csvfile:
      writer = csv.DictWriter(csvfile, fieldnames=['title', 'abstract', 'introduction', 'conclusion'])
      writer.writeheader()

## Generation

For Every section, make sure that the model take correct max input tokens, enforcing that padding is left and enforcing that pad token id is 1 and the token string is "\<pad\>"

In [None]:
def get_abstract(title: str, tokenizer: str = tokenizer, model = model, debug = False) -> str:
  """
  Returns an abstract generation based on the given title.
  This method transforms the given title to the prompt "Title: $title\n\n".

  Generated abstracts are accepted when:
    the abstract text follows the string "Abstract:" and ends before EOS token.
  Generated abstracts are rejected when:
    the generation doesnt have an EOS token.
  Accepted generations are checked for a dot ending and having extra section titles.

  Parameters
  ----------
  title : str

  tokenizer: transformers tokenizer
      a transformers tokenizer. By default pretrained 
      AutoTokenizer of facebook/galactica-1.3b.

  model: OPTForCausalLM
      an OPTForCausalLM. By default pretrained 
      OPTForCausalLM of facebook/galactica-1.3b.

  debug: bool
      a boolean indicator for printing the matched generation. If True,
      prints an abstract that matches the generation expected quality.
  """
  tokenizer.pad_token_id = 1
  tokenizer.padding_side = 'left'
  tokenizer.model_max_length = 64
  input_text = f"Title: {title}\n\n"
  input_ids = tokenizer(input_text, padding='max_length', return_tensors="pt").input_ids.to("cuda")

  found_match = False

  while (not found_match): # keep asking for generation, if not good enough.
    out = model.generate(input_ids, max_new_tokens=512,
                            do_sample=True)
    text = tokenizer.decode(out[0]).lstrip('<pad>')

    ret = ""

    found_match = re.search("<\/s>", text) is not None
    # get strings afer the context 'Abstract' and befor EOS token.
    found_list = re.findall(r"Abstract:([\s\S]*)<\/s>", text)
    if (len(found_list) == 0):
      found_match = False
    else:
      ret = found_list[0]

    if (debug and found_match):
      print(f"matched abstract text {text}")
      print(f"MATCHED!\n")

  ret = ret.replace('\n', ' ').strip()
  # skip abstract string if present in the abstract itself. Other cleaning methods are used
  # later in the filtering notebook as well.
  if (ret[0:len("Abstract")] == "Abstract"):
    ret = ret[len("Abstract"):]

  if(ret != ""):
    # add dot if no dot in the end.
    if (ret[-1] != "."):
      ret = ret + "."

  return ret

In [None]:
def get_introduction(title, abstract, tokenizer = tokenizer, model = model, debug = False):
   """
  Returns an introduction generation based on the given title and abstract.
  This method transforms the given title and abstract 
  to the prompt "Title: $title\n\nAbstract: $abstract\n\n".

  Generated introductions are accepted when:
    first try: the introduction text follows the string "Introduction:" and ends before a new section header.
    or second try: the introduction text follows the string "Introduction:" and ends before an EOS token.
  Generated introductions are rejected when:
    they dont follow any of the acceptance criteria.
    or they are less than 40 tokens.
  Accepted generations are checked for a dot ending and having extra section titles.

  Parameters
  ----------
  title : str

  abstract : str

  tokenizer: transformers tokenizer
      a transformers tokenizer. By default pretrained 
      AutoTokenizer of facebook/galactica-1.3b.

  model: OPTForCausalLM
      an OPTForCausalLM. By default pretrained 
      OPTForCausalLM of facebook/galactica-1.3b.

  debug: bool
      a boolean indicator for printing the matched generation. If True,
      prints an abstract that matches the generation expected quality.
  """
  tokenizer.pad_token_id = 1
  tokenizer.padding_side = 'left'
  tokenizer.model_max_length = 64 + 512
  input_text = f"""
  Title: {title}

  Abstract: {abstract}

  Introduction:
  """
  found_match = False
  input_ids = tokenizer(input_text, padding='max_length', return_tensors="pt").input_ids.to("cuda")

  while (not found_match):
    out = model.generate(input_ids, max_new_tokens=1024,
                            temperature=0.7,
                            top_k=25,
                            top_p=0.9,
                            no_repeat_ngram_size=10,
                            early_stopping=True,
                            do_sample=True)
    
    text = tokenizer.decode(out[0]).lstrip('<pad>')
    ret = ""
    found_match = re.search("<\/s>", text) is not None # by default no match if no EOS token.
    # first try: find intro by finding the paragargh under the section until a new section header.  
    found_list = re.findall(r"Introduction:([\s\S]*?)[A-Z].+:\n+", text)

    # if no match, check if there is complete introduction without any extra sections after.
    # second try: when no second header is present, find the paragraph which is under intro header,
    # till EOS token. 
    if (len(found_list) == 0):
      found_list = re.findall(r"Introduction:([\s\S]*)<\/s>", text)
    else: # if there is a match but no EOS token, take intro. nevertheless.
      found_match = True

    # if no match at all, try again.
    if (len(found_list) == 0):
      found_match = False
    else: # if there is a match but too small, skip.
      ret = found_list[0].replace('\n', ' ').strip()
      # ignore small match sizes due to regex limitations or other reasons.
      if (len(tokenizer.encode(ret)) < 40):
        found_match = False
      else:
        found_match = True

    if (debug and found_match):
      print(f"matched intro text {text}")
      print(f"MATCHED!\n")

  if(ret != ""):
    if (ret[-1] != "."):
      ret = ret + "."

  return ret

In [None]:
def get_conclusion(title, abstract, introduction, tokenizer = tokenizer, model = model, debug = False):
  """
  Returns an introduction generation based on the given title, abstract and introduction.
  This method transforms the given title, abstract and introduction
  to the prompt "Title: $title\n\nAbstract: $abstract\n\Introduction: $introduction\n\n".

  Generated conclusions are accepted when:
    first try: the conclusion text follows the string "Conclusion:" and ends before References:.
    or second try: the conclusion text follows the string "Conclusion:" and ends before Acknowledgments:.
    or third try: the conclusion text follows the string "Conclusion:" and ends before ENOS token.
    or fourth try: the conclusion text follows the string "Conclusion:".
  Generated introductions are rejected when:
    they dont follow any of the acceptance criteria.
    or they are less than 30 tokens.
  Accepted generations are checked for a dot ending and having extra section titles.

  Parameters
  ----------
  title : str

  abstract : str

  tokenizer: transformers tokenizer
      a transformers tokenizer. By default pretrained 
      AutoTokenizer of facebook/galactica-1.3b.

  model: OPTForCausalLM
      an OPTForCausalLM. By default pretrained 
      OPTForCausalLM of facebook/galactica-1.3b.

  debug: bool
      a boolean indicator for printing the matched generation. If True,
      prints an abstract that matches the generation expected quality.
  """
  tokenizer.pad_token_id = 1
  tokenizer.padding_side = 'left'
  tokenizer.model_max_length = 64 + 512 + 1024
  input_text = f"""
  Title: {title}

  Abstract: {abstract}

  Introduction:

  {introduction}

  Conclusion:
  """
  found_match = False
  input_ids = tokenizer(input_text, padding='max_length', return_tensors="pt").input_ids.to("cuda")

  tries = 3
  found_okay_text = []

  while (not found_match and tries>0):
    out = model.generate(input_ids, max_new_tokens=1024,
                            temperature=0.7,
                            top_k=25,
                            top_p=0.9,
                            no_repeat_ngram_size=10,
                            early_stopping=True,
                            do_sample=True)
    
    text = tokenizer.decode(out[0]).lstrip('<pad>')
    ret = ""
    found_match = re.search("<\/s>", text) is not None
    
    # if (debug):
    #   print(f"original conc found: {text} \n")

    found_refs = re.search(r"References:", text)
    found_list = []
    # only get conclusions with efrences section after to ensure that text is written in past tense.
    found_list_refs = re.findall(r"Conclusion:([\s\S]*?)References:\n*", text)

    # lv is used to indicate how good the generation is among the loops.
    if (len(found_list_refs)==0):
      found_list = re.findall(r"Conclusion:([\s\S]*?)Acknowledgments:\n*", text)
      if (len(found_list)!=0):
        found_okay_text.append({"text":found_list[0], "lv":1})

    if (len(found_list)==0):
      found_list = re.findall(r"Conclusion:([\s\S]*)<\/s>", text)
      if (len(found_list)!=0):
        found_okay_text.append({"text":found_list[0], "lv":2})

    if (len(found_list)==0):
      found_list = re.findall(r"Conclusion:([\s\S]*)", text)
      if (len(found_list)!=0):
        found_okay_text.append({"text":found_list[0], "lv":3})


    if (len(found_list_refs) == 0):
      found_match = False
    else:
      ret = found_list_refs[0].replace('\n', ' ').strip()
      # ignore small match sizes due to regex limitations or other reasons.
      if (len(tokenizer.encode(ret)) < 30):
        found_match = False
      else:
        found_match = True

    if (debug and found_match):
      print(f"matched conc text {text}")
      print(f"MATCHED!\n")
    tries-=1

  # tries are over.
  if (tries == 0 and ret == ""):
    # get the levels found
    okay_list_lvs = list(map(lambda x: x["lv"], found_okay_text))
    # get the index of the lowest level.
    min_lv = min(okay_list_lvs)
    idx_min = okay_list_lvs.index(min_lv)
    ret = found_okay_text[idx_min]["text"].replace('\n', ' ').strip()
    # print("fail after many tries...")
    # print(ret)

  if(ret != ""):
    if (ret[-1] != "."):
      ret = ret + "."
  return ret

In [None]:
def generate_paper(title, tokenizer = tokenizer, model = model, debug = False):
  abs = get_abstract(title, tokenizer, model, debug = debug)
  # print(f"{abs} \n")
  intro = get_introduction(title, abs, tokenizer, model, debug = debug)
  # print(f"{intro} \n")
  conc = get_conclusion(title, abs, intro, tokenizer, model, debug = debug)
  # print(f"{conc} \n")
  if (debug):
    print(f"""
    Title: {title}

    Abstract: 
    
    {abs}

    Introduction:

    {intro}

    Conclusion:

    {conc}
    """)
  return {"abstract": abs, "introduction": intro, "conclusion": conc}

In [None]:
def clean_special_tokens(row):
  """
  Returns a cleaned json row after filtering the sections from Galactica special tokens.

  Parameters
  ----------
  row : json
    a json row with the abstract, introduction and conclusion sections. 
  """
  tags = [
    "[START_REF]",
    "[END_REF]",
    "[START_SUP]",
    "[END_SUP]",
    "[START_DNA]",
    "[END_DNA]",
    "[START_AMINO]",
    "[END_AMINO]",
    "[START_SMILES]",
    "[END_SMILES]",
    "[START_I_SMILES]",
    "[END_I_SMILES]",
  ]

  reference_index = 1
  reference = {}
  abstract = row["abstract"]
  introduction = row["introduction"]
  conclusion = row["conclusion"]

  # filter refs and make a list.
  while abstract.find("[START_REF]") >= 0:
    start = abstract.find("[START_REF]")
    end = abstract.find("[END_REF]") + len("[END_REF]")
    ref = abstract[start:end]

    if ref not in reference:
        reference[ref] = reference_index
        abstract = abstract.replace(ref, str(reference_index))
        reference_index += 1
    else:
        abstract = abstract.replace(ref, str(reference[ref]))

  for tag in tags:
    abstract = abstract.replace(tag, "")

  # filter refs and make a list.
  while introduction.find("[START_REF]") >= 0:
    start = introduction.find("[START_REF]")
    end = introduction.find("[END_REF]") + len("[END_REF]")
    ref = introduction[start:end]

    if ref not in reference:
        reference[ref] = reference_index
        introduction = introduction.replace(ref, str(reference_index))
        reference_index += 1
    else:
        introduction = introduction.replace(ref, str(reference[ref]))

  for tag in tags:
    introduction = introduction.replace(tag, "")
  # filter refs and make a list.
  while conclusion.find("[START_REF]") >= 0:

      start = conclusion.find("[START_REF]")
      end = conclusion.find("[END_REF]") + len("[END_REF]")
      if start > end:
          conclusion = conclusion[:start]
          break
      ref = conclusion[start:end]

      if ref not in reference:
          reference[ref] = reference_index
          conclusion = conclusion.replace(ref, str(reference_index))
          reference_index += 1
      else:
          conclusion = conclusion.replace(ref, str(reference[ref]))

  for tag in tags:
    conclusion = conclusion.replace(tag, "")

  return {"abstract":abstract, "introduction":introduction, "conclusion":conclusion}

Start processing:

In [None]:
import gc

In [None]:
titles = list(pd.read_csv('galactica.csv')['title'])
titles = titles[START:END]
n_batches_done = 0
abstracts, introductions, conclusions = [], [], []

n_batches_start_end = math.ceil((END-START) // BATCH_SIZE)
print(f"{n_batches_start_end} batches are needed for titles from {START} (inclusive) to {END} (exclusive)")

# Reload from file
print("Reading output file to resume work...")
generated_df = pd.read_csv(output_file)
abstracts = list(generated_df['abstract'])
assert len(abstracts) % BATCH_SIZE == 0
n_batches_done = len(abstracts) // BATCH_SIZE
print(f"{n_batches_done} have been found in the output file")

n_batches_to_do = n_batches_start_end - n_batches_done
print(f"There are {n_batches_to_do} batches still to process")

# Process  
for batch in range(n_batches_done, n_batches_start_end):
    current_titles = titles[batch*BATCH_SIZE:(batch+1)*BATCH_SIZE]
    current_abstracts = []
    current_introductions = []
    current_conclusions = []

    print(f"\nStarting processing batch {batch+1} of {n_batches_start_end}...")
    pbar = tqdm(total=len(current_titles))
    for i in range(len(current_titles)):
        # Abstract
        # memory cleaning.
        with torch.no_grad():
          torch.cuda.empty_cache()
        unclean_paper = None
        cleaned_paper = None
        gc.collect()
        unclean_paper = generate_paper(current_titles[i])
        cleaned_paper = clean_special_tokens(unclean_paper)
        current_abstracts.append(cleaned_paper["abstract"])
        current_introductions.append(cleaned_paper["introduction"])
        current_conclusions.append(cleaned_paper["conclusion"])
        # memory cleaning.
        unclean_paper = None
        cleaned_paper = None
        gc.collect()
        with torch.no_grad():
          torch.cuda.empty_cache()

        pbar.update(1)
    pbar.close()

    print(f"\nSaving batch {batch+1}...")
    with open(output_file,'a') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['title', 'abstract', 'introduction', 'conclusion'])
        for i in range(len(current_abstracts)):
            writer.writerow({'title': current_titles[i],
                              'abstract': current_abstracts[i],
                              'introduction': current_introductions[i],
                              'conclusion': current_conclusions[i]
                              })
    copyfile(output_file, COLAB_BASE_DIR + output_file)

250 batches are needed for titles from 500 (inclusive) to 1000 (exclusive)
Reading output file to resume work...
30 have been found in the output file
There are 220 batches still to process

Starting processing batch 31 of 250...



 50%|█████     | 1/2 [11:55<11:55, 715.47s/it]
