<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2023/blob/main/galactica_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with GALACTICA

This notebook demonstrates text generation with a small GALACTICA model (https://galactica.org/) on Colab.

First, we'll install the required Python packages. The [transformers](https://huggingface.co/docs/transformers/index) package is used to load the model and run generation, and the [accelerate](https://huggingface.co/docs/accelerate/index) package supports running large models efficiently on multiple devices.

In [1]:
!pip install --quiet transformers accelerate

Next, we'll import the `AutoTokenizer` and `AutoModelForCausalLM` classes. These support loading tokenizers and models from the [Hugging Face Hub](https://huggingface.co/models).

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

Load a causal language model and its tokenizer. You can substitute any other causal model model name here (e.g. other GALACTICA models), but note that Colab may have issues running very large models.

In [3]:
MODEL_NAME = 'facebook/galactica-1.3b'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='auto')

We'll define a simple function to perform generation for a given text prompt using broadly reasonable parameters. For details on text generation using transformers, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

In [4]:
def generate(prompt, temperature=0.7, max_new_tokens=100):
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids
    input_ids.to(model.device)
    output = model.generate(
        input_ids,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens,
        no_repeat_ngram_size=2,
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

Run generation with a few example prompts.

(Note that re-running these generation examples will produce different outputs as `model.generate` is invoked with the `do_sample=True` parameter.)

In [5]:
print(generate('p53 is an extensively studied protein that is known to interact with'))

p53 is an extensively studied protein that is known to interact with proteins that mediate DNA repair, cell cycle regulation, and apoptosis ( Functional interactions between p50(nrb) and p73: regulation of apoptosis, transcriptional activation, DNA binding, oligomerization, nuclear localization, subcellular localization and growth suppressor activity., Vousden).

p70 S6 kinase 1 [p-p85α/α-subunit of PI3K] is a serine-threonine kinase that plays an important role in the regulation


In [6]:
print(generate('The most significant risk factors for cancer include'))

The most significant risk factors for cancer include smoking, alcohol consumption, and physical inactivity. Therefore, efforts to reduce the incidence of cancer are essential. In this study, we analyzed the association between the intake of green tea and the risk of lung cancer among 27,414 male workers in the Korean Working Environment Cohort study. Cox proportional hazards regression analysis was used to calculate hazard ratios (HRs) and 95% confidence intervals [CIs]. During follow-up, 331 cases of incident
