<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2025/blob/main/galactica_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with GALACTICA

This notebook demonstrates text generation with a small GALACTICA model (https://galactica.org/) on Colab.

First, we'll install the required Python packages. The [transformers](https://huggingface.co/docs/transformers/index) package is used to load the model and run generation, and the [accelerate](https://huggingface.co/docs/accelerate/index) package supports running large models efficiently on multiple devices.

In [None]:
!pip install --quiet transformers accelerate

We'll perform generation using the `pipeline` class. This class abstracts over many of the details involved in loading models from the [Hugging Face Hub](https://huggingface.co/models) and using them for common tasks.

In [None]:
from transformers import pipeline

First, create a `pipeline` for text classification, loading a named model. You can substitute any other causal model model name here (e.g. other GALACTICA models), but note that Colab may have issues running very large models.

In [None]:
MODEL_NAME = 'facebook/galactica-1.3b'

pipe = pipeline(
    'text-generation',
    model=MODEL_NAME,
    device_map='auto',
)

Device set to use cuda:0


We'll define a simple function to perform generation for a given text prompt using broadly reasonable parameters. For details on text generation using transformers, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

In [None]:
def generate(prompt, temperature=0.7, max_new_tokens=100):
    output = pipe(
        prompt,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens
    )
    return output[0]['generated_text']

Run generation with a few example prompts.

(Note that re-running these generation examples will produce different outputs as `model.generate` is invoked with the `do_sample=True` parameter.)

In [None]:
print(generate('p53 is an extensively studied protein that is known to interact with'))

p53 is an extensively studied protein that is known to interact with many proteins including histones, chromatin remodelers, and transcription factors, among others, to promote cell survival and/or apoptosis [ p53-Dependent Apoptosis: Evolutionary Conservation of the p53 Pathway from Bacteria to Humans, Loughery, p53-dependent apoptosis: a conserved mechanism in human and mouse., Kastan]. Additionally, p53 is also known to contribute to the regulation of autophagy, which is defined as the process of


In [None]:
print(generate('The most significant risk factors for cancer include'))

The most significant risk factors for cancer include obesity (for breast cancer), physical inactivity (for colorectal cancer), and smoking (for colorectal cancer) [ The risk factors for cancer in Japan, Hiraki].

Since the 1990s, the number of patients with cancer in Japan has increased over 10% annually, and the number of deaths from cancer has increased by 20% over the same period. In 2014, there were 1


**NOTE**: GALACTICA is a _base_ language model and has _not_ been trained to follow instructions (or chat). Because of this, "requests" such as the following will not result in responsive output.

In [None]:
print(generate('List the five most common types of cancer, with one per line.'))

List the five most common types of cancer, with one per line.

# 2.3. Data Preparation

In the data preparation step, we first extracted all the data from the TCGA database. The following data are extracted: (1) the ID and name of the patient; (2) the ID, name, and survival time of the tumor; (3) the ID, name, and survival time of the tumor microenvironment; (4) the ID, name, and survival time of the immune cells; and (


Try the model with a few prompts that test for facts relevant to your work.

---

**BONUS**: try changing the model from `facebook/galactica-1.3b` to a model trained to follow instructions, e.g. `HuggingFaceTB/SmolLM2-1.7B-Instruct`, and rerun the generations. What is different?