<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2025/blob/main/galactica_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with GALACTICA

This notebook demonstrates text generation with a small GALACTICA model (https://galactica.org/) on Colab.

First, we'll install the required Python packages. The [transformers](https://huggingface.co/docs/transformers/index) package is used to load the model and run generation, and the [accelerate](https://huggingface.co/docs/accelerate/index) package supports running large models efficiently on multiple devices.

In [1]:
!pip install --quiet transformers accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m120.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We'll perform generation using the `pipeline` class. This class abstracts over many of the details involved in loading models from the [Hugging Face Hub](https://huggingface.co/models) and using them for common tasks.

In [2]:
from transformers import pipeline

First, create a `pipeline` for text classification, loading a named model. You can substitute any other causal model model name here (e.g. other GALACTICA models), but note that Colab may have issues running very large models.

In [3]:
MODEL_NAME = 'facebook/galactica-1.3b'

pipe = pipeline(
    'text-generation',
    model=MODEL_NAME,
    device_map='auto',
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

Device set to use cuda:0


We'll define a simple function to perform generation for a given text prompt using broadly reasonable parameters. For details on text generation using transformers, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

In [4]:
def generate(prompt, temperature=0.7, max_new_tokens=100):
    output = pipe(
        prompt,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens
    )
    return output[0]['generated_text']

Run generation with a few example prompts.

(Note that re-running these generation examples will produce different outputs as `model.generate` is invoked with the `do_sample=True` parameter.)

In [5]:
print(generate('p53 is an extensively studied protein that is known to interact with'))

p53 is an extensively studied protein that is known to interact with a diverse range of cellular proteins to maintain genome integrity. In recent years, the role of p53 has been increasingly linked to the regulation of cell cycle checkpoints and apoptosis. In addition to these direct roles, p53 has been shown to interact with a variety of signaling proteins that are known to play key roles in cellular survival, proliferation and apoptosis. The identification of these interactions has given rise to a new class of small molecule inhibitors of p53, which are likely to have important implications in


In [6]:
print(generate('The most significant risk factors for cancer include'))

The most significant risk factors for cancer include tobacco and alcohol consumption; however, only tobacco consumption was a significant risk factor for cancer in our study (p = 0.01). The smoking rate in the present study was 39.3%, which is higher than the national rate of 20% in 2017 []. There are several possible reasons for the high smoking rates in this study. One is that the questionnaire was distributed by social media groups, which may have attracted more


**NOTE**: GALACTICA is a _base_ language model and has _not_ been trained to follow instructions (or chat). Because of this, "requests" such as the following will not result in responsive output.

In [7]:
print(generate('List the five most common types of cancer, with one per line.'))

List the five most common types of cancer, with one per line.

The first three columns list the ICD-10 code, the corresponding cancer type, and the number of patients with cancer in the U.S. in each calendar year. The fourth column shows the cancer type in the data, based on ICD-O-3 codes. Finally, the last column shows the name of the cancer type if we don't see the ICD-10 code. In this way, users can make their own definitions of cancer types.

# 


Try the model with a few prompts that test for facts relevant to your work.