<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2024/blob/main/galactica_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with GALACTICA

This notebook demonstrates text generation with a small GALACTICA model (https://galactica.org/) on Colab.

First, we'll install the required Python packages. The [transformers](https://huggingface.co/docs/transformers/index) package is used to load the model and run generation, and the [accelerate](https://huggingface.co/docs/accelerate/index) package supports running large models efficiently on multiple devices.

In [1]:
!pip install --quiet transformers accelerate

We'll perform generation using the `pipeline` class. This class abstracts over many of the details involved in loading models from the [Hugging Face Hub](https://huggingface.co/models) and using them for common tasks.

In [2]:
from transformers import pipeline

First, create a `pipeline` for text classification, loading a named model. You can substitute any other causal model model name here (e.g. other GALACTICA models), but note that Colab may have issues running very large models.

In [3]:
MODEL_NAME = 'facebook/galactica-1.3b'

pipe = pipeline(
    'text-generation',
    model=MODEL_NAME,
    device_map='auto',
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

We'll define a simple function to perform generation for a given text prompt using broadly reasonable parameters. For details on text generation using transformers, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

In [4]:
def generate(prompt, temperature=0.7, max_new_tokens=100):
    output = pipe(
        prompt,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens
    )
    return output[0]['generated_text']

Run generation with a few example prompts.

(Note that re-running these generation examples will produce different outputs as `model.generate` is invoked with the `do_sample=True` parameter.)

In [5]:
print(generate('p53 is an extensively studied protein that is known to interact with'))

p53 is an extensively studied protein that is known to interact with DNA and histones and act as a tumor suppressor. p53: 800 million years of evolution and 40 years of discovery, Levine p53 can be activated by a variety of DNA damage and other stress signals and is an important regulator of cell cycle arrest, apoptosis, and senescence. The role of p53 in cell cycle arrest and apoptosis., Amaral p53 can also regulate autophagy in response to cellular stress, DNA damage, and other


In [6]:
print(generate('The most significant risk factors for cancer include'))

The most significant risk factors for cancer include exposure to carcinogens, such as tobacco smoke, and genetic susceptibility1. Tobacco smoke contains numerous carcinogens, including polycyclic aromatic hydrocarbons (PAHs), which are generated by incomplete combustion of organic materials2. PAHs are ubiquitously present in indoor air and can enter the human body through inhalation, ingestion of food or water, and dermal contact. PAHs are a group of hydrocarbons with a wide variety of chemical structures and physicochemical properties, which include high volatility, poor water solubility,


**NOTE**: GALACTICA is a _base_ language model and has _not_ been trained to follow instructions (or chat). Because of this, "requests" such as the following will not result in responsive output.

In [7]:
print(generate('List the five most common types of cancer, with one per line.'))

List the five most common types of cancer, with one per line.

## History

 The game was developed by Infocom, with help from the company's former employee Steve Madden. The game was originally planned to be called Star Wars: Battlefront II, but the name was changed to avoid confusion with the Star Wars: Battlefront game that was released about 18 months before. The game was released in 1981 for the Amstrad CPC, Atari ST, Commodore 64, and ZX Spectrum


Try the model with a few prompts that test for facts relevant to your work.