<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2024/blob/main/galactica_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with GALACTICA

This notebook demonstrates text generation with a small GALACTICA model (https://galactica.org/) on Colab.

First, we'll install the required Python packages. The [transformers](https://huggingface.co/docs/transformers/index) package is used to load the model and run generation, and the [accelerate](https://huggingface.co/docs/accelerate/index) package supports running large models efficiently on multiple devices.

In [1]:
!pip install --quiet transformers accelerate

We'll perform generation using the `pipeline` class. This class abstracts over many of the details involved in loading models from the [Hugging Face Hub](https://huggingface.co/models) and using them for common tasks.

In [2]:
from transformers import pipeline

First, create a `pipeline` for text classification, loading a named model. You can substitute any other causal model model name here (e.g. other GALACTICA models), but note that Colab may have issues running very large models.

In [3]:
MODEL_NAME = 'facebook/galactica-1.3b'

pipe = pipeline(
    'text-generation',
    model=MODEL_NAME,
    device_map='auto',
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

We'll define a simple function to perform generation for a given text prompt using broadly reasonable parameters. For details on text generation using transformers, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

In [4]:
def generate(prompt, temperature=0.7, max_new_tokens=100):
    output = pipe(
        prompt,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens
    )
    return output[0]['generated_text']

Run generation with a few example prompts.

(Note that re-running these generation examples will produce different outputs as `model.generate` is invoked with the `do_sample=True` parameter.)

In [5]:
print(generate('p53 is an extensively studied protein that is known to interact with'))

p53 is an extensively studied protein that is known to interact with a number of other proteins that regulate cell growth, differentiation and apoptosis. It has been shown that p53 can directly interact with various transcription factors, including the transcription factor Sp1 ( Sp1 interacts with tumor suppressor protein p53, Guo). Sp1 has been shown to regulate transcription of p53, and it is tempting to speculate that Sp1 is the downstream effector of the interaction between p53 and the miR-21 promoter in the present study.



In [6]:
print(generate('The most significant risk factors for cancer include'))

The most significant risk factors for cancer include a history of smoking, obesity, diabetes, alcoholism, cardiovascular disease, and depression ( Cancer and depression., Gilman,  Depression, anxiety and risk of breast cancer in African American and white women: a longitudinal study., Goodman). Cancer patients are more likely to develop depression, anxiety, and sleep disorders ( The effects of cancer on sleep: a review of the literature and proposed mechanisms., Fisher). A history of cancer was also found


**NOTE**: GALACTICA is a _base_ language model and has _not_ been trained to follow instructions (or chat). Because of this, "requests" such as the following will not result in responsive output.

In [7]:
print(generate('List the five most common types of cancer, with one per line.'))

List the five most common types of cancer, with one per line.

# 4.2.2. Classifying Types of Breast Cancer

Breast cancer is a highly heterogeneous disease, which is classified into four main types: ductal, lobular, papillary, and medullary. Ductal, lobular and papillary breast cancers are the most common and represent 70–80% of breast cancers, while the remaining 20% are medullary-like and are rarer [ Clinical classification of breast cancer., Vakiani


Try the model with a few prompts that test for facts relevant to your work.