<a href="https://colab.research.google.com/github/rsamala/datascience/blob/main/biogpt-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BioGPT Simple Example 

This is a simple example of using BioGPT for text generation. 

# Installing Dependencies

The `transformers` library allows downloading pretrained models. The `sacremoses` library is used internally for BioGptTokenizer.

In [None]:
!pip install transformers
!pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
from transformers import pipeline
from transformers import BioGptTokenizer, BioGptForCausalLM

# Setting up the model and tokenizer

In [None]:
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

Downloading (…)lve/main/config.json:   0%|          | 0.00/595 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

## The Tokenizer

In [None]:
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
result = tokenizer("You need to approximately know how much tokens a text is, as the API uses tokens as a prameter. Here is an example. Let's take this sentence and run it through the the tokenizer.")
print(result)
print("There are {} tokens".format(len(result['input_ids'])))

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/927k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/696k [00:00<?, ?B/s]

{'input_ids': [2, 1015, 20337, 658, 13, 792, 6502, 597, 1100, 2953, 803, 4404, 14, 4210, 21, 7, 27, 6, 15513, 2719, 2953, 803, 4404, 27, 14, 3437, 33838, 4, 556, 21, 32, 2327, 4, 237, 1346, 121, 2956, 35, 17305, 8, 5431, 114, 199, 6, 6, 2953, 18559, 11698, 4], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
There are 49 tokens


# Text Generation 

In [None]:
generator = pipeline("text-generation",model=model,tokenizer=tokenizer)

generator("A healthy lifestyle includes ")



[{'generated_text': 'A healthy lifestyle includes  a healthy diet, physical activity, and smoking cessation.'}]

In [None]:
generator("A healthy lifestyle includes ", num_return_sequences=5, do_sample=True)

[{'generated_text': 'A healthy lifestyle includes  healthy eating and aerobic physical behavior, but the extent of these behaviors and lifestyle factors'},
 {'generated_text': 'A healthy lifestyle includes  regular physical activity, healthy eating, and a healthy weight status.'},
 {'generated_text': 'A healthy lifestyle includes  a healthy diet, exercise, and healthy weights.'},
 {'generated_text': 'A healthy lifestyle includes  a daily healthy eating pattern, a healthy physical activity, and having a minimum of'},
 {'generated_text': 'A healthy lifestyle includes  daily physical activity, adequate nutrition, and the use of appropriate dietary and pharmacological treatments'}]

In [None]:
generator("A healthy lifestyle includes ", num_return_sequences=3, do_sample=True,
          min_new_tokens=40, max_new_tokens=70)

[{'generated_text': 'A healthy lifestyle includes  an increased intake of vegetables, fruit, legumes, and moderate to hard physical activities. (ABSTRACT TRUNCATED AT 250 WORDS) 3. (ABSTRACT TRUNCATED AT 250 WORDS) 2 CURRENT STATE'},
 {'generated_text': 'A healthy lifestyle includes  regular aerobic exercise, smoking cessation, and healthy diet. & nbsp.; METHODS: To promote physical activity, the study involved a public-health intervention, based on a health promotion model in primary care.'},
 {'generated_text': 'A healthy lifestyle includes  dietary, physical activity, and sedentary behaviors., An evaluation of the Canadian Primary Care Exercise Program among young, inactive women aged 19-32 years. (J Diabetes Ther 9 (Suppl 1): R13-R33, 2013).'}]

In [None]:
generator("A healthy lifestyle includes ", num_return_sequences=3, do_sample=True,
          min_new_tokens=40, max_new_tokens=70, max_time=10)

[{'generated_text': 'A healthy lifestyle includes  a physically active life and a healthy diet. I. Concept, concept, development, and evaluation. The Lifestyle Lifestyle Change Intervention Study Group. (ABSTRACT TRUNCATED AT 250 WORDS). I. Concept analysis. (ABSTRACT'},
 {'generated_text': 'A healthy lifestyle includes  a healthy diet and regular physical activity. (2) Lifestyle interventions are key to reducing obesity-related health disparity. (3) Healthy lifestyles include a healthy diet and regular physical activity (4) Lifestyle interventions'},
 {'generated_text': 'A healthy lifestyle includes  a safe and healthy diet; however, sedentary lifestyles and physical inactivity are well-established risk factors for cardiovascular diseases. "Physical activity (PA) is an essential component for the prevention of chronic diseases as well'}]