# BioGPT with HF Transformers
A notebook to evaluate [BioGPT](https://academic.oup.com/bib/article/23/6/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9), Microsoft's domain-specific generative Transformer language model pre-trained on large-scale biomedical literature.  
No hardware acceleration needed to execute the code in this notebook.

### Settings

Install the missing requirements in the Colab VM (Hugging Face's Transformer and sacremoses).

In [None]:
!pip install transformers sacremoses

Import the necessary packages/classes.

In [None]:
import torch
from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM

Load a pretrained model.

In [None]:
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")

### Text Generation

Create the generation pipeline.

In [None]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)

Set the prompt for the model, the maximum lenght of each generated text sequence and the maximum number of sequences to generate.

In [None]:
prompt = "Psoralen is" #@param {type: "string"}
generated_sequence_max_length = 60 #@param {type:"slider", min:10, max:200, step:1}
num_return_sequences = 3 #@param {type:"slider", min:1, max:20, step:1}

Generate text. The generated sequences are printed to the code cell output.

In [None]:
generator(prompt, 
          max_length=generated_sequence_max_length, 
          num_return_sequences=num_return_sequences, 
          do_sample=True)

### Beam-Search Decoding

Set the minimum and max lenght of the generated text and the number of beams. The prompt for the model is the same set as for previous form for text generation.

In [None]:
generated_text_min_length = 100 #@param {type:"slider", min:10, max:200, step:1}
generated_text_max_length = 1024 #@param {type:"slider", min:300, max:1200, step:1}
num_beams = 5 #@param {type:"slider", min:1, max:10, step:1}

Get the feature of the given prompt in PyTorch format.

In [None]:
inputs = tokenizer(prompt, return_tensors="pt")

Do beam-search decoding. The generated text is printed to the code cell output.

In [None]:
with torch.no_grad():
    beam_output = model.generate(**inputs,
                                min_length=generated_text_min_length,
                                max_length=generated_text_max_length,
                                num_beams=num_beams,
                                early_stopping=True
                                )
tokenizer.decode(beam_output[0], skip_special_tokens=True)