# BioGPT

The BioGPT-Large model with 1.5B paramters hosted at [huggingface](https://huggingface.co/microsoft/biogpt).

In [1]:
import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed, pipeline

In [2]:
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

Beam-search decoding. The beam size determines the number of hypotheses to consider at each step. The higher the beam size, the more accurate the results, but the slower the inference.

In [3]:

set_seed(43)
sentence = "to differentiate germline and sporadic mutations, we need:"
inputs = tokenizer(sentence, return_tensors="pt")

with torch.no_grad():
    beam_output = model.generate(**inputs,
                                min_length=100,
                                max_length=1024,
                                num_beams=10,
                                early_stopping=True
                                )
tokenizer.decode(beam_output[0], skip_special_tokens=True)

'to differentiate germline and sporadic mutations, we need: (1) a better understanding of the mechanisms of germline mutation; (2) a better understanding of the mechanisms of somatic mutation; (3) a better understanding of the mechanisms of somatic mutation; (4) a better understanding of the mechanisms of germline mutation; (5) a better understanding of the mechanisms of somatic mutation; (6) a better understanding of the mechanisms of germline mutation; (7) a better understanding of the mechanisms of somatic mutation; and (8) a better understanding of the mechanisms of germline mutation.'

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:



In [4]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
generator(sentence, max_length=50, num_return_sequences=1, do_sample=True)


[{'generated_text': 'to differentiate germline and sporadic mutations, we need: (1) methods / assays not exclusively focused on germline ones such as the identification of new disease-associated single nucleotide variants and next-generation sequencing strategies able to characterize non-inherited disease-causing variants'}]

Here is how to use this model to get the features of a given text in PyTorch:



In [5]:
from transformers import BioGptTokenizer, BioGptForCausalLM
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
