# Text-based Conditional Sequence Generation

Here, we explore a generative model that is based on a text LLM specialised by fine tuning to understand protein sequences. 

This model, described in the paper [Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation](https://arxiv.org/abs/2411.05966), starts from a pre-trained text LLM and then gives it additional protein-sequence fine-tuning. As a result the model learns how to generate protein sequences and can be conditioned on one of the ten class names that the model knows about. 


In [8]:
from transformers import pipeline
from transformers import AutoTokenizer

### Unconditional generation example

In [9]:
generator = pipeline('text-generation', model="Esperanto/Protein-Phi-3-mini", tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True) )

sequences = generator("Seq=<",temperature=0.2,
    top_k=40,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.2,
    max_new_tokens=30,
    num_return_sequences=500)

for sequence in sequences:
    print(sequence['generated_text'])

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/990 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

KeyboardInterrupt: 

### Conditional generation example

Note the specified conditional text `[Generate Ligase enzyme protein]`. 

The options that the model was trained on are: 
- SAM-MT 
- TPHD 
- TRX 
- CheY 
- Ligase 
- Hydrolase 
- Lyase 
- Oxidoreductase 
- Transferase 
- Isomerase

In [11]:
generator = pipeline('text-generation', model="Esperanto/Protein-Phi-3-mini", tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True))

sequences = generator("[Generate Ligase enzyme protein] Seq=<",temperature=0.2,
    top_k=40,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.2,
    max_new_tokens=30,
    num_return_sequences=500)

for sequence in sequences:
    print(sequence['generated_text'])

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 21.0M/4.97G [00:00<?, ?B/s]

KeyboardInterrupt: 

## Bonus Exercise

###Â What do state of the art natural language models know about proteins based on their sequences? 

Try the following prompt in some of the state of the art natural language models (e.g. claude.ai, chat.openai.com, chat.deepseek.com)

```
Can you describe this protein? MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKMVAAGICRSDEHVVSGNLVTPLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTPQCGKCRICKNPESNYCLKNDLGNPRGTLQDGTRRFTCSGKPIHHFVGVSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVVMGCKAAGAARIIAVDINKDKFAKAKELGATECINPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTMMASLLCCHEACGTSVIVGVPPDSQNLSINPMLLLTGRTWKGAIFGGFKSKESVPKLVADFMAKKFSLDALITNILPFEKINEGFDLLRSGKSIRTVLTF
```

The protein is P00326 - [alcohol dehydrogenase](https://www.uniprot.org/uniprotkb/P00326/entry#sequences). 

How accurate or relevant are the descriptions in the different models? Now try some of your own favourite protein sequences. 

