<a href="https://colab.research.google.com/github/tijeco/berteome/blob/dev/notebooks/final/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# berteome

A library to analyze and explore protein sequences using BERT models

## Install

`pip install berteome`

# How to use

Berteome makes use of the masked language model of BERT to determine predictions for all residues in a protein sequence. 

The main `berteome` library can be imported as follows:

In [3]:
from berteome import berteome

To see the models that are currently supported by `berteome` (functionality coming soon), in the meantime here they are printed out explicitly!

- `Rostlab/prot_bert`
- `facebook/esm2_t33_650M_UR50D`
- `facebook/esm1b_t33_650M_UR50S`

All of these models are distributed through huggingface, and berteome makes great use of it's API.

## Load library

To load prot_bert model, run the following:

In [4]:
bert_tokenizer, bert_model = berteome.load_model("Rostlab/prot_bert")

Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The language models utilized by `berteome` were trained using a masked token approach. In this approach, a random amino acid is masked in a protein and the model is trained to predict what the amino acid should be. These models do this on an incredibly large amount of protein sequences, to the point that they begin to learn the language of protein sequence space as we currently know it. For instance, it can start to learn, which residues are unlikely to exist at a given point in a protein. Using these models, you can place a mask at any given residue in the protein, and the model will generate a probability score for all the possible amino acids that could go there.

`berteome` allows the user to take the models and begin to really investigate these predictions for a given protein sequence, by masking every single residue in the protein sequence and predicting the probabilities for all the possible amino acids. The result is a nice, easy to work with pandas data frame. To make this dataframe for a very simple peptide sequence (`MENDEL`), do the following:

In [6]:
mendel_bertDF = berteome.modelPredDF("MENDEL",bert_tokenizer, bert_model)
mendel_bertDF.predDf

Unnamed: 0,wt,wtIndex,wtScore,A,C,D,E,F,G,H,...,M,N,P,Q,R,S,T,V,W,Y
0,M,1,0.076602,0.036697,0.011504,0.048245,0.118906,0.024072,0.039202,0.012621,...,0.076602,0.072661,0.024722,0.038672,0.043105,0.07028,0.056544,0.049927,0.007781,0.021699
1,E,2,0.07483,0.045721,0.015662,0.041921,0.07483,0.037153,0.044325,0.018264,...,0.043581,0.062667,0.025277,0.036911,0.055543,0.064425,0.049955,0.056789,0.012691,0.029893
2,N,3,0.04199,0.043564,0.009685,0.16259,0.184364,0.033782,0.044661,0.012355,...,0.041484,0.04199,0.019992,0.025515,0.029433,0.048106,0.030303,0.054742,0.00743,0.024924
3,D,4,0.049748,0.042083,0.013244,0.049748,0.086194,0.039736,0.055911,0.016861,...,0.04008,0.060822,0.032024,0.039689,0.046228,0.062323,0.044901,0.058937,0.010875,0.026596
4,E,5,0.086915,0.046641,0.01877,0.079822,0.086915,0.050638,0.050466,0.022397,...,0.028962,0.062234,0.023879,0.030534,0.040489,0.065195,0.044938,0.068038,0.012156,0.038034
5,L,6,0.060736,0.038191,0.009217,0.065189,0.152547,0.02095,0.049525,0.013955,...,0.040042,0.096484,0.020712,0.035022,0.046888,0.049071,0.046247,0.048276,0.010486,0.022727


This dataframe is where the true berteomic magic begins. Each row corresponds to each residue in the input protein sequence. 

The `wt` column represents the actual amino acid at the given position, and `wtIndex` is just a one-based index of the residue which makes plotting easier, may not stick around forever though..

`wtScore` is a very interesting and important value. For a given protein, one would hope that the model would predict that the masked residue would be the same as the wild-type in the sequence. This column gives us the actual probability that the model provided for the wild type residue at that position.

The remaining columns are simply the probabilities of each possible amino acid generated by the model when placing a mask at every residue in the input protein.

Using `MENDEL` as a toy example, isn't truly a great use casage, because the model wasn't necesarily trained on such small peptides with that particular order of residues, nevertheless, it still demonstrates the possibilities that `berteome` opens. 

From 6 amino acids, we have 120 data points to start playing around with!

From a single sequence, you could generate a seqlogo plot, which would normally require aligning multiple homologous sequences together.

A position-specific scoring matrix for BLAST searches and HMM searches could be generated from a single sequence!

From a single sequence, you could also generate a sampling of variants that would be reasonably better than just randomly mutating, allowing for a decent way of mutating a protein sequence or just augmenting a sparse protein dataset!

# Development

To build the library run the following

```
nbdev export
```

Then, pip install in a development environment

```
pip install -e '.[dev]'
```

I do quite a bit of work on a chromebook, which allows for doing stuff on github through codespace and also on google colab. To install a particular commit hash of `berteome` you can do the following:

In [1]:
!pip install "berteome @ git+https://github.com/tijeco/berteome@eb9c1dfd45c52c8d43362fda9d74a4643755d544"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting berteome@ git+https://github.com/tijeco/berteome@eb9c1dfd45c52c8d43362fda9d74a4643755d544
  Using cached berteome-0.1.5-py3-none-any.whl
