# berteome

A library to analyze and explore protein sequences using BERT models

## Install

`pip install berteome`





# Getting started

Berteome makes use of the masked language model of BERT to determine predictions for all residues in a protein sequence. 

The main `berteome` library can be imported as follows:

In [1]:
from berteome import berteome

The `modelLoader` class can be used to show what models are supported by `berteome`. 

In [2]:
berteome_models = berteome.modelLoader()
berteome_models.supported_models

['Rostlab/prot_bert',
 'facebook/esm2_t33_650M_UR50D',
 'facebook/esm1b_t33_650M_UR50S']

All of these models are distributed through huggingface, and berteome makes great use of it's API.

## Load library

To load prot_bert model, run the following:

In [3]:
bert_tokenizer, bert_model = berteome_models.load_model("Rostlab/prot_bert")

Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The language models utilized by `berteome` were trained using a masked token approach. In this approach, a random amino acid is masked in a protein and the model is trained to predict what the amino acid should be. These models do this on an incredibly large amount of protein sequences, to the point that they begin to learn the language of protein sequence space as we currently know it. For instance, it can start to learn, which residues are unlikely to exist at a given point in a protein. Using these models, you can place a mask at any given residue in the protein, and the model will generate a probability score for all the possible amino acids that could go there.

`berteome` allows the user to take the models and begin to really investigate these predictions for a given protein sequence, by masking every single residue in the protein sequence and predicting the probabilities for all the possible amino acids. The result is a nice, easy to work with pandas data frame. To make this dataframe for a very simple peptide sequence (`MENDEL`), do the following:

In [4]:
mendel_berteome = berteome.modelPredDF("MENDEL",bert_tokenizer, bert_model)
mendel_berteome.df

Unnamed: 0,wt,wtIndex,wtScore,n_effective,topAA,topAAscore,A,C,D,E,...,M,N,P,Q,R,S,T,V,W,Y
0,M,1,0.076601,16.680502,E,0.118907,0.036697,0.011504,0.048245,0.118907,...,0.076601,0.072661,0.024722,0.038672,0.043104,0.07028,0.056544,0.049927,0.007781,0.021699
1,E,2,0.07483,17.59915,L,0.106501,0.045721,0.015662,0.041921,0.07483,...,0.043581,0.062667,0.025277,0.036911,0.055543,0.064424,0.049955,0.056789,0.012691,0.029893
2,N,3,0.04199,14.518506,E,0.184365,0.043564,0.009685,0.162591,0.184365,...,0.041484,0.04199,0.019992,0.025515,0.029433,0.048105,0.030303,0.054742,0.00743,0.024924
3,D,4,0.049748,17.561045,L,0.109087,0.042082,0.013244,0.049748,0.086194,...,0.04008,0.060822,0.032024,0.039689,0.046228,0.062323,0.044901,0.058937,0.010875,0.026596
4,E,5,0.086915,17.921403,L,0.090806,0.046641,0.01877,0.079823,0.086915,...,0.028962,0.062234,0.023879,0.030534,0.040489,0.065195,0.044938,0.068038,0.012156,0.038034
5,L,6,0.060736,16.06808,E,0.152547,0.038191,0.009217,0.065189,0.152547,...,0.040042,0.096484,0.020712,0.035022,0.046888,0.049071,0.046247,0.048276,0.010486,0.022727


This dataframe is where the true berteomic magic begins. Each row corresponds to each residue in the input protein sequence. 


Here is a breakdown of some the columns in the dataframe.

- `wt` represents the actual amino acid at the given position `
- `wtIndex` is just a one-based index of the residue which makes plotting easier, may not stick around forever though..- 
- `wtScore` is a very interesting and important value. For a given protein, one would hope that the model would predict that the masked residue would be the same as the wild-type in the sequence. This column gives us the actual probability that the model provided for the wild type residue at that position.
- `n_effective` is a measure of site-specific variability which gives a proxy of how many amino acids could occupy that site and is defined as $N_{eff}(i) = exp(-\sum p_{ji} \ln p_{ji})$
- `topAA` is the top scoring amino acid at a given position in the protein
- `topAAscore` is the score of the top scoring amino acid at a given position in the protein

The remaining columns are simply the probabilities of each possible amino acid generated by the model when placing a mask at every residue in the input protein.

## Amino acid correlation

For a given berteome dataframe, to investigate how correlated the predictions of the different amino acids are to each other, the `aa_correlation()` can be used to generate a correlation dataframe

In [5]:
mendel_berteome.aa_correlation()

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
A,1.0,0.728715,0.23581,-0.389881,0.879476,0.295942,0.745626,0.281984,-0.521585,0.733507,-0.720196,-0.611638,0.079974,-0.43348,-0.010743,0.051074,-0.411039,0.833228,0.585931,0.854028
C,0.728715,1.0,-0.335088,-0.816557,0.854113,0.231246,0.94853,0.774245,-0.042328,0.46636,-0.382039,-0.235091,0.369492,0.063831,0.313224,0.63868,0.247716,0.876382,0.736406,0.923179
D,0.23581,-0.335088,1.0,0.765979,0.084235,-0.105942,-0.311789,-0.822667,-0.909459,0.087417,-0.275036,-0.582001,-0.59921,-0.924921,-0.890908,-0.67145,-0.903985,0.053578,-0.545097,-0.021774
E,-0.389881,-0.816557,0.765979,1.0,-0.555587,-0.275369,-0.756602,-0.960438,-0.445065,-0.44961,0.096603,-0.027768,-0.732525,-0.612381,-0.71052,-0.797273,-0.600746,-0.555545,-0.767344,-0.570188
F,0.879476,0.854113,0.084235,-0.555587,1.0,0.456554,0.850718,0.485913,-0.477462,0.699525,-0.622554,-0.579098,0.359113,-0.254101,-0.07273,0.316786,-0.244819,0.988905,0.54693,0.916871
G,0.295942,0.231246,-0.105942,-0.275369,0.456554,1.0,0.469726,0.397903,-0.077726,0.31133,-0.730925,0.058535,0.495876,0.101607,0.10324,-0.197845,-0.268707,0.46457,0.501197,0.351616
H,0.745626,0.94853,-0.311789,-0.756602,0.850718,0.469726,1.0,0.780561,-0.042413,0.403463,-0.613985,-0.096182,0.331736,0.020781,0.334197,0.428618,0.133952,0.884543,0.852826,0.949145
I,0.281984,0.774245,-0.822667,-0.960438,0.485913,0.397903,0.780561,1.0,0.529274,0.250579,-0.168629,0.251701,0.680964,0.638906,0.732241,0.718695,0.641516,0.519193,0.815986,0.560868
K,-0.521585,-0.042328,-0.909459,-0.445065,-0.477462,-0.077726,-0.042413,0.529274,1.0,-0.363198,0.430711,0.773596,0.335641,0.889436,0.850882,0.411445,0.872259,-0.447153,0.317516,-0.325409
L,0.733507,0.46636,0.087417,-0.44961,0.699525,0.31133,0.403463,0.250579,-0.363198,1.0,-0.360747,-0.779561,0.554168,-0.037799,0.062691,0.196182,-0.320035,0.588133,0.326965,0.436262


## Most probable variants

`berteome` can also be used to generate single residue substitution variants for the top k amino acids for a given residue in a protein. To generate the top 3 mutational variants for `MENDEL` the `generate` submodule can be loaded and used as follows:

In [6]:
from berteome import generate

In [7]:
generate.top_k_variants(mendel_berteome, 3)

Unnamed: 0,sub,seq
0,0subE,EENDEL
1,0subK,KENDEL
2,0subN,NENDEL
3,1subL,MLNDEL
4,1subK,MKNDEL
5,1subI,MINDEL
6,2subE,MEEDEL
7,2subD,MEDDEL
8,2subL,MELDEL
9,3subL,MENLEL


This returns a dataframe with L x k possible single amino acid variants. 
- `sub` is the substitution id that indicates which residue was substitued with what amino acid following the pattern `{residue_number}sub{substituted_amino_acid}`
- `seq` is the new variant sequence.

# Random sequences

If you'd like to take the amino acid probabilities at each residue position to randomly generate proteins from the probability dataframe provided by berteome, you can use `n_random_seqs`

In [9]:
generate.n_random_seqs(mendel_berteome, 10)

Unnamed: 0,seq,score
0,QSSESS,0.05861
1,NDHGWH,0.034827
2,SAKCVK,0.056908
3,PLRLMA,0.056149
4,IQVQFS,0.049592
5,MTDEIA,0.081339
6,PVFAVM,0.044242
7,NMSSVW,0.050866
8,DYYIIQ,0.047647
9,GQLEPM,0.053943


- `seq` is the randomly generated sequence
- `score` is the average score of the amino acids chosen in the randomly generated sequence

## Plotting

# Development

To build the library run the following

```
nbdev export
```

Then, pip install in a development environment

```
pip install -e '.[dev]'
```

I do quite a bit of work on a chromebook, which allows for doing stuff on github through codespace and also on google colab. To install a particular commit hash of `berteome` you can do the following:

In [11]:
!pip uninstall berteome

Found existing installation: berteome 0.1.5
Uninstalling berteome-0.1.5:
  Would remove:
    /usr/local/lib/python3.8/dist-packages/berteome-0.1.5.dist-info/*
    /usr/local/lib/python3.8/dist-packages/berteome/*
Proceed (y/n)? y
  Successfully uninstalled berteome-0.1.5


In [12]:
!pip install "berteome @ git+https://github.com/tijeco/berteome@08ec268f6d066acc885cf82625b6bb1d3865019d"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting berteome@ git+https://github.com/tijeco/berteome@08ec268f6d066acc885cf82625b6bb1d3865019d
  Cloning https://github.com/tijeco/berteome (to revision 08ec268f6d066acc885cf82625b6bb1d3865019d) to /tmp/pip-install-r7afa7d6/berteome_92a02d8df4704f279e5d6306c6f351df
  Running command git clone -q https://github.com/tijeco/berteome /tmp/pip-install-r7afa7d6/berteome_92a02d8df4704f279e5d6306c6f351df
  Running command git rev-parse -q --verify 'sha^08ec268f6d066acc885cf82625b6bb1d3865019d'
  Running command git fetch -q https://github.com/tijeco/berteome 08ec268f6d066acc885cf82625b6bb1d3865019d
  Running command git checkout -q 08ec268f6d066acc885cf82625b6bb1d3865019d
Building wheels for collected packages: berteome
  Building wheel for berteome (setup.py) ... [?25l[?25hdone
  Created wheel for berteome: filename=berteome-0.1.5-py3-none-any.whl size=16681 sha256=29d9a4a4b4117df4960344