Alright, so I think I have a pretty neat idea for this package, so I'm just going to document a few of my thoughts. This is an augment package, where the idea is to take a protein sequence and generate a bunch of point mutational variants. I think this would be quite handy in certain applications, for instance antimicrobial peptide classification. There aren't too many known examples of antimicrobial peptides, but tons and tons of examples of non-antimicrobial peptides. If this were a computer vision problem, say we were trying to build a classifier for four leaf clovers vs three leaf clovers, you'd have a much easier time finding examples of three leaf vs four leaf. To compensate for this so that you have an evenish sampling you can augment the dataset byt manipulating the pictures you do have and jittering the images about to get kind of pseudoreplicates. Sure, there's lot's of ways to approach that problem, but for protein sequences it is not so easy to augment those sequences. If they are all in the same gene family, you can make a multiple sequence alignment of them and reconstruct their ancestral states using phylogenetic approaches, this would double the data size! But if that isn't the case then what can you do? Well you can take the single point mutation approach. Say for a protein of length L, generate N variants where you randomly mutate a single amino residue. That is sort of fine in practice, but we know that some mutations are more feleterious than others, if only there were a way to determine which mutations weren't likely to be deleterious... Aha! So I think we can take the dataframes generated from ESM or prot_bert and generate the top k most likely mutational variants per residue. Let's think about how that could potentially augment the data set. So for a single peptide of length L, if we take the top k substitutions per residue, that gives L*k variants, for the toy MENDEL example where L is 6, this would generate 30 variants! Which is quite substantial, for larger proteins I think you could get quite a reasanobly large number of variants.

Alright, so here is what I plan to do, the prediction dataframe is pretty core to this, though for demonstration purposes, I'll probably just handmake a MENDEL dataframe. From that the main function will be an augmentPep function which will take a prediction df from a single peptide, as well k to dictate how many mutational variants per residue. An obvious caveat is the fact that the wild type residue will probably often be in the top k possible substitutions, so there will need to be a small check to insure we are getting the top k residues that are not wt. Beyond that, two aditional functions come to mind than can be built on top of this, augmentFasta and augmentPeps, which just takes augmentPep and applies it to either a fasta file or an iterable list of peptides. I should also think about how to name mutants, I feel like just posResidue + ResidueSub would  be sufficient to tag that to the original id.

Let's try to jump in this, and make a dummy MENDEL dataframe.

In [None]:
import pandas as pd

In [None]:
mendelDF = pd.DataFrame.from_dict({
	#	 A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
	"M":[0,0,0,0,0,0,0,0,0,8,9,7,0,0,0,0,0,0,0,0],
	"E":[0,0,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
	"N":[0,0,0,0,0,0,0,0,0,9,8,7,0,0,0,0,0,0,0,0],
	"D":[0,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
	# "E":[0,0,9,8,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], # redundant... for the example
	"L":[0,0,0,0,0,0,0,0,8,9,7,0,0,0,0,0,0,0,0,0]
}, orient="index", columns=list("ACDEFGHIKLMNPQRSTVWY")).reset_index().rename(columns={"index": "wt"})


mendelDF

Unnamed: 0,wt,A,C,D,E,F,G,H,I,K,...,M,N,P,Q,R,S,T,V,W,Y
0,M,0,0,0,0,0,0,0,0,0,...,9,7,0,0,0,0,0,0,0,0
1,E,0,0,7,8,9,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,N,0,0,0,0,0,0,0,0,0,...,8,7,0,0,0,0,0,0,0,0
3,D,0,7,8,9,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,L,0,0,0,0,0,0,0,0,8,...,7,0,0,0,0,0,0,0,0,0


Alright so this very much a barebones dummy df, but I think it will work.

So for this example I would only want to sample k=2, to go through each residue it should be something like

1. M: L,N
2. E: D,F
3. N: L,M
4. D: C,E
<!-- 5. E: D,F -->
6. L: K,M

So that should give us 12 variant sequences! (Edit:10..)

So let me go ahead and think about what the logic needs to be for this. There is certainly a chance that I will be able to use some fancy pandas stuff to do this, but for now I will just focus on what might work.

So I definitely know that I need to grab the top k values in each column.. wait that's not right. No, I need to get the top values for each row! Then the tricky part is figuring out if the column name is the same value as the index.


Oh gravy, there definitely has to be some sort of SQLy or panday way to get this


In [None]:
def augmentPep(df, k):
	seqList = list(df["wt"])
	variantDict = {}
	for index, row in df.iterrows():
		scores = row[list("ACDEFGHIKLMNPQRSTVWY")]
		top_k_scores = scores.where(scores.index != row["wt"]).sort_values(ascending=False).head(k)

		top_k_subs = list(top_k_scores.index)
		for res in top_k_subs:
			
			seqCopy = seqList.copy()
			seqCopy[index] = res
			variantDict[f"{index}x{res}"] = ''.join(seqCopy) 

	return variantDict


{'0xL': 'LENDL',
 '0xN': 'NENDL',
 '1xF': 'MFNDL',
 '1xD': 'MDNDL',
 '2xL': 'MELDL',
 '2xM': 'MEMDL',
 '3xE': 'MENEL',
 '3xC': 'MENCL',
 '4xK': 'MENDK',
 '4xM': 'MENDM'}

In [None]:
assert augmentPep(mendelDF, 2) == {'0xL': 'LENDL',
 '0xN': 'NENDL',
 '1xF': 'MFNDL',
 '1xD': 'MDNDL',
 '2xL': 'MELDL',
 '2xM': 'MEMDL',
 '3xE': 'MENEL',
 '3xC': 'MENCL',
 '4xK': 'MENDK',
 '4xM': 'MENDM'}

ALright!! In principle, that's working!!

The output structure needs to be modified so that the mutational variant is annotated as "pos_res".

I think I like this..

Let's see how it works on esm/prot_bert dataframes

In [None]:
from berteome import prot_bert

Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
mendel_prot_bert_DF = prot_bert.bertPredictionDF("MENDEL")

In [None]:
mendel_prot_bert_DF

Unnamed: 0,wt,wtIndex,A,C,D,E,F,G,H,I,...,M,N,P,Q,R,S,T,V,W,Y
0,M,1,0.036685,0.011501,0.048229,0.118868,0.024064,0.03919,0.012617,0.066477,...,0.07658,0.072637,0.024714,0.03866,0.043091,0.070257,0.056526,0.049911,0.007779,0.021692
1,E,2,0.045712,0.015659,0.041913,0.074816,0.037146,0.044317,0.01826,0.073063,...,0.043572,0.062655,0.025272,0.036905,0.055532,0.064412,0.049945,0.056779,0.012689,0.029887
2,N,3,0.043558,0.009684,0.162566,0.184336,0.033777,0.044654,0.012353,0.052622,...,0.041478,0.041984,0.019989,0.025511,0.029428,0.048098,0.030299,0.054734,0.007428,0.02492
3,D,4,0.042079,0.013243,0.049744,0.086189,0.039733,0.055907,0.01686,0.073291,...,0.040078,0.060817,0.032022,0.039686,0.046224,0.062319,0.044898,0.058933,0.010875,0.026594
4,E,5,0.046638,0.018769,0.079816,0.086908,0.050634,0.050462,0.022395,0.074495,...,0.02896,0.062229,0.023877,0.030532,0.040486,0.06519,0.044934,0.068032,0.012155,0.038031
5,L,6,0.035695,0.008615,0.060928,0.142576,0.019581,0.046287,0.013043,0.060374,...,0.037424,0.090177,0.019358,0.032733,0.043823,0.045863,0.043224,0.045121,0.0098,0.021241


In [None]:
augmentPep(mendel_prot_bert_DF, 2)

{'0xE': 'EENDEL',
 '0xK': 'KENDEL',
 '1xL': 'MLNDEL',
 '1xK': 'MKNDEL',
 '2xE': 'MEEDEL',
 '2xD': 'MEDDEL',
 '3xL': 'MENLEL',
 '3xK': 'MENKEL',
 '4xL': 'MENDLL',
 '4xD': 'MENDDL',
 '5xE': 'MENDEE',
 '5xK': 'MENDEK'}

In [None]:
from berteome import esm

In [None]:
mendel_esm_DF = esm.esmPredictionDF("MENDEL")

In [None]:
augmentPep(mendel_esm_DF, 2)

{'0xE': 'EENDEL',
 '0xD': 'DENDEL',
 '1xL': 'MLNDEL',
 '1xS': 'MSNDEL',
 '2xE': 'MEEDEL',
 '2xL': 'MELDEL',
 '3xL': 'MENLEL',
 '3xK': 'MENKEL',
 '4xL': 'MENDLL',
 '4xS': 'MENDSL',
 '5xE': 'MENDEE',
 '5xK': 'MENDEK'}

As far as I'm concerned, this is working! It shouldn't be too difficult to use this to make a `augmentPeps` and `augmentFasta`, which takes multiple peptides and returns their top k variants.

I don't entirely need this right this moment, so maybe I should just work on that when I definitely need it or at least have more time, because it will probably take a couple of hours to finish.