# Models

There are several possible configurations of the Multibind model. In this tutorial we will go through all of them. The most important distinction is made by the datatype which should be modeled. Possible datatypes are SELEX, PBM or other genomic datasets like scATAC.

In [1]:
import multibind as mb
import bindome as bd
bd.constants.ANNOTATIONS_DIRECTORY = '../../../annotations'
import pandas as pd
import numpy as np
import scipy
import torch
import torch.utils.data as tdata
import os
import pickle

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device: " + str(device))

Using device: cpu


## Modeling SELEX data

First we look at the possibilities to create a model for SELEX data. This count-based model uses ideas from [Rube22](https://doi.org/10.1038/s41587-022-01307-0). Assume we have one dataset from one experiment with round zero and one. Then we can create a dataset object of the class `mb.datasets.SelexDataset`.

In [2]:
data = mb.bindome.datasets.ProBound.ctcf(flank_length=0)
data

Unnamed: 0,seq,0,1
0,AAAAAAAGCCCGGAAATAGGCAACTTGTAG,0,1
1,AAAAAAAGGATGTTCCTAGCAACTTATAAA,1,0
2,AAAAAACAACGATAACCAACTGCTGCCGGA,0,1
3,AAAAAACACATGTATGAGTTTTTGATGGAG,1,0
4,AAAAAACCCTCCTTGGTGTCGGACGGCTAT,0,1
...,...,...,...
120091,TTTTTTTTCTTCATTGTTACAGTAGGTAGC,1,0
120092,TTTTTTTTGACTGCTTGGCTGGCTCCTGTG,1,0
120093,TTTTTTTTGGTCGGATTCGCTGTTGTTCAC,0,1
120094,TTTTTTTTTGAACCGGCCGCTCCTATGATC,1,0


In [3]:
dataset = mb.datasets.SelexDataset(data, n_rounds=1)
train = tdata.DataLoader(dataset=dataset, batch_size=256, shuffle=True)

For SELEX data, the model first determines a score which indicates the strength of binding for every expermental round $r$. The parameters $a$ are the activities, $\vec{b}$ the binding modes and we use $\vec{X}$ to indicate the onehot-encoded input sequence.

$S_r = a_{n.s., r} + \sum_b a_{b, r} \sum_x e^{\vec{b} \cdot \vec{X}}$.

After that step, the enrichment over the rounds is calculated by

$Enr_r = \prod_{i=1}^r S_i$

and then converted to counts by a normalization step.

An initialization for that model is the following:

In [None]:
model = mb.models.Multibind(datatype="selex", n_rounds=4, n_kernels=3)

This will give us a model which has a parameter for unspecific binding and parameters for two additional binding modes. In most usecases, we will not need to initialize the model by ourselves. Instead this will be accomplished by `mb.tl.train_iterative()`.

If we have several (e.g. 2) SELEX experiments which were done with the same protein and the same number of rounds, we need to indicate this in the `mb.datasets.SelexDataset`.

In [4]:
data = mb.bindome.datasets.ProBound.ctcf(flank_length=0)
data["batch"] = np.repeat(["A", "B"], len(data)/2)
data

Unnamed: 0,seq,0,1,batch
0,AAAAAAAGCCCGGAAATAGGCAACTTGTAG,0,1,A
1,AAAAAAAGGATGTTCCTAGCAACTTATAAA,1,0,A
2,AAAAAACAACGATAACCAACTGCTGCCGGA,0,1,A
3,AAAAAACACATGTATGAGTTTTTGATGGAG,1,0,A
4,AAAAAACCCTCCTTGGTGTCGGACGGCTAT,0,1,A
...,...,...,...,...
120091,TTTTTTTTCTTCATTGTTACAGTAGGTAGC,1,0,B
120092,TTTTTTTTGACTGCTTGGCTGGCTCCTGTG,1,0,B
120093,TTTTTTTTGGTCGGATTCGCTGTTGTTCAC,0,1,B
120094,TTTTTTTTTGAACCGGCCGCTCCTATGATC,1,0,B


In [5]:
dataset = mb.datasets.SelexDataset(data, n_rounds=1, labels=[0, 1])
train = tdata.DataLoader(dataset=dataset, batch_size=256, shuffle=True)

Then we can use the following model:

In [None]:
model = mb.models.Multibind(datatype="selex", n_rounds=4, n_kernels=3, n_batches=5)

## Modeling PBM data

As a next step we look at the possibilities for PBM data. Assume we have one dataset with multiple proteins and multiple DNA sequences. First we need to shift the measured signal such that the smallest value is zero. Then we can create a `mb.datasets.PBMDataset` object.

In [6]:
matlab_path = os.path.join(bd.constants.ANNOTATIONS_DIRECTORY, 'pbm', 'affreg', 'PbmDataHom6_norm.mat')
mat = scipy.io.loadmat(matlab_path)
data = mat['PbmData'][0]
seqs_dna =  data[0][5]
seqs_dna = [s[0][0] for s in seqs_dna]
# load the MSA sequences, one hot encoded
df, signal = bd.datasets.PBM.pbm_homeo_affreg()
# x, y = pickle.load(open('../../data/example_homeo_PbmData.pkl', 'rb'))
x, y = pickle.load(open(os.path.join(bd.constants.ANNOTATIONS_DIRECTORY, 'pbm/example_homeo_PbmData.pkl'), 'rb'))

# Set up the dataframe
df = pd.DataFrame(signal.T)
df['seq'] = seqs_dna
df.index = df['seq']
del df['seq']
df -= df.min()

df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,168,169,170,171,172,173,174,175,176,177
seq,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TAGCTTTCCAAAATTCACCAGTAACTTGGTAAATCCGTCTGTGTTCCGTTGTCCGTGCTG,14.074130,11.894619,13.774439,8.808984,8.268782,14.754835,13.638542,9.100628,10.618406,10.080940,...,9.524837,14.552818,12.600934,37.484852,11.189450,14.685014,18.448307,9.905655,14.942296,12.645746
CGCATGCCCGAGCCTAATTGCTTCTTCGTCGTAGTCGTCTGTGTTCCGTTGTCCGTGCTG,15.046353,12.855888,17.880991,10.092453,10.175521,15.899379,15.671677,10.596622,12.771672,12.296549,...,7.726180,13.517212,11.836590,37.536018,11.103490,14.903232,21.093563,11.908670,16.154250,14.110695
GTCTATTTTAAAAACAATACGCACGCCCGTCATATAGTCTGTGTTCCGTTGTCCGTGCTG,14.677882,11.799079,13.770519,9.351587,8.768532,14.975954,13.734794,9.104438,10.930431,9.963806,...,8.883301,14.268490,12.982803,38.106815,11.577204,15.080433,18.676463,10.253420,15.606891,12.875898
CGATTTCCCTCCGTTCTCACACCTAGACGGTTTCCAGTCTGTGTTCCGTTGTCCGTGCTG,13.749306,11.447418,13.712889,8.438002,7.762961,14.369059,13.259086,8.685034,10.134055,9.537355,...,8.092311,13.943988,12.443106,37.637481,11.000383,14.092121,17.980643,9.958218,15.060158,12.229817
AGCTATAAGGACAACGCTTCGCGCGCGCAATCATACGTCTGTGTTCCGTTGTCCGTGCTG,13.867320,10.276587,12.700894,9.158781,8.015545,14.244715,13.411069,8.954147,10.560594,9.124708,...,7.515072,13.442757,12.287620,37.877111,11.433560,15.052433,18.015072,9.619749,14.667666,11.876489
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GGGTACGCAGTACCTCACTGCCGAACTGACACTCAAGTCTGTGTTCCGTTGTCCGTGCTG,13.717141,10.975625,12.919744,9.253869,7.236386,13.490946,13.063000,7.819911,10.035072,9.028914,...,7.705765,13.114445,13.680240,37.758438,11.035615,14.529917,17.640501,9.543198,14.617094,11.848627
GTGACAGGTGGTTCGTTGTACCGTGATCTTAGATCGGTCTGTGTTCCGTTGTCCGTGCTG,14.024820,11.971725,13.179835,9.078670,7.542659,13.471047,13.328045,8.486929,10.571290,9.583800,...,7.753925,13.653856,16.833282,39.930070,11.427839,14.324456,17.727517,9.673005,15.088025,12.208795
GGCACGCGATAAAATCATGATTGATTGAGAATACAGGTCTGTGTTCCGTTGTCCGTGCTG,13.498761,11.451631,14.616221,9.520041,9.024161,13.993389,13.943477,8.930199,12.095298,9.860593,...,8.230562,13.179457,12.443699,37.418914,11.154854,15.299397,19.339742,11.119318,15.456251,12.317442
ATAGTGCAACTTCCACAAGTGAACGCACGTCCTAGGGTCTGTGTTCCGTTGTCCGTGCTG,12.718854,10.979732,13.138953,9.266587,7.644174,14.049973,13.048131,7.959475,10.198705,9.129285,...,8.204274,13.232719,11.984509,37.679619,11.571540,14.099203,17.647029,9.530850,14.511629,11.742010


In [7]:
dataset = mb.datasets.PBMDataset(df)
train = tdata.DataLoader(dataset=dataset, batch_size=256, shuffle=True)

The general modeling for PBM data also calculates scores for evaluating the binding. The parameters $a$ are again the activities, $\vec{b}$ the binding modes and we use $\vec{X}$ to indicate the onehot-encoded input sequence. We will learn scores per protein $p$.

$S_p = a_{n.s., p} + \sum_b a_{b, p} \sum_x e^{\vec{b} \cdot \vec{X}}$.

Here we learn the binding modes jointly for all proteins. We need the following model:

In [None]:
model = mb.models.Multibind(datatype="pbm", n_kernels=4)

The binding modes are shared across all proteins, but the activities are learned per protein.

If we want to learn independent binding modes per protein, we need to use a generator class to store the binding modes. For this we can work with an object of `mb.models.BMCollection`.

In [None]:
n_proteins = 178  #assuming that 178 proteins are contained in the used dataset
bm_generator = mb.models.BMCollection(n_proteins=n_proteins, n_kernels=3)
model = mb.models.Multibind(datatype="pbm", bm_generator=bm_generator)

If we additionally know the residue sequence, we can initialize a `mb.datasets.ResiduePBMDataset`.

In [8]:
dataset = mb.datasets.ResiduePBMDataset(df, x)
train = tdata.DataLoader(dataset=dataset, batch_size=128, shuffle=True)

Then we use `mb.models.BMPrediction` to predict binding modes based on the residue sequences. An LSTM is used at the moment for that task.

In [None]:
bm_generator = mb.models.BMPrediction(
    input_size=21, 
    hidden_size=2, 
    num_layers=1, 
    seq_length=train.dataset.get_max_residue_length(),
)
model = mb.models.Multibind(datatype="pbm", bm_generator=bm_generator)

## Modeling Genomic data

This option can be used for scATAC data, for example. Internally the model works similar as for pbm data. Hence the measured signal also needs to be shifted such that the smallest value in the dataset is zero. Then we can use the `mb.datasets.GenomicsDataset` class to store the data and initialize the corresponding model:

In [None]:
model = mb.models.Multibind(datatype="gen", n_kernels=4)

Here the binding modes are also shared across all cells.

There are also other options for the models. They work in the same way as for PBM data. We just need to replace `"pbm"` by `"gen"`.

In general, there are some more parameters which are beyond the scope of this tutorial, but you can find them in the API documentation.