Loads in full library of 160,000 GB1 variants, and computes deterministic baselines (from bases.py).

In [12]:
import torch
import itertools
import pickle

from scipy.stats import norm
import operator

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('white')
sns.set_context('paper')
# Plot adjustments:
plt.rcParams.update({'ytick.labelsize': 15})
plt.rcParams.update({'xtick.labelsize': 15})
plt.rcParams.update({'axes.labelsize': 35})
plt.rcParams.update({'legend.fontsize': 30})
plt.rcParams.update({'axes.titlesize': 16})

from gptorch import kernels, models
import bases, helpers, opt

In [13]:
with open('../inputs/GB1.pkl', 'rb') as f:
    t = pickle.load(f)

X = t[0] # one-hot encoding of X
T = t[1] # tokenized encoding of X
y = t[3].values

In [14]:
seq_to_x = {} # dictionary of strings of aa with corresponding index in X
for i, x in enumerate(X):
    seq = helpers.decode_X(x)
    seq_to_x[seq] = i

For a given X, this baseline is an optimal sequence consisting of the best amino acids (those with max y-values) at each position out of the four possible positions in the wildtype variant. This baseline is computed by first varying the amino acid in the first position while fixing the three wildtype amino acids in the next three positions, then by varying the amino acid in the second position by fixing the best amino acid chosen to be in the previous position and the wildtype amino acids in the two other positions, etc. Thus, in each iteration through the variant, the amino acids in the fixed positions are not necessarily all wildtype amino acids. 

The function baseline_fixed() in bases.py that computes this baseline returns a list of all possible 24 optimal untested variants (as a string). Then the variant from that list with the best y-value is taken to be the deterministic baseline.

In [18]:
# DET BASELINE_FIXED

wt = 'VDGV' # wt as string
seqs = bases.det_fixed(wt, X, y)
print("aa: {}".format(seqs))

# find y-values corresponding to 24 possible baselines from baseline_fixed() --> take aa seq with max y

seqs = list(set(seqs)) # remove duplicates
X_decode = [helpers.decode_X(x) for x in X]
ys_baseline = [y[X_decode.index(x)] for x in seqs]
max_baseline = seqs[ys_baseline.index(max(ys_baseline))]

y_seq1 = max(ys_baseline)
print("best baseline: {}".format(max_baseline))
print("y value: {}".format(y_seq1))
print("global max: {}".format(np.max(y)))

aa: ['ARQL', 'ARQC', 'ACQL', 'AVQL', 'AVFN', 'AEFN', 'ARQL', 'ARQC', 'IRQL', 'CRQA', 'YRCQ', 'YRIQ', 'YVQY', 'YYQA', 'YVQY', 'GVQR', 'VEQA', 'CEQA', 'MFHN', 'MFHN', 'MVHN', 'CVRN', 'CVRN', 'CVRN']
best baseline: YRCQ
y value: 0.6716957389071662
global max: 0.9962089758370211


For a given X, this baseline is an optimal sequence consisting of the best amino acids (those with max y-values) at each position out of the four possible positions in the wildtype variant. This baseline is computed by varying the amino acid in the first position while fixing the wildtype amino acids at the three other positions, then taking the best amino acid at each position to make up the optimal sequence. The fixed substring in each iteration is a substring of the wildtype sequence.

The function baseline_vary() in bases.py that computes this baseline returns one optimal variant (as a string) that is taken to be the deterministic baseline.

In [19]:
# DET BASELINE_VARY

seq2 = bases.det_vary(wt, X, y)
print("aa: {}".format(seq2))

y_seq2 = y[X_decode.index(seq2)]
print("y value: {}".format(y_seq2))
print("global max: {}".format(np.max(y)))

aa: ARQN
y value: -1.3947150084629183
global max: 0.9962089758370211
