## Homework 3.3: Site-saturation mutagenesis (20 points)

One approach for diversity generation is to do saturation mutagenesis on individual residues (site-saturation mutagenesis). Degenerate oligos are used to build saturation mutagenesis libraries. Different methods have been developed to build site-saturation libraries with reduced redundancy, and these are outlined in [this paper by Kille, et al.](https://doi.org/10.1021/sb300037w) You should read that paper to help you work through this problem.

## Part a

Name two advantages that site-saturation mutagenesis has over ePCR. 

<hr>

1. Site-saturation mutagenesis results in a well-defined library size (20^n, n is number of sites), while ePCR does not.
2. When using site-saturation mutagenesis, one can take advantage of available information on target protein by focusing on specific site (or combinations of sites) that are believed to relate to activity or stability. For example, site saturation can target residues in the active site, which are likely to control activity.

## Part b

What is the risk of evolving a protein by focusing on individual residues?

<hr>

By focusing on individual residues, an experimenter risks exploring a residue or collection of residues that signifcantly contribute to activity or stability. With an incorrect hypothesis for which residues are most relevant for fitness or stability gains, the experimenter may get stuck in fitness dead space or at a local fitness optima instead of converging on the global maxima of the squence-fitness landscape.

## Part c

Plot the number of codons that code for each amino acid for the NNN, NNK, and NDT/VHG/TGG degenerate codon libraries. That is, on the x-axis you should have the amino acids and on the y-axis you should have the respective number of codons that code for them. How many total codons are there in each respective saturation library?

For convenient reference, here are the bases each symbol encompasses.

| Symbol     | Bases      |
| :--------: | :--------: |
| N          | A, G, T, C |
| K          | G, T       |
| D          | A, G, T    |
| V          | A, G, C    |
| H          | A, T, C    |

<hr>

In [1]:
def get_codons(codon_lib):
    # Build list of codons
    codon_list = []
    for first_base in codon_lib[0]:
        for second_base in codon_lib[1]:
            for third_base in codon_lib[2]:
                codon_list += [first_base + second_base + third_base]

    # The amino acids that are coded for (* = STOP codon)
    amino_acids = "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"

    # Build dictionary from tuple of 2-tuples (technically an iterator, but it works)
    codons = dict(zip(codon_list, amino_acids))

    return codons

# build list of all natural codons
bases = "TCAG"
codon_list = []
for first_base in bases:
    for second_base in bases:
        for third_base in bases:
            codon_list += [first_base + second_base + third_base]

# The amino acids that are coded for (* = STOP codon)
amino_acids = "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"

# Build dictionary from tuple of 2-tuples (technically an iterator, but it works)
codon_dict = dict(zip(codon_list, amino_acids))

def translate(codons):
    """Translate a list of codons into a protein sequence."""

    translated = []
    for c in codons:
        translated.append(codon_dict[c])

    return translated

def count_aa(lib):
    aa = list("ACDEFGHIKLMNPQRSTVWY*")
    count = dict(zip(aa, [0]*len(aa)))

    for c in lib:
        count[c] += 1
    return count

In [2]:
import numpy as np
import bokeh.plotting
import bokeh.io

def plot_bokeh(lib, title):

    bokeh.io.output_notebook()
    rg = np.random.default_rng()

    aa = list("ACDEFGHIKLMNPQRSTVWY*")
    
    p = bokeh.plotting.figure(
        frame_width=300,
        frame_height=150,
        x_axis_label="amino acid",
        y_axis_label="count",
        x_range=aa,
        title = title
    )

    p.circle(x=aa, y=lib)
    for amino_acid, y in zip(aa, lib):
        p.line(x=[amino_acid, amino_acid], y=[0, y])

    return bokeh.io.show(p)

In [3]:
N = ['A', 'G', 'C', 'T']
K = ['G', 'T']
D = ['A', 'G', 'T']
V = ['A', 'G', 'C']
H = ['A', 'C', 'T']

# build codon libraries
lib = {}
lib['NNN'] = get_codons([N, N, N])
lib['NNK'] = get_codons([N, N, K])
lib['NDT'] = get_codons([N, D, ['T']])
lib['VHG'] = get_codons([V, H, ['G']])
lib['TGG'] = get_codons([['T'], ['G'], ['G']])

# translate
trans_lib = {}
trans_lib['NNN'] = translate(lib['NNN'])
trans_lib['NNK'] = translate(lib['NNK'])
trans_lib['NDT'] = translate(lib['NDT'])
trans_lib['VHG'] = translate(lib['VHG'])
trans_lib['TGG'] = translate(lib['TGG'])
trans_combo_lib =  trans_lib['NDT'] + trans_lib['VHG'] + trans_lib['TGG']

# count
count_lib = {}
count_lib['NNN'] = count_aa(trans_lib['NNN'])
count_lib['NNK'] = count_aa(trans_lib['NNK'])
count_lib['NDT'] = count_aa(trans_lib['NDT'])
count_lib['VHG'] = count_aa(trans_lib['VHG'])
count_lib['TGG'] = count_aa(trans_lib['TGG'])
count_lib['NDT/VHG/TGG'] = count_aa(trans_combo_lib)

# plot
for l in count_lib:
    count = list(count_lib[l].values())
    total_codons = sum(count)
    plot_bokeh(count, title = str(l) + ', Total codons: ' + str(total_codons))



## Part d

The NDT/VHG/TGG site-saturation library requires three different primers. Is there any advantage to using these three primers separately (i.e. perform separate PCRs and transformations for the NDT, VHG, and TGG primers)? If there is an advantage, then why would one nonetheless choose to mix all three primers together?

*Hint*: Consider the number of colonies required to reach 95% library coverage for separate or combined transformations using formula (1) in [Kille, et al. 2012](https://doi.org/10.1021/sb300037w).

<hr>

In [4]:
import math

f = 0.95

def calc_lib_size(c):
    return - c*math.log(1 - f)

# separate transformation library size
NDT_lib = calc_lib_size(12)
print('NDT number of colonies for 95% library coverage: {:0.0f}'.format(float(NDT_lib)))

VHG_lib = calc_lib_size(c = 9)
print('VHG number of colonies for 95% library coverage: {:0.0f}'.format(float(VHG_lib)))

TGG_lib = 1 # 100% coverage for 1 mutation
print('TGG number of colonies for 95% library coverage: {:0.0f}'.format(float(TGG_lib)))

print('Total number of colonies for separate transformation libraries: {:0.0f}'.format(float(NDT_lib + VHG_lib + TGG_lib)))

# combined transformation library size
NDT_VHG_TGG_lib = calc_lib_size(c = 22)
print('NDT/VHG/TGG number of colonies for 95% library coverage: {:0.0f}'.format(float(NDT_VHG_TGG_lib)))


NDT number of colonies for 95% library coverage: 36
VHG number of colonies for 95% library coverage: 27
TGG number of colonies for 95% library coverage: 1
Total number of colonies for separate transformation libraries: 64
NDT/VHG/TGG number of colonies for 95% library coverage: 66


For SSM the total number of colonies for separate transformation libraries (NDT, VHG, TGG) to reach 95% library coverage is 64 and is 66 for the combined transformation library (NDT/VHG/TGG). Even though there is slight advantage in terms of the screening burden for the separate transformation libraries, the experimenter can reduce their workload by using the combined transformation library. When using the combined transformation library, mutagenesis and expression only needs to be completed once versus three times.


## Part e

Optogenetics refers to the ability to control or monitor cellular activities with light using genetically encoded machinery. Light-activated microbial rhodopsins can be transgenically expressed in neurons to reversibly control and sense neural activity. Rhodopsins are a family of light-activated integral membrane proteins that adopt a seven trans-membrane $\alpha$-helical fold. The polyene chromophore retinal is covalently attached to the $\epsilon$-amino group of a conserved lysine residue on the seventh $\alpha$-helix through a protonated Schiff base (PSB) linkage.  Microbial rhodopsin pumps and channels are widely used for optogenetic applications. Light-triggered isomerization of retinal from all-*trans* to 13-*cis* initiates the rhodopsin photocycle and ultimately results in the movement of ions across the membrane. The absorption maximum of rhodopsin is determined by the energy gap between the resting state (S0) and excited state (S1) of the retinal chromophore. 

Consider the proton-pumping rhodopsin (PPR) *Gloeobacter violaceus* rhodopsin (GR). GR has 298 residues. Upon light activation, GR is weakly fluorescent and transports protons into the cell. Say you are screening variants of GR with a 96-well plate reader, allowing you to screen about 1500 colonies of a single library (a fairly high-throughput screen). You have the following options for library creation: 

* Error-prone PCR. 
* Site-saturation mutagenesis of the twenty residues within 5 Angstroms of the retinal (D121, W122, T125, V126, L129, M158, I159, G162, E166, G178, S181, T182, F185, W222, Y225, P226, D253, A256, and K257)

For each of the following libraries, how many variants are possible, and what is the coverage obtained by screening 1500 colonies?

1. Single amino acid mutants. What would the coverage be if you screened using an ePCR where each variant has exactly one amino acid different from the parent? What would the coverage (of single mutants) be if the number of amino acid mutations in your ePCR followed a Poisson distribution with an average of one mutation per variant?
2. Single site-saturation at each of the 20 locations of interest. 
3. Simultaneous site-saturation at all 20 locations of interest.

<hr>

In [5]:
import math

def calc_coverage(v): # coverage (F) from Kille, et al.
    l = 1500
    return 1 - math.exp(-l/v)

In [6]:
# 3.3.1
from scipy.stats import poisson

v = 298*19

print('3.3.1, exactly 1 aa: Variants = {:0.0f}, Coverage = {:0.2f}'.format(v, float(calc_coverage(v))))

f = poisson.pmf(1,1)
v2 = v/f
print('3.3.1, Poisson distribution: Variants = {:0.0f}, Coverage = {:0.2f}'.format(v2, calc_coverage(v2)))

# # 3.3.2
v = 20*19
print('3.3.2: Variants (excluding wt) = {:0.0f}, Coverage = {:0.2f}'.format(v, calc_coverage(v)))

# #3.3.3
v = 19**20 # include og seq?
print('3.3.3: Variants (excluding wt) = {:e}, Coverage = {:0.2f}'.format(v, calc_coverage(v)))

3.3.1, exactly 1 aa: Variants = 5662, Coverage = 0.23
3.3.1, Poisson distribution: Variants = 15391, Coverage = 0.09
3.3.2: Variants (excluding wt) = 380, Coverage = 0.98
3.3.3: Variants (excluding wt) = 3.758997e+25, Coverage = 0.00


## Part f

One obstacle to using microbial opsins for optogenetics is expressing them in mammalian cells. If you want to find a variant of GR that expresses better in mammalian cells, which library would you screen, and why?

<hr>

I would screen the e-PCR library since we are unaware of which residues may contribute to expression in mammalian cells. e-PCR will mutate a broader range of positions, which can help identify the residues that will boost expression in mammalian cells. Conversely, SSM will constrain the variant search to sites that are close to the retinal. 

## Part g

Beyond just having good expression, we also would like to have a large spectral shift (absorbing and emitting wavelengths of light different from the parent GFP). To accomplish this goal, which library would you screen, and why?

<hr>

I would choose to screen the single site-saturation mutagenesis library for the 20 locations close to the retinal. The problem states that "the absorption maximum of rhodopsin is determined by the energy gap between the resting state (S0) and excited state (S1) of the retinal chromophore." Therefore, introducing mutations at these positions is most likely to increase the spectral shift.

<br />