# Correcting for H6 standard neutralization in AUSAB-05 selection modeling
Facing the issue that less-potent sera are significantly neutralizing the H6 standard in selections at concentrations IC99 and above. We could correct for this by running RT-qPCR to analyze actual amount of H6 RNA relative to total RNA at each selection concentration, then normalizing.

Here, I estimate non-neutralized H6 counts at each concentration by comparing to AUSAB-07 and AUSAB-13 selections (these do not neutralize H6 at relevant concentrations). After correcting prob escape scores with these new H6 estimates, I generate models to see if this helps better resolve escape sites. 

Upshot is that there's not a significant benefit from editing H6 counts and incorporating these more potent selections.

### import packages, set up some general plotting + modeling functions

In [1]:
import pandas as pd
import altair as alt

import polyclonal

from IPython.utils import io

import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
os.chdir('../../')

In [3]:
# with open("config.yaml") as f:
#     config = yaml.safe_load(f)

In [4]:
# import Bio.SeqIO
# import alignparse.utils

In [5]:
# set up function for mean prob escape chart to avoid clutter from large block of code

def plot_avg_escape(prob_escape):
    max_aa_subs = 4  # group if >= this many substitutions
    
    mean_prob_escape = (
        prob_escape.assign(
            n_subs=lambda x: (
                x["aa_substitutions_reference"]
                .str.split()
                .map(len)
                .clip(upper=max_aa_subs)
                .map(lambda n: str(n) if n < max_aa_subs else f">{max_aa_subs - 1}")
            )
        )
        .groupby(["antibody_concentration", "n_subs"], as_index=False)
        .aggregate({"prob_escape": "mean", "prob_escape_uncensored": "mean"})
        .rename(
            columns={
                "prob_escape": "censored to [0, 1]",
                "prob_escape_uncensored": "not censored",
            }
        )
        .melt(
            id_vars=["antibody_concentration", "n_subs"],
            var_name="censored",
            value_name="probability escape",
        )
    )

    mean_prob_escape_chart = (
        alt.Chart(mean_prob_escape)
        .encode(
            x=alt.X("antibody_concentration"),
            y=alt.Y(
                "probability escape",
                scale=alt.Scale(type="symlog", constant=0.05),
            ),
            column=alt.Column("censored", title=None),
            color=alt.Color("n_subs", title="n substitutions"),
            tooltip=[
                alt.Tooltip(c, format=".3g") if mean_prob_escape[c].dtype == float else c
                for c in mean_prob_escape.columns
            ],
        )
        .mark_line(point=True, size=0.5)
        .properties(width=200, height=125)
        .configure_axis(grid=False)
    )

    return mean_prob_escape_chart

In [47]:
def generate_model(
    prob_escape_df,
    n_epitopes=1
):
    
    model = polyclonal.Polyclonal(
        n_epitopes=n_epitopes,
        data_to_fit=prob_escape_df.rename(
            columns={
                "antibody_concentration": "concentration",
                "aa_substitutions_reference": "aa_substitutions",
            }
        ),
        alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    )

    # fit model, suppressing output text to avoid clutter in notebook
    with io.capture_output() as captured:
        opt_res = model.fit(
            logfreq=200,
            reg_escape_weight=0.1,
        )

    mut_escape_plot = model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, 
                                            init_floor_at_zero=False,
                                            show_heatmap=False
                                           )
    return mut_escape_plot

### Compare relative neut standard fraction between AUSAB-05 and AUSAB-07/13

For AUSAB-05, we start seeing H6 neutralization that interferes with analysis at concentrations 0.0166 and 0.0249. These concentrations are roughly the IC99 and 1.5-fold IC99, respectively. I'm going to try replacing the neut standard counts at these concentrations with fake counts that approximate what the non-neutralized H6 would look like, and see how that affects analysis.

Use AUSAB-13 and AUSAB-07 to get reference values. Selection concentrations for both these antibodies include:
* 1.75-fold IC99
* 1.17-fold IC99
* 0.80-fold IC99

So none of these exactly match up, but we can estimate some intermediate values. AUSAB-07 will likely be the better reference, as it was from the same run and therefore uses the same neut std spike-in virus mix. But use both to check.

Read in summary neut standard count files for both sera:

In [7]:
neut_std_13 = pd.read_csv('results/prob_escape/libA_221027_1_AUSAB-13_1_neut_standard_fracs.csv')

neut_std_13

Unnamed: 0,library,antibody_sample,no-antibody_sample,antibody,antibody_concentration,antibody_count,antibody_frac,no-antibody_count,no-antibody_frac
0,libA,221027_1_antibody_AUSAB-13_0.00752_1,221027_1_no-antibody_control_1,AUSAB-13,0.00752,420515,0.07312,6257,0.00038
1,libA,221027_1_antibody_AUSAB-13_0.00501333_1,221027_1_no-antibody_control_1,AUSAB-13,0.005013,224073,0.04337,6257,0.00038
2,libA,221027_1_antibody_AUSAB-13_0.00334222_1,221027_1_no-antibody_control_1,AUSAB-13,0.003342,326272,0.04842,6257,0.00038
3,libA,221027_1_antibody_AUSAB-13_0.00222815_1,221027_1_no-antibody_control_1,AUSAB-13,0.002228,262072,0.04657,6257,0.00038
4,libA,221027_1_antibody_AUSAB-13_0.00148543_1,221027_1_no-antibody_control_1,AUSAB-13,0.001485,116899,0.0243,6257,0.00038
5,libA,221027_1_antibody_AUSAB-13_0.00099029_1,221027_1_no-antibody_control_1,AUSAB-13,0.00099,47558,0.01284,6257,0.00038
6,libA,221027_1_antibody_AUSAB-13_0.00066019_1,221027_1_no-antibody_control_1,AUSAB-13,0.00066,29529,0.008403,6257,0.00038


In [8]:
neut_std_07 = pd.read_csv('results/prob_escape/libA_221223_1_AUSAB-07_1_neut_standard_fracs.csv')
neut_std_07

Unnamed: 0,library,antibody_sample,no-antibody_sample,antibody,antibody_concentration,antibody_count,antibody_frac,no-antibody_count,no-antibody_frac
0,libA,221223_1_antibody_AUSAB-07_0.00776_1,221223_1_no-antibody_control_1,AUSAB-07,0.00776,723026,0.1505,28408,0.002654
1,libA,221223_1_antibody_AUSAB-07_0.005173333_1,221223_1_no-antibody_control_1,AUSAB-07,0.005173,696024,0.1066,28408,0.002654
2,libA,221223_1_antibody_AUSAB-07_0.003448889_1,221223_1_no-antibody_control_1,AUSAB-07,0.003449,564905,0.1108,28408,0.002654
3,libA,221223_1_antibody_AUSAB-07_0.002299259_1,221223_1_no-antibody_control_1,AUSAB-07,0.002299,349479,0.06804,28408,0.002654
4,libA,221223_1_antibody_AUSAB-07_0.00153284_1,221223_1_no-antibody_control_1,AUSAB-07,0.001533,187261,0.04217,28408,0.002654
5,libA,221223_1_antibody_AUSAB-07_0.001021893_1,221223_1_no-antibody_control_1,AUSAB-07,0.001022,44725,0.01486,28408,0.002654
6,libA,221223_1_antibody_AUSAB-07_0.000681262_1,221223_1_no-antibody_control_1,AUSAB-07,0.000681,6696,0.00724,28408,0.002654


In [9]:
neut_std_05 = pd.read_csv('results/prob_escape/libA_221223_1_AUSAB-05_1_neut_standard_fracs.csv')
neut_std_05['total_antibody_count'] = neut_std_05['antibody_count'] / neut_std_05['antibody_frac']
neut_std_05

Unnamed: 0,library,antibody_sample,no-antibody_sample,antibody,antibody_concentration,antibody_count,antibody_frac,no-antibody_count,no-antibody_frac,total_antibody_count
0,libA,221223_1_antibody_AUSAB-05_0.056_1,221223_1_no-antibody_control_1,AUSAB-05,0.056,74631,0.0117,28408,0.002654,6378718.0
1,libA,221223_1_antibody_AUSAB-05_0.037333333_1,221223_1_no-antibody_control_1,AUSAB-05,0.03733,175377,0.0264,28408,0.002654,6643068.0
2,libA,221223_1_antibody_AUSAB-05_0.024888889_1,221223_1_no-antibody_control_1,AUSAB-05,0.02489,453396,0.07019,28408,0.002654,6459553.0
3,libA,221223_1_antibody_AUSAB-05_0.016592593_1,221223_1_no-antibody_control_1,AUSAB-05,0.01659,743423,0.1395,28408,0.002654,5329197.0
4,libA,221223_1_antibody_AUSAB-05_0.011061729_1,221223_1_no-antibody_control_1,AUSAB-05,0.01106,608469,0.1126,28408,0.002654,5403810.0
5,libA,221223_1_antibody_AUSAB-05_0.007374486_1,221223_1_no-antibody_control_1,AUSAB-05,0.007374,131356,0.03462,28408,0.002654,3794223.0
6,libA,221223_1_antibody_AUSAB-05_0.004916324_1,221223_1_no-antibody_control_1,AUSAB-05,0.004916,39060,0.01057,28408,0.002654,3695364.0


I switched over to an excel spreadsheet at this point, to do some manual analysis. Basically asking what the percent change is between neut fractions at one selection concentration, and neut fraction at the next. For AUSAB-07:
* neut std fraction increases by 62% from 0.8-fold IC99 to 1.2-fold IC99 selection
* " increases by 61% from 1.2-fold IC99 to 1.8-fold IC99

For AUSAB-13:
* " increases by 52% from 0.8-fold to 1.2-fold IC99 selection
* " increases by 96% from 1.2-fold to 1.8-fold IC99 selection

I have more confidence in the AUSAB-07 calculations, as starting neut_std fraction in the no antibody control is much higher (0.26%) than it is for the earlier AUSAB-13 selections (0.03%). 

Note that all of these selections are at 1.5-fold differences; AUSAB-05 just had a different starting concentration. So despite fold-IC values not completely matching up, I think it's reasonable to use roughly 60% increase from the previous selection to estimate 'real' H6 counts at IC99 and 1.5-fold IC99 for AUSAB-05.

The selections less potent than this (0.8-fold, 0.5-fold, and 0.3-fold) match up in neut std fraction change rates with AUSAB-07. Neut std frac changes 30% at both steps for AUSAB-05, and 49% / 35% for these steps for AUSAB-07. So I can take the actual measured neut std count at 0.7-fold IC99 for AUSAB-05, and then assign an H6 count that's 60% greater for IC99, and another 60% greater for 1.5-fold IC99.

### Calculate new neut std counts, based on fraction counts relative to previous selection concentration

In [10]:
# for AUSAB-05: IC99 is 0.0166, 1.5-fold is 0.0249, 2.3-fold is 0.0264
# calling them c4, c5, and c6 here for easier variable naming. Not adjusting H6 for first 3 conc

h6_frac_estimate_c1 = 0.1126 * 1.6
h6_frac_estimate_c2 = h6_frac_estimate_c1 * 1.6
h6_frac_estimate_c3 = h6_frac_estimate_c2 * 1

In [11]:
# get dict with new neut_std counts
fraction_h6_dict = {
    0.016590: h6_frac_estimate_c1, 
    0.024890: h6_frac_estimate_c2, 
    0.037330: h6_frac_estimate_c3
                   }

# initialize dict with first 3 selection conc, which I'm not going to change
new_neut_std_count = {
    0.0049: 'keep',
    0.0074: 'keep',
    0.0111: 'keep',
}

for sele_conc in fraction_h6_dict:
    total_counts = (neut_std_05.loc[neut_std_05['antibody_concentration'] == sele_conc]['total_antibody_count']
                    .iloc[0]
                   )
    
    dummy_h6 = (total_counts * fraction_h6_dict[sele_conc]).astype(int)    
    rounded_conc = round(sele_conc, 4)   
    new_neut_std_count[rounded_conc] = dummy_h6

new_neut_std_count

{0.0049: 'keep',
 0.0074: 'keep',
 0.0111: 'keep',
 0.0166: 960108,
 0.0249: 1862004,
 0.0373: 1914904}

### Manually edit neut std counts and prob escape scores in main prob_escape_05 df

In [12]:
prob_escape_05 = pd.read_csv(
    "results/prob_escape/libA_221223_1_AUSAB-05_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape_05.notnull().all().all()

In [13]:
prob_escape_by_sele_conc = []

for sele_conc in new_neut_std_count:    
    prob_escape_sele = prob_escape_05.loc[prob_escape_05['antibody_concentration'] == sele_conc]
    
    if new_neut_std_count[sele_conc] != 'keep':

        prob_escape_sele['antibody_neut_standard_count'] = new_neut_std_count[sele_conc]
        prob_escape_sele['prob_escape_uncensored'] = (
            (prob_escape_sele['antibody_count'] / prob_escape_sele['antibody_neut_standard_count']) /
            (prob_escape_sele['no-antibody_count'] / prob_escape_sele['no-antibody_neut_standard_count'])
        )

        prob_escape_sele = prob_escape_sele.assign(
            prob_escape=lambda x: x["prob_escape_uncensored"].clip(upper=1),
        )
    
    prob_escape_by_sele_conc.append(prob_escape_sele)
    
prob_escape_edited = pd.concat(prob_escape_by_sele_conc)

### Visualize new average escape plots

In [14]:
plot_avg_escape(prob_escape_05)

In [15]:
plot_avg_escape(prob_escape_edited)

This looks pretty good! There's something wonky happening with the 4th selection concentration, but it's also in the original so I'm not sure if this will be fixed by editing the neut standard counts.

If anything, I could actually increase the neut standard counts a bit to get a consistent downward trend. But let's start with this just for a rough approximation.

Compare full models between the original and edited versions:

In [48]:
generate_model(prob_escape_05)

In [49]:
generate_model(prob_escape_edited)

**Weirdly enough, editing the H6 counts gets us better resolution of the universal stalk mutations (around 320, 370, 440) that always have low escape scores. But the mutations in the actual protein are the same.**

**Try playing around with single concentrations and smaller sets of concentrations:**

In [18]:
selection_df_edited = (
    prob_escape_edited.groupby("antibody_concentration")
    .aggregate(n_variants=pd.NamedAgg("barcode", "nunique"))
    .reset_index()
)

selections_edited = selection_df_edited['antibody_concentration'].tolist()

selections_edited

[0.0049, 0.0074, 0.0111, 0.0166, 0.0249, 0.0373]

In [51]:
escape_plots_edited = []

for selection in selections_edited:
    single_conc = prob_escape_edited.loc[prob_escape_edited['antibody_concentration'] == selection]
    single_conc_plot = generate_model(single_conc)
    escape_plots_edited.append(single_conc_plot)

In [20]:
selection_df_05 = (
    prob_escape_05.groupby("antibody_concentration")
    .aggregate(n_variants=pd.NamedAgg("barcode", "nunique"))
    .reset_index()
)

selections_05 = selection_df_05['antibody_concentration'].tolist()

selections_05

[0.0049, 0.0074, 0.0111, 0.0166, 0.0249, 0.0373, 0.056]

In [50]:
escape_plots_original = []

for selection in selections_05:
    single_conc = prob_escape_05.loc[prob_escape_05['antibody_concentration'] == selection]
    single_conc_plot = generate_model(single_conc)
    escape_plots_original.append(single_conc_plot)

### IC99 comparison

In [52]:
escape_plots_original[3]

In [53]:
escape_plots_edited[3]

### 1.5-fold IC99

In [54]:
escape_plots_original[4]

In [55]:
escape_plots_edited[4]

**The original escape plot, where we haven't edited H6 counts, actually has stronger resolution of mutations in the H3 head.**

### 2.3-fold IC99

In [56]:
escape_plots_original[5]

In [57]:
escape_plots_edited[5]

**Again, original magnitude is higher than the edited version. Both have a similar degree of noise, though the edited version has better resolution of sensitizing mutations.**

### Try fitting model on smaller set of concentrations from prob_escape with edited H6 counts

In [58]:
selections_edited

[0.0049, 0.0074, 0.0111, 0.0166, 0.0249, 0.0373]

In [59]:
prob_escape_edited_filtered = prob_escape_edited.loc[
    (prob_escape_edited['antibody_concentration'] != 0.0049) &
    (prob_escape_edited['antibody_concentration'] != 0.0373)
]

generate_model(prob_escape_edited_filtered)

In [60]:
# plot_avg_escape(prob_escape_edited_filtered)

This actually looks pretty good! But compare to equivalent model with non-edited values - 

In [31]:
selections_05

[0.0049, 0.0074, 0.0111, 0.0166, 0.0249, 0.0373, 0.056]

In [61]:
prob_escape_filtered = prob_escape_05.loc[
    (prob_escape_05['antibody_concentration'] != 0.0049) &
    (prob_escape_05['antibody_concentration'] != 0.0373) &
    (prob_escape_05['antibody_concentration'] != 0.056)
]

generate_model(prob_escape_filtered)

In [33]:
plot_avg_escape(prob_escape_filtered)

**The edited version is a little less noisy, which is promising, but it's not a significant change in magnitude like I was hoping for.**

### Test multi-epitope fitting with edited data

The main reason I'm exploring this is because current data is too noisy / doesn't have strong enough signal to assign multiple epitopes. Play around with multi-epitope models fit with different regularization parameters to see if I have any more luck with the edited data.

In [34]:
spatial_distances = polyclonal.pdb_utils.inter_residue_distances(
    "scratch_notebooks/221227_model_fitting/4o5n_renumbered_1chain.pdb",
    target_chains=["A"],
)

In [35]:
reference_sites = pd.read_csv("data/site_map.csv")["reference_site"].tolist()

def generate_multi_epitope_model(
    prob_escape_df,
    n_epitopes=2,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.0005,
):
    
    model = polyclonal.Polyclonal(
        n_epitopes=n_epitopes,
        data_to_fit=prob_escape_df.rename(
            columns={
                "antibody_concentration": "concentration",
                "aa_substitutions_reference": "aa_substitutions",
            }
        ),
        alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
        sites=reference_sites,
        spatial_distances=spatial_distances,
    )

    # fit model, suppressing output text to avoid clutter in notebook
    with io.capture_output() as captured:
        opt_res = model.fit(
            logfreq=200,
            reg_escape_weight=0.1,
            reg_uniqueness_weight=reg_uniqueness_weight,
            reg_uniqueness2_weight=reg_uniqueness2_weight,
            reg_spatial_weight=reg_spatial_weight,
            reg_spatial2_weight=reg_spatial2_weight,
        )

    # display results
#     display(model.activity_wt_barplot())
    mut_escape_plot = model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False)
    
    return mut_escape_plot

In [36]:
# generate_multi_epitope_model(prob_escape_edited_filtered)

In [37]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              reg_spatial2_weight=1e-3
#                             )

In [38]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              reg_spatial2_weight=1e-2
#                             )

In [39]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              reg_spatial2_weight=1e-1
#                             )

In [40]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              reg_spatial2_weight=1
#                             )

In [41]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              n_epitopes=3
#                             )

In [42]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              n_epitopes=3,
#                              reg_spatial2_weight=1e-3
#                             )

In [43]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              n_epitopes=3,
#                              reg_spatial2_weight=1e-2
#                             )

In [44]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              n_epitopes=3,
#                              reg_spatial2_weight=1e-1
#                             )

In [45]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              n_epitopes=3,
#                              reg_spatial2_weight=1
#                             )

In [46]:
# generate_multi_epitope_model(prob_escape_edited_filtered,
#                              n_epitopes=4
#                             )