# Analyze correlation in prob_escape scores for technical replicates under different conditions

Tested 3 different selections conditions:
* 6e5 virions in 600uL
* 6e5 virions in 1200uL
* 3e5 virions in 1200uL

Included a technical replicate for lowest selection concentration in each condition. Random bottlenecking will be less of an issue at higher selection potencies, because there will be much fewer escape variants. So I wanted to analyze degree of correlation at selections around the IC99, where we expect to see more variant diversity.

## Import python modules

In [1]:
import altair as alt

import pandas as pd

import polyclonal

import pickle

import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
os.chdir('../../')

## Set up functions for general analysis

In [3]:
# sample replicates are currently labeled as numbers 1-6, due to limitations in pipeline labeling
# set up dict to convert these to readable labels
sample_dict = {
    1: '6e5vir_600uL_1',
    2: '6e5vir_600uL_2',
    3: '6e5vir_1200uL_1',
    4: '6e5vir_1200uL_2',
    5: '3e5vir_1200uL_1',
    6: '3e5vir_1200uL_2'
}

In [4]:
# define function to calculate correlations from prob_escape file
def get_corr_df(
    antibody,
    replicate_conc, # list of concentrations that include technical replicates
    sample_dict=sample_dict,
    value_to_correlate='prob_escape'   
):    
    
    replicate_prob_escapes = []
    
    for sample in sample_dict:
        prob_escape = pd.read_csv(
            f'results/prob_escape/libB_221202_1_{antibody}_{sample}_prob_escape.csv',
            keep_default_na=False,
            na_values='nan'
        ).query("`no-antibody_count` >= no_antibody_count_threshold")
        assert prob_escape.notnull().all().all()
        
        # rough way to just get condition and replicate given current sample dict
        prob_escape['selection_condition'] = "_".join(sample_dict[sample].split('_')[:2])
        
#         if len(replicate_conc) > 1:
#             # include Ab conc in replicate label
#             prob_escape['replicate'] = str(prob_escape['antibody_concentration']) + sample_dict[sample].split('_')[-1:][0]
#         else:    
#             prob_escape['replicate'] = sample_dict[sample].split('_')[-1:][0]
        
        # include Ab conc in replicate label
        if len(replicate_conc) > 1:
            prob_escape_list = []
            
            for conc in replicate_conc:             
                prob_escape_conc = prob_escape[prob_escape['antibody_concentration'] == conc]
                prob_escape_conc['replicate'] = str(conc) + '_' + sample_dict[sample].split('_')[-1:][0]
                prob_escape_list.append(prob_escape_conc)
            
            prob_escape_filtered = pd.concat(prob_escape_list)
        
        # if only one Ab conc, leave replicates as single numbers
        else:
            prob_escape['replicate'] = sample_dict[sample].split('_')[-1:][0]
            prob_escape_filtered = prob_escape[prob_escape['antibody_concentration'].isin(replicate_conc)]
        
        replicate_prob_escapes.append(prob_escape_filtered)
    
    replicate_df = pd.concat(replicate_prob_escapes)
    
    # get df of pearson correlations for technical replicates within a selection condition
    corr = polyclonal.utils.tidy_to_corr(
        replicate_df,
        sample_col='replicate',
        label_col='barcode',
        value_col=value_to_correlate,
        group_cols = ['selection_condition', 'antibody_concentration']
    )
    
    return [corr, replicate_df]

In [5]:
# function for generating heatmaps from corr df
def corr_heatmap(
    corr_df,
    corr_col,
    sample_cols,
    group_col=None,
    corr_range=(0, 1),
    columns=3,
    diverging_colors=None,
#     scheme=None,
):
    
    corr_chart = (
        alt.Chart(corr_df)
        .encode(
            x=alt.X(sample_cols[0], title=None),
            y=alt.Y(sample_cols[1], title=None),
            color=alt.Color(
                corr_col,
                scale=(
                    alt.Scale(domainMid=0, domain=corr_range)
                    if diverging_colors
                    else alt.Scale(domain=corr_range)
                ),
            ),
            tooltip=[
                alt.Tooltip(c, format=".3f") if corr_df[c].dtype == float else c
                for c in corr_df.columns
                if c not in {"_label_1", "_label_2"}
            ],
            facet=(
                alt.Facet()
                if group_col is None
                else alt.Facet(group_col, columns=columns)
            ),
        )
        .mark_rect(stroke="black")
        .properties(width=alt.Step(15), height=alt.Step(15))
        .configure_axis(labelLimit=500)
    )
    
    return corr_chart

## Analyze correlation between prob_escape scores for independent antibody selections
These are technical replicates for selections at a single concentration, ~IC90 for the cocktail (1.37ng/uL), and ~IC99 + IC99.5 for AUSAB-13 (serum dilutions 0.0014 and 0.0021). i.e. the lowest concentrations I tend to use for selections.
### 1:1 antibody cocktail

In [6]:
cocktail_corr = get_corr_df(
    antibody='1C04-5G04', 
    replicate_conc=[1.37]
)[0]

corr_heatmap(cocktail_corr, 
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
            )

Plot on smaller scale to better resolve low correlations:

In [7]:
corr_heatmap(cocktail_corr, 
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
             corr_range = (0, 0.1)
            )

### AUSAB-13 serum
*Included technical replicates for both serum selection concentrations in all conditions*

In [8]:
serum_df = get_corr_df('AUSAB-13', [0.0014, 0.0021])[0]

corr_heatmap(serum_df,
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
            )

In [9]:
# Max is <0.3, plot on smaller scale:
corr_heatmap(serum_df, 
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
             corr_range = (0, 0.3)
            )

*unfortunately, error in pooling meant that I did not sequence 6e5vir_600uL_0.0021-dilution replicate to high enough coverage. So it will be missing from these analyses.*

### Increased primer concentration in R1 PCR

In [10]:
sample_dict_primers = {
    3: '6e5vir_1200uL_1',
    4: '6e5vir_1200uL_2',
    7: '6e5vir_1200uL-3uM-primers_1',
    8: '6e5vir_1200uL-3uM-primers_2'
}

In [11]:
serum_df = get_corr_df('AUSAB-13', [0.0014], sample_dict=sample_dict_primers)[0]

corr_heatmap(serum_df,
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
            )

In [12]:
corr_heatmap(serum_df,
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
             corr_range = (0, 0.3)
            )

doesn't look like primer concentration in R1 PCR is the culprit in low replicate correlation

Correlations between individual variant prob_escape scores are very low for replicate selections. Serum correlations are slightly better than cocktail correlations, but still only in the range of r=0.2. 

This could indicate issues with bottlenecking, i.e. prob_escape scores are highly inconsistent because their starting representation in each individual selection is random, due to bottlenecking of library diversity. But these scores are calculated based on counts in the no-antibody control, as well as the neut standard. So we need to analyze those additional factors as well before making any conclusions.

## Check correlations between variant counts, not prob_escape values

If bottlenecking at selection virion # is responsible for low correlation between escape score values, we would probably expect variant counts to also be poorly correlated.

In [13]:
cocktail_df = get_corr_df(
    '1C04-5G04', 
    [1.37], 
    value_to_correlate = 'antibody_count'
)[0]

corr_heatmap(cocktail_df, 
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
            )

For antibody cocktail selections, at the lowest selection concentration (roughly IC90), variant counts are actually quite well correlated. r=0.93 for 6e5vir in 600uL, and 0.58 for 6e5 vir in 1200uL. 

Side note: this drop is a little surprising. Same number of virions, so we wouldn't expect large changes in correlation in a technical replicate. But if we look at avg prob escape for wildtype variants -
* avg_prob_escape=0.12 for 6e5vir in 600uL
* avg_prob_escape=0.02 for 6e5vir in 1200uL

So this is probably affected by antibody saturation. Lots of free virions in 600uL selections - counts also reflect baseline library composition, which is going to be consistent between replicates. 1200uL selections are more representative of neutralization, and therefore more variable between replicates. We should keep this in mind, because it's tempting to say that the 3e5 drop (r=0.213) indicates further bottlenecking with reduced virion #. But we can't actually fully disentangle that from Ab saturation effects at an even lower TCID50/uL. 

In [14]:
# check serum count correlations
serum_df = get_corr_df(
    'AUSAB-13', 
    [0.0014, 0.0021], 
    value_to_correlate = 'antibody_count'
)[0]

corr_heatmap(serum_df, 
             corr_col='correlation', 
             sample_cols=['replicate_1', 'replicate_2'], 
             group_col='selection_condition',
#              corr_range=(0, 0.1)
            )

We see a similar scenario for serum selections. Note that correlation is slightly higher for the lower-potency selection, which seems consistent with generally seeing better correlation under conditions where the antibody is **less** neutralizing. Selection is more stochastic, vs conditions with less neutralization --> variant frequencies more dominated by library composition (nonvariable)

## Variability in neut standard count between selection conditions

We're seeing a huge drop in correlation between replicates when we convert raw variant counts to prob escape scores. Note that all selection samples were normalized to the **same** no-Antibody control. 

I made one large aliquot of virus mix, with libA and H6 spiked in at 1%, at a concentration of 6e5 virions / 300uL. I added 300uL of this mix for the 6e5-virion selections, 150uL for the 3e5-virion selections, and then infected 4 wells with 6e5 virions each for the no-Antibody controls. Each no-antibody well was prepped as an individual sample, and I then pooled 3 of these samples together as a single no-antibody control, to mimic the level of coverage I have for a typical experiment.

The most likely explanation here is that correlations between raw counts are skewed artificially high, due to the uneven representation of variants in the library. Good correlation between high-count outlier variants is likely masking lower correlation for the bulk of the library. This gets factored out when we convert to prob_escape scores.

It's still worth checking on proportion of neut standard counts for each selection, to see if this is affecting prob_escape scores.

In [15]:
# get full df of selection replicates for cocktail
cocktail_df = get_corr_df('1C04-5G04', [1.37])[1]

# get counts of antibody neut standard for each selection
# very rough method of reducing large df to a summary of seleciton condition / rep / neut std count
neut_std_cocktail = (cocktail_df.groupby(['selection_condition', 'replicate'])
            .agg({'antibody_neut_standard_count': ['mean']})
            .reset_index())
neut_std_cocktail.columns = [''.join(col) for col in neut_std_cocktail.columns]
neut_std_cocktail = neut_std_cocktail.rename(
    columns={'antibody_neut_standard_countmean': 'antibody_neut_standard_count'}
)

neut_std_cocktail

Unnamed: 0,selection_condition,replicate,antibody_neut_standard_count
0,3e5vir_1200uL,1,70125.0
1,3e5vir_1200uL,2,130865.0
2,6e5vir_1200uL,1,63749.0
3,6e5vir_1200uL,2,55785.0
4,6e5vir_600uL,1,9389.0
5,6e5vir_600uL,2,10960.0


In [16]:
# get total counts for each selection as well, to calculate percent neut standard

sample_dict = {
    1: '6e5vir_600uL_1',
    2: '6e5vir_600uL_2',
    3: '6e5vir_1200uL_1',
    4: '6e5vir_1200uL_2',
    5: '3e5vir_1200uL_1',
    6: '3e5vir_1200uL_2'
}

total_counts = []
selection_condition = []
replicate = []

# pull total counts as number valid barcode reads from barcode fates file
for sample in sample_dict:
    bc_fates = pd.read_csv(
        f'results/barcode_runs/fates_by_sample/libB_221202_1_antibody_1C04-5G04_1.37_{sample}.csv'
    )
    
    total_counts.append(bc_fates.loc[bc_fates['fate'] == 'valid barcode']['count'][3])
    
    selection_condition.append("_".join(sample_dict[sample].split('_')[:2]))
    
    replicate.append(sample_dict[sample].split('_')[-1:][0])
    
total_counts_df = pd.DataFrame(
    {'selection_condition': selection_condition,
     'replicate': replicate,
     'total_counts': total_counts}
)

cocktail_counts_summary = neut_std_cocktail.merge(
    total_counts_df,
    how='inner'
)

cocktail_counts_summary['percent_neut_std'] = (
    cocktail_counts_summary['antibody_neut_standard_count'] / 
    cocktail_counts_summary['total_counts']) * 100

cocktail_counts_summary

Unnamed: 0,selection_condition,replicate,antibody_neut_standard_count,total_counts,percent_neut_std
0,3e5vir_1200uL,1,70125.0,2430772,2.884886
1,3e5vir_1200uL,2,130865.0,2175773,6.014644
2,6e5vir_1200uL,1,63749.0,2976924,2.141439
3,6e5vir_1200uL,2,55785.0,2414912,2.310022
4,6e5vir_600uL,1,9389.0,2023872,0.463913
5,6e5vir_600uL,2,10960.0,2024530,0.54136


Percent neut standard counts actually look quite consistent between replicates, with the exception of 3e5 virions. 

I noticed that the avg_prob_escape profiles look very strange for the 3e5-vir selections - very inconsistent trends, tend to spike up at the most potent Ab concentration (i.e. more escape). I wonder if this has to do with this neut standard variation between samples. Neut standard is at a consistent fraction, 1% of virions. But maybe I need to start with a minimum threshold of neut standard virions to get consistent growth across samples. Noise is going to be much more consequential when everything is starting at lower counts. So consider spiking up the neut standard if I do end up dropping virion number (maybe 2%).

## Main conclusions
1. With technical replicates of selections, correlation between prob_escape scores for individual variants is very low across the board
2. Reducing antibody saturation also leads to drop in prob_escape correlation
    * At saturating conditions, correlations are likely skewed high by free virions. I.e. non-neutralized library will be very consistent between replicate infections, whereas neutralization is more stochastic. 
3. Neut standard counts are quite variable between samples for the 3e5-virion selections
    * Would need to spike up neut standard fraction if I moved forward with these conditions
    * As things stand, we can't really compare these replicates because the inconsistent neut standard counts are likely skewing results. Meaning that I can't make any conclusions about whether reducing virion number will introduce more severe bottlenecking. 
4. Increasing R1 primer concentration does not improve correlation in prob_escape scores between technical replicates.