In [17]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

### Primer rebalancing

In this notebook, we will use the coverage calculations from a MiSeq run to 'rebalance' the primer pools, so that we have more even coverage. This is according to T112_LAB_SOP011_GBS_Testing_&_Rebalancing_V6.docx from GSU. 

In [30]:
# Load the sample metadata
metadata = pd.read_csv("../../config/metadata.tsv", sep="\t")
metadata.shape

(672, 9)

In [78]:
## Load the coverage data for each sample
# Take the mean where we have multiple target SNPs on a given amplicon

dfs =[]
for sample in tqdm(metadata.sampleID):
    df = pd.read_csv(f"../../results/coverage/{sample}.regions.bed.gz", sep="\t", header=None, names=['contig', 'start', 'end', 'amplicon', 'depth'])
    df = df.groupby('amplicon').agg({'depth':'mean'}).reset_index().sort_values('amplicon')
    dfs.append(df.assign(sample=sample))
    
dfs = pd.concat(dfs)

# Remove samples that are negative controls
dfs = dfs.query("~sample.str.contains('negative')", engine='python').reset_index(drop=True)
dfs = dfs.query("~sample.str.contains('Negative')", engine='python').reset_index(drop=True)
dfs = dfs.query("~sample.str.contains('random')", engine='python').reset_index(drop=True)
dfs.shape

  0%|          | 0/672 [00:00<?, ?it/s]

(54202, 3)

Convert the dataframe to amplicons x samples depth table. 

In [121]:
depth_df = dfs.pivot(columns='sample', index='amplicon', values='depth')
depth_df.head(2)

sample,Calvin_01,Calvin_02,Calvin_03,Calvin_04,Calvin_05,Calvin_06,Calvin_07,Calvin_08,Calvin_09,Calvin_10,...,VK7_dead_34,VK7_dead_34_dil,VK7_dead_35,VK7_dead_36,VK7_dead_37,VK7_dead_38,VK7_dead_39,VK7_dead_40,VK7_dead_41,VK7_dead_42
amplicon,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Agam_1,68.0,19.0,50.0,70.0,25.0,32.0,8.0,74.0,28.0,36.0,...,61.0,12.0,157.0,78.0,101.0,124.0,158.0,126.0,79.0,89.0
Agam_10,222.0,40.0,174.0,198.0,109.0,75.0,10.0,226.0,49.0,130.0,...,86.0,11.0,236.0,114.0,171.0,155.0,234.0,204.0,150.0,202.0


Calculate the total reads per amplicon, and the total reads per sample.

In [109]:
tot_per_amplicon = depth_df.sum(axis=1)
tot_per_sample = depth_df.sum(axis=0)

In [None]:
# sort the dataframe by total depth per amplicon and sample. Not necessary. 

# sample_order = tot_per_sample.sort_values().to_frame().reset_index()['sample'].to_list()
# amplicon_order = tot_per_amplicon.sort_values().to_frame().reset_index()['amplicon'].to_list()
# depth_df = depth_df.loc[amplicon_order, sample_order]

Divide each value in the amplicon x sample table by the total reads per amplicon to get the fraction of amplicon/target reads. 

In [120]:
fraction_df = depth_df.divide(tot_per_amplicon, axis=0)

Find the median target fraction across all samples.

In [123]:
med_read_fractions = fraction_df.median(axis=1)
med_read_fractions

amplicon
Agam_1     3.857433e-08
Agam_10    1.576085e-08
Agam_11    3.033801e-08
Agam_12    1.377803e-08
Agam_13    6.351626e-08
               ...     
Agam_8     9.922454e-08
Agam_80    3.358414e-08
Agam_81    0.000000e+00
Agam_82    0.000000e+00
Agam_9     3.697632e-08
Length: 82, dtype: float64

And the total sum of the median read fractions...

In [125]:
med_read_fraction_sum = med_read_fractions.sum()

Scale it so they all add up to 1. 

In [131]:
scaled_med_read_fractions = med_read_fractions * (1/med_read_fraction_sum)
scaled_med_read_fractions

Take the scaled median read fraction for each amplicon, and put it to the power of -0.561 (the magic number). 

In [134]:
primer_volumes = scaled_med_read_fractions**-0.561

The pool weightings can be interpreted directly as the volumes to add of each target’s primer pair in a pool, however in order to reduce inaccuracies associated with pipetting small volumes it is prudent to scale all the weightings such that the minimum weight in the pool is 1 (so that the minimum volume pipetted is 1µl); 

In [137]:
primer_volumes = primer_volumes / min(primer_volumes)

Primer pairs which generate only very small read fractions can be overweighted by the primer rebalancing algorithm, which leads to them dominating the reads from the resultant rebalanced pool. Given that very poorly performing primers may hint at a design issue, it has been empirically determined to be prudent to ‘clip’ the maximum pool weighting to 10x the minimum. Given that the minimum pool weight has been scaled to 1, the maximum volume of primer pair that can be added is 10µl.

In [140]:
primer_volumes = np.clip(primer_volumes, 0, 10)

Calculate the sum of the pool weightings, and the interquartile mean pool weighting; we use the interquartile mean rather than the arithmetic mean so as to allow us to ignore the effect of clipping any overweighted targets to 10x the minimum

In [143]:
primer_volumes.sum()

585.9869569294103

In [145]:
iqr = np.subtract(*np.percentile(primer_volumes, [75, 25]))
iqr

3.731180598479294

Calculate (IQM pool weighting/sum of weightings) and multiply by 250000nM (the concentration of the primer pairs in the source plate) to obtain the ‘central’ primer concentration for the pool in nM.

In [149]:
central_primer_conc = iqr/primer_volumes.sum() * 250_000
central_primer_conc

1591.836027388218

Calculate the dilution factor required to dilute this pool to 40nM working concentration as (central pool concentration/40). 

In [150]:
working_conc_40nm = central_primer_conc/40
working_conc_40nm

39.79590068470545