## Example inference pipeline
This notebook infers lineages from the [Briney et al. 2019](https://doi.org/10.1038/s41586-019-0879-y) dataset.

In [None]:
import os
import pandas as pd
from tqdm import tqdm

* Download data  
Annotated data can be downloaded using links provided in the [briney/grp_paper repository](https://github.com/briney/grp_paper).  
The procedure below takes all replicates together and introduces AIRR-compatible format requred by HILARy

In [None]:
!wget http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
!tar -xvf 316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz --directory data/

In [None]:
from utils import Compatible
compatible = Compatible()
usecols = ['seq_id',
           'chain',
           'productive',
           'v_full',
           'j_full',
           'cdr3_nt',
           'v_start',
           'vdj_nt',
           'isotype']
dirname = 'data/consensus-cdr3nt-90_minimal'
dfs = []
for filename in tqdm(os.listdir(dirname)):
    df = pd.read_csv(os.path.join(dirname,filename),usecols=usecols)
    dfs.append(compatible.df2airr(df))
df = pd.concat(dfs,ignore_index=True)
df['sequence_id'] = df.index
filename = 'data/316188_ids.tsv.gz'
df[['seq_id','sequence_id']].to_csv(filename,sep='\t',index=False)
df.drop('seq_id',axis=1,inplace=True)
filename = 'data/316188.tsv.gz'
usecols = ['sequence_id',
           'v_call',
           'j_call',
           'junction',
           'v_sequence_alignment',
           'j_sequence_alignment',
           'v_germline_alignment',
           'j_germline_alignment']
df[usecols].to_csv(filename,sep='\t',index=False)

* HILARy pre-inference  
Partition the dataset into VJl classes.  
For larfest classes, compute distributions of pairwise distances.  
Use a priori estimates of prevalence to define high-precision and high-sensitivity thresholds.

In [None]:
usecols = ['sequence_id',
           'v_call',
           'j_call',
           'junction',
           'v_sequence_alignment',
           'j_sequence_alignment',
           'v_germline_alignment',
           'j_germline_alignment']
filename = 'data/316188.tsv.gz'
df = pd.read_table(filename,usecols=usecols)

In [None]:
from apriori import preprocess
df = preprocess(df)

In [None]:
from apriori import Apriori
ap = Apriori(df)
ap.get_histograms(df.loc[ap.productive])
ap.get_parameters()
ap.get_thresholds()

* HILARy inference  
Define high-precision and high-sensitivity partition  
Apply the full method to high-sensitivity classes that require further partitions 

In [None]:
from inference import HILARy
from inference import CDR3Clustering
hilary = HILARy(ap)

In [None]:
prec = CDR3Clustering(ap.classes[hilary.group+['precise_threshold']])
sens = CDR3Clustering(ap.classes[hilary.group+['sensitive_threshold']])
df['precise_cluster'] = prec.infer(df.loc[ap.productive])
df['sensitive_cluster'] = sens.infer(df.loc[ap.productive])

In [None]:
hilary.to_do(df)
df['family'] = hilary.infer(df)