# **Example inference pipeline**
This notebook infers lineages from the [Briney et al. 2019](https://doi.org/10.1038/s41586-019-0879-y) dataset.

##  1. Download data  
Annotated data can be downloaded using links provided in the [briney/grp_paper repository](https://github.com/briney/grp_paper).  
Uncomment following two lines to download all data in `./data/`  folder (make sure your current working directory is `.../HILARy/`)

In [None]:
#!wget http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
#!tar -xvf 316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz --directory benchmark/briney_dataset


## 2. Convert Briney data into airr format required by Hilary

### 2.1 install required libraries

In [None]:
!pip install hilary==1.1.7
!pip install biopython


### 2.2 Process briney data

In [None]:
import os
import pandas as pd
from tqdm import tqdm
from hilary.utils import create_classes


In [None]:
from compatible import Compatible
compatible = Compatible()
usecols = [
    "seq_id",
    "chain",
    "productive",
    "v_full",
    "j_full",
    "cdr3_nt",
    "v_start",
    "vdj_nt",
    "isotype",
]
dirname = "./benchmark/briney_dataset/consensus-cdr3nt-90_minimal"
dfs = []
for filename in tqdm(os.listdir(dirname)):
    df = pd.read_csv(os.path.join(dirname, filename), usecols=usecols)
    dfs.append(compatible.df2airr(df))
df = pd.concat(dfs, ignore_index=True)
df["sequence_id"] = df.index
filename = "./benchmark/briney_dataset/316188_ids.tsv.gz"
df[["seq_id", "sequence_id"]].to_csv(filename, sep="\t", index=False)
df.drop("seq_id", axis=1, inplace=True)
filename = "./benchmark/briney_dataset/316188.tsv.gz"
usecols = [
    "sequence_id",
    "v_call",
    "j_call",
    "junction",
    "v_sequence_alignment",
    "j_sequence_alignment",
    "v_germline_alignment",
    "j_germline_alignment",
]
df[usecols].to_csv(filename, sep="\t", index=False)


In [None]:
usecols = ['sequence_id',
        'v_call',
        'j_call',
        'junction',
        'v_sequence_alignment',
        'j_sequence_alignment',
        'v_germline_alignment',
        'j_germline_alignment']
filename = "benchmark/briney_dataset/316188.tsv.gz"
dataframe = pd.read_table(filename,usecols=usecols)


## 3. Package tutorial to infer lineages in python script

### 3.0 Uncomment next line to run on 100 000 sequences

In [None]:
#dataframe=dataframe.head(100000)


### 3.1 Create apriori object

In [None]:
from hilary.apriori import Apriori
apriori = Apriori(silent=False, threads=-1, precision=0.99, sensitivity=0.9) # show progress bars, use all threads


In [None]:
dataframe_processed = apriori.preprocess(df=dataframe, df_kappa=None)
apriori.classes= create_classes(dataframe_processed)


### 3.2 Infer histogram, parameters rho and mu, and sensitivity & precision thresholds for all classes

In [None]:
apriori.get_histograms(dataframe_processed)
apriori.get_parameters()
apriori.get_thresholds()


### 3.3 Create hilary object from apriori

In [None]:
from hilary.inference import HILARy
hilary=HILARy(apriori) # hilary.df is what is being updated by the algorithm


### 3.4 Compute precise and sensitive clusters

In [None]:
dataframe_cdr3=hilary.compute_prec_sens_clusters(df=dataframe_processed)


### 3.5 Infer clonal families from these clusters

In [None]:
dataframe_inferred = hilary.infer(df=dataframe_cdr3)


In [None]:
dataframe_inferred.to_csv(
    "./benchmark/briney_dataset/briney_clonal_families.csv"
)
