# **Example inference pipeline**
This notebook infers lineages from the [Briney et al. 2019](https://doi.org/10.1038/s41586-019-0879-y) dataset.

##  1. Download data  
Annotated data can be downloaded using links provided in the [briney/grp_paper repository](https://github.com/briney/grp_paper).  
Uncomment following two lines to download all data in `./data/`  folder (make sure your current working directory is `.../HILARy/`)

In [3]:
#!wget http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
#!tar -xvf 316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz --directory data/


--2024-02-14 09:46:20--  http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
Resolving burtonlab.s3.amazonaws.com (burtonlab.s3.amazonaws.com)... 3.5.28.165, 52.217.40.36, 52.216.77.172, ...
Connecting to burtonlab.s3.amazonaws.com (burtonlab.s3.amazonaws.com)|3.5.28.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3551933919 (3,3G) [application/x-tar]
Saving to: ‘316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz.1’


2024-02-14 09:48:01 (33,7 MB/s) - ‘316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz.1’ saved [3551933919/3551933919]

consensus-cdr3nt-90_minimal/
consensus-cdr3nt-90_minimal/14_consensus.txt

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


## 2. Convert Briney data into airr format required by Hilary

### 2.1 install required libraries

In [1]:
!pip install os
!pip install tqdm
!pip install pandas
!pip install biopython


[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m
[31mERROR: No matching distribution found for os[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m


### 2.2 Process briney data

In [2]:
import os
import pandas as pd
from tqdm import tqdm
from hilary.utils import create_classes


In [6]:
from compatible import Compatible
compatible = Compatible()
usecols = [
    "seq_id",
    "chain",
    "productive",
    "v_full",
    "j_full",
    "cdr3_nt",
    "v_start",
    "vdj_nt",
    "isotype",
]
dirname = "./data/consensus-cdr3nt-90_minimal"
dfs = []
for filename in tqdm(os.listdir(dirname)):
    df = pd.read_csv(os.path.join(dirname, filename), usecols=usecols)
    dfs.append(compatible.df2airr(df))
df = pd.concat(dfs, ignore_index=True)
df["sequence_id"] = df.index
filename = "./data/316188_ids.tsv.gz"
df[["seq_id", "sequence_id"]].to_csv(filename, sep="\t", index=False)
df.drop("seq_id", axis=1, inplace=True)
filename = "./data/316188.tsv.gz"
usecols = [
    "sequence_id",
    "v_call",
    "j_call",
    "junction",
    "v_sequence_alignment",
    "j_sequence_alignment",
    "v_germline_alignment",
    "j_germline_alignment",
]
df[usecols].to_csv(filename, sep="\t", index=False)


  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:08<00:00,  8.27s/it]


In [12]:
usecols = ['sequence_id',
        'v_call',
        'j_call',
        'junction',
        'v_sequence_alignment',
        'j_sequence_alignment',
        'v_germline_alignment',
        'j_germline_alignment']
filename = 'data/316188.tsv.gz'
dataframe = pd.read_table(filename,usecols=usecols)


## 3. Package tutorial to infer lineages in python script

### 3.0 Uncomment next line to run on 100 000 sequences

In [8]:
#dataframe=dataframe.head(100000)


### 3.1 Create apriori object

In [13]:
from hilary.apriori import Apriori
apriori = Apriori(silent=False, threads=-1) # show progress bars, use all threads


In [14]:
dataframe_processed = apriori.preprocess(df=dataframe, df_kappa=None)
apriori.classes= create_classes(dataframe_processed)


100%|██████████| 2886/2886 [00:41<00:00, 68.90it/s] 


### 3.2 Infer histogram, parameters rho and mu, and sensitivity & precision thresholds for all classes

In [16]:
apriori.get_histograms(dataframe_processed)
apriori.get_parameters()
apriori.get_thresholds()


2024-02-14 12:27:44 [debug    ] Computing CDR3 hamming distances within all large VJl classes.


100%|██████████| 2103/2103 [00:10<00:00, 192.25it/s] 
100%|██████████| 2103/2103 [00:03<00:00, 661.16it/s]
100%|██████████| 52/52 [00:00<00:00, 276.76it/s]


### 3.3 Create hilary object from apriori

In [17]:
from hilary.inference import HILARy
hilary=HILARy(apriori) # hilary.df is what is being updated by the algorithm


### 3.4 Compute precise and sensitive clusters

In [18]:
dataframe_cdr3=hilary.compute_prec_sens_clusters(df=dataframe_processed)


100%|██████████| 2886/2886 [00:02<00:00, 1427.18it/s]
100%|██████████| 2886/2886 [00:02<00:00, 1347.35it/s]


### 3.5 Infer clonal families from these clusters

In [19]:
dataframe_inferred = hilary.infer(df=dataframe_cdr3)


2024-02-14 12:28:37 [debug    ] Checking alignment length.     alignment_length=271
2024-02-14 12:28:37 [debug    ] Inferring family clusters for small groups.


100%|██████████| 8294/8294 [04:44<00:00, 29.10it/s]


2024-02-14 12:33:24 [debug    ] Inferring family clusters for large groups.
