# **Example inference pipeline**
This notebook infers lineages from the [Briney et al. 2019](https://doi.org/10.1038/s41586-019-0879-y) dataset.

##  1. Download data  
Annotated data can be downloaded using links provided in the [briney/grp_paper repository](https://github.com/briney/grp_paper).  
Uncomment following two lines to download all data in `./data/`  folder (make sure your current working directory is `.../HILARy/`)

In [16]:
#!wget http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
#!tar -xvf 316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz --directory data/


--2024-02-17 16:07:38--  http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
Resolving burtonlab.s3.amazonaws.com (burtonlab.s3.amazonaws.com)... 54.231.166.217, 52.216.212.241, 52.217.139.41, ...
Connecting to burtonlab.s3.amazonaws.com (burtonlab.s3.amazonaws.com)|54.231.166.217|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3551933919 (3,3G) [application/x-tar]
Saving to: ‘316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz’


2024-02-17 16:10:42 (18,4 MB/s) - ‘316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz’ saved [3551933919/3551933919]

consensus-cdr3nt-90_minimal/
consensus-cdr3nt-90_minimal/14_consensus.txt
consensus-cdr3nt-90_minimal/6_consensus.txt
consensus-cdr3nt-90_minimal/3_consensus.txt
consensus-cdr3nt-90_minimal/15_consensus.txt
consensus-cdr3nt-90_minimal/16_consensus.txt
consensus-cdr3nt-90_minimal/7_consensus.txt
consensus-cdr3

## 2. Convert Briney data into airr format required by Hilary

### 2.1 install required libraries

In [1]:
!pip install os
!pip install tqdm
!pip install pandas
!pip install biopython


[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m
[31mERROR: No matching distribution found for os[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/gathenes/.pyenv/versions/3.9.6/envs/hilary/bin/python3.9 -m pip install --upgrade pip' command.[0m


### 2.2 Process briney data

In [2]:
import os
import pandas as pd
from tqdm import tqdm
from hilary.utils import create_classes


In [17]:
from compatible import Compatible
compatible = Compatible()
usecols = [
    "seq_id",
    "chain",
    "productive",
    "v_full",
    "j_full",
    "cdr3_nt",
    "v_start",
    "vdj_nt",
    "isotype",
]
dirname = "./data/consensus-cdr3nt-90_minimal"
dfs = []
for filename in tqdm(os.listdir(dirname)):
    df = pd.read_csv(os.path.join(dirname, filename), usecols=usecols)
    dfs.append(compatible.df2airr(df))
df = pd.concat(dfs, ignore_index=True)
df["sequence_id"] = df.index
filename = "./data/316188_ids.tsv.gz"
df[["seq_id", "sequence_id"]].to_csv(filename, sep="\t", index=False)
df.drop("seq_id", axis=1, inplace=True)
filename = "./data/316188.tsv.gz"
usecols = [
    "sequence_id",
    "v_call",
    "j_call",
    "junction",
    "v_sequence_alignment",
    "j_sequence_alignment",
    "v_germline_alignment",
    "j_germline_alignment",
]
df[usecols].to_csv(filename, sep="\t", index=False)


100%|██████████| 18/18 [04:43<00:00, 15.76s/it]


In [3]:
usecols = ['sequence_id',
        'v_call',
        'j_call',
        'junction',
        'v_sequence_alignment',
        'j_sequence_alignment',
        'v_germline_alignment',
        'j_germline_alignment']
filename = 'data/briney_dataset/316188.tsv.gz'
dataframe = pd.read_table(filename,usecols=usecols)


## 3. Package tutorial to infer lineages in python script

### 3.0 Uncomment next line to run on 100 000 sequences

In [20]:
#dataframe=dataframe.head(100000)


### 3.1 Create apriori object

In [4]:
from hilary.apriori import Apriori
apriori = Apriori(silent=False, threads=-1, precision=0.99, sensitivity=0.9) # show progress bars, use all threads


In [5]:
dataframe_processed = apriori.preprocess(df=dataframe, df_kappa=None)
apriori.classes= create_classes(dataframe_processed)


100%|██████████| 7180/7180 [09:59<00:00, 11.98it/s]  


### 3.2 Infer histogram, parameters rho and mu, and sensitivity & precision thresholds for all classes

In [6]:
apriori.get_histograms(dataframe_processed)
apriori.get_parameters()
apriori.get_thresholds()


2024-02-17 19:34:21 [debug    ] Computing CDR3 hamming distances within all large VJl classes.


100%|██████████| 4295/4295 [02:38<00:00, 27.04it/s] 
100%|██████████| 4295/4295 [00:15<00:00, 275.82it/s]
100%|██████████| 83/83 [00:00<00:00, 88.63it/s] 


### 3.3 Create hilary object from apriori

In [7]:
from hilary.inference import HILARy
hilary=HILARy(apriori) # hilary.df is what is being updated by the algorithm


### 3.4 Compute precise and sensitive clusters

In [8]:
dataframe_cdr3=hilary.compute_prec_sens_clusters(df=dataframe_processed)


100%|██████████| 7180/7180 [00:08<00:00, 799.19it/s] 
100%|██████████| 7180/7180 [00:08<00:00, 832.18it/s] 


### 3.5 Infer clonal families from these clusters

In [9]:
dataframe_inferred = hilary.infer(df=dataframe_cdr3)


2024-02-17 19:41:40 [debug    ] Checking alignment length.     alignment_length=271
2024-02-17 19:41:40 [debug    ] Inferring family clusters for small groups.


100%|██████████| 3169/3169 [03:55<00:00, 13.44it/s]


2024-02-17 19:45:42 [debug    ] Inferring family clusters for large groups.


In [10]:
dataframe_inferred.to_csv("./data/results/briney_data/briney_families_new_1e3.csv")
