# **Example inference pipeline**
This notebook infers lineages from the [Briney et al. 2019](https://doi.org/10.1038/s41586-019-0879-y) dataset.

##  1. Download data  
Annotated data can be downloaded using links provided in the [briney/grp_paper repository](https://github.com/briney/grp_paper).  
Uncomment following two lines to download all data in `./data/`  folder (make sure your current working directory is `.../HILARy/`)

In [1]:
#!wget http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz
#!tar -xvf 316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_minimal_071817.tar.gz --directory data/


## 2. Convert Briney data into airr format required by Hilary

### 2.1 install required libraries

In [2]:
!pip install os
!pip install tqdm
!pip install pandas
!pip install biopython


[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for os[0m[31m


### 2.2 Process briney data

In [2]:
import os
import pandas as pd
from tqdm import tqdm


In [3]:
from compatible import Compatible
compatible = Compatible()
usecols = [
    "seq_id",
    "chain",
    "productive",
    "v_full",
    "j_full",
    "cdr3_nt",
    "v_start",
    "vdj_nt",
    "isotype",
]
dirname = "./data/consensus-cdr3nt-90_minimal"
dfs = []
for filename in tqdm(os.listdir(dirname)):
    df = pd.read_csv(os.path.join(dirname, filename), usecols=usecols)
    dfs.append(compatible.df2airr(df))
df = pd.concat(dfs, ignore_index=True)
df["sequence_id"] = df.index
filename = "./data/316188_ids.tsv.gz"
df[["seq_id", "sequence_id"]].to_csv(filename, sep="\t", index=False)
df.drop("seq_id", axis=1, inplace=True)
filename = "./data/316188.tsv.gz"
usecols = [
    "sequence_id",
    "v_call",
    "j_call",
    "junction",
    "v_sequence_alignment",
    "j_sequence_alignment",
    "v_germline_alignment",
    "j_germline_alignment",
]
df[usecols].to_csv(filename, sep="\t", index=False)


100%|██████████| 18/18 [04:54<00:00, 16.39s/it]


In [3]:
usecols = ['sequence_id',
        'v_call',
        'j_call',
        'junction',
        'v_sequence_alignment',
        'j_sequence_alignment',
        'v_germline_alignment',
        'j_germline_alignment']
filename = 'data/316188.tsv.gz'
dataframe = pd.read_table(filename,usecols=usecols)


## 3. Package tutorial to infer lineages in python script

### 3.0 Uncomment next line to run on 100 000 sequences

In [6]:
dataframe=dataframe.head(100000)


### 3.1 Create apriori object
* Apriori class filters dataframe by removing sequences with :
  - CDR3 length not multiple of three
  - CDR3 length not in [15,81]
  - More than 60 sequence mutations from germline
  - A null column value
* Stores this processed dataframe in `.df` attribute

In [8]:
from hilary.apriori import Apriori
apriori = Apriori(dataframe, silent=False)
# apriori.df is a processed version of dataframe that the model is going to use


100%|██████████| 1390/1390 [00:05<00:00, 268.90it/s]


2024-01-09 11:40:45 [debug    ] Filtering sequences            criteria_four=With a null column value. criteria_one=CDR3 length not multiple of three. criteria_three=More than 60 sequence mutations from germline. criteria_two=CDR3 length not in [15,81].


### 3.2 Infer histogram, parameters rho and mu, and sensitivity & precision thresholds for all classes

In [9]:
apriori.get_histograms()
apriori.get_parameters()
apriori.get_thresholds()


2024-01-09 11:40:45 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=45


100%|██████████| 55/55 [00:11<00:00,  4.60it/s]

2024-01-09 11:40:58 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=39



100%|██████████| 53/53 [00:11<00:00,  4.71it/s]

2024-01-09 11:41:10 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=42



100%|██████████| 54/54 [00:12<00:00,  4.50it/s]

2024-01-09 11:41:23 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=54



100%|██████████| 60/60 [00:11<00:00,  5.18it/s]

2024-01-09 11:41:35 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=48



100%|██████████| 56/56 [00:13<00:00,  4.06it/s]


2024-01-09 11:41:50 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=51


100%|██████████| 63/63 [00:15<00:00,  4.07it/s]

2024-01-09 11:42:06 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=36



100%|██████████| 53/53 [00:11<00:00,  4.46it/s]

2024-01-09 11:42:19 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=57



100%|██████████| 60/60 [00:11<00:00,  5.00it/s]


2024-01-09 11:42:31 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=33


100%|██████████| 50/50 [00:11<00:00,  4.25it/s]


2024-01-09 11:42:44 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=60


100%|██████████| 54/54 [00:13<00:00,  4.03it/s]

2024-01-09 11:42:58 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=30



100%|██████████| 32/32 [00:06<00:00,  4.91it/s]

2024-01-09 11:43:05 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=27



100%|██████████| 33/33 [00:07<00:00,  4.43it/s]

2024-01-09 11:43:13 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=63



100%|██████████| 51/51 [00:10<00:00,  4.71it/s]

2024-01-09 11:43:25 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=66



100%|██████████| 49/49 [00:10<00:00,  4.55it/s]

2024-01-09 11:43:37 [debug    ] Computing CDR3 hamming distances within l class. CDR3_length=69



100%|██████████| 36/36 [00:07<00:00,  4.60it/s]

2024-01-09 11:43:45 [debug    ] Computing CDR3 hamming distances within all large VJl classes.



100%|██████████| 40/40 [00:30<00:00,  1.32it/s]
100%|██████████| 55/55 [00:09<00:00,  5.95it/s]


2024-01-09 11:44:26 [debug    ] Assigning effective prevalence.


100%|██████████| 23/23 [00:04<00:00,  4.88it/s]


2024-01-09 11:44:32 [debug    ] Assigning effective mean distance


100%|██████████| 23/23 [00:04<00:00,  4.86it/s]
100%|██████████| 23/23 [00:04<00:00,  5.13it/s]


### 3.3 Create hilary object from apriori

In [10]:
from hilary.inference import HILARy
hilary=HILARy(apriori) # hilary.df is what is being updated by the algorithm


### 3.4 Compute precise and sensitive clusters

In [11]:
hilary.compute_prec_sens_clusters()


100%|██████████| 1099/1099 [00:00<00:00, 1200.67it/s]
100%|██████████| 1099/1099 [00:00<00:00, 1190.58it/s]


### 3.5 Infer clonal families from these clusters

In [12]:
hilary.infer()


2024-01-09 11:48:02 [debug    ] Grouping precise clusters together to reach desired sensitivity.


100%|██████████| 22880/22880 [2:19:56<00:00,  2.73it/s]       


### 3.6 Update original dataframe with clonal families

In [13]:
mask = ~dataframe.index.isin(hilary.df.index) # Get sequences that have been removed in computations (non productive, null values ect)
dataframe["family"] = hilary.df["family"] # Update filtered sequences with clonal families
dataframe.loc[mask, "family"] = 0 # Label removed sequences with 0
