# Dandelion class

Much of the functions and utility of the `dandelion` package revolves around the `Dandelion` class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This section will run through a quick primer to the `Dandelion` class.

<b>Import modules</b>

In [1]:
import os

os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
import dandelion as ddl

ddl.logging.print_versions()

dandelion==0.3.9.dev14 pandas==1.5.3 numpy==1.26.4 matplotlib==3.8.4 networkx==2.8.8 scipy==1.11.4


In [2]:
vdj = ddl.read_h5ddl("dandelion_results.h5ddl")
vdj

Dandelion class object with n_obs = 2077 and n_contigs = 4895
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

Essentially, the `.data` slot holds the AIRR contig table while the `.metadata` holds a collapsed version that is compatible with combining with `AnnData`'s `.obs` slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:

In [3]:
vdj.metadata

Unnamed: 0,clone_id,clone_id_by_size,sample_id,locus_VDJ,locus_VJ,productive_VDJ,productive_VJ,v_call_genotyped_VDJ,d_call_VDJ,j_call_VDJ,...,mu_count_VJ,mu_count,junction_length_VDJ,junction_length_VJ,junction_aa_length_VDJ,junction_aa_length_VJ,np1_length_VDJ,np1_length_VJ,np2_length_VDJ,np2_length_VJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC,B_VDJ_115_3_1_VJ_25_2_3,89,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,"IGHV1-69,IGHV1-69D",IGHD3-22,IGHJ3,...,0.0,0.0,63.0,33.0,21.0,11.0,4.0,0.0,5.0,0.0
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG,B_VDJ_98_1_1_VJ_186_1_1,1846,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV1-2,,IGHJ3,...,8.0,30.0,42.0,33.0,14.0,11.0,18.0,0.0,0.0,0.0
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC,B_VDJ_128_4_4_VJ_196_1_1,1490,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,IGHV5-51,,IGHJ3,...,0.0,0.0,54.0,33.0,18.0,11.0,24.0,0.0,0.0,0.0
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA,B_VDJ_15_2_1_VJ_115_2_7,1489,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-4,IGHD6-13,IGHJ3,...,0.0,0.0,54.0,39.0,18.0,13.0,10.0,0.0,0.0,0.0
sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG,B_VDJ_140_2_2_VJ_167_3_4,1488,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-39,IGHD3-22,IGHJ3,...,0.0,0.0,51.0,39.0,17.0,13.0,5.0,0.0,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG,B_VDJ_30_2_1_VJ_37_2_7,748,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV2-5,"IGHD5/OR15-5b,IGHD5/OR15-5a","IGHJ4,IGHJ5",...,13.0,26.0,30.0,33.0,10.0,11.0,0.0,0.0,14.0,0.0
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT,B_VDJ_100_6_1_VJ_166_1_3,749,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV3-30,IGHD4-17,IGHJ6,...,0.0,0.0,60.0,33.0,20.0,11.0,4.0,0.0,10.0,0.0
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA,B_VDJ_201_1_1_VJ_23_4_12,750,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV4-59,IGHD6-13,IGHJ2,...,6.0,22.0,48.0,33.0,16.0,11.0,3.0,0.0,1.0,0.0
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG,B_VDJ_2_4_1_VJ_128_3_5,751,vdj_v1_hs_pbmc3,IGH,IGL,T,T,"IGHV1-69,IGHV1-69D",IGHD2-15,IGHJ6,...,0.0,0.0,66.0,39.0,22.0,13.0,5.0,0.0,3.0,0.0


### slicing

You can slice the `Dandelion` object via the `.data` or `.metadata` via their indices, with the behavior similar to how it is in pandas `DataFrame` and `AnnData`.

<b>slicing</b> `.data`

In [4]:
# get the largest clone
largest_clone = vdj.data["clone_id"].value_counts().idxmax()

vdj[vdj.data["clone_id"] == largest_clone]

Dandelion class object with n_obs = 522 and n_contigs = 595
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j

In [5]:
vdj[
    vdj.data_names.isin(
        [
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2",
        ]
    )
]

Dandelion class object with n_obs = 2 and n_contigs = 4
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_num

**slicing** `.metadata`

In [6]:
vdj[vdj.metadata["productive_VDJ"].isin(["T", "T|T"])]

Dandelion class object with n_obs = 2076 and n_contigs = 4888
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

In [7]:
vdj[vdj.metadata_names == "vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT"]

Dandelion class object with n_obs = 1 and n_contigs = 2
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_num

### copy

You can deep copy the `Dandelion` object to another variable which will inherit all slots:

In [8]:
vdj2 = vdj.copy()
vdj2.metadata

Unnamed: 0,clone_id,clone_id_by_size,sample_id,locus_VDJ,locus_VJ,productive_VDJ,productive_VJ,v_call_genotyped_VDJ,d_call_VDJ,j_call_VDJ,...,mu_count_VJ,mu_count,junction_length_VDJ,junction_length_VJ,junction_aa_length_VDJ,junction_aa_length_VJ,np1_length_VDJ,np1_length_VJ,np2_length_VDJ,np2_length_VJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC,B_VDJ_115_3_1_VJ_25_2_3,89,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,"IGHV1-69,IGHV1-69D",IGHD3-22,IGHJ3,...,0.0,0.0,63.0,33.0,21.0,11.0,4.0,0.0,5.0,0.0
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG,B_VDJ_98_1_1_VJ_186_1_1,1846,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV1-2,,IGHJ3,...,8.0,30.0,42.0,33.0,14.0,11.0,18.0,0.0,0.0,0.0
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC,B_VDJ_128_4_4_VJ_196_1_1,1490,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,IGHV5-51,,IGHJ3,...,0.0,0.0,54.0,33.0,18.0,11.0,24.0,0.0,0.0,0.0
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA,B_VDJ_15_2_1_VJ_115_2_7,1489,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-4,IGHD6-13,IGHJ3,...,0.0,0.0,54.0,39.0,18.0,13.0,10.0,0.0,0.0,0.0
sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG,B_VDJ_140_2_2_VJ_167_3_4,1488,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-39,IGHD3-22,IGHJ3,...,0.0,0.0,51.0,39.0,17.0,13.0,5.0,0.0,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG,B_VDJ_30_2_1_VJ_37_2_7,748,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV2-5,"IGHD5/OR15-5b,IGHD5/OR15-5a","IGHJ4,IGHJ5",...,13.0,26.0,30.0,33.0,10.0,11.0,0.0,0.0,14.0,0.0
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT,B_VDJ_100_6_1_VJ_166_1_3,749,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV3-30,IGHD4-17,IGHJ6,...,0.0,0.0,60.0,33.0,20.0,11.0,4.0,0.0,10.0,0.0
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA,B_VDJ_201_1_1_VJ_23_4_12,750,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV4-59,IGHD6-13,IGHJ2,...,6.0,22.0,48.0,33.0,16.0,11.0,3.0,0.0,1.0,0.0
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG,B_VDJ_2_4_1_VJ_128_3_5,751,vdj_v1_hs_pbmc3,IGH,IGL,T,T,"IGHV1-69,IGHV1-69D",IGHD2-15,IGHJ6,...,0.0,0.0,66.0,39.0,22.0,13.0,5.0,0.0,3.0,0.0


### Retrieving entries with `update_metadata`

The `.metadata` slot in Dandelion class automatically initializes whenever the `.data` slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the `.data` slot, we can update the metadata with `ddl.update_metadata` and specify the options `retrieve` and `retrieve_mode`. 

The following modes determine how the retrieval is completed:

`split and unique only` - splits the retrieval into VDJ and VJ chains. A `|` will separate _**unique**_ element.

`split and merge` - splits the retrieval into VDJ and VJ chains. A `|` will separate _**every**_ element.

`merge and unique only` - smiliar to above but merged into a single column.

`split` - split retrieval into _**individual**_ columns for each contig.

`merge` - merge retrieval into a _**single**_ column where a `|` will separate _**every**_ element.

For numerical columns, there's additional options:

`split and sum` - splits the retrieval into VDJ and VJ chains and sum separately.

`split and average` - smiliar to above but average instead of sum.

`sum` - sum the retrievals into a single column.

`average` - averages the retrievals into a single column.

If `retrieve_mode` is not specified, it will default to `split and merge`

***Example: retrieving fwr1 sequences***

In [9]:
ddl.update_metadata(vdj, retrieve="fwr1")
vdj

Dandelion class object with n_obs = 2077 and n_contigs = 4895
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

Note the additional `fwr1` VDJ and VJ columns in the metadata slot.

By default, `dandelion` will not try to merge numerical columns as it can create mixed dtype columns.

There is a new sub-function that will try and retrieve frequently used columns such as `np1_length`, `np2_length`:

In [10]:
vdj.update_plus()
vdj

Dandelion class object with n_obs = 2077 and n_contigs = 4895
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

### concatenating multiple objects

This is a simple function to concatenate (append) two or more `Dandelion` class, or `pandas` dataframes. Note that this operates on the `.data` slot and not the `.metadata` slot.

In [11]:
# for example, the original dandelion class has 2071 unique cell barcodes and 4882 contigs
vdj

Dandelion class object with n_obs = 2077 and n_contigs = 4895
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

In [12]:
# now it has 14646 (4882*3) contigs instead, and the metadata should also be properly populated
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat

Dandelion class object with n_obs = 6231 and n_contigs = 14685
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn',

In [13]:
vdj_concat.data[["sequence_id", "cell_id"]].head()

Unnamed: 0_level_0,sequence_id,cell_id
sequence_id,Unnamed: 1_level_1,Unnamed: 2_level_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_0,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_0,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_0
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_0,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_0,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_0
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_1,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_1,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_1,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_1,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_2,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_2,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_2


`ddl.concat` also lets you add in your custom prefixes/suffixes to append to the sequence ids. If not provided, it will add `-0`, `-1` etc. as a suffix if it detects that the sequence ids are not unique as seen above.

### read/write

`Dandelion` class can be saved using `.write_h5ddl` and `.write_pkl` functions with accompanying compression methods e.g. `gzip`. `write_h5ddl` primarily uses `h5py` library and `write_pkl` just uses pickle. `read_h5ddl` and `read_pkl` functions will read the respective file formats accordingly. 

In [14]:
%time vdj.write_h5ddl('dandelion_results.h5ddl', compression="gzip")

CPU times: user 7.39 s, sys: 229 ms, total: 7.62 s
Wall time: 7.98 s


If you see any warnings above, it's due to mix dtypes somewhere in the object. So do some checking if you think it will interfere with downstream usage.

In [15]:
%time vdj_1 = ddl.read_h5ddl('dandelion_results.h5ddl')
vdj_1

CPU times: user 1.26 s, sys: 91 ms, total: 1.35 s
Wall time: 1.57 s


Dandelion class object with n_obs = 2077 and n_contigs = 4895
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

The read/write times using `pickle` can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).

In [16]:
%time vdj.write_pkl('dandelion_results.pkl.gz')

CPU times: user 5.89 s, sys: 47.5 ms, total: 5.94 s
Wall time: 6.1 s


In [17]:
%time vdj_2 = ddl.read_pkl('dandelion_results.pkl.gz')
vdj_2

CPU times: user 213 ms, sys: 52.4 ms, total: 265 ms
Wall time: 308 ms


Dandelion class object with n_obs = 2077 and n_contigs = 4895
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 

There's also other types of writing functions such as `.write_airr` and `.write_10x`, which will write the object to a `.tsv` or `.csv` file that is compatible with `airr` and `10x` formats respectively.

In [18]:
import pandas as pd

vdj2.write_airr("test.airr.tsv")
df = pd.read_csv("test.airr.tsv", sep="\t")
df

Unnamed: 0,sequence_id,sequence,rev_comp,productive,v_call,d_call,j_call,sequence_alignment,germline_alignment,junction,...,j_call_multiplicity,j_call_sequence_start_multimappers,j_call_sequence_end_multimappers,j_call_support_multimappers,mu_count,ambiguous,extra,rearrangement_status,clone_id,changeo_clone_id
0,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2,ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCA...,F,T,"IGHV1-69*01,IGHV1-69D*01",IGHD3-22*01,IGHJ3*02,CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...,CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...,TGTGCGACTACGTATTACTATGATAGTAGTGGTTATTACCAGAATG...,...,1.0,445.0,494.0,4.25e-23,0,F,F,standard,B_VDJ_115_3_1_VJ_25_2_3,11_0
1,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1,AGGAGTCAGACCCTGTCAGGACACAGCATAGACATGAGGGTCCCCG...,F,T,IGKV1-8*01,,IGKJ1*01,GCCATCCGGATGACCCAGTCTCCATCCTCATTCTCTGCATCTACAG...,GCCATCCGGATGACCCAGTCTCCATCCTCATTCTCTGCATCTACAG...,TGTCAACAGTATTATAGTTACCCTCGGACGTTC,...,1.0,380.0,415.0,2.51e-15,0,F,F,standard,B_VDJ_115_3_1_VJ_25_2_3,11_0
2,sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1,ACTGTGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTC...,F,T,IGLV5-45*02,,IGLJ3*02,CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG...,CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG...,TGTATGATTTGGCACAGCAGCGCTTGGGTGGTC,...,1.0,402.0,431.0,6.35e-12,8,F,F,standard,B_VDJ_98_1_1_VJ_186_1_1,150_1
3,sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2,GGGAGCATCACCCAGCAACCACATCTGTCCTCTAGAGAATCCCCTG...,F,T,IGHV1-2*02,,IGHJ3*02,CAGGTGCAACTGGTGCAGTCTGGGGGT...GAGGTAAAGAAGCCTG...,CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...,TGTGCGAGAGAGATAGAGGGGGACGGTGTTTTTGAAATCTGG,...,1.0,433.0,479.0,4.17e-18,22,F,F,standard,B_VDJ_98_1_1_VJ_186_1_1,150_1
4,sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC_contig_2,GGAGTCTCCCTCACCGCCCAGCTGGGATCTCAGGGCTTCATTTTCT...,F,T,IGHV5-51*01,,IGHJ3*02,GAGGTGCAGCTGGTGCAGTCTGGAGCA...GAGGTGAAAAAGCCGG...,GAGGTGCAGCTGGTGCAGTCTGGAGCA...GAGGTGAAAAAGCCCG...,TGTGCGAGACATATCCGTGGGAACAGATTTGGCAATGATGCTTTTG...,...,1.0,437.0,486.0,4.19e-23,0,F,F,standard,B_VDJ_128_4_4_VJ_196_1_1,323_2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4890,vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA_contig_2,ACTTTCTGAGAGTCCTGGACCTCCTGTGCAAGAACATGAAACATCT...,F,T,IGHV4-59*08,IGHD6-13*01,IGHJ2*01,CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTAAAACCTT...,CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT...,TGTGCGAGACCCCGTATAGCAGGATCTGGGTGGTACTTCGATCTCTGG,...,1.0,405,453,1.42e-22,16,F,F,standard,B_VDJ_201_1_1_VJ_23_4_12,1205_1980
4891,vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2,ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCA...,F,T,"IGHV1-69*01,IGHV1-69D*01",IGHD2-15*01,IGHJ6*02,CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...,CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...,TGTGCGAGATCTCTGGATATTGTAGTGGTGGTAGCACTCTACTACT...,...,1.0,439,496,1.5300000000000001e-27,0,F,F,standard,B_VDJ_2_4_1_VJ_128_3_5,1807_1981
4892,vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1,AGCTTCAGCTGTGGTAGAGAAGACAGGATTCAGGACAATCTCCAGC...,F,T,IGLV1-47*01,,IGLJ3*02,CAGTCTGTGCTGACTCAGCCACCCTCA...GCGTCTGGGACCCCCG...,CAGTCTGTGCTGACTCAGCCACCCTCA...GCGTCTGGGACCCCCG...,TGTGCAGCATGGGATGACAGCCTGAGTGGTTGGGTGTTC,...,1.0,397,434,2.28e-16,0,F,F,standard,B_VDJ_2_4_1_VJ_128_3_5,1807_1981
4893,vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2,GGCTGGGGTCTCAGGAGGCAGCACTCTCGGGACGTCTCCACCATGG...,F,T,IGLV2-11*01,,"IGLJ2*01,IGLJ3*01,IGLJ3*02",CAGTCTGCCCTGACTCAGCCTCGCTCA...GTGTCCGGGTCTCCTG...,CAGTCTGCCCTGACTCAGCCTCGCTCA...GTGTCCGGGTCTCCTG...,TGCTGCTCATATGCAGGCAGCTACACTGTGTTTTTC,...,1.0,393,430,2.28e-11,4,F,F,standard,B_VDJ_190_5_3_VJ_191_3_2,1941_1982


In [19]:
vdj2.write_10x(
    folder="10x_test",
    filename_prefix="all",
)  # this writes both the conting_annotations.csv and contig.fasta
df = pd.read_csv("10x_test/all_contig_annotations.csv")
df

Unnamed: 0,barcode,contig_id,length,chain,v_gene,d_gene,j_gene,c_gene,full_length,productive,cdr3,cdr3_nt,reads,umis,raw_clonotype_id,raw_consensus_id
0,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2,565,IGH,"IGHV1-69*01,IGHV1-69D*01",IGHD3-22*01,IGHJ3*02,IGHM,,True,CATTYYYDSSGYYQNDAFDIW,TGTGCGACTACGTATTACTATGATAGTAGTGGTTATTACCAGAATG...,4161,51,B_VDJ_115_3_1_VJ_25_2_3,B_VDJ_115_3_1_VJ_25_2_3
1,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC,sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1,551,IGK,IGKV1-8*01,,IGKJ1*01,IGKC,,True,CQQYYSYPRTF,TGTCAACAGTATTATAGTTACCCTCGGACGTTC,5679,43,B_VDJ_115_3_1_VJ_25_2_3,B_VDJ_115_3_1_VJ_25_2_3
2,sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG,sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1,642,IGL,IGLV5-45*02,,IGLJ3*02,IGLC3,,True,CMIWHSSAWVV,TGTATGATTTGGCACAGCAGCGCTTGGGTGGTC,13160,90,B_VDJ_98_1_1_VJ_186_1_1,B_VDJ_98_1_1_VJ_186_1_1
3,sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG,sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2,550,IGH,IGHV1-2*02,,IGHJ3*02,IGHM,,True,CAREIEGDGVFEIW,TGTGCGAGAGAGATAGAGGGGGACGGTGTTTTTGAAATCTGG,5080,47,B_VDJ_98_1_1_VJ_186_1_1,B_VDJ_98_1_1_VJ_186_1_1
4,sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC,sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC_contig_2,557,IGH,IGHV5-51*01,,IGHJ3*02,IGHM,,True,CARHIRGNRFGNDAFDIW,TGTGCGAGACATATCCGTGGGAACAGATTTGGCAATGATGCTTTTG...,8292,80,B_VDJ_128_4_4_VJ_196_1_1,B_VDJ_128_4_4_VJ_196_1_1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4890,vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA,vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA_contig_2,524,IGH,IGHV4-59*08,IGHD6-13*01,IGHJ2*01,IGHM,,True,CARPRIAGSGWYFDLW,TGTGCGAGACCCCGTATAGCAGGATCTGGGTGGTACTTCGATCTCTGG,1257,14,B_VDJ_201_1_1_VJ_23_4_12,B_VDJ_201_1_1_VJ_23_4_12
4891,vdj_v1_hs_pbmc3_TTTGCGCCATACCATG,vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2,568,IGH,"IGHV1-69*01,IGHV1-69D*01",IGHD2-15*01,IGHJ6*02,IGHM,,True,CARSLDIVVVVALYYYYGMDVW,TGTGCGAGATCTCTGGATATTGTAGTGGTGGTAGCACTCTACTACT...,2464,32,B_VDJ_2_4_1_VJ_128_3_5,B_VDJ_2_4_1_VJ_128_3_5
4892,vdj_v1_hs_pbmc3_TTTGCGCCATACCATG,vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1,645,IGL,IGLV1-47*01,,IGLJ3*02,IGLC3,,True,CAAWDDSLSGWVF,TGTGCAGCATGGGATGACAGCCTGAGTGGTTGGGTGTTC,2457,28,B_VDJ_2_4_1_VJ_128_3_5,B_VDJ_2_4_1_VJ_128_3_5
4893,vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG,vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2,641,IGL,IGLV2-11*01,,"IGLJ2*01,IGLJ3*01,IGLJ3*02",IGLC,,True,CCSYAGSYTVFF,TGCTGCTCATATGCAGGCAGCTACACTGTGTTTTTC,2744,36,B_VDJ_190_5_3_VJ_191_3_2,B_VDJ_190_5_3_VJ_191_3_2
