# *Dandelion* class

![dandelion_logo](img/dandelion_logo_illustration.png)

Much of the functions and utility of the `dandelion` package revolves around the `Dandelion` class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This section will run through a quick primer to the `Dandelion` class.

***Import modules***

In [1]:
import os
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
import dandelion as ddl
ddl.logging.print_versions()

dandelion==0.2.2.dev86 pandas==1.4.2 numpy==1.21.6 matplotlib==3.5.2 networkx==2.8.4 scipy==1.8.1


In [2]:
vdj = ddl.read_h5ddl('dandelion_results.h5ddl')
vdj



Dandelion class object with n_obs = 2420 and n_contigs = 4832
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn',

Basically, the object can be summarized in the following illustration:

![dandelion_class <](img/dandelion_class.png)

Essentially, the `.data` slot holds the AIRR contig table while the `.metadata` holds a collapsed version that is compatible with combining with `AnnData`'s `.obs` slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:

In [3]:
vdj.metadata

Unnamed: 0,clone_id,clone_id_by_size,sample_id,locus_VDJ,locus_VJ,productive_VDJ,productive_VJ,v_call_genotyped_VDJ,v_call_genotyped_VJ,d_call_VDJ,...,junction_aa_VJ,isotype,isotype_status,locus_status,productive_status,rearrangement_VDJ_status,rearrangement_VJ_status,constant_VDJ_status,constant_VJ_status,changeo_clone_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC,49_3_1_24_2_2,1194,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,IGHV1-69,IGKV1-8,IGHD3-22,...,CQQYYSYPRTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,586_1029
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG,48_1_2_32_1_1,1697,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV1-2,IGLV5-45,,...,CMIWHSSAWVV,IgM,IgM,IGH + IGL,T + T,Multi,Single,Single,Single,963_134
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC,192_4_4_127_1_1,1443,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,IGHV5-51,IGKV1D-8,,...,CQQYYSFPYTF,IgM,IgM,IGH + IGK,T + T,Multi,Single,Single,Single,409_1729
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA,10_1_1_181_2_6,1442,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-4,IGLV3-19,IGHD6-13,...,CNSRDSSGNHVVF,IgM,IgM,IGH + IGL,T + T,Single,Single,Single,Single,399_1848
sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG,206_2_4_110_4_4,1441,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-39,IGLV3-21,IGHD3-22,...,CQVWDSSSDHVVF,IgM,IgM,IGH + IGL,T + T,Single,Single,Single,Single,1363_1800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG,33_2_1_149_2_7,347,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV2-5,IGKV4-1,"IGHD5/OR15-5b,IGHD5/OR15-5a",...,CQQYYTTPLTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,1809_346
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT,45_5_6_20_1_3,348,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV3-30,IGKV2-30,IGHD4-17,...,CMQGTHWPYTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,74_1535
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA,6_1_1_100_4_11,349,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV4-59,"IGKV1-39,IGKV1D-39",IGHD6-13,...,CQQSYSTPWTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,900_11
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG,138_6_1_2_2_2,350,vdj_v1_hs_pbmc3,IGH,IGL,T,T,IGHV1-69,IGLV1-47,IGHD2-15,...,CAAWDDSLSGWVF,IgM,IgM,IGH + IGL,T + T,Single,Single,Single,Single,1410_1536


### copy

You can deep copy the `Dandelion` object to another variable which will inherit all slots:

In [4]:
vdj2 = vdj.copy()
vdj2.metadata

Unnamed: 0,clone_id,clone_id_by_size,sample_id,locus_VDJ,locus_VJ,productive_VDJ,productive_VJ,v_call_genotyped_VDJ,v_call_genotyped_VJ,d_call_VDJ,...,junction_aa_VJ,isotype,isotype_status,locus_status,productive_status,rearrangement_VDJ_status,rearrangement_VJ_status,constant_VDJ_status,constant_VJ_status,changeo_clone_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC,49_3_1_24_2_2,1194,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,IGHV1-69,IGKV1-8,IGHD3-22,...,CQQYYSYPRTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,586_1029
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG,48_1_2_32_1_1,1697,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV1-2,IGLV5-45,,...,CMIWHSSAWVV,IgM,IgM,IGH + IGL,T + T,Multi,Single,Single,Single,963_134
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC,192_4_4_127_1_1,1443,sc5p_v2_hs_PBMC_10k,IGH,IGK,T,T,IGHV5-51,IGKV1D-8,,...,CQQYYSFPYTF,IgM,IgM,IGH + IGK,T + T,Multi,Single,Single,Single,409_1729
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA,10_1_1_181_2_6,1442,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-4,IGLV3-19,IGHD6-13,...,CNSRDSSGNHVVF,IgM,IgM,IGH + IGL,T + T,Single,Single,Single,Single,399_1848
sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG,206_2_4_110_4_4,1441,sc5p_v2_hs_PBMC_10k,IGH,IGL,T,T,IGHV4-39,IGLV3-21,IGHD3-22,...,CQVWDSSSDHVVF,IgM,IgM,IGH + IGL,T + T,Single,Single,Single,Single,1363_1800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG,33_2_1_149_2_7,347,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV2-5,IGKV4-1,"IGHD5/OR15-5b,IGHD5/OR15-5a",...,CQQYYTTPLTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,1809_346
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT,45_5_6_20_1_3,348,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV3-30,IGKV2-30,IGHD4-17,...,CMQGTHWPYTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,74_1535
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA,6_1_1_100_4_11,349,vdj_v1_hs_pbmc3,IGH,IGK,T,T,IGHV4-59,"IGKV1-39,IGKV1D-39",IGHD6-13,...,CQQSYSTPWTF,IgM,IgM,IGH + IGK,T + T,Single,Single,Single,Single,900_11
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG,138_6_1_2_2_2,350,vdj_v1_hs_pbmc3,IGH,IGL,T,T,IGHV1-69,IGLV1-47,IGHD2-15,...,CAAWDDSLSGWVF,IgM,IgM,IGH + IGL,T + T,Single,Single,Single,Single,1410_1536


### Retrieving entries with `update_metadata`

The `.metadata` slot in Dandelion class automatically initializes whenever the `.data` slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the `.data` slot, we can update the metadata with `ddl.update_metadata` and specify the options `retrieve` and `retrieve_mode`. 

The following modes determine how the retrieval is completed:

`split and unique only` - splits the retrieval into VDJ and VDJ chains. A `|` will separate _**unique**_ element.

`merge and unique only` - smiliar to above but merged into a single column.

`split` - split retrieval into _**individual**_ columns for each contig.

`merge` - merge retrieval into a _**single**_ column where a `|` will separate _**every**_ element.

For numerical columns, there's additional options:

`split and sum` - splits the retrieval into VDJ and VDJ chains and sum separately.

`split and average` - smiliar to above but average instead of sum.

`sum` - sum the retrievals into a single column.

`average` - averages the retrievals into a single column.

If `retrieve_mode` is not specified, it will default to `split and unique only`

***Example: retrieving fwr1 sequences***

In [5]:
ddl.update_metadata(vdj, retrieve = 'fwr1')
vdj

Dandelion class object with n_obs = 2420 and n_contigs = 4832
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn',

Note the additional `fwr1` VDJ and VJ columns in the metadata slot.

By default, `dandelion` will not try to merge numerical columns as it can create mixed dtype columns.

There is a new class function now that will try and retrieve frequently used columns such as `np1_length`, `np2_length`:

In [6]:
vdj.update_plus()
vdj



Dandelion class object with n_obs = 2420 and n_contigs = 4832
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn',

### concatenating multiple objects

This is a simple function to concatenate (append) two or more `Dandelion` class, or `pandas` dataframes. Note that this operates on the `.data` slot and not the `.metadata` slot.

In [7]:
# for example, the original dandelion class has 2437 unique cell barcodes and 4866 contigs
vdj

Dandelion class object with n_obs = 2420 and n_contigs = 4832
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn',

In [8]:
# now it has 14808 (4936*3) contigs instead, and the metadata should also be properly populated
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat

Dandelion class object with n_obs = 2420 and n_contigs = 14496
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn'

### read/write

`Dandelion` class can be saved using `.write_h5ddl` and `.write_pkl` functions with accompanying compression methods. `write_h5ddl` primarily uses pandas `to_hdf` library and `write_pkl` just uses pickle. `read_h5ddl` and `read_pkl` functions will read the respective file formats accordingly.

In [9]:
%time vdj.write_h5ddl('dandelion_results.h5ddl', complib = 'bzip2')



CPU times: user 4.45 s, sys: 187 ms, total: 4.64 s
Wall time: 4.78 s


If you see any warnings above, it's due to mix dtypes somewhere in the object. So do some checking if you think it will interfere with downstream usage.

In [10]:
%time vdj_1 = ddl.read_h5ddl('dandelion_results.h5ddl')
vdj_1



CPU times: user 1.26 s, sys: 147 ms, total: 1.41 s
Wall time: 1.48 s


Dandelion class object with n_obs = 2420 and n_contigs = 4832
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn',

The read/write times using `pickle` can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).

In [11]:
%time vdj.write_pkl('dandelion_results.pkl.gz')

CPU times: user 7.5 s, sys: 50.7 ms, total: 7.55 s
Wall time: 7.68 s


In [12]:
%time vdj_2 = ddl.read_pkl('dandelion_results.pkl.gz')
vdj_2

CPU times: user 220 ms, sys: 24.7 ms, total: 245 ms
Wall time: 274 ms


Dandelion class object with n_obs = 2420 and n_contigs = 4832
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn',