# *Dandelion* class

![dandelion_logo](img/dandelion_logo_illustration.png)

Much of the functions and utility of the `dandelion` package revolves around the `Dandelion` class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This section will run through a quick primer to the `Dandelion` class.

***Import modules***

In [1]:
import os
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
import dandelion as ddl
ddl.logging.print_versions()

dandelion==0.1.3.post2.dev62 pandas==1.2.3 numpy==1.20.1 matplotlib==3.3.4 networkx==2.5 scipy==1.6.1 skbio==0.5.6


In [2]:
vdj = ddl.read_h5('dandelion_results.h5')
vdj

Dandelion class object with n_obs = 562 and n_contigs = 1130
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start',

Basically, the object can be summarized in the following illustration:

![dandelion_class <](img/dandelion_class.png)

Essentially, the `.data` slot holds the AIRR contig table while the `.metadata` holds a collapsed version that is compatible with combining with `AnnData`'s `.obs` slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:

In [3]:
vdj.metadata

Unnamed: 0,clone_id,clone_id_by_size,sample_id,locus_VDJ,locus_VJ,productive_VDJ,productive_VJ,v_call_genotyped_VDJ,v_call_genotyped_VJ,j_call_VDJ,...,status,status_summary,productive,productive_summary,isotype,isotype_summary,vdj_status,vdj_status_summary,constant_status_summary,changeo_clone_id
sc5p_v2_hs_PBMC_1k_ACACTGATCGGTTCGG,131_4_1_58,2,sc5p_v2_hs_PBMC_1k,IGH,IGK,True,True,IGHV4-59,IGKV4-1,IGHJ6,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Single,Single,Single,8_0
sc5p_v2_hs_PBMC_1k_CCGGTAGGTCAGAAGC,17_1_1_163,4,sc5p_v2_hs_PBMC_1k,IGH,IGL,True,True,IGHV3-15,IGLV2-23,IGHJ3,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Single,Single,Single,63_1
sc5p_v2_hs_PBMC_1k_CGATCGGAGATGTCGG,146_2_1_38,8,sc5p_v2_hs_PBMC_1k,IGH,IGL,True,True,IGHV4-39,IGLV3-21,IGHJ5,...,IGH + IGL,IGH + IGL,True + True,True + True,IgA,IgA,Single + Multi_VJ_j,Single,Single,67_2
sc5p_v2_hs_PBMC_1k_CGGACGTGTTGATTCG,22_2_1_55,11,sc5p_v2_hs_PBMC_1k,IGH,IGL,True,True,IGHV1-3,IGLV1-51,IGHJ6,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Multi_VJ_j,Single,Single,72_3
sc5p_v2_hs_PBMC_1k_GTACGTACAGCCTATA,59_3_2_308,10,sc5p_v2_hs_PBMC_1k,IGH,IGK,True,True,IGHV4-59,IGKV3-15,IGHJ4,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Single,Single,Single,57_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vdj_nextgem_hs_pbmc3_TTCTCAACACATCTTT,116_2_1_191,353,vdj_nextgem_hs_pbmc3,IGH,IGK,True,True,IGHV3-15,IGKV2-28|IGKV2D-28,IGHJ6,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Multi_VJ_v,Single,Single,294_537
vdj_nextgem_hs_pbmc3_TTCTCCTAGGGCACTA,40_5_2_289,382,vdj_nextgem_hs_pbmc3,IGH,IGL,True,True,IGHV1-2,IGLV2-14,IGHJ4,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Multi_VJ_j,Single,Single,465_538
vdj_nextgem_hs_pbmc3_TTCTTAGCAAACGCGA,71_7_1_249,481,vdj_nextgem_hs_pbmc3,IGH,IGK,True,True,IGHV3-15,IGKV1D-39|IGKV1-39,IGHJ4,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Multi_VJ_v,Single,Single,400_539
vdj_nextgem_hs_pbmc3_TTTATGCTCCGCATAA,117_3_2_56,463,vdj_nextgem_hs_pbmc3,IGH,IGL,True,True,IGHV3-7,IGLV1-51,IGHJ6,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Multi_VJ_j,Single,Single,292_540


### copy

You can deep copy the `Dandelion` object to another variable which will inherit all slots:

In [4]:
vdj2 = vdj.copy()
vdj2.metadata

Unnamed: 0,clone_id,clone_id_by_size,sample_id,locus_VDJ,locus_VJ,productive_VDJ,productive_VJ,v_call_genotyped_VDJ,v_call_genotyped_VJ,j_call_VDJ,...,status,status_summary,productive,productive_summary,isotype,isotype_summary,vdj_status,vdj_status_summary,constant_status_summary,changeo_clone_id
sc5p_v2_hs_PBMC_1k_ACACTGATCGGTTCGG,131_4_1_58,2,sc5p_v2_hs_PBMC_1k,IGH,IGK,True,True,IGHV4-59,IGKV4-1,IGHJ6,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Single,Single,Single,8_0
sc5p_v2_hs_PBMC_1k_CCGGTAGGTCAGAAGC,17_1_1_163,4,sc5p_v2_hs_PBMC_1k,IGH,IGL,True,True,IGHV3-15,IGLV2-23,IGHJ3,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Single,Single,Single,63_1
sc5p_v2_hs_PBMC_1k_CGATCGGAGATGTCGG,146_2_1_38,8,sc5p_v2_hs_PBMC_1k,IGH,IGL,True,True,IGHV4-39,IGLV3-21,IGHJ5,...,IGH + IGL,IGH + IGL,True + True,True + True,IgA,IgA,Single + Multi_VJ_j,Single,Single,67_2
sc5p_v2_hs_PBMC_1k_CGGACGTGTTGATTCG,22_2_1_55,11,sc5p_v2_hs_PBMC_1k,IGH,IGL,True,True,IGHV1-3,IGLV1-51,IGHJ6,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Multi_VJ_j,Single,Single,72_3
sc5p_v2_hs_PBMC_1k_GTACGTACAGCCTATA,59_3_2_308,10,sc5p_v2_hs_PBMC_1k,IGH,IGK,True,True,IGHV4-59,IGKV3-15,IGHJ4,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Single,Single,Single,57_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
vdj_nextgem_hs_pbmc3_TTCTCAACACATCTTT,116_2_1_191,353,vdj_nextgem_hs_pbmc3,IGH,IGK,True,True,IGHV3-15,IGKV2-28|IGKV2D-28,IGHJ6,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Multi_VJ_v,Single,Single,294_537
vdj_nextgem_hs_pbmc3_TTCTCCTAGGGCACTA,40_5_2_289,382,vdj_nextgem_hs_pbmc3,IGH,IGL,True,True,IGHV1-2,IGLV2-14,IGHJ4,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Multi_VJ_j,Single,Single,465_538
vdj_nextgem_hs_pbmc3_TTCTTAGCAAACGCGA,71_7_1_249,481,vdj_nextgem_hs_pbmc3,IGH,IGK,True,True,IGHV3-15,IGKV1D-39|IGKV1-39,IGHJ4,...,IGH + IGK,IGH + IGK,True + True,True + True,IgM,IgM,Single + Multi_VJ_v,Single,Single,400_539
vdj_nextgem_hs_pbmc3_TTTATGCTCCGCATAA,117_3_2_56,463,vdj_nextgem_hs_pbmc3,IGH,IGL,True,True,IGHV3-7,IGLV1-51,IGHJ6,...,IGH + IGL,IGH + IGL,True + True,True + True,IgM,IgM,Single + Multi_VJ_j,Single,Single,292_540


### Retrieving entries with `update_metadata`

The `.metadata` slot in Dandelion class automatically initializes whenever the `.data` slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the `.data` slot, we can update the metadata with `ddl.update_metadata` and specify the option `retrieve`. 

The following options determine how the retrieval is completed:

`split` - splits the retrieval into heavy and light chains calls.

`split_locus` - smiliar to `split` but splits the retrieval to `IGH/IGK/IGL`.

`collapse` - Adds a `|` to separate every element.

`combine` - similar to `collapse` but only retains unique elements (separated by a `|` if multiple are found).

***Example 1 : retrieving junction amino acid sequences***

In [5]:
ddl.update_metadata(vdj, retrieve = 'd_call')
vdj

Dandelion class object with n_obs = 562 and n_contigs = 1130
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start',

Note the additional `d_call` heavy and light columns in the metadata slot.

By default, `dandelion` will not try to merge numerical columns as it can create mixed dtype columns.

***Example 2 : editing clone_id column***

Perhaps you want to have a bit more control with how clones are called. We can edit this directly from the `.data` slot and retrieve accordingly.

In [6]:
# if we only want to keep the light chain clone assignment 
clones = []
for clone in vdj.data['clone_id']:
    if '|' in clone: # this is because clones were merged into the the same column if they have different pairing of BCR combinations
        clone_list = clone.split('|')
        clones.append('|'.join(list(set([clone_2.rsplit('_', 1)[0] if clone_2.count('_') == 3 else clone_2 for clone_2 in clone_list]))))
    else:
        if clone.count('_') == 3: # this means it's looking for X_X_X_X, 3 underscores
            clones.append(clone.rsplit('_', 1)[0]) # split the 3rd underscore but only keep the first entry
        else:
            clones.append(clone)
vdj.data['clone_id_heavy_only'] = clones
ddl.update_metadata(vdj, retrieve = 'clone_id_heavy_only', split = False, collapse = True)
vdj.metadata[['clone_id', 'clone_id_heavy_only']]

Unnamed: 0,clone_id,clone_id_heavy_only
sc5p_v2_hs_PBMC_1k_ACACTGATCGGTTCGG,131_4_1_58,131_4_1
sc5p_v2_hs_PBMC_1k_CCGGTAGGTCAGAAGC,17_1_1_163,17_1_1
sc5p_v2_hs_PBMC_1k_CGATCGGAGATGTCGG,146_2_1_38,146_2_1
sc5p_v2_hs_PBMC_1k_CGGACGTGTTGATTCG,22_2_1_55,22_2_1
sc5p_v2_hs_PBMC_1k_GTACGTACAGCCTATA,59_3_2_308,59_3_2
...,...,...
vdj_nextgem_hs_pbmc3_TTCTCAACACATCTTT,116_2_1_191,116_2_1
vdj_nextgem_hs_pbmc3_TTCTCCTAGGGCACTA,40_5_2_289,40_5_2
vdj_nextgem_hs_pbmc3_TTCTTAGCAAACGCGA,71_7_1_249,71_7_1
vdj_nextgem_hs_pbmc3_TTTATGCTCCGCATAA,117_3_2_56,117_3_2


### `concat`enating multiple objects

This is a simple function to concatenate (append) two or more `Dandelion` class, or `pandas` dataframes.

In [7]:
# for example, the original dandelion class has 838 unique cell barcodes and 1700 contigs
vdj

Dandelion class object with n_obs = 562 and n_contigs = 1130
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start',

In [8]:
# now it has 5100 contigs instead, and the metadata should also be properly populated
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat

Dandelion class object with n_obs = 562 and n_contigs = 3390
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start',

### read/write

`Dandelion` class can be saved using `.write_h5` and `.write_pkl` functions with accompanying compression methods. `write_h5` primarily uses pandas `to_hdf` library and `write_pkl` just uses pickle. `read_h5` and `read_pkl` functions will read the respective file formats accordingly.

In [9]:
%time vdj.write_h5('dandelion_results.h5', complib = 'bzip2')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->Index(['sequence_id', 'sequence', 'v_call', 'd_call', 'j_call',
       'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa',
       'v_cigar', 'd_cigar', 'j_cigar', 'locus', 'np2_length',
       'd_sequence_start', 'd_sequence_end', 'd_germline_start',
       'd_germline_end', 'd_score', 'd_identity', 'd_support', 'fwr1', 'fwr2',
       'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call',
       'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x',
       'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask',
       'sample_id', 'c_sequence_alignment', 'c_germline_alignment',
       'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity',
       'c_support', 'c_call_10x', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa',
       'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa',
       '

CPU times: user 1.05 s, sys: 51.5 ms, total: 1.1 s
Wall time: 1.12 s


In [10]:
%time vdj_1 = ddl.read_h5('dandelion_results.h5')
vdj_1

CPU times: user 419 ms, sys: 37.9 ms, total: 457 ms
Wall time: 458 ms


Dandelion class object with n_obs = 562 and n_contigs = 1130
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start',

The read/write times using `pickle` can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).

In [11]:
%time vdj.write_pkl('dandelion_results.pkl.gz')

CPU times: user 4.44 s, sys: 30.9 ms, total: 4.47 s
Wall time: 4.55 s


In [12]:
%time vdj_2 = ddl.read_pkl('dandelion_results.pkl.gz')
vdj_2

CPU times: user 58.7 ms, sys: 6.65 ms, total: 65.3 ms
Wall time: 68.5 ms


Dandelion class object with n_obs = 562 and n_contigs = 1130
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start',