# Dandelion class

Much of the functions and utility of the `dandelion` package revolves around the `Dandelion` class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This section will run through a quick primer to the `Dandelion` class.

<b>Import modules</b>

In [None]:
import os

os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
import dandelion as ddl

ddl.logging.print_versions()

In [None]:
vdj = ddl.read_h5ddl("dandelion_results.h5ddl")
vdj

Essentially, the `.data` slot holds the AIRR contig table while the `.metadata` holds a collapsed version that is compatible with combining with `AnnData`'s `.obs` slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:

In [None]:
vdj.metadata

### slicing

You can slice the `Dandelion` object via the `.data` or `.metadata` via their indices, with the behavior similar to how it is in pandas `DataFrame` and `AnnData`.

<b>slicing</b> `.data`

In [None]:
# get the largest clone
largest_clone = vdj.data["clone_id"].value_counts().idxmax()

vdj[vdj.data["clone_id"] == largest_clone]

In [None]:
vdj[
    vdj.data_names.isin(
        [
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2",
        ]
    )
]

**slicing** `.metadata`

In [None]:
vdj[vdj.metadata["productive_VDJ"].isin(["T", "T|T"])]

In [None]:
vdj[vdj.metadata_names == "vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT"]

### copy

You can deep copy the `Dandelion` object to another variable which will inherit all slots:

In [None]:
vdj2 = vdj.copy()
vdj2.metadata

### Retrieving entries with `update_metadata`

The `.metadata` slot in Dandelion class automatically initializes whenever the `.data` slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the `.data` slot, we can update the metadata with `ddl.update_metadata` and specify the options `retrieve` and `retrieve_mode`. 

The following modes determine how the retrieval is completed:

`split and unique only` - splits the retrieval into VDJ and VJ chains. A `|` will separate _**unique**_ element.

`split and merge` - splits the retrieval into VDJ and VJ chains. A `|` will separate _**every**_ element.

`merge and unique only` - smiliar to above but merged into a single column.

`split` - split retrieval into _**individual**_ columns for each contig.

`merge` - merge retrieval into a _**single**_ column where a `|` will separate _**every**_ element.

For numerical columns, there's additional options:

`split and sum` - splits the retrieval into VDJ and VJ chains and sum separately.

`split and average` - smiliar to above but average instead of sum.

`sum` - sum the retrievals into a single column.

`average` - averages the retrievals into a single column.

If `retrieve_mode` is not specified, it will default to `split and merge`

***Example: retrieving fwr1 sequences***

In [None]:
vdj.update_metadata(retrieve="fwr1")
vdj

Note the additional `fwr1` VDJ and VJ columns in the metadata slot.

By default, `dandelion` will not try to merge numerical columns as it can create mixed dtype columns.

There is a new sub-function that will try and retrieve frequently used columns such as `np1_length`, `np2_length`:

In [None]:
vdj.update_plus()
vdj

## Renaming barcodes

You can now use a simple function to rename the barcodes (both sequence and cell ids at the same time). This is useful for when you want to rename the barcodes to a more meaningful name. This only works on the indices that were initially used to create the `Dandelion` object. So if you have run the function once already, it doesn't continuously add the prefix/suffix to the new indices. It just updates based on the original indices.

In [None]:
# original
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)

In [None]:
# let's add a 'test-' as a prefix. There's also the suffix option
vdj.add_sequence_prefix("test", sep="-")
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)

In [None]:
# same functionality as above
vdj.add_cell_prefix("test2", sep="_")
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)

In [None]:
# you can also reset the ids
vdj.reset_ids()
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)

### Simplifying the V/DJ/C call annotations

Sometimes the V/DJ/C call annotations can be quite verbose. You can simplify them with the `.simplify()` function. This function will remove the `,` and only keep the first element of the call, as well as stripping alleles. This is useful for when you want to simplify the V/DJ/C calls for plotting purposes.

In [None]:
# before
(
    vdj.data[["v_call_genotyped", "j_call"]],
    vdj.metadata[["v_call_genotyped_VDJ", "j_call_VDJ"]],
)

In [None]:
# after
vdj.simplify()
# before
(
    vdj.data[["v_call_genotyped", "j_call"]],
    vdj.metadata[["v_call_genotyped_VDJ", "j_call_VDJ"]],
)

### concatenating multiple objects

This is a simple function to concatenate (append) two or more `Dandelion` class, or `pandas` dataframes. Note that this operates on the `.data` slot and not the `.metadata` slot.

In [None]:
# for example, the original dandelion class has 2071 unique cell barcodes and 4882 contigs
vdj

In [None]:
# now it has 14646 (4882*3) contigs instead, and the metadata should also be properly populated
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat

In [None]:
vdj_concat.data[["sequence_id", "cell_id"]].head()

`ddl.concat` also lets you add in your custom prefixes/suffixes to append to the sequence ids. If not provided, it will add `-0`, `-1` etc. as a suffix if it detects that the sequence ids are not unique as seen above.

### read/write

`Dandelion` class can be saved using `.write_h5ddl` and `.write_pkl` functions with accompanying compression methods e.g. `gzip`. `write_h5ddl` primarily uses `h5py` library and `write_pkl` just uses pickle. `read_h5ddl` and `read_pkl` functions will read the respective file formats accordingly. 

In [None]:
%time vdj.write_h5ddl('dandelion_results.h5ddl', compression="gzip")

If you see any warnings above, it's due to mix dtypes somewhere in the object. So do some checking if you think it will interfere with downstream usage.

In [None]:
%time vdj_1 = ddl.read_h5ddl('dandelion_results.h5ddl')
vdj_1

The read/write times using `pickle` can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).

In [None]:
%time vdj.write_pkl('dandelion_results.pkl.gz')

In [None]:
%time vdj_2 = ddl.read_pkl('dandelion_results.pkl.gz')
vdj_2

There's also other types of writing functions such as `.write_airr` and `.write_10x`, which will write the object to a `.tsv` or `.csv` file that is compatible with `airr` and `10x` formats respectively.

In [None]:
import pandas as pd

vdj2.write_airr("test.airr.tsv")
df = pd.read_csv("test.airr.tsv", sep="\t")
df

In [None]:
vdj2.write_10x(
    folder="10x_test",
    filename_prefix="all",
)  # this writes both the conting_annotations.csv and contig.fasta
df = pd.read_csv("10x_test/all_contig_annotations.csv")
df