# V(D)J analysis

In [None]:
import dandelion as ddl
import scanpy as sc
import warnings
import os

warnings.filterwarnings("ignore")
sc.settings.set_figure_params(dpi=80)

Let's run through some of what Dandelion can do in terms of analysis. In order to kickstart this tutorial, we prepared GEX and VDJ objects with four demo 10X samples parsed for your convenience. The previous notebook shows how this was done, with the VDJ loading into Dandelion likely of most interest due to the syntax required.

In [None]:
if not os.path.exists("demo-gex.h5ad"):
    os.system("wget ftp://ftp.sanger.ac.uk/pub/users/kp9/demo-gex.h5ad")

if not os.path.exists("demo-vdj.h5ddl"):
    os.system("wget ftp://ftp.sanger.ac.uk/pub/users/kp9/demo-vdj.h5ddl")


Let's import the objects. Dandelion's `h5ddl` files can be read via `ddl.read_h5ddl()`.

In [None]:
adata = sc.read("demo-gex.h5ad")
vdj = ddl.read_h5ddl("demo-vdj.h5ddl")


At this point you're probably wondering why there's a separate Dandelion object. The reason is AIRR compliance. Some of the AIRR columns have more complex typing than what Scanpy can currently support within its objects. However, it's quite straightforward to link up a Scanpy object with a Dandelion one.

In [None]:
vdj, adata = ddl.pp.check_contigs(vdj, adata)


This filters the contigs and synchronises relevant information between the objects. Once linked up like this, any new information can be copied over from the Dandelion object via `ddl.tl.transfer()`. There will be an example later in the notebook.

For now, let's take a look at the chain status (as gotten from the Dandelion object) and known BCR marker expression.

In [None]:
sc.pl.umap(adata, color=["IGHM", "JCHAIN", "chain_status"])

Under the hood, the Dandelion object is essentially two data frames. `.data` holds the AIRR-compliant contig space table, while `.metadata` is an `.obs` equivalent that parses the contig information to a cell level and can be easily integrated with a Scanpy object. There are also `ddl.to_scirpy()` and `ddl.from_scirpy()` for interoperability with Scirpy, as explored in a notebook in the advanced guide. Scirpy also offers its own conversion functions.

The thing you're most likely to find yourself doing manually with the Dandelion object is modifying cell names to match your GEX naming convention. The cell names can be found in `.data.cell_id`, change those however you see fit and then call `.update_metadata()` to regenerate the per-cell `.obs` equivalent.

```
vdj.data.cell_id = [result of modification procedure on existing vdj.data.cell_id]
vdj.update_metadata()
```

In [None]:
vdj


Now that we've got the gist of basic handling of the Dandelion object, let's use it for some analysis!

A core element of VDJ analysis is clonotype calling, roughly equivalent to clustering cells in GEX processing. Dandelion requires the clones it calls to have identical V and J genes, along with no more than 15% mismatches in the CDR3 sequences ([common practice](https://royalsocietypublishing.org/doi/10.1098/rstb.2014.0239) in BCR analysis).

For TCR clonotype calling, you can perform common practice nucleotide sequence identity by passing `identity=1` and `key="junction"` to the function.

In [None]:
ddl.tl.find_clones(vdj)


We can compute a graph based on Levenshtein distance of the complete contig sequence. A NetworkX representation of it is now saved in `vdj.graph`.

In [None]:
ddl.tl.generate_network(vdj)


Since we now know what our clonotype calls are, we can quantify clonal expansion. It's possible to cap this at a desired maximum clonotype size.

In [None]:
ddl.tl.clone_size(vdj)
# this makes an independent column with the provided max_size in its name
ddl.tl.clone_size(vdj, max_size=3)

Now that our Dandelion object has analysis information inside it, we can copy it over to the Scanpy object to have access to it there. The graph gets turned into the Scanpy standard forms of `.obsp['vdj_distances']` and `.obsp['vdj_connectivites']` for potential downstream use.

In [None]:
ddl.tl.transfer(adata, vdj)


Let's take a look at what we made!

In [None]:
ddl.pl.clone_network(adata, color="clone_id_size")
sc.pl.umap(adata, color="clone_id_size")

Wait, why are we seeing some clone size 0 in the plots? Orphan chains.

In [None]:
ddl.pl.clone_network(adata, color="clone_id_size_max_3")
sc.pl.umap(adata, color="clone_id_size_max_3")

Dandelion comes with a number of plotting functions for your convenience. However, those functions tend to operate best without the Scanpy plotting defaults in place. You can reset Matplotlib's configuration prior to using them.

In [None]:
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
%matplotlib inline

We've got bar plots.

In [None]:
ddl.pl.barplot(
    vdj[vdj.metadata.isotype_status != "Multi"],  # remove multi from the plots
    color="v_call_genotyped_VDJ",
    xtick_fontsize=5,
)

All of the plotting functions have a number of parameters that can be fiddled with for desired visualisation outcomes. For example, let's disable automatic descending sorting, show counts rather than proportions, and change the palette.

In [None]:
ddl.pl.barplot(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="v_call_genotyped_VDJ",
    normalize=False,
    sort_descending=None,
    palette="tab20",
    xtick_fontsize=5,
)

We've got stacked bar plots.

In [None]:
ddl.pl.stackedbarplot(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="isotype_status",
    groupby="locus_status",
    xtick_rotation=0,
    figsize=(4, 3),
)

These can be normalised to add up to 1 for each column.

In [None]:
ddl.pl.stackedbarplot(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="v_call_genotyped_VDJ",
    groupby="isotype_status",
    normalize=True,
    xtick_fontsize=5,
)

We've also got a spectratype plot, which shows the distribution of the CDR3 length for the various contigs.

In [None]:
ddl.pl.spectratype(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="junction_length",
    groupby="c_call",
    locus="IGH",
    width=2.3,
)

Another common VDJ analysis request is to examine the distribution of shared clonotypes between cells of different metadata groups. Dandelion can do this as a circos plot.

In [None]:
ddl.tl.clone_overlap(
    adata, groupby="leiden", colorby="leiden", weighted_overlap=True
)
ddl.pl.clone_overlap(
    adata, groupby="leiden", colorby="leiden", weighted_overlap=True
)

There's also a heatmap on offer.

In [None]:
ddl.pl.clone_overlap(
    adata,
    groupby="leiden",
    colorby="leiden",
    weighted_overlap=True,
    as_heatmap=True,
    # seaborn clustermap kwargs
    cmap="Blues",
    annot=True,
    figsize=(8, 8),
    annot_kws={"size": 10},
)

Save the objects, like so.

In [None]:
adata.write("demo-gex-processed.h5ad")
vdj.write("demo-vdj-processed.h5ddl")
