# Filtering

We now move on to filtering out BCR contigs (and corresponding cells if necessary) from the BCR data and transcriptome object loaded in *scanpy*.

<b>Import <i>dandelion</i> module</b>

In [None]:
import os
import dandelion as ddl

# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
ddl.logging.print_header()

<b>Import modules for use with scanpy</b>

In [None]:
import pandas as pd
import scanpy as sc
import warnings

warnings.filterwarnings("ignore")
sc.logging.print_header()

<b>Import the transcriptome data</b>

In [None]:
samples = [
    "sc5p_v2_hs_PBMC_1k",
    "sc5p_v2_hs_PBMC_10k",
    "vdj_v1_hs_pbmc3",
    "vdj_nextgem_hs_pbmc3",
]
adata_list = []
for sample in samples:
    adata = sc.read_10x_h5(
        sample + "/filtered_feature_bc_matrix.h5", gex_only=True
    )
    adata.obs["sampleid"] = sample
    # rename cells to sample id + barcode
    adata.obs_names = [str(sample) + "_" + str(j) for j in adata.obs_names]
    adata.var_names_make_unique()
    adata_list.append(adata)
adata = adata_list[0].concatenate(adata_list[1:])
# rename the obs_names again, this time cleaving the trailing -#
adata.obs_names = [str(j).split("-")[0] for j in adata.obs_names]
adata

I'm using a wrapper called `pp.recipe_scanpy_qc` to run through a generic [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) workflow. You can skip this if you already have a pre-processed `AnnData` object for the subsequent steps.

In [None]:
ddl.pp.recipe_scanpy_qc(adata, mito_cutoff=None)  # use a gmm model to decide
# we can continue with those that survive qc
adata = adata[adata.obs["filter_rna"] == "False"].copy()
adata

## Filter cells that are potental doublets and poor quality in both the V(D)J data and transcriptome data

### `ddl.pp.check_contigs`

<div class="alert alert-warning">

Deprecation warning

Pre v0.5.0, there are two separate functions to perform contig QC with either `ddl.pp.filter_contigs` or `ddl.pp.check_contigs` to deal with poor quality contigs, either explicitly removing them or just flagging them. From v0.5.0 onwards however, `ddl.pp.filter_contigs` is deprecated and will be removed in v0.5.0, and `ddl.pp.check_contigs` will be the only QC option going forward. `ddl.pp.check_contigs` is easier to maintain and simply marks the problematic contigs as `ambiguous` and withhold them from downstream analysis. The new version of `ddl.pp.check_contigs` will also have the `filter_extra` and `filter_ambiguous` options to remove/keep the `extra` (marked due to passing the internal QC filters but not explicitly ambiguous) and `ambiguous` contigs, fulfilling the same utility as `ddl.pp.filter_contigs`.
</div>

We use the function `pp.check_contigs` to mark and filter out cells and contigs from both the V(D)J data and transcriptome data in `AnnData`. The operation will remove bad quality cells based on transcriptome information as well as remove V(D)J doublets (multiplet heavy/long chains, and/or light/short chains) from the V(D)J data. In some situations, a single cell can have multiple heavy/long and light/short chain contigs although they have an identical V(D)J+C alignment; in situations like this, the contigs with lesser UMIs will be dropped and the UMIs transferred to `umi_count` column. The same procedure is applied to both chains before further checks of the annotation quality, UMI and consensus count distributions.

Cells in the gene expression object without V(D)J information will not be affected which means that the `AnnData` object can hold non-B/T cells.

In [None]:
# first we read in the 4 bcr files
bcr_files = []
for sample in samples:
    file_location = sample + "/dandelion/filtered_contig_dandelion.tsv"
    bcr_files.append(pd.read_csv(file_location, sep="\t"))
bcr = pd.concat(bcr_files, ignore_index=True)
bcr.reset_index(inplace=True, drop=True)
bcr

<div class="alert alert-warning">

Library type
    
It is recommended to specify the <b><b>library_type</b></b> argument as it will remove all contigs that do not belong to the related loci. The rationale is that the choice of the library type should mean that the primers used would most likely amplify those related sequences and if there's any unexpected loci, they likely represent artifacts and shouldn't be analysed. The optional argument accepts: `ig`, `tr-ab`, `tr-gd` or `None` where `None` means all contigs will be kept.
</div>    

The main output of this function are two an additional columns in `vdj.data`, `extra` and `ambiguous`, which flags `T` or `F` for contigs that were marked accordingly. The rules for marking contigs are as follows:

`extra` is marked as `T` if the contig passes the internal QC filters based on `umi_count` (or `consensus_count` if there are ties in the `umi_count`) in a cell. If you are only interested in just the top contig pair, you can set `filter_extra=True` to remove the extra contigs.

For VDJ chains, the current rule set is to keep the **top 1** `productive` contig with the highest counts and mark the rest as `extra` (or `ambiguous` if appropriate). Toggle `ntop_vdj` to keep the top `n` (default 1) contigs.

For VJ chains, the current rule set is to keep the **top 2** `productive` contigs with the highest counts and mark the rest as `extra` (or `ambiguous` if appropriate). Toggle `ntop_vj` to keep the top `n` (default 2) contigs.

`ambiguous` is marked as `T` if the contig is of poor quality annotation and would be removed from downstream analysis. Cells with multiple contigs with very low `umi_counts` and/or `consensus_counts` are also marked as `ambiguous` as it is not possible to distinguish which is the most representative contig.

Please note that the default for `filter_extra` is `True`. If you want to keep the `extra` contigs for whatever reasons e.g. interested in T/B-cell development datasets, you need to set `filter_extra=False`. We are setting this as `False` in this example because later on we want to visualise these extra contigs.

```python

In [None]:
vdj, adata = ddl.pp.check_contigs(
    bcr, adata, library_type="ig", filter_extra=False
)

<b>Check the Dandelion object</b>

In [None]:
vdj

<b>Check the AnnData object as well</b>

In [None]:
adata

These are the relevant columns for looking at the QC status of the cells and contigs in the `.obs` slot in the `AnnData` object (and also `.metadata` slot in the `Dandelion` object):
<div class="alert alert-info">

Relevant columns in obs

- `has_contig`
- whether cells have V(D)J chains.<br><br>
    
- `locus_status`
- detailed information on chain status pairings (below).<br><br>
    
- `chain_status`
- summarised information of the chain locus status pairings (similar to `chain_pairing` in `scirpy`).<br><br>
    
- `rearrangement_status_VDJ` and  `rearrangement_status_VJ`
- whether or not V(D)J gene usage are standard (i.e. all from the same locus).

</div>

So in a standard situation, I would remove cells flagged with `Orphan VJ`, `Orphan VJ-exception`, `ambiguous` in `.metadata.chain_status`, and also any cell marked as `chimeric` in the `.metadata.rearrangement_status_VDJ` and `.metadata.rearrangement_status_VJ` from downstream cell-level calculations/analysis. 

Having said that, you will find that most of `Dandelion`'s functions will work without the need to requirement to perform additional filtering and filtering can be performed on the final `AnnData` object (described in the visualisation section).

<b>Let's take a look at these new columns</b>

In [None]:
pd.crosstab(adata.obs["chain_status"], adata.obs["locus_status"])

if there are multiple library types, i.e. `ddl.pp.filter_contigs` or `ddl.pp.check_contigs` was run with `library_type = None`, or if several tcr/bcr `Dandelion` objects are concatenated, there will be additional columns where the `v/d/j/c calls` and `productive` will be split into additional columns to reflect those that belong to a B cell, alpha-beta T cell, or gamma-delta T cell.

We will use this `contig_checked` object going forward.

## Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data

Because the 'filtered' `AnnData` object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.

In [None]:
# filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
# Logarithmize the data
sc.pp.log1p(adata)
# Stash the normalised counts
adata.raw = adata

<b>Identify highly-variable genes</b>

In [None]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

<b>Filter the genes to only those marked as highly-variable</b>

In [None]:
adata = adata[:, adata.var.highly_variable]

<b>Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.</b>

In [None]:
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)

<b>Run PCA</b>

In [None]:
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

<b>Computing the neighborhood graph, umap and clusters</b>

In [None]:
# Computing the neighborhood graph
sc.pp.neighbors(adata)
# Embedding the neighborhood graph
sc.tl.umap(adata)
# Clustering the neighborhood graph
sc.tl.leiden(adata)

<b>Visualizing the clusters and whether or not there's a corresponding V(D)J receptor</b>

In [None]:
sc.pl.umap(adata, color=["leiden", "chain_status"])

<b>Visualizing some B cell genes</b>

In [None]:
sc.pl.umap(adata, color=["IGHM", "JCHAIN"])

<b>Save AnnData</b>

We can save this `AnnData` object for now.

In [None]:
adata.write("adata.h5ad", compression="gzip")

<b>Save dandelion</b>

To save the vdj object, we have two options - either save the `.data` and `.metadata` slots with pandas' functions:

In [None]:
vdj.data.to_csv("filtered_vdj_table.tsv", sep="\t")

Or save the whole Dandelion class object with either `.write_h5ddl/.write`, which saves the class to a HDF5 format, or using a pickle-based `.write_pkl` function.

<div class="alert alert-warning">

From v0.4.0, the `.write_h5ddl/.write` function has been refactored to use `h5py`. Support for files saved prior to v0.4.0 (which used `pandas` to save in HDF5 format) will be maintained at least until the next major version and `ddl.read_h5ddl` will be able to read both old and new versions. The old version can be saved with `.write_h5ddl(..., version=3)` but this is not covered by tests because of issues with installing the dependencies.
</div>

In [None]:
vdj.write_h5ddl("dandelion_results.h5ddl")  # can add compression="gzip"

In [None]:
vdj.write_pkl(
    "dandelion_results.pkl.pbz2"
)  # this will automatically use bzip2 for compression, switch the extension to .gz for gzip

## Running `ddl.pp.check_contigs` without `AnnData`

Finally, `ddl.pp.check_contigs` can also be run without an `AnnData` object:

In [None]:
vdj3 = ddl.pp.check_contigs(bcr)
vdj3