# scBasset: Batch correction of scATAC-seq data

<div class="alert alert-warning">

Warning

SCBASSET's development is still in progress. The current version may not fully reproduce the original implementation's results.

</div>

In addition to performing [representation learning on scATAC-seq data](https://docs.scvi-tools.org/en/latest/tutorials/notebooks/scbasset.html), scBasset can also be used to integrate data across several samples. This tutorial walks through the following:

1. Loading the dataset
2. Preprocessing the dataset with `scanpy`
3. Setting up and training the model
4. Visualizing the batch-corrected latent space with `scanpy`
5. Quantifying integration performance with `scib-metrics`

In [None]:
!pip install --quiet scvi-colab
from scvi_colab import install
install()

In [None]:
import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import scvi
import seaborn as sns

scvi.settings.seed = 0
sc.set_figure_params(figsize=(4, 4), frameon=False)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

## 1. Loading the dataset

We will use the dataset from [Buenrostro et al., 2018](https://pubmed.ncbi.nlm.nih.gov/29706549/) throughout this tutorial, which contains single-cell chromatin accessibility profiles across 10 populations of human hematopoietic cell types. 

In [None]:
adata = sc.read(
    "data/buen_ad_sc.h5ad", 
    backup_url="https://storage.googleapis.com/scbasset_tutorial_data/buen_ad_sc.h5ad"
)
adata

We see that batch information is stored in `adata.obs["batch"]`. In this case, batches correspond to different donors.

In [None]:
adata.obs["batch"].value_counts()

## 2. Preprocessing the dataset

We now use `scanpy` to preprocess the data before giving it to the model. In our case, we filter out peaks that are rarely detected (detected in less than 5% of cells) in order to make the model train faster.

In [None]:
print("before filtering:", adata.shape)
min_cells = int(adata.n_obs * 0.05)  # threshold: 5% of cells
sc.pp.filter_genes(adata, min_cells=min_cells)  # in-place filtering of regions
print("after filtering:", adata.shape)

Taking a look at `adata.var`, we see that this dataset has already been processed to include the `start` and `end` positions of each peak, as well as the chromosomes on which they are located.

In [None]:
adata.var.sample(10)

We will use this information in order to add DNA sequences into `adata.varm`. This can be performed in-place with `scvi.data.add_dna_sequence`.

In [None]:
scvi.data.add_dna_sequence(
    adata, 
    chr_var_key="chr", 
    start_var_key="start",
    end_var_key="end",
    genome_name="hg19",
    genome_provider="GENCODE",
    genome_dir="data/genome",
)
adata

The function adds two new fields into `adata.varm`: `dna_sequence`, containing bases for each position, and `dna_code`, containing bases encoded as integers.

In [None]:
adata.varm["dna_sequence"]

## 3. Setting up and training the model

We now set up our data with the model using `setup_anndata`, which will ensure everything the model needs is in place for training.

In this stage, we can condition the model on additional covariates, which encourages the model to remove the impact of those covariates from the learned latent space. Since we are integrating our data across donors, we set the `batch_key` argument to the key in `adata.obs` that contains donor information (in our case, just `"batch"`).

Additionally, since scBasset considers training mini-batches across regions rather than observations, we transpose the data prior to giving it to the model. The model also expects binary accessibility data, so we add a new layer with binary information.

In [None]:
bdata = adata.transpose()
bdata.layers["binary"] = (bdata.X.copy() > 0).astype(float)
scvi.external.SCBASSET.setup_anndata(bdata, layer="binary", dna_code_key="dna_code")

We now create the model and train it with default parameters.

In [None]:
model = scvi.external.SCBASSET(bdata)
model.train()

## 4. Visualizing the batch-corrected latent space

After training, we retrieve the integrated latent space and save it into `adata.obsm`.

In [None]:
LATENT_KEY = "X_scbasset"
adata.obsm[LATENT_KEY] = model.get_latent_representation()
adata.obsm[LATENT_KEY].shape

Now, we use `scanpy` to cluster and visualize the latent space by first computing the k-nearest-neighbor graph and then computing its UMAP & Leiden clusters.

In [None]:
CLUSTER_KEY = "leiden_scbasset"
sc.pp.neighbors(adata, use_rep=LATENT_KEY)
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added=CLUSTER_KEY)

In [None]:
sc.pl.umap(adata, color=CLUSTER_KEY)

## 5. Quantifying integration performance

Finally, we quantify the quality of the integration by computing various integration metrics using `scib-metrics`.

In [None]:
# work in progress :)