# 1 Million Brain Cells
**Author:** [Severin Dicks](https://github.com/Intron7)

To run this notebook please make sure you have a working environment with all nessaray dependencies. Run the [data_downloader](https://github.com/scverse/rapids_singlecell-notebooks/blob/abc4fc6f3fe7f85cbffb94e76d190cad0ae00a5f/data_downloader.ipynb) notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of 1000000 brain cells from  [Nvidia](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/1M_brain_cpu_analysis.ipynb).

In [1]:
import scanpy as sc
import cupy as cp
import anndata as ad

import time
import rapids_singlecell as rsc

import warnings

warnings.filterwarnings("ignore")

In [2]:
import gc

## Load and Prepare Data

We load the sparse count matrix from an `h5ad` file using Scanpy. The sparse count matrix will then be placed on the GPU. 

In [3]:
data_load_start = time.time()

In [4]:
%%time
adata = ad.read_zarr("zarr/1M.zarr/")

CPU times: user 7.18 s, sys: 1.93 s, total: 9.11 s
Wall time: 2.74 s


We now load the the AnnData object into VRAM.

Verify the shape of the resulting sparse matrix:

In [5]:
adata.shape

(1000000, 27998)

In [6]:
data_load_time = time.time()
print("Total data load and format time: %s" % (data_load_time - data_load_start))

Total data load and format time: 2.93041729927063


## Preprocessing

In [7]:
preprocess_start = time.time()

### Quality Control

We perform a basic qulitiy control and plot the results

In [8]:
%%time
rsc.pp.flag_gene_family(adata, gene_family_name="MT", gene_family_prefix="mt-")

CPU times: user 2.88 ms, sys: 0 ns, total: 2.88 ms
Wall time: 2.81 ms


In [9]:
%%time
sc.pp.calculate_qc_metrics(adata, qc_vars=["MT"], inplace=True)

CPU times: user 53.4 s, sys: 3.69 s, total: 57.1 s
Wall time: 11.7 s


### Filter

We filter the count matrix to remove cells with an extreme number of genes expressed.
We also filter out cells with a mitchondrial countent of more than 20%.

In [10]:
%%time
adata = adata[
    (adata.obs["n_genes_by_counts"] < 5000)
    & (adata.obs["n_genes_by_counts"] > 500)
    & (adata.obs["pct_counts_MT"] < 20)
].copy()

CPU times: user 1.42 s, sys: 1.1 s, total: 2.51 s
Wall time: 2.51 s


Many python objects are not deallocated until garbage collection runs. When working with data that barely fits in memory (generally, >50%) you may need to manually trigger garbage collection to reclaim memory.

In [11]:
%%time
gc.collect()

CPU times: user 121 ms, sys: 41 ms, total: 162 ms
Wall time: 162 ms


519

We also filter out genes that are expressed in less than 3 cells.

In [12]:
%%time
sc.pp.filter_genes(adata, min_cells=3)

CPU times: user 20 s, sys: 1.92 s, total: 21.9 s
Wall time: 21.9 s


We store the raw expression counts in the `.layer["counts"]`

In [13]:
adata.layers["counts"] = adata.X.copy()

In [14]:
adata.shape

(982490, 22539)

### Normalize

We normalize the count matrix so that the total counts in each cell sum to 1e4.

In [15]:
%%time
sc.pp.normalize_total(adata, target_sum=1e4)

CPU times: user 3.99 s, sys: 534 ms, total: 4.52 s
Wall time: 1.23 s


Next, we log transform the count matrix.

In [16]:
%%time
sc.pp.log1p(adata)

CPU times: user 1 s, sys: 0 ns, total: 1 s
Wall time: 1 s


### Select Most Variable Genes

Now we search for highly variable genes. This function only supports the flavors `cell_ranger` `seurat` `seurat_v3` and `pearson_residuals`. As you can in scanpy you can filter based on cutoffs or select the top n cells. You can also use a `batch_key` to reduce batcheffects.

In this example we use `seurat_v3` for selecting highly variable genes based on the raw counts in `.layer["counts"]`. 

In [17]:
%%time
sc.pp.highly_variable_genes(
    adata, n_top_genes=5000, flavor="seurat_v3", layer="counts"
)

CPU times: user 31.9 s, sys: 984 ms, total: 32.9 s
Wall time: 15 s


Now we safe this version of the AnnData as adata.raw.

In [18]:
%%time
adata.raw = adata

CPU times: user 292 μs, sys: 25 μs, total: 317 μs
Wall time: 324 μs


Now we restrict our AnnData object to the highly variable genes.

In [19]:
%%time
adata = adata[:, adata.var["highly_variable"]].copy()

CPU times: user 13 s, sys: 1.36 s, total: 14.3 s
Wall time: 14.3 s


In [20]:
adata.shape

(982490, 5002)

Next we regress out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this.

In [None]:
%%time
sc.pp.regress_out(adata, keys=["total_counts", "pct_counts_MT"])

CPU times: user 4 μs, sys: 0 ns, total: 4 μs
Wall time: 8.58 μs


### Scale

Finally, we scale the count matrix to obtain a z-score and apply a cutoff value of 10 standard deviations.

In [None]:
%%time
sc.pp.scale(adata, max_value=10, zero_center=True)

CPU times: user 6.49 s, sys: 107 ms, total: 6.6 s
Wall time: 439 ms


### Principal component analysis

We use PCA to reduce the dimensionality of the matrix to its top 100 principal components. We use the PCA implementation from cuml to run this. With `use_highly_variable = False` we save VRAM since we already subset the matrix to only HVGs.

In [23]:
%%time
sc.pp.pca(adata, n_comps=100, use_highly_variable=False)

CPU times: user 19min 18s, sys: 122 ms, total: 19min 19s
Wall time: 2min 27s


In [25]:
preprocess_time = time.time()
print("Total Preprocessing time: %s" % (preprocess_time - preprocess_start))

Total Preprocessing time: 217.44359397888184


# Visualization
## Clustering and Visualization

### Computing the neighborhood graph and UMAP

Next we compute the neighborhood graph using rsc.

Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the exact graph. Both methods are valid, but you might see differences.

In [26]:
%%time
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50)

CPU times: user 2min 19s, sys: 4.16 s, total: 2min 23s
Wall time: 2min 20s


Next we calculate the UMAP embedding using rapdis.

In [27]:
%%time
sc.tl.umap(adata, min_dist=0.3)

CPU times: user 2h 43min 28s, sys: 3.06 s, total: 2h 43min 31s
Wall time: 7min 27s


### Clustering

Next, we use the Louvain and Leiden algorithm for graph-based clustering.

In [28]:
%%time
sc.tl.louvain(adata, resolution=0.6, flavor = "igraph")

CPU times: user 3min 42s, sys: 3.83 s, total: 3min 45s
Wall time: 3min 45s


In [29]:
%%time
sc.tl.leiden(adata, resolution=1.0, flavor = "igraph", key_added="igraph_leiden")

CPU times: user 29.8 s, sys: 2.43 s, total: 32.2 s
Wall time: 32.2 s


In [30]:
%%time
sc.tl.leiden(adata, resolution=1.0, key_added="org_leiden" )

CPU times: user 1h 11min 37s, sys: 1min 7s, total: 1h 12min 44s
Wall time: 1h 12min 31s


## TSNE

In [30]:
%%time
sc.tl.tsne(adata, n_pcs=40)

CPU times: user 4h 52min 1s, sys: 5min 19s, total: 4h 57min 21s
Wall time: 25min 41s


## Diffusion Maps

In [None]:
%%time
sc.tl.diffmap(adata)

CPU times: user 52min 34s, sys: 147 ms, total: 52min 34s
Wall time: 50.9 s


After this you can use `X_diffmap` for `sc.pp.neighbors` and other functions. 

In [32]:
print("Total Processing time: %s" % (time.time() - preprocess_start))

Total Processing time: 2602.0673031806946
