# Preparing single-cell data for a benchmark

To streamline benchmarking, we download a number of scRNA-seq datasets, apply pre-processing and extract the PCA-transformed data, along with a single vector of manual labels.
We also create a *k*-nearest-neighbour graph (*k*-NNG) for denoising or triplet generation.
Datasets differ in the type of pre-processing they need, as well as the name of the manual label column.

Use and adapt the code below to download and prepare your datasets.
Some cells use Bash commands for downloading data, assuming this is run on Linux/macOS machine.
You can also download your data using your browser, or run this notebook in Google Colab (you can mount your Google Drive there).

### Benchmarking on cytometry datasets

This is easily adaptable to flow and mass cytometry data.
You can find examples of basic FCS file data pre-processing [here](https://pytometry.netlify.app/examples/01_preprocess_cytof_oetjen) and [here](https://github.com/saeyslab/ViVAE/blob/main/example_cytometry.ipynb).

In [1]:
import scanpy as sc
import numpy as np
import os
import viscore as vs



#### **0.** Set up dataset name and output path

These will be used for storing the dataset.

In [None]:
dataset_name = 'Triana'
output_path = './data'

In [None]:
if not os.path.exists(output_path):
    os.mkdir(output_path)

#### **1.** Download dataset as H5AD file

Datasets are most easily downloadable from the CELLxGENE database using `wget`.

In [None]:
%%bash
wget -O ./scrnaseq.h5ad https://datasets.cellxgene.cziscience.com/d738f73e-7c76-4ff9-b9ef-94a46bc217f4.h5ad >/dev/null 2>&1

#### **2.** Pre-process counts/expression data

If the `X` matrix contains a raw count matrix, set `counts` to `True`.
Otherwise, if it already contains transformed expression values, set it to `False`, as fewer pre-processing steps need to be applied.

In [None]:
counts = False

hd = sc.read_h5ad('./scrnaseq.h5ad')

## If we filter by some condition (eg. tissue=='blood'):
# hd = hd[hd.obs['tissue']=='blood']

if counts:
    sc.pp.normalize_total(hd)
    sc.pp.log1p(hd)
sc.pp.scale(hd, max_value=10.)
sc.tl.pca(hd, svd_solver='arpack', n_comps=100)
pc = hd.obsm['X_pca']
np.save(os.path.join(output_path, f'{dataset_name}_input.npy'), pc, allow_pickle=True)
print(f'Saved {pc.shape[0]}-by-{pc.shape[1]} PC matrix')

#### **3.** Extract annotation

For plotting and supervised evaluation of embeddings, we need a set of labels per cell. Using `colname`, indicate which column of the `obs` dataframe should be used for this.

Additionally, if there any populations that are considered unknown/unlabelled, list them in `unassigned`.

In [None]:
colname = 'cell_type'
unassigned = []

labels = hd.obs[colname]
np.save(os.path.join(output_path, f'{dataset_name}_labels.npy'), labels, allow_pickle=True)
np.save(os.path.join(output_path, f'{dataset_name}_unassigned.npy'), unassigned, allow_pickle=True)
print(f'Saved {len(labels)}-label vector with {len(np.unique(labels))} unique labels')

#### **4.** Create *k*-NNG

A *k*-nearest neighbour graph is pre-computed to be able to denoise the input expression matrix.
A pre-computed *k*-NNG can also be used in some DR methods (*eg.* ivis, UMAP, DensMAP) where the *k*-NN relations within our input point cloud are used.

In [None]:
k = 150

knn = vs.make_knn(x=pc, k=k, fname=os.path.join(output_path, f'{dataset_name}_knn.npy'), verbose=False)
print(f'Saved {k}-nearest-neighbour graph')

#### **5.** Create denoised input matrix

We already create the denoised expression matrix, used by default by ViVAE.
It can, in principle, be used by any embedding algorithm.

In [None]:
pc_d = vs.smooth(pc, knn, k=1000, coef=1., n_iter=1)
np.save(os.path.join(output_path, f'{dataset_name}_inpu_denoised.npy'), pc_d, allow_pickle=True)
print('Saved denoised PC matrix')

#### **6.** Create *k*-NNG on denoised data

If we want to provide a *k*-NNG to an algorithm that we run on denoised data, we should passed a *k*-NNG base on the denoised coordinates.
For completeness, we compute that as well.

In [None]:
knn = vs.make_knn(x=pc_d, k=k, fname=os.path.join(output_path, f'{dataset_name}_knn_denoised.npy'), verbose=False)
print(f'Saved denoised {k}-nearest-neighbour graph')

#### *7.* Remove the H5AD file

If everything went well and we don't need the H5AD data anymore, we can delete the original downloaded file.

In [None]:
%%bash
rm ./scrnaseq.h5ad