# Data handling in scvi-tools

In this tutorial we will cover how data is handled in scvi-tools. 

Sections:
1. Data Registration via `setup_anndata()` and `register_tensor_from_anndata()`
2. Data loading and `AnnDataLoader()`

In [1]:
import sys

#if branch is stable, will install via pypi, else will install from source
branch = "stable"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]

In [2]:
import scvi
import numpy as np

## 1. Data Registration
Lets start by formatting an example AnnData Object to setup for scvi-tools. In the following code, we build off the `synthetic_iid()` dataset, copy X to a layer, and add continuous and categorical covariates to the AnnData.

In [3]:
adata = scvi.data.synthetic_iid(run_setup_anndata=False)
adata.layers['raw_counts'] = adata.X.copy()
adata.obs['my_categorical_covariate'] = ['A'] * 200 + ['B'] * 200
adata.obs['my_continuous_covariate'] = np.random.randint(0,100,400)
print(adata)

AnnData object with n_obs × n_vars = 400 × 100
    obs: 'batch', 'labels', 'my_categorical_covariate', 'my_continuous_covariate'
    uns: 'protein_names'
    obsm: 'protein_expression'
    layers: 'raw_counts'


Now we call `scvi.data.setup_anndata()` to register all the tensors we want to load to the model during training.

In [4]:
scvi.data.setup_anndata(adata, 
                        batch_key='batch', 
                        labels_key='labels', 
                        layer='raw_counts', 
                        protein_expression_obsm_key='protein_expression', 
                        protein_names_uns_key='protein_names', 
                        categorical_covariate_keys=['my_categorical_covariate'], 
                        continuous_covariate_keys=['my_continuous_covariate'],
                       )

[34mINFO    [0m Using batches from adata.obs[1m[[0m[32m"batch"[0m[1m][0m                                               
[34mINFO    [0m Using labels from adata.obs[1m[[0m[32m"labels"[0m[1m][0m                                               
[34mINFO    [0m Using data from adata.layers[1m[[0m[32m"raw_counts"[0m[1m][0m                                          
[34mINFO    [0m Computing library size prior per batch                                              
[34mINFO    [0m Using protein expression from adata.obsm[1m[[0m[32m'protein_expression'[0m[1m][0m                      
[34mINFO    [0m Using protein names from adata.uns[1m[[0m[32m'protein_names'[0m[1m][0m                                 
[34mINFO    [0m Successfully registered anndata object containing [1;34m400[0m cells, [1;34m100[0m vars, [1;34m2[0m batches, [1;34m3[0m 
         labels, and [1;34m100[0m proteins. Also registered [1;34m1[0m extra categorical covariates and [

  res = method(*args, **kwargs)
  res = method(*args, **kwargs)


We can view what was registered via the `scvi.data.view_anndata_setup()` command.

In [5]:
scvi.data.view_anndata_setup(adata)

If there are other tensors in the anndata you need to register, you can use the `scvi.data.register_tensor_from_anndata()` command. 

In the following code we add a new field to our AnnData with the key **extra_values**. Then we register the tensor with `register_tensor_from_anndata()` by passing the adata (`adata=adata`), the datafield of the key we want to register (`adata_attr_name='obs'`), the key we wish to register (`adata_key_name="extra_values"`), and the key to access the data when it is loaded via the dataloader (`registry_key='_extra_values'`)

In [6]:
key = 'extra_values'
adata.obs[key] = np.random.randint(0,10, 400)

scvi.data.register_tensor_from_anndata(adata=adata, 
                                       adata_attr_name='obs',
                                       adata_key_name=key,
                                       registry_key='_extra_values',
                                      )

## 2. DataLoaders

AnnDataLoader is the base dataloader for scvi-tools. In this section we show how the data we registered in the previous section is loaded by AnnDataLoader.

First, we construct an AnnDataLoader and get the first batch. Then we will enumerate all the values in the batch.

In [7]:
from scvi.dataloaders._ann_dataloader import AnnDataLoader
from scvi import _CONSTANTS

# initialize an AnnDataLoader which will iterate over our anndata
adl = AnnDataLoader(adata, batch_size = 10)

# get the first batch of data
data_batch = next(tensors for tensors in adl)

The variable **data_batch** contains the first batch of data. It is a dictionary whose values are the tensors registered in the previous section via `setup_anndata()` and `register_tensor_from_anndata()`. 

For tensors setup with `setup_anndata()` the keys are from `scvi._CONSTANTS`. For tensors setup with `register_tensor_from_anndata()`, the keys are the values passed to `registry_key`.

In [8]:
print('data_batch_keys:')
print(data_batch.keys())

data_batch_keys:
dict_keys(['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels', 'protein_expression', 'cat_covs', 'cont_covs', '_extra_values'])


The values in data_batch that were registered via `setup_anndata()` can be accessed via `scvi._CONSTANTS`

In [9]:
print(_CONSTANTS.X_KEY)                 # key for X values
print(_CONSTANTS.BATCH_KEY)             # key for batch info
print(_CONSTANTS.LOCAL_L_MEAN_KEY)      # key for mean of batch specific log library size
print(_CONSTANTS.LOCAL_L_VAR_KEY)       # key for variance of batch specific log library size
print(_CONSTANTS.LABELS_KEY)            # key for label data
print(_CONSTANTS.PROTEIN_EXP_KEY)       # key for protein data
print(_CONSTANTS.CAT_COVS_KEY)          # key for categorical covariate data
print(_CONSTANTS.CONT_COVS_KEY)         # key for continuous covariate data

X
batch_indices
local_l_mean
local_l_var
labels
protein_expression
cat_covs
cont_covs


If we look at the labels for the first batch from the data loader, it corresponds to the labels of the first 10 cells of our AnnData. 

In [10]:
adata.obs['labels'][:10]

0    label_2
1    label_2
2    label_0
3    label_0
4    label_2
5    label_0
6    label_2
7    label_0
8    label_2
9    label_1
Name: labels, dtype: category
Categories (3, object): ['label_0', 'label_1', 'label_2']

In [11]:
# setup_anndata automatically encoded the categorical labels as integers
data_batch[_CONSTANTS.LABELS_KEY] 

tensor([[2.],
        [2.],
        [0.],
        [0.],
        [2.],
        [0.],
        [2.],
        [0.],
        [2.],
        [1.]])

In [12]:
print(data_batch[_CONSTANTS.X_KEY].shape) #shape is batch_size x n_genes
print(data_batch[_CONSTANTS.BATCH_KEY].shape) #shape is batch_size x 1

torch.Size([10, 100])
torch.Size([10, 1])


For the tensor we registered via `register_tensor_from_anndata()`, the key is the value passed to the `registry_key`argument, which in our case was `_extra_values`.

In [13]:
adata.obs[:10]['extra_values']

0    4
1    8
2    0
3    5
4    6
5    1
6    2
7    7
8    6
9    9
Name: extra_values, dtype: int64

In [14]:
data_batch['_extra_values']

tensor([[4.],
        [8.],
        [0.],
        [5.],
        [6.],
        [1.],
        [2.],
        [7.],
        [6.],
        [9.]])