# Getting started with anndata

**Authors:** [Adam Gayoso](https://twitter.com/adamgayoso), [Alex Wolf](https://twitter.com/falexwolf)

In this tutorial, we introduce basic properties of the central object, [AnnData](http://anndata.readthedocs.io/en/latest/anndata.AnnData.html) ("Annotated Data").

`AnnData` is specifically designed for tabular data. By this we mean that we have $n$ observations, each of which can be represented as $d$-dimensional vectors, where each dimension corresponds to a variable or feature. Both the rows and columns of this $n \times d$ matrix are special in the sense that they are indexed.

For instance, in scRNA-seq data, each row corresponds to a cell with a barcode, and each column corresponds to a gene with a gene id. Furthermore, for each cell and each gene we might have additional metadata, like (1) donor information for each cell, or (2) alternative gene symbols for each gene. Finally, we might have other unstructured metadata like color palletes to use for plotting. Without going into every fancy Python-based data structure, we think that still today no other alternative really exists that:

* Handles sparsity
* Handles unstructured data
* Handles observation- and feature-level metadata
* Is user-friendly

<div class="alert alert-info">

**Note**
    
* Download the notebook by clicking on the _Edit on GitHub_ button. On GitHub, you can download using the _Raw_ button via right-click and _Save Link As_. Alternatively, download the whole [anndata-tutorial](https://github.com/theislab/anndata-tutorials) repository.
* In Jupyter notebooks and lab, you can see the documentation for a python function by hitting ``SHIFT + TAB``. Hit it twice to expand the view.
* This tutorial is heavily based on a blog [post by Adam in 2021](https://adamgayoso.com/posts/ten_min_to_adata/). It started out as a [blog post by Alex in 2017](https://falexwolf.me/2017/introducing-anndata/).

</div>

In [1]:
import numpy as np
import pandas as pd
import anndata as ad
print(ad.__version__)

0.7.5.dev7+gefffdfb


Let's look at a data matrix with `n_obs` observations (samples) of `n_vars` variables (features).

In [2]:
n_obs, n_vars = 10000, 10
X = np.random.random((n_obs, n_vars))

<div class="alert alert-info">

**Note**

* The convention is that observations of sets of variables are stored in the rows of a data matrix $\mathbf{X}$. This is the convention of the statistics & machine learning textbooks ([Hastie et al., 2009](https://web.stanford.edu/~hastie/ElemStatLearn/); [Murphy,  2012](https://mitpress.mit.edu/books/machine-learning-0)), of _tidy data_ [(Wickham, 2014)](https://doi.org/10.18637/jss.v059.i10), and of established statistics and machine learning packages in Python ([statsmodels](http://www.statsmodels.org/stable/index.html), [scikit-learn](http://scikit-learn.org/)).

* In software for transcriptomic data in R, the opposite convention is used: the observed expression of a gene across a set of samples is stored in rows of a matrix.
    
* In some tools that convert to _tidy data_ format, data is cast to a long table format that spreads samples across many rows storing multiple variables in one column, for instance, in [tidySummarizeExperiment](https://doi.org/doi:10.18129/B9.bioc.tidySummarizedExperiment). Adoption of the original _tidy data_ definition by Wickham seems hence not consistent.

</div>

Let's make this a dataframe and name the variables:

In [3]:
df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'), index=np.arange(n_obs, dtype=int).astype(str))

In [4]:
df.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,0.253349,0.684447,0.669161,0.309679,0.014394,0.277615,0.505934,0.032573,0.43461,0.601414
1,0.950834,0.626623,0.68161,0.416638,0.134257,0.685353,0.978178,0.863056,0.265252,0.738342
2,0.53376,0.622051,0.568626,0.878087,0.374846,0.677614,0.594641,0.366929,0.873765,0.592978
3,0.113632,0.695772,0.876843,0.80707,0.111572,0.920572,0.599671,0.527112,0.03271,0.91449
4,0.45918,0.638794,0.30731,0.672624,0.682199,0.813126,0.976467,0.616834,0.116605,0.374751


Let's imagine that the observations come from instruments characterizing 10 readouts in a multi-year study with samples taken from different subjects at different sites. We'd typically get that information in some format and then store it in another DataFrame:

In [5]:
obs_meta = pd.DataFrame({
        'time_yr': np.random.choice([0, 2, 4, 8], n_obs),
        'subject_id': np.random.choice(['subject 1', 'subject 2', 'subject 4', 'subject 8'], n_obs),
        'instrument_type': np.random.choice(['type a', 'type b'], n_obs),
        'site': np.random.choice(['site x', 'site y'], n_obs),
    },
    index=np.arange(n_obs, dtype=int).astype(str),    # these are the same IDs of observations as above!
)

We'd like to understand the readout data in light of the metadata, which will require training models that take into account the metadata.

Let us join the readout data with the metadata.

In [6]:
adata = ad.AnnData(df, obs=obs_meta)

We now have a single data container that keeps track of everything:

In [7]:
print(adata)

AnnData object with n_obs × n_vars = 10000 × 10
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'


If we want DataFrames back, we'd do:

In [8]:
adata.to_df().head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,0.253349,0.684447,0.669161,0.309679,0.014394,0.277615,0.505934,0.032573,0.43461,0.601414
1,0.950834,0.626623,0.68161,0.416638,0.134257,0.685353,0.978178,0.863056,0.265252,0.738342
2,0.53376,0.622051,0.568626,0.878087,0.374846,0.677614,0.594642,0.36693,0.873765,0.592978
3,0.113632,0.695772,0.876843,0.80707,0.111572,0.920572,0.599671,0.527112,0.03271,0.91449
4,0.45918,0.638794,0.30731,0.672624,0.682199,0.813126,0.976467,0.616834,0.116605,0.374751


In [9]:
adata.obs.head()

Unnamed: 0,time_yr,subject_id,instrument_type,site
0,8,subject 2,type a,site y
1,4,subject 1,type a,site x
2,2,subject 8,type a,site y
3,0,subject 1,type b,site x
4,8,subject 1,type a,site y


## Subsetting

One of the most important operations on the joint data matrix is subsetting. For instance, to focus on only subsets of variables or observations, or to define train-test splits for a machine learning model.

<div class="alert alert-info">

**Note**

Similar to numpy arrays, AnnData objects can either hold actual data or reference another `AnnData` object. In the later case, they are referred to as "view", as in numpy.

Subsetting AnnData objects always returns views, which has two advantages:

- no new memory is allocated
- it is possible to modify the underlying AnnData object

You can get an actual AnnData object from a view by calling `.copy()` on the view. Usually, this is not necessary, as any modification of elements of a view (calling `.[]` on an attribute of the view) internally calls `.copy()` and makes the view an AnnData object that holds actual data. See the example below.
    
</div>

In [10]:
adata

AnnData object with n_obs × n_vars = 10000 × 10
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

Get access to the first 5 rows for variables `'A'` and `'C'`.

<div class="alert alert-info">

**Note**

Indexing into AnnData will assume that integer arguments to `[]` behave like `.iloc` in pandas, whereas string arguments behave like `.loc`. `AnnData` always assumes string indices.
    
</div> 

In [11]:
adata[:5, ['A', 'C']]

View of AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

This is a view! If we want an `AnnData` that holds the data in memory, let's call `.copy()`

In [12]:
adata_subset = adata[:5, ['A', 'C']].copy()

For a view, we can also set the first 3 elements of a column.

In [13]:
print(adata[:3, 'A'].X.tolist())
adata[:3, 'A'].X = [0, 0, 0]
print(adata[:3, 'A'].X.tolist())

[[0.2533490061759949], [0.9508337378501892], [0.5337598323822021]]
[[0.0], [0.0], [0.0]]


A convenience design choice is the following: If you try to access parts of a view of an AnnData, the content will be auto-copied and a data-storing object will be generated.

In [14]:
adata_subset = adata[:3, ['A', 'B']]

In [15]:
adata_subset

View of AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

In [16]:
adata_subset.obs['foo'] = range(3)

Trying to set attribute `.obs` of view, copying.


Now `adata_subset` stores the actual data and is no longer just a reference to `adata`.

In [17]:
adata_subset

AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site', 'foo'

Evidently, you can use all of pandas to slice with sequences or boolean indices.

In [18]:
adata[adata.obs.time_yr.isin([2, 4])].obs.head()

Unnamed: 0,time_yr,subject_id,instrument_type,site
1,4,subject 1,type a,site x
2,2,subject 8,type a,site y
6,2,subject 8,type a,site y
7,4,subject 4,type a,site y
8,2,subject 8,type b,site x


## Writing the results to disk

`AnnData` comes with its own persistent HDF5-based file format: `h5ad`. If string columns with small number ofs categories aren't yet categoricals, `AnnData` will auto-transform to categoricals.

In [19]:
adata.write('my_results.h5ad')

... storing 'subject_id' as categorical
... storing 'instrument_type' as categorical
... storing 'site' as categorical


In [20]:
!h5ls 'my_results.h5ad'

X                        Dataset {10000, 10}
obs                      Group
var                      Group


## Partial reading of large data

If a single `.h5ad` is very large, you can partially read it into memory by using backed mode:

In [21]:
adata = ad.read('my_results.h5ad', backed='r')

In [22]:
adata.isbacked

True

If you do this, you'll need to remember that the `AnnData` object has an open connection to the file used for reading:

In [23]:
adata.filename

PosixPath('my_results.h5ad')

As we're using it in read-only mode, we can't damage anything. To proceed with this tutorial, we still need to explicitly close it: 

In [24]:
adata.file.close()

## Manipulating the object on disk

You just saw that if you index AnnData, you get a view on elements of these data containers that essentially behaves the same as the containers themselves, but doesn't take additional memory.

You can do something similar when backing an AnnData object with a file, then AnnData will act as a view on this file, and still essentially behave the same.

In [25]:
adata = ad.read('my_results.h5ad', backed='r+')

See whether the backing file has been created.

In [26]:
print(adata[:3, 'A'].X)

[[0.]
 [0.]
 [0.]]


In [27]:
adata

AnnData object with n_obs × n_vars = 10000 × 10 backed at 'my_results.h5ad'
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

By nature of subsetting being a view, you can edit the backed object through the `[]` access as if it was in memory:

In [28]:
adata[:3, 'A'].X = [1, 1, 1]

In [29]:
print(adata[:3, 'A'].X)

[[1.]
 [1.]
 [1.]]


To save changes to the file:

In [30]:
adata.file.close()