# Sparse matrices with h5py

*Published: [December 12, 2017](https://github.com/theislab/anndata_usage/tree/master/171212_sparse_matrices_with_h5py/README.ipynb?history=True). [Download](https://github.com/theislab/anndata_usage/tree/master/171212_sparse_matrices_with_h5py/README.ipynb?raw=True) or [view](https://github.com/theislab/anndata_usage/tree/master/171212_sparse_matrices_with_h5py/README.ipynb) executable source.*

<img src="http://www.h5py.org/cat.gif" style="height: 100px; margin: 5px 10px 5px 0px" align="right">

[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) is an established cross-platform, cross-language binary dataformat, allowing fast and partial loading of data from disk into memory. [h5py](http://www.h5py.org/) is the established Python API for interacting with HDF5 files (the image shows the book of the lead author). However, neither h5py, nor the popular high-level interface to h5py, [pytables](http://www.pytables.org/), provide support for sparse matrices.

Within the single-cell genomics community, work arounds have been proposed: earlier this year, [10X Genomics](https://www.10xgenomics.com/) and [Scanpy](https://scanpy.readthedocs.io) adapted the [CSR/CSC/Yale](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)) for HDF5 storage in a static way, later [loompy](http://loompy.org/) suggested to dynamically load matrices using the [COO](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)) format. However no conventions on how different API's would recognize these objects have been established.

Here, I suggest to adapt the [h5sparse](https://github.com/appier/h5sparse) convention suggested by [Appier, Inc.](https://www.appier.com/) earlier this year. The idea is to mark conventional HDF5 groups using two attributes

- the storage format of the sparse matrix: `h5sparse_format`
- the shape of the sparse matrix: `h5sparse_shape`

These are sufficient to recoqnize a group as a sparse matrix of a specific format and shape.  Here, we introduce [`anndata.h5py`](http://anndata.readthedocs.io/en/latest/anndata.h5py.html), a thin layer for h5py that offers all functionality of h5py and is able to efficiently handle HDF5 files that store both dense and sparse data in different formats. The original [h5sparse](https://pypi.python.org/pypi/h5sparse/0.0.4), by contrast, follows different design principles and only allows to deal with HDF5 files that solely contain sparse data with limited indexing options.

In [7]:
import numpy as np
from scipy.sparse import csr_matrix, csc_matrix
from anndata import h5py, logging

## Indexing data on disk as in memory

In [3]:
X_array = np.array(
    [[0, 1, 0],
     [0, 0, 2],
     [0, 0, 0],
     [3, 4, 0]])
# make this a sparse matrix
X = csr_matrix(X_array)

In [4]:
X

<4x3 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

Show it's entries.

In [5]:
print(X)

  (0, 1)	1
  (1, 2)	2
  (3, 0)	3
  (3, 1)	4


In [9]:
X[:, 1:3].toarray()

array([[1, 0],
       [0, 2],
       [0, 0],
       [4, 0]])

Now, let's open a file and create an [`h5py.SparseDataset`](http://anndata.readthedocs.io/en/latest/anndata.h5py.SparseDataset.html#anndata.h5py.SparseDataset).

In [8]:
f = h5py.File('./test.h5')

In [9]:
f.create_dataset('X', data=X)

<HDF5 sparse dataset: format 'csr', shape (4, 3), type '<i8'>

In [10]:
f['X'][:, 1:3].toarray()

array([[1, 0],
       [0, 2],
       [0, 0],
       [4, 0]])

In [11]:
f['X'][np.array([True, False, False, True])].toarray()

array([[0, 1, 0],
       [3, 4, 0]], dtype=int64)

Just check that we can do the same with the array.

In [12]:
f.create_dataset('X_array', data=X_array)

<HDF5 dataset "X_array": shape (4, 3), type "<i8">

In [13]:
f.close()

Looking into the file reveals the following structure.

In [14]:
!h5ls -r './test.h5'

/                        Group
/X                       Group
/X/data                  Dataset {4}
/X/indices               Dataset {4}
/X/indptr                Dataset {5}
/X_array                 Dataset {4, 3}


The group that stores the sparse matrix is marked by the two attributes `h5sparse_format` and `h5sparse_shape`:

In [15]:
!h5ls -v './test.h5'

Opened "./test.h5" with sec2 driver.
X                        Group
    Attribute: h5sparse_format scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "csr"
    Attribute: h5sparse_shape {2}
        Type:      native long
        Data:  4, 3
    Location:  1:800
    Links:     1
X_array                  Dataset {4/4, 3/3}
    Location:  1:7384
    Links:     1
    Storage:   96 logical bytes, 96 allocated bytes, 100.00% utilization
    Type:      native long


## Memory profiling

Let us perform some memory profiling to see whether we really gained something.

In [12]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.07 GB


In [13]:
X = csr_matrix(np.ones((10000, 10000)))

This is a really boring, really large matrix filled with $10^8$ ones.

In [14]:
X

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 100000000 stored elements in Compressed Sparse Row format>

It takes about 1.12 GB in memory.

In [15]:
logging.print_memory_usage()

Memory usage: current 1.19 GB, difference +1.12 GB


In [16]:
f = h5py.File('./test1.h5')

In [17]:
f.create_dataset('X', data=X)

<HDF5 sparse dataset: format 'csr', shape (10000, 10000), type '<f8'>

In [18]:
logging.print_memory_usage()

Memory usage: current 1.19 GB, difference +0.00 GB


In [39]:
f.close()

Make sure the memory is actually freed by restarting the notebook, let's start over.

In [4]:
f = h5py.File('./test1.h5')

In [5]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.00 GB


In [6]:
f['X'][1:5]

<4x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 40000 stored elements in Compressed Sparse Row format>

In [7]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.00 GB


In [8]:
f['X'][[1, 5, 10, 13]]

<4x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 40000 stored elements in Compressed Sparse Row format>

In [9]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.00 GB


Only when loading the full object into memory, we again observe an 1.12 GB increase.

In [10]:
f['X'][:]

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 100000000 stored elements in Compressed Sparse Row format>

In [11]:
logging.print_memory_usage()

Memory usage: current 1.19 GB, difference +1.12 GB


### In the context of an AnnData object

Initializing an AnnData object with a sparse matrix in memory mode.

In [3]:
adata = AnnData(X_array)

In [4]:
adata.X

array([[ 0.,  1.,  0.],
       [ 0.,  0.,  2.],
       [ 0.,  0.,  0.],
       [ 3.,  4.,  0.]], dtype=float32)

In [5]:
adata.X = csr_matrix(X_array)

In [6]:
adata.X

<4x3 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

Chanching to "backed" mode by setting a backing filename.

In [7]:
adata.filename = './test3.h5ad'

In [8]:
adata.X

<HDF5 sparse dataset: format 'csr', shape (4, 3), type '<i8'>

In [9]:
adata.X[0:2].toarray()

array([[0, 1, 0],
       [0, 0, 2]])

Changing from csr to csc of the file.

In [10]:
adata.X = csc_matrix(X_array)

In [11]:
adata.X

<HDF5 sparse dataset: format 'csc', shape (4, 3), type '<i8'>

In [12]:
adata.X[0:2].toarray()

array([[0, 1, 0],
       [0, 0, 2]])

In [13]:
!h5ls -rv './test3.h5ad'

Opened "./test3.h5ad" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/X                       Group
    Attribute: h5sparse_format scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "csr"
    Attribute: h5sparse_shape {2}
        Type:      native long
        Data:  4, 3
    Location:  1:20456
    Links:     1
/X/data                  Dataset {4/4}
    Location:  1:21376
    Links:     1
    Chunks:    {4} 32 bytes
    Storage:   32 logical bytes, 19 allocated bytes, 168.42% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      native long
/X/indices               Dataset {4/4}
    Location:  1:21976
    Links:     1
    Chunks:    {4} 16 bytes
    Storage:   16 logical bytes, 18 allocated bytes, 88.89% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      native int
/X/indptr                Dataset {5/5}
    Location:  1:32840
    Links:     1
    Chunks:    {5} 20 byte