# Sparse matrices with h5py

*Published: December 12, 2017.*

In [1]:
import numpy as np
from anndata import AnnData, h5py, logging
from scipy.sparse import csr_matrix, csc_matrix

#### Slicing for simple data

In [2]:
X_array = np.array(
    [[0, 1, 0],
     [0, 0, 2],
     [0, 0, 0],
     [3, 4, 0]])

In [3]:
X = csr_matrix(X_array)

In [4]:
f = h5py.File('./test.h5')

In [5]:
f.create_dataset('X', data=X)

<HDF5 sparse dataset: format 'csr', shape (4, 3), type '<i8'>

In [6]:
f['X'][:, 1:3].toarray()

array([[1, 0],
       [0, 2],
       [0, 0],
       [4, 0]])

In [7]:
f['X'][np.array([True, False, False, True])].toarray()

array([[0, 1, 0],
       [3, 4, 0]], dtype=int64)

In [8]:
f.close()

Looking into the file reveals the following structure.

In [9]:
!h5ls -r './test.h5'

/                        Group
/X                       Group
/X/data                  Dataset {4}
/X/indices               Dataset {4}
/X/indptr                Dataset {5}


The group that stores the sparse matrix is marked by the two attributes `h5sparse_format` and `h5sparse_shape`:

In [10]:
!h5ls -v './test.h5'

Opened "./test.h5" with sec2 driver.
X                        Group
    Attribute: h5sparse_format scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "csr"
    Attribute: h5sparse_shape {2}
        Type:      native long
        Data:  4, 3
    Location:  1:800
    Links:     1


#### Memory profiling

In [11]:
import gc

Let us perform some memory profiling to see whether we really gained something.

In [12]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.07 GB


In [13]:
X = csr_matrix(np.ones((10000, 10000)))

This is a very boring, large matrix with $10^8$ entries. 

In [14]:
X

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 100000000 stored elements in Compressed Sparse Row format>

It takes about 1.12 GB in memory.

In [15]:
logging.print_memory_usage()

Memory usage: current 1.19 GB, difference +1.12 GB


In [16]:
f = h5py.File('./test1.h5')

In [17]:
f.create_dataset('X', data=X)

<HDF5 sparse dataset: format 'csr', shape (10000, 10000), type '<f8'>

In [18]:
logging.print_memory_usage()

Memory usage: current 1.19 GB, difference +0.00 GB


In [39]:
f.close()

Make sure the memory is actually freed by restarting the notebook, let's start over.

In [4]:
f = h5py.File('./test1.h5')

In [5]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.00 GB


In [6]:
f['X'][1:5]

<4x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 40000 stored elements in Compressed Sparse Row format>

In [7]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.00 GB


In [8]:
f['X'][[1, 5, 10, 13]]

<4x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 40000 stored elements in Compressed Sparse Row format>

In [9]:
logging.print_memory_usage()

Memory usage: current 0.07 GB, difference +0.00 GB


Only when loading the full object into memory, we again observe an 1.12 GB increase.

In [10]:
f['X'][:]

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 100000000 stored elements in Compressed Sparse Row format>

In [11]:
logging.print_memory_usage()

Memory usage: current 1.19 GB, difference +1.12 GB


### In the context of an AnnData object

Initializing an AnnData object with a sparse matrix in memory mode.

In [3]:
adata = AnnData(X_array)

In [4]:
adata.X

array([[ 0.,  1.,  0.],
       [ 0.,  0.,  2.],
       [ 0.,  0.,  0.],
       [ 3.,  4.,  0.]], dtype=float32)

In [5]:
adata.X = csr_matrix(X_array)

In [6]:
adata.X

<4x3 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

Chanching to "backed" mode by setting a backing filename.

In [7]:
adata.filename = './test3.h5ad'

In [8]:
adata.X

<HDF5 sparse dataset: format 'csr', shape (4, 3), type '<i8'>

In [9]:
adata.X[0:2].toarray()

array([[0, 1, 0],
       [0, 0, 2]])

Changing from csr to csc of the file.

In [10]:
adata.X = csc_matrix(X_array)

In [11]:
adata.X

<HDF5 sparse dataset: format 'csc', shape (4, 3), type '<i8'>

In [12]:
adata.X[0:2].toarray()

array([[0, 1, 0],
       [0, 0, 2]])

In [13]:
!h5ls -rv './test3.h5ad'

Opened "./test3.h5ad" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/X                       Group
    Attribute: h5sparse_format scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "csr"
    Attribute: h5sparse_shape {2}
        Type:      native long
        Data:  4, 3
    Location:  1:20456
    Links:     1
/X/data                  Dataset {4/4}
    Location:  1:21376
    Links:     1
    Chunks:    {4} 32 bytes
    Storage:   32 logical bytes, 19 allocated bytes, 168.42% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      native long
/X/indices               Dataset {4/4}
    Location:  1:21976
    Links:     1
    Chunks:    {4} 16 bytes
    Storage:   16 logical bytes, 18 allocated bytes, 88.89% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      native int
/X/indptr                Dataset {5/5}
    Location:  1:32840
    Links:     1
    Chunks:    {5} 20 byte