In [1]:
import numpy as np
import pandas as pd
import tempfile

from scipy.sparse import random

In [2]:
from ratschlab_common.io import sparse_df

Example for the usage of `sparse_df`.

Pandas dataframes support working with sparse data, however, the support for storing such dataframes on disk seems limited. That is where `ratschlab_common.io.sparse_df` comes in, supporting the storage of sparse data frames into HDF5 files. There, the set of sparse columns of a dataframe are stored as [COO matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html).

## Generate Random Sparse Data Frame

In [3]:
cols = 100
rows = 100_000

column_names = [ f"col{i}" for i in range(cols)]
row_names =  [ f"myrow{i}" for i in range(rows)]

In [4]:
# generating random sparse matrix
np.random.seed(12345)
data_sparse = random(rows, cols, density=0.0001)

In [5]:
data_sparse

<100000x100 sparse matrix of type '<class 'numpy.float64'>'
	with 1000 stored elements in COOrdinate format>

In [6]:
df = pd.DataFrame.sparse.from_spmatrix(data_sparse, columns=column_names)

In [7]:
df['key'] = row_names
# reordering columsn, s.t. 'key' col is first
df = df[df.columns.to_list()[-1:] + df.columns.to_list()[:-1]]

In [8]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Columns: 101 entries, key to col99
dtypes: Sparse[float64, 0.0](100), object(1)
memory usage: 6.4 MB


In [9]:
# approximate memory [MB] it would take as a dense data frame
cols*rows*8 / 1024**2

76.2939453125

## Writing Sparse Data Frame to Disk

In [10]:
path = tempfile.NamedTemporaryFile().name
sparse_df.to_hdf(df, path)



## Reading Back

In [11]:
my_df = sparse_df.read_hdf(path)

In [12]:
my_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Columns: 101 entries, key to col99
dtypes: Sparse[float64, 0.0](100), object(1)
memory usage: 6.4 MB
