# Part 3B Using compression

> Objectives:
> * How to compress chunked datasets
> * Learn how to fine-tune the HDF5 compression pipeline to suit your needs
> * How to use pandas for reading CSV files

In [1]:
import os
import numpy as np
import pandas as pd
import tables

In [2]:
import os
import shutil
data_dir = "compression"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

## Intermezzo: the movielens-1M datasets

Previous work by Greg Reda: http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/

This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. These files contain 1,000,209 anonymous ratings of approximately 3,900 movies  made by 6,040 MovieLens users who joined MovieLens in 2000.


In [3]:
# Import CSV files via pandas
dset = 'datasets\movielens-1m'
fdata = os.path.join(dset, 'ratings.dat.gz')
fitem = os.path.join(dset, 'movies.dat.gz')

# pass in column names for each CSV
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(fdata, sep=';', names=r_cols)

m_cols = ['movie_id', 'title', 'genres']
movies = pd.read_csv(fitem, sep=';', names=m_cols,
                     dtype={'title': object, 'genres': object})

In [4]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
movie_id    3883 non-null int64
title       3883 non-null object
genres      3883 non-null object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [6]:
movies.ftypes

movie_id     int64:dense
title       object:dense
genres      object:dense
dtype: object

In [7]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [8]:
ratings.ftypes

user_id           int64:dense
movie_id          int64:dense
rating            int64:dense
unix_timestamp    int64:dense
dtype: object

## Storing in HDF5/PyTables in compressed form

We enable compression by using the HDF5 Filters. These are accessed in `pytables` using the `tables.Filters` class:

```
filters = tables.Filters(complevel=2, complib='zlib')
f.create_table(..., filters=filters)
```

`table.Filters()` kwargs:

* `complevel`
    compression level (`complevel`): 0 (no compression) to 9 (maximum compression) 

* `complib` compression libraries:
  - **zlib** (standard HDF5)
  - **lzo** (usually not available) 
  - **bzip2**
  - **BLOSC**
      - blosc:blosclz
      - blosc:lz4
      - blosc:lz4hc
      - blosc:snappy
      - blosc:zlib
      - blosc:zstd
* `shuffle` use shuffle filter 

---------

`to_hdf5` creates a HDF5 with the Movielens dataset with a filename corresponding to the `Filters` settings.

In [9]:
def to_hdf5(ratings, movies, filters):
    
    class Ratings(tables.IsDescription):
        user_id = tables.Int32Col(pos=0)
        movie_id = tables.Int32Col(pos=1)
        rating = tables.Int8Col(pos=2)
        unix_timestamp = tables.Int64Col(pos=3)
    
    class Movies(tables.IsDescription):
        movie_id = tables.Int32Col(pos=0)
        title = tables.StringCol(100, pos=1)
        genres = tables.StringCol(50, pos=2)
    
    def get_filename(filters):
        if filters.complevel != 0:
            complib = filters.complib if ":" not in filters.complib else filters.complib.replace(":", "-")
            shuffle = "shuffle" if filters.shuffle else "noshuffle"
            filename = "%s/%s-%d-%s.h5" % (data_dir, complib, filters.complevel, shuffle)
        else:
            filename = "%s/no-compression.h5" % (data_dir,)
        return filename

    filename = get_filename(filters)
    print("Creating file:", filename)
    with tables.open_file(filename, "w") as f:
        table_ratings = f.create_table(f.root, "ratings", Ratings, filters=filters, expectedrows=len(ratings))
        table_ratings.append([ratings[col].values for col in ratings.ftypes.keys()])
        table_movies = f.create_table(f.root, "movies", Movies, filters=filters, expectedrows=len(movies))
        table_movies.append([movies[col].values for col in movies.ftypes.keys()])
    return filename

In [10]:
%%time
filters = tables.Filters(complevel=5)
h5file = to_hdf5(ratings, movies, filters)

Creating file: compression/zlib-5-shuffle.h5
Wall time: 678 ms


In [11]:
!ptdump -v {h5file}

/ (RootGroup) ''
/movies (Table(3883,), shuffle, zlib(5)) ''
  description := {
  "movie_id": Int32Col(shape=(), dflt=0, pos=0),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (425,)
/ratings (Table(1000209,), shuffle, zlib(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "movie_id": Int32Col(shape=(), dflt=0, pos=1),
  "rating": Int8Col(shape=(), dflt=0, pos=2),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (7710,)


### Exercise 1: Compression. Create/write speed.

PyTables comes with out-of-box support for a series of codecs.  Do a quick comparison between "zlib", "bzip2", and "blosc" for compression levels of 1 (fastest), 5 and 9 (slowest).  Which one compresses best?  Which one compresses faster?

Also, Blosc being a meta-compressor, it has support for different codecs internally that can be selected from PyTables in the "blosc:`codec`" form.  Do another comparison between internal Blosc codecs, namely, "blosc:blosclz" (the default), "blosc:lz4", "blosc:lz4hc", "blosc:snappy", "blosc:zlib" and "blosc:zstd".

Finally, avoid any compression totally (`complevel=0`).  How fast it is compared with existing codecs?

In [12]:
#
#
# SOLUTION STARTS HERE
#
#

In [13]:
for complib in ("zlib", "bzip2", "blosc"):
    filters = tables.Filters(complevel=5, complib=complib)
    %time to_hdf5(ratings, movies, filters)

Creating file: compression/zlib-5-shuffle.h5
Wall time: 648 ms
Creating file: compression/bzip2-5-shuffle.h5
Wall time: 1.46 s
Creating file: compression/blosc-5-shuffle.h5
Wall time: 163 ms


In [14]:
!ls -lh {data_dir}

total 14M
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.1M Jun 22 11:31 bzip2-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:31 zlib-5-shuffle.h5


In [15]:
for complib in ("blosc:blosclz", "blosc:lz4", "blosc:lz4hc", "blosc:snappy", "blosc:zlib", "blosc:zstd"):
    filters = tables.Filters(complevel=5, complib=complib)
    %time to_hdf5(ratings, movies, filters)

Creating file: compression/blosc-blosclz-5-shuffle.h5
Wall time: 173 ms
Creating file: compression/blosc-lz4-5-shuffle.h5
Wall time: 115 ms
Creating file: compression/blosc-lz4hc-5-shuffle.h5
Wall time: 347 ms
Creating file: compression/blosc-snappy-5-shuffle.h5
Wall time: 132 ms
Creating file: compression/blosc-zlib-5-shuffle.h5
Wall time: 638 ms
Creating file: compression/blosc-zstd-5-shuffle.h5
Wall time: 448 ms


In [16]:
# Finally, the uncompressed case
filters = tables.Filters(complevel=0)
%time to_hdf5(ratings, movies, filters)

Creating file: compression/no-compression.h5
Wall time: 93.6 ms


'compression/no-compression.h5'

In [17]:
!ls -lh {data_dir}

total 60M
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-blosclz-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.4M Jun 22 11:31 blosc-lz4-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.8M Jun 22 11:31 blosc-lz4hc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 22 11:32 blosc-snappy-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.4M Jun 22 11:32 blosc-zlib-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:32 blosc-zstd-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.1M Jun 22 11:31 bzip2-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613  17M Jun 22 11:32 no-compression.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:31 zlib-5-shuffle.h5


## Reading compressed datasets

In [18]:
files = list(os.walk(data_dir))[0][2]

In [19]:
files

['blosc-5-shuffle.h5',
 'blosc-blosclz-5-shuffle.h5',
 'blosc-lz4-5-shuffle.h5',
 'blosc-lz4hc-5-shuffle.h5',
 'blosc-snappy-5-shuffle.h5',
 'blosc-zlib-5-shuffle.h5',
 'blosc-zstd-5-shuffle.h5',
 'bzip2-5-shuffle.h5',
 'no-compression.h5',
 'zlib-5-shuffle.h5']

### Exercise 2: Reading compressed datasets

Which codec and compression level can read the fastest?  How it does compare with reading an uncompressed dataset?

In [20]:
#
#
# SOLUTION STARTS HERE
#
#

In [21]:
for f in files:
    print("Reading file:", f)
    with tables.open_file(os.path.join(data_dir, f)) as h5f:
        %timeit h5f.root.ratings[:]

Reading file: blosc-5-shuffle.h5
40.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-blosclz-5-shuffle.h5
40.8 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-lz4-5-shuffle.h5
31.2 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-lz4hc-5-shuffle.h5
31.7 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-snappy-5-shuffle.h5
39.3 ms ± 83.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-zlib-5-shuffle.h5
106 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-zstd-5-shuffle.h5
57.9 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: bzip2-5-shuffle.h5
548 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Reading file: no-compression.h5
18.1 ms ± 46.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Reading file: zlib-5-shuffle.h5
123 ms 

### Exercise 3: BLOSC multithreading

Blosc can use multithreading for compressing/decompressing, although it is disabled by default.  You can enable a multithreaded Blosc in a series of ways, but perhaps the easiest is to set the "BLOSC_NTHREADS" environment variable to the desired number of threads (typically the available number of cores in your computer).

Execute the cell below and re-do the reading benchmarks and look at how the reading speed vary.  Pay special attention to the difference between the CPU times and wall times.

In [22]:
os.environ["BLOSC_NTHREADS"] = "4"  # set to any other number you prefer

In [23]:
#
#
# SOLUTION STARTS HERE
#
#

In [24]:
for f in files:
    for nthreads in [1, 2, 4]:
        os.environ["BLOSC_NTHREADS"] = "%s" % nthreads
        print("Reading file:", f, 'nthreads=', os.environ["BLOSC_NTHREADS"])
        with tables.open_file(os.path.join(data_dir, f)) as h5f:
            %timeit h5f.root.ratings[:]

Reading file: blosc-5-shuffle.h5 nthreads= 1
48.6 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-5-shuffle.h5 nthreads= 2
45.7 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-5-shuffle.h5 nthreads= 4
39.8 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-blosclz-5-shuffle.h5 nthreads= 1
44.5 ms ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-blosclz-5-shuffle.h5 nthreads= 2
50 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-blosclz-5-shuffle.h5 nthreads= 4
48.6 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-lz4-5-shuffle.h5 nthreads= 1
33.1 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-lz4-5-shuffle.h5 nthreads= 2
35.3 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Reading file: blosc-lz4-5-shuffle.h5 nthrea

## Normalizing and denormalizing tables

Many data sources are expressed in terms of related tables.  For example, part of the [MovieLens dataset](https://grouplens.org/datasets/movielens/) is structured in tables having the next columns:

In [25]:
ratings = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
movies = ['movie_id', 'title', 'genres']

The relation that links the two tables above is the `movie_id` field.  This way, one can query parts of the dataset that involve the two tables, like for example, which users ('user_id') gave a rating of 5 to some movie ('title').  This is called the `normalized` version and we have already dealt with that in a previous section.

On the other hand, one can fuse the above 2 tables into a single one:

In [26]:
ratings_movies = ['title', 'genres', 'user_id', 'rating', 'unix_timestamp']

As you see, we still keep all the data fields, except for the 'movie_id' that is not needed anymore.  This is called the `denormalized` version.

The advantage of this one is that we have all the fields readily available in one single table, so querying it and getting info about all the fileds is straighforward.  The disadvantage is that this table will have many duplicated information, i.e. the 'title' and 'genres' fields will appear for all the ratings, which can be seen as a waste of space.

However, many times compression can get rid of many of the duplicated info in denormalized tables.  Let's see how to produce a denormalized table and how it fares compared with the normalized version.

## Denormalizing tables using pandas

In [27]:
# Import CSV files via pandas
dset = 'datasets\movielens-1m'
fdata = os.path.join(dset, 'ratings.dat.gz')
fitem = os.path.join(dset, 'movies.dat.gz')

# pass in column names for each CSV
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(fdata, sep=';', names=r_cols)

m_cols = ['movie_id', 'title', 'genres']
movies = pd.read_csv(fitem, sep=';', names=m_cols,
                     dtype={'title': object, 'genres': object})

In [28]:
# create one merged DataFrame
lens = pd.merge(movies, ratings)

In [29]:
lens.head()

Unnamed: 0,movie_id,title,genres,user_id,rating,unix_timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474


In [30]:
lens.ftypes

movie_id           int64:dense
title             object:dense
genres            object:dense
user_id            int64:dense
rating             int64:dense
unix_timestamp     int64:dense
dtype: object

In [31]:
def to_hdf5_denorm(lens, filters):

    class Lens(tables.IsDescription):
        user_id = tables.Int32Col(pos=0)
        rating = tables.Int8Col(pos=1)
        unix_timestamp = tables.Int64Col(pos=2)
        title = tables.StringCol(100, pos=3)
        genres = tables.StringCol(50, pos=4)
        
    def get_filename(filters):
        if filters.complevel != 0:
            complib = filters.complib if ":" not in filters.complib else filters.complib.replace(":", "-")
            shuffle = "shuffle" if filters.shuffle else "noshuffle"
            filename = "%s/%s-%d-%s-denorm.h5" % (data_dir, complib, filters.complevel, shuffle)
        else:
            filename = "%s/no-compression-denorm.h5" % (data_dir,)
        return filename

    filename = get_filename(filters)
    print("Creating file:", filename)
    with tables.open_file(filename, "w", filters=filters) as f:
        table_lens = f.create_table(f.root, "lens", Lens)
        table_lens.append([lens[col].values for col in table_lens.dtype.names])
    return filename

In [32]:
%%time
filters = tables.Filters(complevel=0)
h5file = to_hdf5_denorm(lens, filters)

Creating file: compression/no-compression-denorm.h5
Wall time: 492 ms


In [33]:
!ptdump -v -R0,10 {h5file}

/ (RootGroup) ''
/lens (Table(1000209,)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "rating": Int8Col(shape=(), dflt=0, pos=1),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=2),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=3),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=4)}
  byteorder := 'little'
  chunkshape := (402,)
  Data dump:
[0] (1, 5, 978824268, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[1] (6, 4, 978237008, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[2] (8, 4, 978233496, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[3] (9, 5, 978225952, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[4] (10, 5, 978226474, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[5] (18, 4, 978154768, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[6] (19, 5, 978555994, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[7] (21, 3, 978139347, b'Toy Story (1995)', b"Animation|Children's|Comedy"

In [34]:
!ls -lh compression

total 215M
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-blosclz-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.4M Jun 22 11:31 blosc-lz4-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.8M Jun 22 11:31 blosc-lz4hc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 22 11:32 blosc-snappy-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.4M Jun 22 11:32 blosc-zlib-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:32 blosc-zstd-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.1M Jun 22 11:31 bzip2-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 156M Jun 22 11:35 no-compression-denorm.h5
-rw-r--r-- 1 tomkooij 197613  17M Jun 22 11:32 no-compression.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:31 zlib-5-shuffle.h5


As can be seen, the size of the denormalized table is much larger than the normalized one (156 MB vs 17 MB).  But that is without using compression.

### Exercise 4: Compressing a denormalized table

Create a compressed version of the denormalized table and compare it with the same table in the normalized state.
What's the difference in size now?  Why do you think the compression process works much better in this case?

In [35]:
#
#
# SOLUTION STARTS HERE
#
#

In [36]:
filters = tables.Filters(complevel=5, complib="blosc:blosclz")
%time to_hdf5_denorm(lens, filters)

Creating file: compression/blosc-blosclz-5-shuffle-denorm.h5
Wall time: 806 ms


'compression/blosc-blosclz-5-shuffle-denorm.h5'

In [37]:
!ls -lh compression

total 223M
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 7.3M Jun 22 11:35 blosc-blosclz-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-blosclz-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.4M Jun 22 11:31 blosc-lz4-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.8M Jun 22 11:31 blosc-lz4hc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 22 11:32 blosc-snappy-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.4M Jun 22 11:32 blosc-zlib-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:32 blosc-zstd-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 4.1M Jun 22 11:31 bzip2-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 156M Jun 22 11:35 no-compression-denorm.h5
-rw-r--r-- 1 tomkooij 197613  17M Jun 22 11:32 no-compression.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:31 zlib-5-shuffle.h5


### Exercise 5: Denormalized table: compare codecs.

Create different files containing the denormalized table using different codecs.  Which one reduces the size better?  How does it compare with the files for the normalized version?

In [38]:
for complib in ("zlib", "bzip2", "blosc:blosclz", "blosc:lz4", "blosc:lz4hc", "blosc:snappy", "blosc:zlib", "blosc:zstd"):
    filters = tables.Filters(complevel=5, complib=complib)
    %time to_hdf5_denorm(lens, filters)

Creating file: compression/zlib-5-shuffle-denorm.h5
Wall time: 2.34 s
Creating file: compression/bzip2-5-shuffle-denorm.h5
Wall time: 6.45 s
Creating file: compression/blosc-blosclz-5-shuffle-denorm.h5
Wall time: 801 ms
Creating file: compression/blosc-lz4-5-shuffle-denorm.h5
Wall time: 770 ms
Creating file: compression/blosc-lz4hc-5-shuffle-denorm.h5
Wall time: 1.42 s
Creating file: compression/blosc-snappy-5-shuffle-denorm.h5
Wall time: 807 ms
Creating file: compression/blosc-zlib-5-shuffle-denorm.h5
Wall time: 2.51 s
Creating file: compression/blosc-zstd-5-shuffle-denorm.h5
Wall time: 1.57 s


In [39]:
!ls -lh compression

total 417M
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 7.3M Jun 22 11:36 blosc-blosclz-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 22 11:31 blosc-blosclz-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 7.8M Jun 22 11:36 blosc-lz4-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.4M Jun 22 11:31 blosc-lz4-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 6.7M Jun 22 11:36 blosc-lz4hc-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.8M Jun 22 11:31 blosc-lz4hc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 156M Jun 22 11:36 blosc-snappy-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 22 11:32 blosc-snappy-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 6.1M Jun 22 11:36 blosc-zlib-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.4M Jun 22 11:32 blosc-zlib-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 22 11:36 blosc-zstd-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 22 11:32 blosc-zstd-5-shuffle.h5
-rw-r--r-- 1 tomkoo

In the next section we will see the effect of querying normalized and denormalized tables.