# 5 Queries in PyTables

> Objectives:
> * How to query HDF5 files without loading them in-memory
> * How to query normalized and denormalized tables

In [1]:
import os
import numpy as np
import pandas as pd
import tables

In [2]:
import os
import shutil
data_dir = "queries"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

In [3]:
!ls -lh compression

total 417M
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 26 10:22 blosc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 7.3M Jun 26 10:26 blosc-blosclz-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.0M Jun 26 10:22 blosc-blosclz-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 7.8M Jun 26 10:26 blosc-lz4-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.4M Jun 26 10:22 blosc-lz4-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 6.7M Jun 26 10:26 blosc-lz4hc-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.8M Jun 26 10:22 blosc-lz4hc-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 156M Jun 26 10:26 blosc-snappy-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 26 10:22 blosc-snappy-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 6.1M Jun 26 10:26 blosc-zlib-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.4M Jun 26 10:22 blosc-zlib-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 26 10:26 blosc-zstd-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 26 10:22 blosc-zstd-5-shuffle.h5
-rw-r--r-- 1 tomkoo

We need to files created in the last notebook:

* `compression/no-compression-denorm.h5`
* `compression/blosc-zstd-5-shuffle.h5`
* `compression/blosc-zstd-5-shuffle-denorm.h5`

In [4]:
for fn in ["compression/no-compression-denorm.h5",
           "compression/blosc-zstd-5-shuffle.h5", 
           "compression/blosc-zstd-5-shuffle-denorm.h5"]:
    if not os.path.exists(fn):
        assert False, "Missing datafile %s: Rerun 04-Using-Compression Notebook" % fn

## Querying in PyTables

Searching in tables is one of the most common and time consuming operations that a typical user faces in the process of mining through his data. Being able to perform queries as fast as possible is a key concept in data usage applications.


In [5]:
# Movieslens-1M (denormalized) not compressed:
fn = "compression/no-compression-denorm.h5"
h5file = tables.open_file(fn)
table = h5file.root.lens

### read table and query in numpy

Naive solution: read the table into memory and select the rows using `numpy`:

In [6]:
x = table.read() 

In [7]:
sum(1 for row in x[x['rating'] >= 4])

575281

Let's benchmark:

In [8]:
%%timeit
x = table.read() 
sum(1 for row in x[x['rating'] >= 4])

357 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


This only works for tables that fit into memory (and we are abuseing OS filecache) but, it provides a benchmark for our queries:

###  table.iterrows()

`table.iterrows()` returns an iterator that iterates over ALL rows, using this iterator, we can avoid loading the table in memory.

In [9]:
sum(1 for row in table.iterrows() if row['rating'] >= 4)

575281

In [10]:
%%timeit
sum(1 for row in table.iterrows() if row['rating'] >= 4)

306 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Better, but we can do better still:

### table.where()

`table.where()` is an iterator that performs an in-kernel query:

It returns a row iterator, that iterates over the selected rows:

In [11]:
type(table.where('rating >= 4'))

tables.tableextension.Row

In [12]:
%%timeit
ts = sum(1 for row in table.where("rating >= 4"))

211 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The difference between `table.where()` and `table.iterrows()` is that `where()` performs an **in-kernel** query.

### read_where()

`table.read_where()` reads all table rows that match a query:

In [13]:
rows = table.read_where("rating >= 4")
rows

array([ (   1, 5,  978824268, b'Toy Story (1995)', b"Animation|Children's|Comedy"),
       (   6, 4,  978237008, b'Toy Story (1995)', b"Animation|Children's|Comedy"),
       (   8, 4,  978233496, b'Toy Story (1995)', b"Animation|Children's|Comedy"),
       ...,
       (5812, 4,  992072099, b'Contender, The (2000)', b'Drama|Thriller'),
       (5837, 4, 1011902656, b'Contender, The (2000)', b'Drama|Thriller'),
       (5998, 4, 1001781044, b'Contender, The (2000)', b'Drama|Thriller')],
      dtype=[('user_id', '<i4'), ('rating', 'i1'), ('unix_timestamp', '<i8'), ('title', 'S100'), ('genres', 'S50')])

In [14]:
len(rows)

575281

In [15]:
%%timeit
rows = table.read_where("rating >= 4")
sum(1 for row in rows)

1.46 s ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The poor performance is due to read_where(), which is much slower (5x) than the `numpy` benchmark.

In [16]:
%timeit table.read_where("rating >= 4")

1.45 s ± 64.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note that this is not totally fair! We are not actually "doing" something with the selected rows.

### Exercise

change: `sum(1 for row in ...)` to `sum(row['rating'] for row in ...)` in the blocks above. Explain the results.

## Normalized vs Denormalized tables

Let's compare a "real life" query:

Query the ratings for the movie `Tom and Huck (1995)`:

### Denormalized

In [17]:
h5denorm = "compression/blosc-zstd-5-shuffle-denorm.h5"
h5file = tables.open_file(h5denorm)
h5lens = h5file.root.lens

In [18]:
h5lens

/lens (Table(1000209,), shuffle, blosc:zstd(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "rating": Int8Col(shape=(), dflt=0, pos=1),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=2),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=3),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=4)}
  byteorder := 'little'
  chunkshape := (402,)

In [19]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 2.2 s


In [20]:
ratings

[0, 4, 15, 28, 18, 3]

In [21]:
h5file.close()

Querying denormalized tables is easy as pie.  Let's see how to manage normalized ones.

### Normalized tables

In [22]:
h5norm = "compression/blosc-zstd-5-shuffle.h5"
h5file = tables.open_file(h5norm)
h5ratings = h5file.root.ratings
h5movies = h5file.root.movies

In [23]:
h5ratings

/ratings (Table(1000209,), shuffle, blosc:zstd(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "movie_id": Int32Col(shape=(), dflt=0, pos=1),
  "rating": Int8Col(shape=(), dflt=0, pos=2),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (7710,)

In [24]:
h5movies

/movies (Table(3883,), shuffle, blosc:zstd(5)) ''
  description := {
  "movie_id": Int32Col(shape=(), dflt=0, pos=0),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (425,)

In [25]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 430 ms


In [26]:
ratings

[0, 4, 15, 28, 18, 3]

In [27]:
h5file.close()