# 5 Queries in PyTables

> Objectives:
> * How to query HDF5 files without loading them in-memory
> * How to query normalized and denormalized tables

In [22]:
import os
import numpy as np
import pandas as pd
import tables

In [23]:
import os
import shutil
data_dir = "queries"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

In [24]:
!ls -lh compression

total 187M
-rw-r--r-- 1 tomkooij 197613 5.5M Jun 26 09:54 blosc-zstd-5-shuffle-denorm.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 26 09:50 blosc-zstd-5-shuffle.h5
-rw-r--r-- 1 tomkooij 197613 156M Jun 26 09:53 no-compression-denorm.h5
-rw-r--r-- 1 tomkooij 197613  17M Jun 26 09:50 no-compression.h5
-rw-r--r-- 1 tomkooij 197613 4.3M Jun 26 09:48 zlib-5-shuffle.h5


We need to files created in the last notebook:

* `compression/no-compression-denorm.h5`
* `compression/blosc-zstd-5-shuffle.h5`
* `compression/blosc-zstd-5-shuffle-denorm.h5`

In [80]:
for fn in ["compression/no-compression-denorm.h5",
           "compression/blosc-zstd-5-shuffle.h5", 
           "compression/blosc-zstd-5-shuffle-denorm.h5"]:
    if not os.path.exists(fn):
        assert False, "Missing datafile %s: Rerun 04-Using-Compression Notebook" % fn

AssertionError: Missing datafile compression/blosc-zstd-5-shuffle-denorm.h5: Rerun 04-Using-Compression Notebook

## Querying in PyTables

Searching in tables is one of the most common and time consuming operations that a typical user faces in the process of mining through his data. Being able to perform queries as fast as possible is a key concept in data usage applications.


In [26]:
# Movieslens-1M (denormalized) not compressed:
fn = "compression/no-compression-denorm.h5"
h5file = tables.open_file(fn)
table = h5file.root.lens

### read table and query in numpy

Naive solution: read the table into memory and select the rows using `numpy`:

In [69]:
x = table.read() 

In [70]:
sum(1 for x in x[x['rating'] >= 4])

575281

Let's benchmark:

In [68]:
%%timeit
x = table.read() 
sum(1 for x in x[x['rating'] >= 4])

341 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


This only works for tables that fit into memory (and we are abuseing OS filecache) but, it provides a benchmark for our queries:

###  table.iterrows()

`table.iterrows()` returns an iterator that iterates over ALL rows, using this iterator, we can avoid loading the table in memory.

In [48]:
sum(1 for x in table.iterrows() if x['rating'] >= 4)

575281

In [49]:
%%timeit
sum(1 for x in table.iterrows() if x['rating'] >= 4)

297 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Better, but we can do better still:

### table.where()

`table.where()` is an iterator that performs an in-kernel query:

It returns a row iterator, that iterates over the selected rows:

In [50]:
type(table.where('rating >= 4'))

tables.tableextension.Row

In [51]:
%%timeit
ts = sum(1 for row in table.where("rating >= 4"))

210 ms ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The difference between `table.where()` and `table.iterrows()` is that `where()` performs an **in-kernel** query.

### read_where()

`table.read_where()` reads all table rows that match a query:

In [39]:
x = table.read_where("rating >= 4")
x

array([ (   1, 5,  978824268, b'Toy Story (1995)', b"Animation|Children's|Comedy"),
       (   6, 4,  978237008, b'Toy Story (1995)', b"Animation|Children's|Comedy"),
       (   8, 4,  978233496, b'Toy Story (1995)', b"Animation|Children's|Comedy"),
       ...,
       (5812, 4,  992072099, b'Contender, The (2000)', b'Drama|Thriller'),
       (5837, 4, 1011902656, b'Contender, The (2000)', b'Drama|Thriller'),
       (5998, 4, 1001781044, b'Contender, The (2000)', b'Drama|Thriller')],
      dtype=[('user_id', '<i4'), ('rating', 'i1'), ('unix_timestamp', '<i8'), ('title', 'S100'), ('genres', 'S50')])

In [40]:
len(x)

575281

In [41]:
%%timeit
x = table.read_where("rating >= 4")
sum(1 for x in x)

1.5 s ± 49.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The poor performance is due to read_where(), which is much slower (5x) than the `numpy` benchmark.

In [53]:
%timeit x = table.read_where("rating >= 4")

1.42 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Normalized vs Denormalized tables

Let's compare a "real life" query:

Query the ratings for the movie `Tom and Huck (1995)`:

### Denormalized

In [54]:
h5denorm = "compression/blosc-zstd-5-shuffle-denorm.h5"
h5file = tables.open_file(h5denorm)
h5lens = h5file.root.lens

In [55]:
h5lens

/lens (Table(1000209,), shuffle, blosc:zstd(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "rating": Int8Col(shape=(), dflt=0, pos=1),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=2),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=3),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=4)}
  byteorder := 'little'
  chunkshape := (402,)

In [56]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 2.34 s


In [57]:
ratings

[0, 4, 15, 28, 18, 3]

In [58]:
h5file.close()

Querying denormalized tables is easy as pie.  Let's see how to manage normalized ones.

### Normalized tables

In [72]:
h5norm = "compression/blosc-zstd-5-shuffle.h5"
h5file = tables.open_file(h5norm)
h5ratings = h5file.root.ratings
h5movies = h5file.root.movies

In [73]:
h5ratings

/ratings (Table(1000209,), shuffle, blosc:zstd(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "movie_id": Int32Col(shape=(), dflt=0, pos=1),
  "rating": Int8Col(shape=(), dflt=0, pos=2),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (7710,)

In [74]:
h5movies

/movies (Table(3883,), shuffle, blosc:zstd(5)) ''
  description := {
  "movie_id": Int32Col(shape=(), dflt=0, pos=0),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (425,)

In [77]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 420 ms


In [78]:
ratings

[0, 4, 15, 28, 18, 3]

In [79]:
h5file.close()