# Queries in PyTables

> Objectives:
> * Query HDF5 files without loading them in-memory
> * How to query normalized and denormalized tables
> * Index columns in tables for accelerating queries

In [3]:
import os
import numpy as np
import pandas as pd
import tables

In [4]:
%ls -lh structuring compression

 Volume in drive D is Data
 Volume Serial Number is 2ACD-5F91

 Directory of D:\PyData-BCN


 Directory of D:\PyData-BCN\structuring

12-06-2017  06:50    <DIR>          .
12-06-2017  06:50    <DIR>          ..
12-06-2017  06:49         7.601.643 blosc-blosclz-5-shuffle.h5
12-06-2017  06:49         8.115.861 blosc-lz4-5-shuffle.h5
12-06-2017  06:50         6.952.939 blosc-lz4hc-5-shuffle.h5
12-06-2017  06:50        13.904.739 blosc-snappy-5-shuffle.h5
12-06-2017  06:50         6.321.787 blosc-zlib-5-shuffle.h5
12-06-2017  06:50         5.686.088 blosc-zstd-5-shuffle.h5
12-06-2017  06:49             5.280 layout.h5
12-06-2017  06:49       163.191.150 no-compressed.h5
12-06-2017  06:49         6.262.051 zlib-5-shuffle.h5
               9 File(s)    218.041.538 bytes

 Directory of D:\PyData-BCN\compression

12-06-2017  06:49    <DIR>          .
12-06-2017  06:49    <DIR>          ..
12-06-2017  06:49         5.221.064 blosc-5-shuffle.h5
12-06-2017  06:49         5.221.080 blosc-blosclz-5

File Not Found


## Querying in PyTables

### Denormalized tables

In [5]:
h5denorm = "structuring/blosc-zstd-5-shuffle.h5"
h5file = tables.open_file(h5denorm)
h5lens = h5file.root.lens

In [6]:
h5lens

/lens (Table(1000209,), shuffle, blosc:zstd(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "rating": Int8Col(shape=(), dflt=0, pos=1),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=2),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=3),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=4)}
  byteorder := 'little'
  chunkshape := (402,)

In [7]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 2.52 s


In [8]:
ratings

[0, 4, 15, 28, 18, 3]

In [9]:
h5file.close()

Querying denormalized tables is easy as pie.  Let's see how to manage normalized ones.

### Normalized tables

In [10]:
h5norm = "compression/blosc-zstd-5-shuffle.h5"
h5file = tables.open_file(h5norm)
h5ratings = h5file.root.ratings
h5movies = h5file.root.movies

In [11]:
h5ratings

/ratings (Table(1000209,), shuffle, blosc:zstd(5)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "movie_id": Int32Col(shape=(), dflt=0, pos=1),
  "rating": Int8Col(shape=(), dflt=0, pos=2),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (7710,)

In [12]:
h5movies

/movies (Table(3883,), shuffle, blosc:zstd(5)) ''
  description := {
  "movie_id": Int32Col(shape=(), dflt=0, pos=0),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (425,)

In [13]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 370 ms


In [14]:
ratings

[0, 4, 15, 28, 18, 3]

In [15]:
h5file.close()

So, the query in the normalized version is more than 2~3x faster than using the denormalized file.  However, this is just a simple example, and in general experimentation should be done so as to determine the best layout for your data.

## Indexing

Indexing is a general technique for adding data structures that can accelerate queries.  Let's see how PyTables makes use of this.

### Denormalized case

In [16]:
## Copy the original PyTables table into another file
import shutil
h5idx = "movielens-denorm-indexed.h5"
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5denorm, h5idx)

'movielens-denorm-indexed.h5'

In [17]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")

In [18]:
# Create an index for the 'title' column
h5lens = h5i.root.lens
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5lens.cols.title.create_csindex(filters=blosc_filter)

Wall time: 3.37 s


1000209

In [19]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 19 ms


Ok, so this time is 100x less than without using indexing.  What if we index the `rating` column too?

In [20]:
ratings

[0, 4, 15, 28, 18, 3]

In [21]:
# Create an index for the rating column
%time h5lens.cols.rating.create_csindex(filters=blosc_filter)

Wall time: 824 ms


1000209

In [22]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 4 ms


Ok, so although small, this represents another improvement in performance.

In [23]:
ratings

[0, 4, 15, 28, 18, 3]

In [24]:
h5i.close()

### Normalized case

In [25]:
## Copy the original PyTables table into another file
import shutil
h5idx = "movielens-norm-indexed.h5"
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5norm, h5idx)

'movielens-norm-indexed.h5'

In [26]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

In [27]:
# Create an index for the rating column
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5ratings.cols.rating.create_csindex(filters=blosc_filter)

Wall time: 702 ms


1000209

In [28]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 575 ms


Hmm, in this case indexing the rating column has not served to accelerate the query (at first sight at least).

In [29]:
ratings

[0, 4, 15, 28, 18, 3]

In [30]:
# Create an index for the movie_id column
%time h5ratings.cols.movie_id.create_csindex(filters=blosc_filter)

Wall time: 747 ms


1000209

In [31]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 44 ms


This time we see a better acceleration in the query, but cannot compete with the query speed for the denormalized case (which is ~10x faster).

In [32]:
ratings

[0, 4, 15, 28, 18, 3]

In [33]:
h5i.close()

In [34]:
%ls -lh movielens*

 Volume in drive D is Data
 Volume Serial Number is 2ACD-5F91

 Directory of D:\PyData-BCN


 Directory of D:\PyData-BCN

29-05-2017  09:14    <DIR>          movielens-1m
12-06-2017  06:51        10.336.895 movielens-denorm-indexed.h5
12-06-2017  06:51        10.019.147 movielens-norm-indexed.h5
               2 File(s)     20.356.042 bytes
               1 Dir(s)   9.433.485.312 bytes free


## Exercise

We have not created an index for the title for the normalized case.  Create such an index and determine if there is a noticeable speed-up or not.  Explain why you think that is the case.  Note: the times for a cold query can be **significatively** different from a hot query.

In [35]:
## Copy the original PyTables table into another file
import shutil
h5idx2 = "movielens-norm-indexed2.h5"
if os.path.exists(h5idx2):
    os.unlink(h5idx2)
shutil.copyfile(h5idx, h5idx2)

'movielens-norm-indexed2.h5'

In [36]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx2, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

In [37]:
# Create an index for the movie_id column
%time h5movies.cols.title.create_csindex(filters=blosc_filter)

Wall time: 16 ms


3883

In [38]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 291 ms


In [39]:
ratings

[0, 4, 15, 28, 18, 3]

In [40]:
h5i.close()

So the first time that the query is done after the cache is built (cold query), the time has been reduced a bit but not too much.  For subsequent queries (hot queries), the times are better, but not reaching the denormalized table either.