# 6 Indexed Queries in PyTables

> Objectives:
>
> * Learn how to index columns in tables for accelerating queries
> * Experiment with different indexes and/or compression
> * Discover some limitations of indexed queries.

In [1]:
import os
import numpy as np
import pandas as pd
import tables

Indexing is a general technique for adding data structures that can accelerate queries.  Let's see how PyTables makes use of this.

### Denormalized case

In [2]:
# continue from the previous notebook
data_dir = 'queries'
h5denorm = "compression/blosc-zstd-5-shuffle-denorm.h5"
h5norm = "compression/blosc-zstd-5-shuffle.h5"

In [3]:
## Copy the original PyTables table into another file
import shutil
h5idx = os.path.join(data_dir, "movielens-denorm-indexed.h5")
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5denorm, h5idx)

'queries\\movielens-denorm-indexed.h5'

In [4]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")

In [5]:
# Create an index for the 'title' column
h5lens = h5i.root.lens
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5lens.cols.title.create_csindex(filters=blosc_filter)

Wall time: 4.56 s


1000209

In [6]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 31.1 ms


In [7]:
ratings

[0, 4, 15, 28, 18, 3]

Ok, so this time is 100x less than without using indexing.  What if we index the `rating` column too?

In [8]:
# Create an index for the rating column
%time h5lens.cols.rating.create_csindex(filters=blosc_filter)

Wall time: 845 ms


1000209

In [9]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 4 ms


Ok, so although small, this represents another improvement in performance.

In [10]:
ratings

[0, 4, 15, 28, 18, 3]

In [11]:
h5i.close()

### Normalized case

In [12]:
## Copy the original PyTables table into another file
import shutil
h5idx = os.path.join(data_dir, "movielens-norm-indexed.h5")
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5norm, h5idx)

'queries\\movielens-norm-indexed.h5'

In [13]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

In [14]:
# Create an index for the rating column
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5ratings.cols.rating.create_csindex(filters=blosc_filter)

Wall time: 626 ms


1000209

In [15]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 489 ms


Hmm, in this case indexing the rating column has not served to accelerate the query (at first sight at least).

In [16]:
ratings

[0, 4, 15, 28, 18, 3]

In [17]:
# Create an index for the movie_id column
%time h5ratings.cols.movie_id.create_csindex(filters=blosc_filter)

Wall time: 688 ms


1000209

In [18]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 37.5 ms


This time we see a better acceleration in the query, but cannot compete with the query speed for the denormalized case (which is ~10x faster).

In [19]:
ratings

[0, 4, 15, 28, 18, 3]

In [20]:
h5i.close()

In [21]:
!ls -lh {data_dir}

total 99M
-rw-r--r-- 1 tomkooij 197613  79M Jun 22 12:58 keyvalue.h5
-rw-r--r-- 1 tomkooij 197613 9.9M Jun 23 10:39 movielens-denorm-indexed.h5
-rw-r--r-- 1 tomkooij 197613 9.6M Jun 23 10:40 movielens-norm-indexed.h5


## Exercise

We have not created an index for the title for the normalized case.  Create such an index and determine if there is a noticeable speed-up or not.  Explain why you think that is the case.  Note: the times for a cold query can be **significatively** different from a hot query.

In [22]:
## Copy the original PyTables table into another file
import shutil
h5idx2 = "movielens-norm-indexed2.h5"
if os.path.exists(h5idx2):
    os.unlink(h5idx2)
shutil.copyfile(h5idx, h5idx2)

'movielens-norm-indexed2.h5'

In [23]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx2, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

In [24]:
#
#
# Solution starts here
#
#

In [25]:
# Create an index for the movie_id column
%time h5movies.cols.title.create_csindex(filters=blosc_filter)

Wall time: 14 ms


3883

In [26]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 284 ms


In [27]:
ratings

[0, 4, 15, 28, 18, 3]

In [28]:
h5i.close()

So the first time that the query is done after the cache is built (cold query), the time has been reduced a bit but not too much.  For subsequent queries (hot queries), the times are better, but not reaching the denormalized table either.

# Exercise

Query size vs speed (indexed queries vs non-indexed queries)

We create a (large) file containing some `(key, value)` pairs. The `value` is an `int64`. The `key` is a random 10-byte string, to simulate actual data, with normal compression.

In [29]:
N = 20  # append 20 blocks of 1M rows

In [30]:
# adapted from: https://stackoverflow.com/questions/20769818/

import random
import string

class KeyValue(tables.IsDescription):
    key = tables.StringCol(itemsize=10, dflt=" ", pos=0)  
    value = tables.Int64Col(dflt=0, pos=1)

fn = os.path.join(data_dir, "keyvalue.h5")

with tables.open_file(fn, "w") as f:    
    filters = tables.Filters(complevel=5, complib='blosc')
    kv = f.create_table("/", "keyvalues", KeyValue, filters=filters)

    for j in range(1, N+1):
        values = []
        print('block: ', j)
        for _ in range(100000):
            key = "".join(random.sample(string.ascii_uppercase, 10))  # slow!
            value = random.randint(0, 1000000)
            values.append((key, value))
        kv.append(values)

block:  1
block:  2
block:  3
block:  4
block:  5
block:  6
block:  7
block:  8
block:  9
block:  10
block:  11
block:  12
block:  13
block:  14
block:  15
block:  16
block:  17
block:  18
block:  19
block:  20


In [31]:
!ptdump -v -R10 {fn}

/ (RootGroup) ''
/keyvalues (Table(2000000,), shuffle, blosc(5)) ''
  description := {
  "key": StringCol(itemsize=10, shape=(), dflt=b' ', pos=0),
  "value": Int64Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (3640,)
  Data dump:
[0] (b'OPLGDUVKMQ', 276136)
[1] (b'SPTKNVBHWI', 522686)
[2] (b'LJZBYWGOTE', 450625)
[3] (b'FLEYROGDWB', 262028)
[4] (b'GJUVFCPQKH', 145975)
[5] (b'QKLSWEYHFB', 87708)
[6] (b'STFJEAPZQD', 590688)
[7] (b'BSVHUZRNAL', 469617)
[8] (b'YFSVBOCLTX', 847850)
[9] (b'EZRVMYCIFP', 571338)


Query the `value` column and compare different query (result) sizes:
Compare indexed queries with unindexed queries.
*Optional: compare different compression levels and codecs* 


For example: `'(value > 100000) & (value <1000010)'`


In [32]:
#
#
# Results start here
#
#


In [33]:
max_values = [10, 50, 100, 1000, 10000]
X = 100000

def get_query(max_value):
    return '(value > %s) & (value <%s)' % (X, X+max_value)


with tables.open_file(fn, "a") as f:
    kv = f.root.keyvalues
    #kv = f.root.sorted
    
    kv.cols.value.remove_index()

    for max_value in max_values:
        query = get_query(max_value)
        print('max_value=%d : len=%d' % (max_value, len(kv.read_where(query))))
    
    print('\nwithout index:')
    for max_value in max_values:
        query = get_query(max_value)
        %timeit sum(1 for x in kv.where(query))

    blosc_filter = tables.Filters(complevel=9, complib="blosc")
    print('\nindexing...')
    %time kv.cols.value.create_csindex()

    print('\nwith index')
    for max_value in max_values:
        query = get_query(max_value)
        %timeit sum(1 for x in kv.where(query))

max_value=10 : len=24
max_value=50 : len=95
max_value=100 : len=206
max_value=1000 : len=2061
max_value=10000 : len=20102

without index:
56.3 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
55.3 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
58 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.6 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
59.1 ms ± 853 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

indexing...
Wall time: 3.6 s

with index
229 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.94 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.65 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
72.1 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
76.8 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Exercise (Optional)

Indexing queries with large result sets is difficult. `pytables` is not optimised for such queries. In general results are comparable to unindexed queries.

For exterme performance, try an indexed query on a sorted table.

In [34]:
%%time
with tables.open_file(fn, 'a') as f:
    table = f.root.keyvalues[:]
    table.sort(order='value')
    f.create_table('/', 'sorted', obj=table)

Wall time: 5.31 s


In [35]:
max_values = [10, 50, 100, 1000, 10000]
X = 100000

def get_query(max_value):
    return '(value > %s) & (value <%s)' % (X, X+max_value)


with tables.open_file(fn, "a") as f:
    #kv = f.root.keyvalues
    kv = f.root.sorted
    
    kv.cols.value.remove_index()

    for max_value in max_values:
        query = get_query(max_value)
        print('max_value=%d : len=%d' % (max_value, len(kv.read_where(query))))
    
    print('\nwithout index:')
    for max_value in max_values:
        query = get_query(max_value)
        %timeit sum(1 for x in kv.where(query))

    blosc_filter = tables.Filters(complevel=9, complib="blosc")
    print('\nindexing...')
    %time kv.cols.value.create_csindex()

    print('\nwith index')
    for max_value in max_values:
        query = get_query(max_value)
        %timeit sum(1 for x in kv.where(query))

max_value=10 : len=24
max_value=50 : len=95
max_value=100 : len=206
max_value=1000 : len=2061
max_value=10000 : len=20102

without index:
44.5 ms ± 696 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
45.1 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
43.9 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
45.2 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.6 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

indexing...
Wall time: 2.94 s

with index
188 µs ± 7.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
261 µs ± 9.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
421 µs ± 6.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.38 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.58 ms ± 63.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [37]:
!ptdump -v {fn}

/ (RootGroup) ''
/keyvalues (Table(2000000,), shuffle, blosc(5)) ''
  description := {
  "key": StringCol(itemsize=10, shape=(), dflt=b' ', pos=0),
  "value": Int64Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (3640,)
  autoindex := True
  colindexes := {
    "value": Index(9, full, shuffle, zlib(1)).is_csi=True}
/sorted (Table(2000000,)) ''
  description := {
  "key": StringCol(itemsize=10, shape=(), dflt=b'', pos=0),
  "value": Int64Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (3640,)
  autoindex := True
  colindexes := {
    "value": Index(9, full, shuffle, zlib(1)).is_csi=True}
