# 6 Indexed Queries in PyTables

> Objectives:
>
> * Learn how to index columns in tables for accelerating queries
> * Experiment with different indexes and/or compression
> * Discover some limitations of indexed queries.

In [1]:
import os
import numpy as np
import pandas as pd
import tables

Indexing is a general technique for adding data structures that can accelerate queries.  Let's see how PyTables makes use of this.

### Denormalized case

We'll be using the same datasets again, but we'll copy them, to add indexes: 

In [2]:
# continue from the previous notebook
data_dir = 'queries'
h5denorm = "compression/blosc-zstd-5-shuffle-denorm.h5"
h5norm = "compression/blosc-zstd-5-shuffle.h5"

In [3]:
## Copy the original PyTables table into another file
import shutil
h5idx = os.path.join(data_dir, "movielens-denorm-indexed.h5")
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5denorm, h5idx)

'queries\\movielens-denorm-indexed.h5'

In [4]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")

In [5]:
# Create an index for the 'title' column
h5lens = h5i.root.lens
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5lens.cols.title.create_csindex(filters=blosc_filter)

Wall time: 3.68 s


1000209

In [6]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 23 ms


In [7]:
ratings

[0, 4, 15, 28, 18, 3]

Ok, so this time is 100x less than without using indexing.  What if we index the `rating` column too?

In [8]:
# Create an index for the rating column
%time h5lens.cols.rating.create_csindex(filters=blosc_filter)

Wall time: 1.25 s


1000209

In [9]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Wall time: 6 ms


Ok, so although small, this represents another improvement in performance.

In [10]:
ratings

[0, 4, 15, 28, 18, 3]

In [11]:
h5i.close()

### Normalized case

In [12]:
## Copy the original PyTables table into another file
import shutil
h5idx = os.path.join(data_dir, "movielens-norm-indexed.h5")
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5norm, h5idx)

'queries\\movielens-norm-indexed.h5'

In [13]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

In [14]:
# Create an index for the rating column
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5ratings.cols.rating.create_csindex(filters=blosc_filter)

Wall time: 1.01 s


1000209

In [15]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 677 ms


Hmm, in this case indexing the rating column has not served to accelerate the query (at first sight at least).

In [16]:
ratings

[0, 4, 15, 28, 18, 3]

In [17]:
# Create an index for the movie_id column
%time h5ratings.cols.movie_id.create_csindex(filters=blosc_filter)

Wall time: 723 ms


1000209

In [18]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Wall time: 79.8 ms


This time we see a better acceleration in the query, but cannot compete with the query speed for the denormalized case (which is ~10x faster).

In [19]:
ratings

[0, 4, 15, 28, 18, 3]

In [20]:
h5i.close()

In [21]:
!ls -lh {data_dir}

total 20M
-rw-r--r-- 1 tomkooij 197613 9.9M Jun 27 09:04 movielens-denorm-indexed.h5
-rw-r--r-- 1 tomkooij 197613 9.6M Jun 27 09:05 movielens-norm-indexed.h5


## Exercise

We have not created an index for the title for the normalized case.  Create such an index and determine if there is a noticeable speed-up or not.  Explain why you think that is the case.  Note: the times for a cold query can be **significatively** different from a hot query.

In [22]:
## Copy the original PyTables table into another file
import shutil
h5idx2 = "movielens-norm-indexed2.h5"
if os.path.exists(h5idx2):
    os.unlink(h5idx2)
shutil.copyfile(h5idx, h5idx2)

'movielens-norm-indexed2.h5'

In [23]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx2, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

# Exercise

Query size vs speed (indexed queries vs non-indexed queries)

We create a (large) file containing some `(key, value)` pairs. The `value` is an `int64`. The `key` is a random 10-byte string, to simulate actual data, with normal compression.

Investigate the query speed vs query result size. **Create the datafile first, the assignment is below**

In [24]:
N = 20  # append 20 blocks of 1M rows

In [25]:
# adapted from: https://stackoverflow.com/questions/20769818/

import random
import string

class KeyValue(tables.IsDescription):
    key = tables.StringCol(itemsize=10, dflt=" ", pos=0)  
    value = tables.Int64Col(dflt=0, pos=1)

fn = os.path.join(data_dir, "keyvalue.h5")

with tables.open_file(fn, "w") as f:    
    filters = tables.Filters(complevel=5, complib='blosc')
    kv = f.create_table("/", "keyvalues", KeyValue, filters=filters)

    for j in range(1, N+1):
        values = []
        print('block: ', j)
        for _ in range(100000):
            key = "".join(random.sample(string.ascii_uppercase, 10))  # slow!
            value = random.randint(0, 1000000)
            values.append((key, value))
        kv.append(values)

block:  1
block:  2
block:  3
block:  4
block:  5
block:  6
block:  7
block:  8
block:  9
block:  10
block:  11
block:  12
block:  13
block:  14
block:  15
block:  16
block:  17
block:  18
block:  19
block:  20


In [26]:
!ptdump -v -R10 {fn}

/ (RootGroup) ''
/keyvalues (Table(2000000,), shuffle, blosc(5)) ''
  description := {
  "key": StringCol(itemsize=10, shape=(), dflt=b' ', pos=0),
  "value": Int64Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (3640,)
  Data dump:
[0] (b'IGNFUVDXYC', 37262)
[1] (b'MWYABCKNTO', 489695)
[2] (b'MLBNFAGKQP', 636287)
[3] (b'POSREKMHBJ', 545942)
[4] (b'APWMSZEXQU', 772907)
[5] (b'YVQWGPXMUK', 99778)
[6] (b'KPFNEHOTGJ', 196752)
[7] (b'SDOMVQPLGK', 822077)
[8] (b'BETZPHWVDS', 451722)
[9] (b'YRIEJCKOLQ', 572756)



Query the `value` column and compare different query (result) sizes:
Compare indexed queries with unindexed queries.
*Optional: compare different compression levels and codecs* 


For example: `'(value > 100000) & (value <1000010)'`


# Exercise (Optional)

Indexing queries with large result sets is difficult. `pytables` is not optimised for such queries. In general results are comparable to unindexed queries (reading the entire table).

For extreme performance, try an indexed query on a sorted table.