#### Online Index Selection Via Combinatorial Contextual Multi Armed Bandits

In [63]:
import logging
import datetime
import os
import subprocess
import uuid

import numpy as np
import pyodbc
import sys
import random
import pandas as pd
import time
import os
from tqdm import tqdm
import logging
import re
import json
import itertools
import math
from collections import defaultdict
from tqdm import tqdm

%load_ext autoreload
%autoreload 2

import IPython
notebook_path = IPython.get_ipython().starting_dir
target_subdirectory_path = os.path.abspath(os.path.join(os.path.dirname(notebook_path), 'database'))
sys.path.append(target_subdirectory_path)
from utils import *

from mab import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [54]:
# read workload queries from JSON file
def read_workload(workload_filepath):
    workload = []
    with open(workload_filepath) as f:
        line = f.readline()
        # read the queries from each line
        while line:
            workload.append(json.loads(line))
            line = f.readline()

    return workload

# Base directory containing the generated queries
workload_filepath = '../datagen/TPCH_workloads/TPCH_static_100_workload.json'

# Read the workload queries from file
workload = read_workload(workload_filepath)
print(len(workload))

2100


#### MAB index selection algorithm. On each round, do the following:

1) Generate candidate arms/indices using mini-workload from previous round
2) Generate context vector for each candidate arm
3) Select the best super-arm, i.e. configuration/subset of candidate indices
4) Materialize the super-arm configuration, then execute new mini-workload for current round


We will implement these 4 steps separately in the given order.

#### 1. Generation of Candidate indices

Test index generation for miniworkload of first 21 queries

In [56]:
mab = MAB()

In [55]:
connection = start_connection()

miniworkload = []
for query in workload[0:21]:
    # convert to Query object
    miniworkload.append(Query(connection, query['template_id'], query['query_string'], query['payload'], query['predicates'], query['order_bys']))

close_connection(connection)

In [57]:
connection = start_connection()

# genete candidate indices
index_arms = mab.generate_candidate_indices(connection, miniworkload, verbose=False)

close_connection(connection)

Genereting candidate indices for 21 queries...


Processing queries: 100%|██████████| 21/21 [00:01<00:00, 15.95it/s]


#### 2. Generation of Context Vectors for Each Arm/Index

The context vector of each index can be defined as a concatenation of two pieces:

* Columns Piece:  a vector with length equal to the total number of columns in the database. Each entry in this vector corresponds to one of the columns and contains the value $10^{-j}$ where $j$ is the position of that column in the index, provided that column is in the index, otherwise the value is zero. 

* Derived Context Piece: a vector of length 2, first component contains time stamp of last round when the index was used and second component is the size of the index relative to the entire database

In [58]:
connection = start_connection()

# test context vector generation
context_vectors = mab.generate_contexts(connection, index_arms)
print(context_vectors_columns.shape)

close_connection(connection)

(652, 61)


#### 3: Generation of super-arm/ best configuration

To generate the best configurations:

* compute the estimated upper bound on expected reward from each index
* then solve the 0-1 knapsack problem to find the subset of indices which maximizes total expected reward upper bound while satisfying memory constraint

In [59]:
# test super arm selection
selected_indices = mab.select_best_configuration(context_vectors, index_arms, verbose=True)


Best configuration contains 150 indices: 
['IX_supplier_s_suppkey', 'IX_supplier_s_suppkey_s_nationkey', 'IX_supplier_s_nationkey_s_suppkey', 'IX_supplier_s_nationkey', 'IX_supplier_s_comment', 'IX_customer_c_custkey', 'IX_part_p_partkey', 'IX_customer_c_nationkey_c_custkey', 'IX_customer_c_custkey_c_nationkey', 'IX_customer_c_nationkey', 'IX_part_p_partkey_p_size', 'IX_part_p_size_p_partkey', 'IX_customer_c_acctbal_c_custkey', 'IX_customer_c_custkey_c_acctbal', 'IX_customer_c_acctbal', 'IX_part_p_size', 'IX_customer_c_mktsegment', 'IX_customer_c_phone_c_custkey', 'IX_customer_c_custkey_c_phone', 'IX_customer_c_phone', 'IX_part_p_brand_p_partkey', 'IX_part_p_container_p_partkey', 'IX_part_p_partkey_p_container', 'IX_part_p_partkey_p_brand', 'IX_part_p_container', 'IX_part_p_brand', 'IX_part_p_container_p_partkey_p_size', 'IX_part_p_partkey_p_container_p_size', 'IX_part_p_partkey_p_size_p_container', 'IX_part_p_size_p_container_p_partkey', 'IX_part_p_size_p_partkey_p_container', 'IX_par

In [77]:
mab = MAB()
mab.step_round(miniworkload, 1, verbose=True)
mab.step_round(miniworkload, 2, verbose=True)

All non-clustered indexes --> [('dbo', 'customer', 'IX_customer_c_acctbal'), ('dbo', 'customer', 'IX_customer_c_acctbal_c_custkey'), ('dbo', 'customer', 'IX_customer_c_acctbal_c_custkey_c_phone'), ('dbo', 'customer', 'IX_customer_c_acctbal_c_phone'), ('dbo', 'customer', 'IX_customer_c_acctbal_c_phone_c_custkey'), ('dbo', 'customer', 'IX_customer_c_custkey'), ('dbo', 'customer', 'IX_customer_c_custkey_c_acctbal'), ('dbo', 'customer', 'IX_customer_c_custkey_c_acctbal_c_phone'), ('dbo', 'customer', 'IX_customer_c_custkey_c_nationkey'), ('dbo', 'customer', 'IX_customer_c_custkey_c_phone'), ('dbo', 'customer', 'IX_customer_c_custkey_c_phone_c_acctbal'), ('dbo', 'customer', 'IX_customer_c_mktsegment'), ('dbo', 'customer', 'IX_customer_c_nationkey'), ('dbo', 'customer', 'IX_customer_c_nationkey_c_custkey'), ('dbo', 'customer', 'IX_customer_c_phone'), ('dbo', 'customer', 'IX_customer_c_phone_c_acctbal'), ('dbo', 'customer', 'IX_customer_c_phone_c_acctbal_c_custkey'), ('dbo', 'customer', 'IX_cu

Processing queries: 100%|██████████| 21/21 [00:00<00:00, 3421.79it/s]

Tables:
Table: customer, Row Count: 150000, PK Columns: ['c_custkey']
Table: orders, Row Count: 1500000, PK Columns: ['o_orderkey']
Table: lineitem, Row Count: 6001215, PK Columns: ['l_linenumber', 'l_orderkey']
Table: part, Row Count: 200000, PK Columns: ['p_partkey']
Table: supplier, Row Count: 10000, PK Columns: ['s_suppkey']
Table: partsupp, Row Count: 800000, PK Columns: ['ps_partkey', 'ps_suppkey']
Table: nation, Row Count: 25, PK Columns: ['n_nationkey']
Table: region, Row Count: 5, PK Columns: ['r_regionkey']

Table --> lineitem, Predicate Columns --> {'l_shipdate'}, table row count --> 6001215
Include columns: ['l_returnflag', 'l_quantity', 'l_tax', 'l_discount', 'l_extendedprice', 'l_linestatus']
Query selectivity: 0.9832358947313169
Full table scan for table: lineitem is cheap, skipping

Table --> lineitem, Payload Columns --> {'l_shipdate'}, table row count --> 6001215
Payload columns are in the predicates, skipping

Table --> lineitem, Predicate Columns --> {'l_shipdate'},




Created index --> [dbo].[part].[IX_part_p_container_p_partkey_p_size], Indexed Columns --> ('p_container', 'p_partkey', 'p_size'), Included Columns --> (), index creation time: 0.31 seconds
Created index --> [dbo].[customer].[IX_customer_c_phone_c_custkey], Indexed Columns --> ('c_phone', 'c_custkey'), Included Columns --> (), index creation time: 0.289 seconds
Created index --> [dbo].[customer].[IX_customer_c_acctbal], Indexed Columns --> ('c_acctbal',), Included Columns --> (), index creation time: 0.243 seconds
Created index --> [dbo].[part].[IX_part_p_brand_p_partkey], Indexed Columns --> ('p_brand', 'p_partkey'), Included Columns --> (), index creation time: 0.312 seconds
Created index --> [dbo].[customer].[IX_customer_c_acctbal_c_custkey], Indexed Columns --> ('c_acctbal', 'c_custkey'), Included Columns --> (), index creation time: 0.147 seconds
Created index --> [dbo].[part].[IX_part_p_partkey], Indexed Columns --> ('p_partkey',), Included Columns --> (), index creation time: 0.

In [79]:
mab.step_round(miniworkload, 3, verbose=True)

Identifying new query templates and updating statistics...
Number of new query templates added: 0
Selecting queries of interest...
Number of queries of interest: 21
Generating candidate indices...
Generating candidate indices for 21 queries...


Processing queries: 100%|██████████| 21/21 [00:00<00:00, 2969.77it/s]

Tables:
Table: customer, Row Count: 150000, PK Columns: ['c_custkey']
Table: orders, Row Count: 1500000, PK Columns: ['o_orderkey']
Table: lineitem, Row Count: 6001215, PK Columns: ['l_linenumber', 'l_orderkey']
Table: part, Row Count: 200000, PK Columns: ['p_partkey']
Table: supplier, Row Count: 10000, PK Columns: ['s_suppkey']
Table: partsupp, Row Count: 800000, PK Columns: ['ps_partkey', 'ps_suppkey']
Table: nation, Row Count: 25, PK Columns: ['n_nationkey']
Table: region, Row Count: 5, PK Columns: ['r_regionkey']

Table --> lineitem, Predicate Columns --> {'l_shipdate'}, table row count --> 6001215
Include columns: ['l_returnflag', 'l_quantity', 'l_tax', 'l_discount', 'l_extendedprice', 'l_linestatus']
Query selectivity: 0.9832358947313169
Full table scan for table: lineitem is cheap, skipping

Table --> lineitem, Payload Columns --> {'l_shipdate'}, table row count --> 6001215
Payload columns are in the predicates, skipping

Table --> lineitem, Predicate Columns --> {'l_shipdate'},




Created index --> [dbo].[part].[IX_part_p_type_p_brand_p_partkey], Indexed Columns --> ('p_type', 'p_brand', 'p_partkey'), Included Columns --> (), index creation time: 0.336 seconds
Created index --> [dbo].[part].[IX_part_p_type_p_brand], Indexed Columns --> ('p_type', 'p_brand'), Included Columns --> (), index creation time: 0.356 seconds
Created index --> [dbo].[part].[IX_part_p_type_p_brand_p_size], Indexed Columns --> ('p_type', 'p_brand', 'p_size'), Included Columns --> (), index creation time: 0.682 seconds
Created index --> [dbo].[part].[IX_part_p_type_p_size], Indexed Columns --> ('p_type', 'p_size'), Included Columns --> (), index creation time: 0.584 seconds
Created index --> [dbo].[part].[IX_part_p_type_p_partkey_p_brand], Indexed Columns --> ('p_type', 'p_partkey', 'p_brand'), Included Columns --> (), index creation time: 0.491 seconds
Created index --> [dbo].[part].[IX_part_p_name], Indexed Columns --> ('p_name',), Included Columns --> (), index creation time: 0.43 second