#### Online Index Selection Via Combinatorial Contextual Multi Armed Bandits

In [13]:
import logging
import datetime
import os
import subprocess
import uuid

import numpy as np
import pyodbc
import sys
import random
import pandas as pd
import time
import os
from tqdm import tqdm
import logging
import re
import json
import itertools
import math
from collections import defaultdict
from tqdm import tqdm

%load_ext autoreload
%autoreload 2

import IPython
notebook_path = IPython.get_ipython().starting_dir
target_subdirectory_path = os.path.abspath(os.path.join(os.path.dirname(notebook_path), 'database'))
sys.path.append(target_subdirectory_path)
from utils import *

from mab import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# read workload queries from JSON file
def read_workload(workload_filepath):
    workload = []
    with open(workload_filepath) as f:
        line = f.readline()
        # read the queries from each line
        while line:
            workload.append(json.loads(line))
            line = f.readline()

    return workload

# Base directory containing the generated queries
workload_filepath = '../datagen/TPCH_workloads/TPCH_static_100_workload.json'

# Read the workload queries from file
workload = read_workload(workload_filepath)
print(len(workload))

2100


#### MAB index selection algorithm. On each round, do the following:

1) Generate candidate arms/indices using mini-workload from previous round
2) Generate context vector for each candidate arm
3) Select the best super-arm, i.e. configuration/subset of candidate indices
4) Materialize the super-arm configuration, then execute new mini-workload for current round


We will implement these 4 steps separately in the given order.

#### 1. Generation of Candidate indices

Test index generation for miniworkload of first 21 queries

In [6]:
mab = MAB()

In [8]:
connection = start_connection()

miniworkload = []
for query in workload[0:21]:
    # convert to Query object
    miniworkload.append(Query(connection, query['template_id'], query['query_string'], query['payload'], query['predicates'], query['order_bys']))

close_connection(connection)

In [9]:
connection = start_connection()

# genete candidate indices
index_arms = mab.generate_candidate_indices(connection, miniworkload, verbose=False)

close_connection(connection)

Gnereting candidate indices for 21 queries...


Processing queries: 100%|██████████| 21/21 [00:00<00:00, 6457.98it/s]


#### 2. Generation of Context Vectors for Each Arm/Index

The context vector of each index can be defined as a concatenation of two pieces:

* Columns Piece:  a vector with length equal to the total number of columns in the database. Each entry in this vector corresponds to one of the columns and contains the value $10^{-j}$ where $j$ is the position of that column in the index, provided that column is in the index, otherwise the value is zero. 

* Derived Context Piece: a vector of length 2, first component contains time stamp of last round when the index was used and second component is the size of the index relative to the entire database

In [14]:
connection = start_connection()

all_columns, num_columns = get_all_columns(connection)

close_connection(connection)

In [15]:
columns_to_idx = {}
i = 0
for table_name, columns in all_columns.items():
    for column in columns:
        columns_to_idx[column] = i
        i += 1

idx_to_columns = {v: k for k, v in columns_to_idx.items()}       

In [17]:
connection = start_connection()

# test context vector generation
context_vectors_columns = mab.generate_context_vector_columns(index_arms, columns_to_idx)
print(context_vectors_columns.shape)

context_vectors_derived = mab.generate_context_vector_derived(connection, index_arms)
print(context_vectors_derived.shape)

close_connection(connection)

(652, 61)
(652, 2)
