#### Construction of Index Benefit Graph (from Schnaitter's PhD Thesis, 2011)

***Definition***: The `IBG` of a query $q$ is a `DAG` in which each node $Y$ is a subset of $C$, a set of all relevant indexes that could ever be utlilized in the execution of $q$. Node $Y$ also stores the following two quantities: 

* $cost(q,Y)$ which is the query optimizer's estimated cost for executing $q$ under configuration $Y$  
* $used(q,Y)$ which is the subset of indexes from $Y$ that are included in the query plan


Recursive algorithm for constructing the IBG:

```python
construct_IBG(q, Y):
    if Y.built:
        return

    # obtain estimated cost and determine indexes used
    Y.cost = cost(q,Y)
    Y.used = used(q,Y)
    Y.built = True
    
    # create children (one for each index in Y.used)
    for a in Y.used:
        create child node: X = Y - {a}   # child node is set Y with index a removed
        X.built = False
        Y.add_child(X)
        # recursively construct IBG on children
        construct_IBG(q, X)

```




```python
# create root node
Y = C
Y.built = False

# call construct_IBG(q, Y)
construct_IBG(q, Y)
```


It is possible that some nodes may share the same child. Instead of creating a new node for that child for each different parent node, we can use a separate hash table to keep track of children that have already been created and reuse children which have already been created.

Once the IBG has been constucted, we can use it to derive $cost(q, X)$ and $used(q, X)$ for any $X \subseteq C$, even if $X$ is not in the IBG, as follows. We start from the root node in the IBG (which will contain all indexes in $X$ and possibly some additional ones not in X), iteratively traverse down to a child that corresponds to removal of a node not in $X$ until we reach a node $Y$ which is either a leaf or only contains nodes that are in $X$. Then $cost(q,X) = cost(q,Y)$ and $used(q, X) = used(q,Y)$.

So the whole point of the IBG is that it gives us a compressed/efficient representation of the power-set of $C$ so that for any subset $X$ in the power-set we can compute  $cost(q, X)$ and $used(q, X)$ using the IBG, without having to maintain those quantities for every possible subset.

(Later on, we will also see how to use the IGB to derive information about index interactions.)

We can also use the IBG to compute the `maximum benefit` of any index $a \in C$ as follows:

$$
\beta = max_{X \subseteq C} \text{ benefit}_q(a, X)
$$

where $\text{ benefit}_q(a, X) \equiv cost(q,X) - cost(q,X \cup \set{a})$. Note that the maximization is over all possible subsets $X$, seems like a lot of work to evaluate the benefit for all of them. However, a simple and efficient way is to just find all the nodes $Y$ which don't contain the index $a$ and then just compute the benefit for all of these nodes and then get the max.





In [2]:
%load_ext autoreload
%autoreload 2

from ssb_qgen_class import *
from pg_utils import *

import time


In [3]:
# create an SSB query generator object
qg = QGEN()

In [28]:
class Node:
    def __init__(self, id, indexes):
        self.id = id
        self.indexes = indexes
        self.children = []
        self.parents = []
        self.built = False
        self.cost = None
        self.used = None


# class for creating and storing the IBG
class IBG:
    def __init__(self, query_object, C=None):
        self.q = query_object
        if C is None:
            # get all candidate indexes
            self.C = extract_query_indexes(self.q, include_cols=True)
        else:
            self.C = C
        print(f"Number of candidate indexes: {len(self.C)}")
        #print(f"Candidate indexes: {self.C}")
        # map index_id to integer
        self.idx2id = {index.index_id:i for i, index in enumerate(self.C)}
        
        # create a hash table for keeping track of all created nodes
        self.nodes = {}
        # create a root node
        self.root = Node(self.get_configuration_id(self.C), self.C)
        self.nodes[self.root.id] = self.root
        print(f"Created root node with id: {self.root.id}")
        print("Constructing IBG...")
        # start the IBG construction
        self.construct_ibg(self.root)


    # assign unique string id to a configuration
    def get_configuration_id(self, indexes):
        # get sorted list of integer ids
        ids = sorted([self.idx2id[idx.index_id] for idx in indexes])
        return "_".join([str(i) for i in ids])
    

    # obtain cost and used indexes for a given configuration
    def _get_cost_used(self, indexes):
        conn = create_connection()
        # create hypothetical indexes
        hypo_indexes = bulk_create_hypothetical_indexes(conn, indexes)
        # map oid to index object
        oid2index = {}
        for i in range(len(hypo_indexes)):
            oid2index[hypo_indexes[i][0]] = indexes[i]
        # get cost and used indexes
        cost, indexes_used = get_query_cost_estimate_hypo_indexes(conn, self.q.query_string, show_plan=False)
        # map used index oids to index objects
        used = [oid2index[oid] for oid,scan_type,scan_cos in indexes_used]
        # drop hypothetical indexes
        bulk_drop_hypothetical_indexes(conn)
        close_connection(conn)   
        return cost, used

    # recursive IBG construction algorithm
    def construct_ibg(self, Y):
        if Y.built:
            return 
        
        # obtain query optimizers cost and used indexes
        cost, used = self._get_cost_used(Y.indexes)
        Y.cost = cost
        Y.used = used
        Y.built = True
        
        #print(f"Creating node for configuration: {[idx.index_id for idx in Y.indexes]}")
        #print(f"Cost: {cost}, Used indexes:")
        #for idx in used:
        #    print(f"{idx}")

        # create children
        for a in Y.used:
            # create a new configuration with index a removed from Y
            X_indexes = [index for index in Y.indexes if index != a]
            X_id = self.get_configuration_id(X_indexes)
            
            # if X is not in the hash table, create a new node and recursively build it
            if X_id not in self.nodes:
                X = Node(X_id, X_indexes)
                X.parents.append(Y)
                self.nodes[X_id] = X
                Y.children.append(X)
                self.construct_ibg(X)

            else:
                X = self.nodes[X_id]
                Y.children.append(X)
                X.parents.append(Y)


    # use IBG to obtain estimated cost and used indexes for arbitrary subset of C
    def get_cost_used(self, X):
        # get id of the configuration
        id = self.get_configuration_id(X)
        # check if the configuration is in the IBG
        if id in self.nodes:
            cost, used = self.nodes[id].cost, self.nodes[id].used
        
        else:
            X_indexes = set([index.index_id for index in X])
            Y = self.root
            Y_indexes = set([index.index_id for index in Y.indexes])
            # traverse IBG to find a node that removes indexes not in X
            while (len(Y_indexes - X_indexes) != 0) or (len(Y.children) > 0):               
                # traverse down to a child node that removes indexes not in X
                child_found = False
                for child in Y.children:
                    child_indexes = set([index.index_id for index in child.indexes])
                    child_indexes_removed = Y_indexes - child_indexes
                    child_indexes_removed_not_in_X = child_indexes_removed - X_indexes
            
                    if len(child_indexes_removed_not_in_X) > 0:
                        Y = child
                        Y_indexes = child_indexes
                        child_found = True
                        break

                # if no children remove indexes not in X    
                if not child_found:
                    break    
        
            cost, used = Y.cost, Y.used

        return cost, used    


    # compute benefit of an index for a given configuration 
    # input X is a list of index objects and 'a' is a single index object
    # X must not contain 'a'
    def compute_benefit(self, a, X):
        if a in X:
            # zero benefit if 'a' is already in X
            #raise ValueError("Index 'a' is already in X")
            return 0
        
        # get cost and used indexes for X
        cost_X, used_X = self.get_cost_used(X)
        # create a new configuration with index a added to X
        X_a = X + [a]
        # get cost and used indexes for X + a
        cost_X_a, used_X_a = self.get_cost_used(X_a)
        # compute benefit
        benefit = cost_X - cost_X_a
        return benefit 


    # compute maximum benefit of adding an index to any possibe configuration
    def compute_max_benefit(self, a):
        max_benefit = float('-inf')
        for id, node in self.nodes.items():
            print(f"Computing benefit for node: {[index.index_id for index in node.indexes]}")
            benefit = self.compute_benefit(a, node.indexes)
            if benefit > max_benefit:
                max_benefit = benefit

        return max_benefit


    # function for printing the IBG, using BFS level order traversal
    def print_ibg(self):
        q = [self.root]
        # traverse level by level, print all node ids in a level in a single line before moving to the next level
        while len(q) > 0:
            next_q = []
            for node in q:
                print(f"{node.id} -> ", end="")
                for child in node.children:
                    next_q.append(child)
            print()
            q = next_q  
                     
            

In [29]:
query = qg.generate_query(14)
print(query)

template id: 14, query: 
                SELECT lo_linenumber, lo_quantity, lo_orderdate  
                FROM lineorder
                WHERE lo_linenumber >= 3 AND lo_linenumber <= 4
                AND lo_quantity = 28;
            , payload: {'lineorder': ['lo_linenumber', 'lo_quantity', 'lo_orderdate']}, predicates: {'lineorder': ['lo_linenumber', 'lo_quantity']}, order by: {}, group by: {}


In [30]:
# candidate indexes
C = extract_query_indexes(qg.generate_query(1), include_cols=True) + extract_query_indexes(qg.generate_query(8), include_cols=True)  + extract_query_indexes(qg.generate_query(14), include_cols=True)  

ibg = IBG(query, C)

ibg.print_ibg()

Number of candidate indexes: 127
Created root node with id: 1_2_3_4_5_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_42_43_44_45_46_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_72_73_73_74_75_75_76_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_119_120_121_122_123_124_125_126
Constructing IBG...


No index scans were explicitly noted in the query plan.
1_2_3_4_5_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_42_43_44_45_46_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_72_73_73_74_75_75_76_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_119_120_121_122_123_124_125_126 -> 
1_2_3_4_5_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_42_43_44_45_46_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_72_73_73_74_75_75_76_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_119_120_121_122_123_124_125 -> 
1_2_3_4_5_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_42_43_44_45_46_46_47_48_49_50_51_52_53

In [31]:
# pick random subset of candidate indexes
X = random.sample(ibg.C, 8)
cost, used = ibg.get_cost_used(X)
print(f"Cost: {cost}, Used indexes: {[idx.index_id for idx in used]}")

cost, used = ibg._get_cost_used(X)
print(f"Cost: {cost}, Used indexes: {[idx.index_id for idx in used]}")

Cost: 1461523.49, Used indexes: []
No index scans were explicitly noted in the query plan.
Cost: 1461523.49, Used indexes: []


In [64]:
C_other = extract_query_indexes(qg.generate_query(14), include_cols=True)  
# pick a random index to compute maximum benefit
a = random.choice(C_other)
max_benefit = ibg.compute_max_benefit(a)
print(f"Maximum benefit of adding index {a.index_id}: {max_benefit}")

Computing benefit for node: ['IX_lineorder_lo_orderdate', 'IXN_lineorder_lo_orderdate_lo_e', 'IXN_lineorder_lo_orderdate_lo_d', 'IXN_lineorder_lo_orderdate_lo_e_lo_d', 'IX_lineorder_lo_discount', 'IXN_lineorder_lo_discount_lo_e', 'IX_lineorder_lo_quantity', 'IXN_lineorder_lo_quantity_lo_e', 'IXN_lineorder_lo_quantity_lo_d', 'IXN_lineorder_lo_quantity_lo_e_lo_d', 'IX_lineorder_lo_orderdate_lo_discount', 'IXN_lineorder_lo_orderdate_lo_discount_lo_e', 'IX_lineorder_lo_orderdate_lo_quantity', 'IXN_lineorder_lo_orderdate_lo_quantity_lo_e', 'IXN_lineorder_lo_orderdate_lo_quantity_lo_d', 'IXN_lineorder_lo_orderdate_lo_quantity_lo_e_lo_d', 'IX_lineorder_lo_discount_lo_orderdate', 'IXN_lineorder_lo_discount_lo_orderdate_lo_e', 'IX_lineorder_lo_discount_lo_quantity', 'IXN_lineorder_lo_discount_lo_quantity_lo_e', 'IX_lineorder_lo_quantity_lo_orderdate', 'IXN_lineorder_lo_quantity_lo_orderdate_lo_e', 'IXN_lineorder_lo_quantity_lo_orderdate_lo_d', 'IXN_lineorder_lo_quantity_lo_orderdate_lo_e_lo_d',