# Decision Trees for Rulesetting

We will use a decision tree to determine if the prior data collected in section one is within a specified ruleset and in or out of the bag.

While many ML algorithms are considered 'black boxes' a decision tree is much more human readable and allows for us to better determine why it predicts or acts in the way that it does.

This exercise falls within the use of supervised machine learning and is using a model we will train on training data to make predicitons.

In [1]:
data = [['a', 0, 'good'], ['a', 101, 'good'], ['b', -1, 'bad']] #['letter', 'number', 'class']

you can see how if you were to classify the 3 data points above as either good or bad, that you could choose to say if letter == a then good or you could choose if number > -1 then good, else bad.

Either way this is how decision trees work by splitting hte data into subgroups and using the features provided (like letter and number) to make a prediction on class.

As data sets grow larger and features become more plentiful you will need better methods for splittign the tree and accouting for the increase in entropy (the measure of uncertainty) in the model.

We can also represent trees with dictionaries shown below.

In [2]:
tree = {'letter': {'a': 'good', 'b': 'bad'}, 'number':{0: 'good', 101: 'good', -1: 'bad'}}

In [3]:
tree['letter']['a']

'good'

In [4]:
tree['letter']['b']

'bad'

In [6]:
tree['number'][101] #can replace with 0 or -1; anything else will return a key error since we don't use conditionals yet

'good'

Now we can sort the data points into groups and push the data point on to the next decision in the tree. 

Lets first set up a partition over numbers, then a second grouping based on letter to place these datum (singular points of data) into thier final groupings as good or bad.

In [26]:
good_group = []
bad_group = []

for key, value in tree.items():
    if key == 'number':
        for k,v in value.items():
            if k > -1:
                good_group.append(k)
            else:
                bad_group.append(k)
    if key == 'letter':
        for k,v in value.items():
            if k == 'a':
                good_group.append(k)
            else:
                bad_group.append(k)
        

In [30]:
print(good_group,bad_group)

['a', 0, 101] ['b', -1]


To figure out the entropy (uncertainty) of our dataset, we use the proportion of our data within each category or feature.

 $ H_{(entropy)} = - \sum^n_{i=1} P(X_i) * log_2 P(X_i)  $

Log base 2 provides us a range between 0 and 1.

Now that we have a basis for understanding entropy and calculating split points, we can attempt to apply this to our data we pickled from section 1.

In [34]:
import pickle

with open('../Section1/data_rand', 'rb') as f:
    L = pickle.load(f)

In [35]:
L #[X,Y,Out of the box or not]

[[0.0, 0.0, False],
 [-6.953878909250905, -2.80954945061934, False],
 [-21.508314803390853, 0.8192789833756753, False],
 [-35.97103602133799, 18.05527895355268, False],
 [-38.063730233661744, 47.982200461347404, True],
 [0.0, 0.0, False],
 [4.193946776030601, 6.217781794162813, False],
 [18.61287221510538, 10.352342131417803, False],
 [39.1676450120639, 1.2007676622123036, True],
 [0.0, 0.0, False],
 [-6.14364033216744, -4.301823272632844, False],
 [-7.45097647338231, 10.641097198743338, False],
 [12.940948734942312, 1.1321863095776035, False],
 [-11.633612593727433, -16.07510678095378, False],
 [-14.901952946764634, 21.28219439748667, False],
 [25.881897469884617, 2.2643726191552283, False],
 [-17.123584855287408, -27.848390289274725, False],
 [-22.352929420146967, 31.92329159622998, False],
 [38.82284620482692, 3.3965589287328655, True],
 [0.0, 0.0, False],
 [2.0672801686274953, -7.209462719537391, False],
 [-2.067280168627497, 7.209462719537391, False],
 [4.134560337254994, -14.4189

In [32]:
import collections
import math
import operator

def entropy(data):
    frequency = collections.Counter([item[-1] for item in data]) # output == ({False: 73, True: 10})
    def item_entropy(category):
        ratio = float(category)/len(data) #ratio of category to len of dataset
        return -1 * ratio * math.log2(ratio) #neg log base 2 of this item
    return sum(item_entropy(c) for c in frequency.values()) #sum it all up to return entropy

In [74]:
print(entropy(L))

0.530744566923854


In [83]:
def best_feature_for_split(data):
    baseline = entropy(data)
    def feature_entropy(f):
        def e(v):
            partitioned_data = [d for d in data if d[f] == v]
            proportion = (float(len(partitioned_data)) / float(len(data)))
            return proportion * entropy(partitioned_data)
        return sum(e(v) for v in set([d[f] for d in data]))
    features = len(data[0]) - 1
    informaiton_gain = [baseline - feature_entropy(f) for f in range(features)]
    best_feature, best_gain = max(enumerate(informaiton_gain), key=operator.itemgetter(1))
    return best_feature

In [84]:
best_feature_for_split(L)

0

In [85]:
def potential_leaf_node(data):
    '''
    returns a tuple of the most common category and a count (category, count)
    '''
    count = collections.Counter([i[-1] for i in data])
    return count.most_common(1)[0]

In [86]:
potential_leaf_node(L)

(False, 73)

In [81]:
def create_tree(data, label):
    category, count = potential_leaf_node(data)
    if count == len(data):
        return category
    node = {}
    feature = best_feature_for_split(data)
    feature_label = label[feature]
    node[feature_label] = {}
    classes = set([d[feature] for d in data])
    for c  in classes:
        partitioned_data = [d for d in data if d[feature] == c]
        node[feature_label][c] = create_tree(partitioned_data, label)
    return node

In [116]:
def classify(tree, label, data):
    root = list(tree.keys())[0]
    node = tree[root]
    index = label.index(root)
    for k in node.keys():
        if data[index] == k:
            if isinstance(node[k],dict):
                return classify(node[k], label, data)
            else:
                return node[k]

In [110]:
def as_rule_str(tree, label, ident=0):
    space_ident = '  ' * ident
    s = space_ident
    root = list(tree.keys())[0]
    node = tree[root]
    index = label.index(root)
    for k in node.keys():
        s += 'if ' + label[index] + ' = ' + str(k)
        if isinstance(node[k], dict):
            s += ':\n' + space_ident + as_rule_str(node[k], label, idnet + 1)
        else:
            s += ' then ' + str(node[k]) + ('.\n' if ident == 0 else ', ')
    if s[-2:] == ', ':
        s = s[:2]
    s+= '\n'
    return s

In [111]:
data = [[0,0, False], [1,0,False], [0,1,True], [1,1,True]]
label = ['x','y','out']

In [112]:
tree =  create_tree(data, label)

In [114]:
print(as_rule_str(tree, label))

if y = 0 then False.
if y = 1 then True.




In [117]:
print(classify(tree, label, [1,1]))
print(classify(tree, label, [0,0]))
print(classify(tree, label, [1,2])) # cant classify what it hasn't seen
print(classify(tree, label, [3,4])) # cant classify what it hasn't seen
print(classify(tree, label, [-1,-1])) # cant classify what it hasn't seen

True
False
None
None
None


In [118]:
tree = create_tree(L, label)

In [119]:
print(as_rule_str(tree, label))

if x = 0.0 then False.
if x = 2.0672801686274953 then False.
if x = 2.317627457812104 then False.
if x = 4.193946776030601 then False.
if x = 4.134560337254994 then False.
if x = 6.201840505882496 then False.
if x = 7.20946271953739 then False.
if x = 8.269120674510003 then False.
if x = 7.732955170074895 then False.
if x = 10.336400843137511 then True.
if x = 4.635254915624211 then False.
if x = 12.940948734942312 then False.
if x = 6.952882373436321 then False.
if x = 9.270509831248434 then False.
if x = 11.588137289060553 then False.
if x = 13.905764746872675 then True.
if x = 14.724407751714958 then False.
if x = 18.61287221510538 then False.
if x = 14.724407751714962 then False.
if x = 20.975911692951723 then False.
if x = -2.317627457812106 then False.
if x = 22.159188913479625 then False.
if x = 22.086611627572438 then False.
if x = 25.881897469884617 then False.
if x = 26.15056350788288 then True.
if x = 28.670995575989526 then False.
if x = 29.44881550342992 then False.
if x =

That is a ton of rules if we use the data from our prior section. We need to generalize ther rules.

In [135]:
X = []
Y = []
for i in L:
    X.append(i[0])
    Y.append(i[1])

In [141]:
def find_edges(tree, label, X, Y):
    X.sort()
    Y.sort()
    diagonals = [i for i in set(X).intersection(set(Y))]
    diagonals.sort()
    L = [classify(tree, label, [d,d]) for d in diagonals]
    low = L.index(False)
    min_x = X[low]
    min_y = Y[low]
    high = L[::-1].index(False)
    max_x = X[len(X)-1 - high]
    max_y = Y[len(Y)-1 - high]
    
    return (min_x, min_y), (max_x, max_y)

In [148]:
find_edges(tree, label, X, Y) # you can use this to create a new rule to classify a point.

((-48.82698562044138, -55.417272249533355),
 (39.1676450120639, 47.982200461347404))

In [163]:
def new_classifier(tree, label, data, pred_point):
    X = []
    Y = []
    for i in data:
        X.append(i[0])
        Y.append(i[1])
    min_x = find_edges(tree, label, X, Y)[0][0]
    min_y = find_edges(tree, label, X, Y)[0][1]
    max_x = find_edges(tree, label, X, Y)[1][0]
    max_y = find_edges(tree, label, X, Y)[1][1]
    x = pred_point[0]
    y = pred_point[1]
    if (min_x < x < max_x) and (min_y < y < max_y): 
        return 'Inside', True
    else:
        return 'Outside', False
    

In [164]:
new_classifier(tree, label, L, [-5, 5]) #change points around to see if it's within or outside the bounds

('Inside', True)

We now have a new classifier that uses the edges found from our training data to determin the bounding box for our decision tree!