# COMP9418 - Assignment 1 - Bayesian Networks as Classifiers

## UNSW Sydney, October 2020

- Student name 1 - zID
- Student name 2 - ZID

## Instructions

**Submission deadline:** Sunday, 18th October 2020, at 18:00:00.

**Late Submission Policy:** The penalty is set at 20% per late day. This is ceiling penalty, so if a group is marked 60/100 and they submitted two days late, they still get 60/100.

**Form of Submission:** This is a group assignment. Each group can have up to **two** students. **Only one member of the group should submit the assignment**.

You can reuse any piece of source code developed in the tutorials.

Submit your files using give. On a CSE Linux machine, type the following on the command-line:

``$ give cs9418 ass1 solution.zip``

Alternative, you can submit your solution via the [WebCMS](https://webcms3.cse.unsw.edu.au/COMP9418/20T3).

## Technical prerequisites

These are the libraries your are allowed to use. No other libraries will be accepted.

In [85]:
# Make division default to floating-point, saving confusion
from __future__ import division
from __future__ import print_function

# Allowed libraries
import numpy as np
import pandas as pd
import scipy as sp
import heapq as pq
import matplotlib as mp
import math
from itertools import product, combinations
from collections import OrderedDict as odict
from graphviz import Digraph
from tabulate import tabulate

# Supplemental libraries
import copy

## Initial task - Initialise graph

Create a graph ``G`` that represents the following network by filling in the edge lists.
![Bayes Net](BayesNet.png)


In [86]:
G = {
    "BreastDensity" : ["Mass"],
    "Location" : ["BC"],
    "Age" : ["BC"],
    "BC" : ["Mass", "AD", "Metastasis", "MC", "SkinRetract","NippleDischarge"],
    "Mass" : ["Size",  "Shape", "Margin" ],
    "AD" : ["FibrTissueDev"],
    "Metastasis" : [ "LymphNodes"],
    "MC" : [],
    "Size" : [],
    "Shape" : [],
    "FibrTissueDev" : [ "SkinRetract" , "NippleDischarge","Spiculation" ],
    "LymphNodes" : [],
    "SkinRetract" : [],
    "NippleDischarge" : [],
    "Spiculation" : ["Margin" ],
    "Margin" : [],
}

## [20 Marks] Task 1 - Efficient d-separation test

Implement the efficient version of the d-separation algorithm in a function ``d_separation(G, X, Z, Y)`` that return a boolean: true if **X** is d-separated from **Y** given **Z** in the graph $G$ and false otherwise.

* **X**,**Y** and **Z** are python sets, each containing a set of variable names. 
* Variable names may be strings or integers, and can be assumed to be nodes of the graph $G$. 
* $G$ is a graph as defined in tutorial 1.

In [87]:
## Develop your code for d_separation(G, X, Z, Y) in one or more cells here
import copy 


def isleaf_node(G,node):
    return not G[node] 

def remove_leaf(G1,leaf_node):
    # delete the leaf node and its edges from the G, return a new Graph
    G_new = copy.deepcopy(G1)
    del G_new[leaf_node]
    for key, value in  G_new.items(): 
        if leaf_node in value:
            G_new[key].remove(leaf_node)
    return G_new
    
def repeat_del(G1,node_list): 
    count = 1 
    list_update = node_list.copy()
    while count > 0:
        count = 0
        for node in node_list: 
            if isleaf_node(G1,node):
                #print(node)
                #remove the nodes and update the graph 
                G1= copy.deepcopy(remove_leaf(G1,node))
                #print(G)
                list_update.remove(node)
                count = count + 1 
                #print(count)
        node_list = list_update.copy()
    return (G1)
    

    #if count == 0:
    #    return(G)
    #print(list_update)
    #new_list = list_update.copy()
    #G_new = repeat_del(G,new_list)
    
 
    
def dfs_r(G, v, colour):
    """
    argument 
    `G`, an adjacency list representation of a graph
    `v`, next vertex to be visited
    `colour`, dictionary with the colour of each node
    """
    #print('Visiting: ', v)
    # Visited vertices are coloured 'grey'
    colour[v] = 'grey'
    # Let's visit all outgoing edges from v
    for w in G[v]:
        # To avoid loops, we vist check if the next vertex hasn't been visited yet
        if colour[w] == 'white':
            dfs_r(G, w, colour)
    # When we finish the for loop, we know we have visited all nodes from v. It is time to turn it 'black'
    colour[v] = 'black' 
    return None
  


def d_separation(G1, X, Z, Y): 
    """ 
    Arguments: 
    `G`, an adjacency list representation of a graph 
    `X`, a set of variables name 
    `Y`, a set of variables name 
    `Z`, a set of a set of variables name 
    
    Returns 
    a boolean: true if X is d-separated from Y given Z in the graph  𝐺  and false otherwise.
    
    """
    if bool(X.intersection(Y).intersection(Z)):
        print("X, Y, Z are not disjoint")  #make it a warning/ error message? 
    
    combine_set = X.union(Y).union(Z)
    node_set = set(G1.keys())
    remain_nodes = set(node_set  - combine_set)
    
    G_final = copy.deepcopy(repeat_del(G1,remain_nodes)) #repeat del leaf nodes 
    
    for var in Z: 
        G_final[var] = [] #delete outgoing edges from Z set 
    
    for key,values in G_final.items():
        if bool(values):
            for n in values: 
                G_final[n].append(key) #make it undirect graph

    colour = {node: 'white' for node in G_final.keys()}
    #check connectivity 
    separate = True 
    for nodex in X:
        dfs_r(G_final,nodex,colour) 
        Y_color = [colour[node] for node in Y]
        #print(Y_color)
        if 'black' in Y_color:
            separate = False
            

    return(separate)
        

In [88]:
############
## TEST CODE

def test(statement):
    if statement:
        print("Passed test case")
    else:
        print("Failed test case")
        
test(d_separation(G, set(['Age']), set(['BC']), set(['AD'])))
test(not d_separation(G, set(['Spiculation','LymphNodes']), set(['MC', 'Size']), set(['Age'])))

Passed test case
Passed test case


## [10 Marks] Task 2 - Estimate Bayesian Network parameters from data

Implement a function ``learn_outcome_space(data)`` that learns the outcome space (the valid values for each variable) from the pandas dataframe ``data`` and returns a dictionary ``outcomeSpace`` with these values.

Implement a function ``learn_bayes_net(G, data, outcomeSpace)`` that learns the parameters of the Bayesian Network $G$. This function should return a dictionary ``prob_tables`` with the all conditional probability tables (one for each node).

- ``G`` is a directed acyclic graph. For this part of the assignment, $G$ should be declared according to the breast cancer Bayesian network presented in the diagram in the assignment specification.
- ``data`` is a dataframe created from a csv file containing the relevant data. 
- ``outcomeSpace`` is defined in tutorials.
- ``prob_tables`` is a dict from each variable name (node) to a "factor". Factors are defined in tutorial 2. 

In [None]:
## Develop your code for learn_outcome_space(data) in one or more cells here

In [89]:
def learn_outcome_space(data):
    outcomeSpace = {}
    for attr in data.columns:
        outcomeSpace[attr] = tuple(data[attr].unique())
    return(outcomeSpace)

In [90]:
############
## TEST CODE

with open('bc.csv') as file:
    data = pd.read_csv(file)

outcomeSpace = learn_outcome_space(data)

outcomes = outcomeSpace['BreastDensity']
answer = ('high', 'medium', 'low')
test(len(outcomes) == len(answer) and set(outcomes) == set(answer))

Passed test case


In [7]:
## Develop your code for learn_bayes_net(G, data, outcomeSpace) in one or more cells here

In [91]:
# Auxilliary functions
def printFactor(f):
    """
    argument 
    `f`, a factor to print on screen
    """
    # Create a empty list that we will fill in with the probability table entries
    table = list()
    
    # Iterate over all keys and probability values in the table
    for key, item in f['table'].items():
        # Convert the tuple to a list to be able to manipulate it
        k = list(key)
        # Append the probability value to the list with key values
        k.append(item)
        # Append an entire row to the table
        table.append(k)
    # dom is used as table header. We need it converted to list
    dom = list(f['dom'])
    # Append a 'Pr' to indicate the probabity column
    dom.append('Pr')
    print(tabulate(table,headers=dom,tablefmt='fancy_grid'))
    
def prob(factor, *entry):
    """
    argument 
    `factor`, a dictionary of domain and probability values,
    `entry`, a list of values, one for each variable in the same order as specified in the factor domain.
    
    Returns p(entry)
    """

    return factor['table'][entry]  

In [94]:
def allEqualThisIndex(dict_of_arrays,**fixed_vars):
    first_array = dict_of_arrays[list(dict_of_arrays.keys())[0]]
    index = np.ones_like(first_array,dtype=np.bool_)
    for var_name,var_val in fixed_vars.items():
        index = index & (np.asarray(dict_of_arrays[var_name])==var_val)
    return (index)

def estProbTable(data,var_name,parent_names,outcomeSpace):
    var_outcomes = outcomeSpace[var_name]
    parent_outcomes = [outcomeSpace[var] for var in parent_names]
    all_parent_combinations = product(*parent_outcomes)
    prob_table = odict()
    
    for i,parent_combination in enumerate(all_parent_combinations):
        parent_vars = dict(zip(parent_names,parent_combination))
        parent_index = allEqualThisIndex(data,**parent_vars)
        for var_outcome in var_outcomes:
            var_index = (np.asarray(data[var_name])==var_outcome)
            new_dom = tuple(list(parent_combination)+[var_outcome])
            prob_table[new_dom]=(var_index & parent_index).sum()/parent_index.sum()
    return({'dom':tuple(list(parent_names)+[var_name]),'table':prob_table})

def transposeGraph(G):
    GT = dict((v,[]) for v in G)
    for v in G:
        for w in G[v]:
            GT[w].append(v)
    return (GT)

In [92]:
def learn_bayes_net(G,data,outcomeSpace):
    bayes_net = odict()
    GT = transposeGraph(G)
    for child, parents in GT.items():
        bayes_net[child] = estProbTable(data,child,parents,outcomeSpace)
    return(bayes_net)

In [95]:
############
## TEST CODE

prob_tables = learn_bayes_net(G, data, outcomeSpace)
test(abs(prob_tables['Age']['table'][('35-49',)] - 0.2476) < 0.001)

Passed test case


## [20 Marks] Task 3 - Bayesian Network Classification

Design a new function ``assess_bayes_net(G, prob_tables, data, outcomeSpace, class_var)`` that uses the test cases in ``data`` to assess the performance of the Bayesian network defined by ``G`` and ``prob_tables``. Implement the efficient classification procedure discussed in the lectures. Such a function should return the classifier accuracy. 
 * ``class_var`` is the name of the variable you are predicting, using all other variables.
 * ``outcomeSpace`` was created in task 2
 
Remember to remove the variables ``metastasis`` and ``lymphnodes`` from the dataset before assessing the accuracy.

Return just the accuracy:

``acc = assess_bayes_net(G, prob_tables, data, outcomeSpace, class_var)``

In [96]:
## Develop your code for assess_bayes_net(G, prob_tables, data, outcomeSpace, class_var) in one or more cells here
def join(f1, f2, outcomeSpace):
    """
    argument 
    `f1`, first factor to be joined.
    `f2`, second factor to be joined.
    `outcomeSpace`, dictionary with the domain of each variable
    
    Returns a new factor with a join of f1 and f2
    """
    
    # First, we need to determine the domain of the new factor. It will be union of the domain in f1 and f2
    # But it is important to eliminate the repetitions
    common_vars = list(f1['dom']) + list(set(f2['dom']) - set(f1['dom']))
    
    # We will build a table from scratch, starting with an empty list. Later on, we will transform the list into a odict
    table = list()
    
    # Here is where the magic happens. The product iterator will generate all combinations of varible values 
    # as specified in outcomeSpace. Therefore, it will naturally respect observed values
    for entries in product(*[outcomeSpace[node] for node in common_vars]):
        
        # We need to map the entries to the domain of the factors f1 and f2
        entryDict = dict(zip(common_vars, entries))
        f1_entry = (entryDict[var] for var in f1['dom'])
        f2_entry = (entryDict[var] for var in f2['dom'])
        
        # Insert your code here
        p1 = prob(f1, *f1_entry)           # Use the fuction prob to calculate the probability in factor f1 for entry f1_entry 
        p2 = prob(f2, *f2_entry)           # Use the fuction prob to calculate the probability in factor f2 for entry f2_entry 
        
        # Create a new table entry with the multiplication of p1 and p2
        table.append((entries, p1 * p2))
    return {'dom': tuple(common_vars), 'table': odict(table)}

def p_joint(outcomeSpace, cond_tables):
    """
    argument 
    `outcomeSpace`, dictionary with domain of each variable
    `cond_tables`, conditional probability distributions estimated from data
    
    Returns a new factor with full joint distribution
    """    
    
    var_list = list(outcomeSpace.keys())
    p = join(cond_tables[var_list[0]], cond_tables[var_list[1]], outcomeSpace)

    for var in var_list[2:]:
        p = join(p,cond_tables[var_list[var]],outcomeSpace)

    return p




In [97]:
def markov_blanket(G,var):
    """ determine the relevant varaibles given the var of interest, return a list of nodes """
    blanket_list = []
    blanket_list = blanket_list + G[var] #include the children
    children_list = blanket_list 
    GT = transposeGraph(G)
    blanket_list = blanket_list + GT[var] #include the parents 
    
    for node in children_list:
        blanket_list = blanket_list + GT[node] #include spouse 
    
    blanket_list = list(set(blanket_list))
    blanket_list = [i for i in blanket_list if i != var]
    return blanket_list
    
def p_joint_new(my_blanket, outcomeSpace, cond_tables):
    var_list = my_blanket
    
    p = join(cond_tables[var_list[0]], cond_tables[var_list[1]], outcomeSpace)

    for var in var_list[2:]:
        p = join(p,cond_tables[var], outcomeSpace)

    return p


In [98]:
def evidence(var, e, outcomeSpace):
    """
    argument 
    `var`, a valid variable identifier.
    `e`, the observed value for var.
    `outcomeSpace`, dictionary with the domain of each variable
    
    Returns dictionary with a copy of outcomeSpace with var = e
    """    
    newOutcomeSpace = outcomeSpace.copy()      # Make a copy of outcomeSpace with a copy to method copy(). 1 line
    newOutcomeSpace[var] = (e,)                # Replace the domain of variable var with a tuple with a single element e. 1 line
    return newOutcomeSpace

def marginalize(f, var, outcomeSpace):
    """
    argument 
    `f`, factor to be marginalized.
    `var`, variable to be summed out.
    `outcomeSpace`, dictionary with the domain of each variable
    
    Returns a new factor f' with dom(f') = dom(f) - {var}
    """    
    
    # Let's make a copy of f domain and convert it to a list. We need a list to be able to modify its elements
    new_dom = list(f['dom'])
    
    new_dom.remove(var)            # Remove var from the list new_dom by calling the method remove(). 1 line
    table = list()                 # Create an empty list for table. We will fill in table from scratch. 1 line
    for entries in product(*[outcomeSpace[node] for node in new_dom]):
        s = 0;                     # Initialize the summation variable s. 1 line

        # We need to iterate over all possible outcomes of the variable var
        for val in outcomeSpace[var]:
            # To modify the tuple entries, we will need to convert it to a list
            entriesList = list(entries)
            # We need to insert the value of var in the right position in entriesList
            entriesList.insert(f['dom'].index(var), val)
                      
            p = prob(f, *tuple(entriesList))     # Calculate the probability of factor f for entriesList. 1 line
            s = s + p                            # Sum over all values of var by accumulating the sum in s. 1 line
            
        # Create a new table entry with the multiplication of p1 and p2
        table.append((entries, s))
    return {'dom': tuple(new_dom), 'table': odict(table)}


def normalize(f):
    """
    argument 
    `f`, factor to be normalized.
    
    Returns a new factor f' as a copy of f with entries that sum up to 1
    """ 
    table = list()
    sum = 0
    for k, p in f['table'].items():
        sum = sum + p
    for k, p in f['table'].items():
        table.append((k, p/sum))
    return {'dom': f['dom'], 'table': odict(table)}


def query(p, outcomeSpace, q_vars, **q_evi):
    """
    argument 
    `p`, probability table to query.
    `outcomeSpace`, dictionary will variable domains
    `q_vars`, list of variables in query head
    `q_evi`, dictionary of evidence in the form of variables names and values
    
    Returns a new factor NORMALIZED factor will all hidden variables eliminated as evidence set as in q_evi
    """     
    
    # Let's make a copy of these structures, since we will reuse the variable names
    pm = p.copy()
    outSpace = outcomeSpace.copy()
    
    # First, we set the evidence 
    for var_evi, e in q_evi.items():
        outSpace = evidence(var_evi, e, outSpace)
        
    # Second, we eliminate hidden variables NOT in the query
    for var in outSpace:
        if not var in q_vars:
            pm = marginalize(pm, var, outSpace)
    return normalize(pm)

In [99]:
def assess_bayes_net(G, prob_tables, data, outcomeSpace, class_var):
    
    var_blanket = markov_blanket(G,class_var)
    blanket_without_var = copy.deepcopy(var_blanket)
    var_blanket.append(class_var)
    var_remove = ['Metastasis', 'LymphNodes']
    var_list = [i for i in var_blanket if i not in var_remove] # now we get the variables that needs for inference class_var 
    
    p_table =  p_joint_new(var_list, outcomeSpace, prob_tables)
    q_var = class_var
    evidence_list = [var for var in var_list if var!=class_var]
    data_update = data[evidence_list]
    
    data_dict = data_update.to_dict(orient='records')
    outcomeSpace_copy = { var: outcomeSpace[var] for var in var_list}
    match_count = 0
    for i in range(len(data_dict)):
        q_table = query(p_table, outcomeSpace_copy, q_var, **data_dict[i])
        pred = max(q_table['table'],key=q_table['table'].get)[0]
        if (pred == data.iloc[i][q_var]):
            match_count +=1
    return (match_count/data.shape[0])

In [101]:
############
## TEST CODE
class_var = "BC"
acc = assess_bayes_net(G, prob_tables, data, outcomeSpace, class_var)

In [102]:
acc

0.84225

Develop a function ``cv_bayes_net(G, data, class_var)`` that uses ``learn_outcome_space``, ``learn_bayes_net``and ``assess_bayes_net`` to learn and assess a Bayesian network in a dataset using 10-fold cross-validation. Compute and report the average accuracy over the ten cross-validation runs as well as the standard deviation, e.g.

``acc, stddev = cv_bayes_net(G, data, class_var)``

In [103]:
## Develop your code for cv_bayes_net(G, data, class_var) in one or more cells here

In [104]:
# The cross validation is 10 fold here
def cv_bayes_net(G,data,class_var):
    outcomeSpace = learn_outcome_space(data)
    fold_len = int(data.shape[0]/10)
    acc_list = []
    for i in range(10):
        training_index = list(range(0,i*fold_len)) + list(range((i+1)*fold_len,data.shape[0]))
        test_index = list(range(i*fold_len,(i+1)*fold_len))
        training_data = data.iloc[training_index]
        test_data = data.iloc[test_index]
        prob_tables = learn_bayes_net(G,training_data,outcomeSpace)
        acc_list.append(assess_bayes_net(G,prob_tables,test_data,outcomeSpace,class_var))
    
    print(acc_list)
    return (np.mean(acc_list),np.std(acc_list))

In [105]:
############
## TEST CODE

acc, stddev = cv_bayes_net(G, data, 'BC')

[0.8395, 0.8445, 0.836, 0.8485, 0.846, 0.843, 0.8285, 0.84, 0.837, 0.848]


In [106]:
acc

0.8411

## [10 Marks] Task 4 - Naïve Bayes Classification

Design a new function ``assess_naive_bayes(G, prob_tables, data, outcomeSpace, class_var)`` to classify and assess the test cases in a dataset ``data`` according to the Naïve Bayes classifier. To classify each example, use the log probability trick discussed in the lectures. This function should return the accuracy of the classifier in ``data``.

In [110]:
## Develop your code for assess_naive_bayes(G, prob_tables, data, outcomeSpace, class_var) in one or more cells here

def naive_bayes_graph(outcomeSpace, class_var):
    """Return the naive-bayes graph structure (a dict) according to above info"""
    G_nb = {}
    node_list = list(outcomeSpace.keys())
    node_list.remove(class_var)
    
    G_nb[class_var] = node_list
    
    for nodes in node_list:
        G_nb[nodes] = []
    
    return G_nb


def single_var_query( e, node_table):
    '''Return the log likelihood for each evidence variable '''
    prob_with_evi = {key[0]: value for key,value in node_table['table'].items() if key[1] == e} #np.log avoid error for log(0),, but it is slower
    return prob_with_evi



    
def predict(x, y_space, table, prior):
    pre_dict = {i:  prior[i] for i in y_space}
    for i in range(len(x)):
        #print(123)
        #print(pre_dict)
        var_prob = single_var_query(x[i], table[x.index[i]])
        #print(234)
        #print(var_prob)
        for key in pre_dict.keys():
            pre_dict[key] = pre_dict[key]* var_prob[key] 
        
    yhat = max(pre_dict, key=pre_dict.get)
    
    return yhat 


def assess_naive_bayes(G, prob_tables, data, outcomeSpace, class_var):
    G_naive = naive_bayes_graph(outcomeSpace,class_var)
    naive_tables = learn_bayes_net(G_naive, data, outcomeSpace)
    
    node_list = list(outcomeSpace.keys())
    var_remove = ['Metastasis', 'LymphNodes']
    var_list = [i for i in node_list if i not in var_remove] #now we get all the variables 
    
    evidence_list = [var for var in var_list if var!=class_var]
    data_update = data[evidence_list]
    
    prior_prob = data['BC'].value_counts(normalize = True)
    
    #p_table =  p_joint_new(var_list, outcomeSpace, naive_tables )
    #outcomeSpace_copy = { var: outcomeSpace[var] for var in var_list }
    #data_dict = data_update.to_dict(orient='records')
    
   # prob_dict = {key: [] for key in outcomeSpace[class_var]} #empety dict to store log prob 
    
    
    y_hat_series = data_update.apply(predict, y_space = outcomeSpace[class_var], table = naive_tables, prior = prior_prob, axis = 1)
    
    
    ''' y_hat_list = []
    
    for obs in data_dict:
        for var,evidence in obs.items():
            single_prob = single_var_query(evidence, naive_tables[var])
            for yhat,prob in single_prob.items():
                prob_dict[yhat].append(prob) #maybe we should use apply here 
    
        total_prob = {key: sum(value) for key,value in prob_dict.items()}
        yhat = max(total_prob)
    
        y_hat_list.append(yhat)
    
    y_hat_seires = np.array(y_hat_list)'''

    correct_predict = np.sum(data[class_var] == y_hat_series)
    acc = correct_predict/len( y_hat_series)
    
    return acc





In [113]:
#option 2 use q3 functions 
def assess_naive_bayes2(G, prob_tables, data, outcomeSpace, class_var):
    G_naive = naive_bayes_graph(outcomeSpace,class_var)
    naive_tables = learn_bayes_net(G_naive, data, outcomeSpace)
    prob = assess_bayes_net(G_naive, naive_tables, data, outcomeSpace, class_var)
    return prob

In [111]:
############
## TEST CODE

acc = assess_naive_bayes(G, prob_tables, data, outcomeSpace, 'BC')
acc 

In [112]:
acc

0.7926

In [116]:
acc2= assess_naive_bayes(G, prob_tables, data, outcomeSpace, 'BC')
acc2

0.7926

In [131]:
naive_graph =naive_bayes_graph(outcomeSpace, 'BC')

Develop a new function ``cv_naive_bayes(data, class_var)`` that uses ``assess_naive_bayes`` to assess the performance of the Naïve Bayes classifier in a dataset ``data``. To develop this code, perform the following steps:

1. Use 10-fold cross-validation to split the data into training and test sets.

2. Implement a function ``learn_naive_bayes_structure(outcomeSpace, class_var)`` to create and return a Naïve Bayes graph structure from ``outcomeSpace`` and ``class_var``. 

3. Use ``learn_bayes_net(G, data, outcomeSpace)`` to learn the Naïve Bayes parameters from a training set ``data``. 

4. Use ``assess_naive_bayes(G, prob_tables, data, outcomeSpace, class_var)`` to compute the accuracy of the Naïve Bayes classifier in a test set ``data``. Remember to remove the variables ``metastasis`` and ``lymphnodes`` from the dataset before assessing the accuracy.

Do 10-fold cross-validation, same as above, and return ``acc`` and ``stddev``.

In [None]:
## Develop your code for learn_naive_bayes_structure(outcomeSpace, class_var) in one or more cells here

def learn_naive_bayes_structure(outcomeSpace, class_var):
    return naive_bayes_graph(outcomeSpace, class_var)




In [None]:
############
## TEST CODE

naive_graph = learn_naive_bayes_structure(outcomeSpace, 'BC')

In [None]:
## Develop your code for cv_naive_bayes(data, class_var) in one or more cells here

In [None]:
############
## TEST CODE

acc, stddev = cv_naive_bayes(data, 'BC')

## [20 Marks] Task 5 - Tree-augmented Naïve Bayes Classification

Similarly to the previous task, implement a Tree-augmented Naïve Bayes (TAN) classifier and evaluate your implementation in the breast cancer dataset. Design a function ``learn_tan_structure(data, outcomeSpace, class_var)`` to learn the TAN structure (graph) from the ``data`` and returns such a structure.

In [None]:
## Develop your code for learn_tan_structure(data, outcomeSpace, class_var) in one or more cells here






In [None]:
############
## TEST CODE

tan_graph = learn_tan_structure(data, outcomeSpace, class_var)
test(len(tan_graph['BC']) == len(tan_graph)-1)
test('FibrTissueDev' in tan_graph['Spiculation'] or 'Spiculation' in tan_graph['FibrTissueDev'])

Similarly to the other tasks, design a function ``cv_tan(data, class_var)`` that uses 10-fold cross-validation to assess the performance of the TAN classifier from ``data``. Remember to remove the variables ``metastasis`` and ``lymphnodes`` from the dataset before assessing the accuracy. This function should use the ``learn_tan_structure`` as well as other functions defined in this notebook.

In [None]:
## Develop your code for cv_tan(data, class_var) in one or more cells here

In [None]:
############
## TEST CODE

acc, stddev = cv_tan(data, 'BC')

## [20 Marks] Task 6 - Report

Write a report (**with less than 500 words**) summarising your findings in this assignment. Your report should address the following:

a. Make a summary and discussion of the experimental results (accuracy). Use plots to illustrate your results.

b. Discuss the complexity of the implemented algorithms.

Use Markdown and Latex to write your report in the Jupyter notebook. Develop some plots using Matplotlib to illustrate your results. Be mindful of the maximum number of words. Please, be concise and objective.

In [None]:
## Develop your report in one or more cells here