# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2019 Semester 1
-----
## Project 1: Gaining Information about Naive Bayes
-----
###### Student Name(s): Xinyao Niu
###### Python version: 3.6.8
###### Submission deadline: 1pm, Fri 5 Apr 2019

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import os
from os.path import join
import subprocess
import pandas as pd
from collections import defaultdict as dd
from math import log

In [2]:
def bash_call(c:'[str]') -> 'null':
    res = subprocess.check_output(c)
    for line in res.splitlines():
        print(line.decode('utf-8'))

In [3]:
datapath = join(os.getcwd(),'2019S1-proj1-data')
bash_call(['ls', datapath])

README.txt
anneal.csv
breast-cancer.csv
car.csv
cmc.csv
headers.txt
hepatitis.csv
hypothyroid.csv
mushroom.csv
nursery.csv
primary-tumor.csv


In [4]:
datapath

'/Users/xinyaoniu/Documents/COMP30027-ML/prject 1/2019S1-proj1-data'

In [89]:
HEADERS = {
        "anneal.csv": "family,product-type,steel,carbon,hardness,temper_rolling,condition,formability,strength,non-ageing,surface-finish,surface-quality,enamelability,bc,bf,bt,bw-me,bl,m,chrom,phos,cbond,marvi,exptl,ferro,corr,bbvc,lustre,jurofm,s,p,shape,oil,bore,packing,class".split(","),
        "breast-cancer.csv": "age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,class".split(","),
        "car.csv": "buying,maint,doors,persons,lug_boot,safety,class".split(","),
        "cmc.csv": "w-education,h-education,n-child,w-relation,w-work,h-occupation,standard-of-living,media-exposure,class".split(","),
        "hepatitis.csv": "sex,steroid,antivirals,fatigue,malaise,anorexia,liver-big,liver-firm,spleen-palpable,spiders,ascites,varices,histology,class".split(","),
        "hypothyroid.csv": "sex,on-thyroxine,query-on-thyroxine,on_antithyroid,surgery,query-hypothyroid,query-hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH,T3,TT4,T4U,FTI,TBG,class".split(","),
        "mushroom.csv": "cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,class".split(","),
        "nursery.csv": "parents,has_nurs,form,children,housing,finance,social,health,class".split(","),
        "primary-tumor.csv": "age,sex,histologic-type,degree-of-diffe,bone,bone-marrow,lung,pleura,peritoneum,liver,brain,skin,neck,supraclavicular,axillar,mediastinum,abdominal,class".split(","),
    }

In [126]:
class naive_bayes_learner():
    
    def __init__(self, path:str, file:str):
        self.file = file
        self.path = join(path, file)
        self.df = None
        self.labels = []
        self.attributes = []  # possible attributes for each row
        self.last = -1
        
        self.prob = []
        self.test_size = -1
        self.train_size = -1
        self.train_set = pd.DataFrame()
        self.test_set = pd.DataFrame()
        
        self.predictions = []
        
        self.ig = {}
        
        
    # This function should open a data file in csv, and transform it into a usable format 
    def preprocess(self, val:str='hold-out', shuffle:bool=False, hold:float=0.2) -> None:
        self.df = pd.read_csv(self.path, header=None)
        self.df.columns = HEADERS[self.file]
        self.size = len(self.df.index)
        # labels location
        self.last = self.df.columns[-1]
        # remove duplicate labels
        self.labels = list(set(self.df[self.last]))
        
        if shuffle:
            self.df = self.df.sample(frac=1).reset_index(drop=True)
        
        self.test_size = int(self.df.shape[0] * hold)
        self.train_size = self.df.shape[0] - self.test_size

        self.train_set = self.df.head(self.train_size)
        self.test_set = self.df.tail(self.test_size)
        
        #print("train=", self.train_set.shape, "test=", self.test_set.shape)

        # pre-process the train set
        for i in self.df.columns:
            stats = self.count_frequency(self.train_set[[i,self.last]])
            self.attributes.append(tuple(set(self.train_set[i])))
            self.prob.append(stats)
            
    
    # helper function for preprocess
    # count the frequency for each different values in an array
    def count_frequency(self, df:pd.DataFrame ,accumulate:dict={}) -> dd:
        res = dd(lambda:0,accumulate)
        index = df.columns
        same = False
        
        # rename the index if they are the same
        if index[0] == index[1]:
            index = ['one', 'two']
            df.columns = ('one','two')
            same = True
            
        for i in range(df.shape[0]):
            feature = df[index[0]].loc[i]
            label = df[index[1]].loc[i]
            if same:
                res[str(label)] += 1
            elif str(feature) == '?': 
                #ignore the missing value
                pass
            else:
                res[str(feature)+'|'+ str(label)] += 1
        return res
    
    
    # helper function of train
    # find the condition of a conditional probability
    def find_condition(self, prob:str) -> str:
        return prob.split('|')[1]
    
    
    # This function should build a supervised NB model
    def train(self, laplace:bool=False) -> None:
        label = self.prob[-1]
        for i in range(len(self.prob)-1):
            current = self.prob[i]
            for k,v in current.items():
                current[k] /= label[self.find_condition(k)]
        
        for k,v in label.items():
            label[k] /= self.train_size
        
        self.prob[-1] = label
    
    
    # predicting the class for an instance or a set of instances, basd on a trained model
    def predict(self, epsilon:bool=False) -> None:
        # split the results and 
        features = self.test_set[self.test_set.columns[:-1]]
        results = self.test_set[self.test_set.columns[-1]]
        
        for i in features.index:
            p = np.zeros(len(self.labels))
            for l in range(len(self.labels)):
                p[l] = (self.prob_calculator(features.loc[i], self.labels[l]))
            #print(p, self.labels[np.argmax(p)], results.loc[i])
            self.predictions.append((self.labels[np.argmax(p)], results.loc[i]))
    
    
    #helper function of predict
    # calculate the probability based on the given features
    def prob_calculator(self, features:pd.DataFrame, target:str) -> float:
        res = log(self.prob[-1][target])
        n = 0
        for f in features:
            if str(f) != '?':
                condition = str(f) + '|' + target
                res += log(self.prob[n][condition])
            n += 1
        return res

            
    # evaluate a set of predictions, in a supervised context
    def evaluate(self) -> None:
        #print(self.predictions)
        total = len(self.predictions)
        correct = 0
        
        for pair in self.predictions:
            if pair[0] == pair[1]:
                correct += 1
        
        print(correct, total)
        print("percison =",correct/total)
        
    
    # calculate the information gain(IG) of an attribute of a set of attribute, with respect to the class
    def info_gain(self) -> None:
        """
        calculate the information gain of an attributes of t
        """
        full = []  # contains the frequency of each atttribute along their column
        for i in self.test_set:
            full.append(self.count_frequency(self.df[[i,self.last]]))
        
        parent = 0
        n = self.df.shape[0]
        
        for i in full[-1].values():
            parent += self.entropy(i/n)
        #print(parent, n)
        #print(full[-1])
        
        for i in range(len(full)-1):
            current = full[i]
            child = 0
            for attr in self.attributes[i]:
                a = []
                for l in self.labels:
                    # ignore the missing value
                    if attr != "?":
                        target = str(attr) + "|" + l
                        #print(target, current[target])
                        a.append(current[target])
                
                #print(a,child,"p=",a[0]/sum(a))
                for n in a:
                    child += (n/self.size)*self.entropy(n/sum(a))
            #print(self.df.columns[i])
            self.ig[self.df.columns[i]] = parent-child
        
        

    def entropy(self, p:float) -> float:
        """
        helper function for info_gain
        mainly for calculate the entropy for given probability
        
        *p -- given probability
        """
        if(p==0):
            return 0
        return -p*log(p,2)
        
        
    
    def run(self) -> None:
        """
        this function will automatically run through the workflow,
        it is only for convenient and debug purpose
        """
        self.preprocess()
        #print(self.prob[-1])
        #print(self.attributes)
        self.train()
        #print(self.prob[-1])
        self.predict()
        self.evaluate()
        self.info_gain()
        print(self.ig)
        

In [127]:
test = naive_bayes_learner(datapath, 'breast-cancer.csv')

In [128]:
test.run()

41 57
percison = 0.7192982456140351
{'age': 0.477873630599265, 'menopause': 0.47310115720971335, 'tumor-size': 0.49562233336866107, 'inv-nodes': 0.517252310316378, 'node-caps': 0.5208954374438026, 'deg-malig': 0.5212338021316094, 'breast': 0.47339050153318996, 'breast-quad': 0.4808297678048802, 'irradiat': 0.48873802917185366}


Questions (you may respond in a cell or cells below):

1. The Naive Bayes classifiers can be seen to vary, in terms of their effectiveness on the given datasets (e.g. in terms of Accuracy). Consider the Information Gain of each attribute, relative to the class distribution — does this help to explain the classifiers’ behaviour? Identify any results that are particularly surprising, and explain why they occur.
2. The Information Gain can be seen as a kind of correlation coefficient between a pair of attributes: when the gain is low, the attribute values are uncorrelated; when the gain is high, the attribute values are correlated. In supervised ML, we typically calculate the Infomation Gain between a single attribute and the class, but it can be calculated for any pair of attributes. Using the pair-wise IG as a proxy for attribute interdependence, in which cases are our NB assumptions violated? Describe any evidence (or indeed, lack of evidence) that this is has some effect on the effectiveness of the NB classifier.
3. Since we have gone to all of the effort of calculating Infomation Gain, we might as well use that as a criterion for building a “Decision Stump” (1-R classifier). How does the effectiveness of this classifier compare to Naive Bayes? Identify one or more cases where the effectiveness is notably different, and explain why.
4. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy. How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)
5. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the Naive Bayes classifier? Explain why, or why not.
6. Naive Bayes is said to elegantly handle missing attribute values. For the datasets with missing values, is there any evidence that the performance is different on the instances with missing values, compared to the instances where all of the values are present? Does it matter which, or how many values are missing? Would a imputation strategy have any effect on this?

Don't forget that groups of 1 student should respond to question (1), and one other question of your choosing. Groups of 2 students should respond to question (1) and question (2), and two other questions of your choosing. Your responses should be about 150-250 words each.