
### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2020 Semester 1

## Assignment 1: Naive Bayes Classifiers

###### Submission deadline: 7 pm, Monday 20 Apr 2020

**Student Name(s):**    `Shreya Sharma`

**Student ID(s):**     `910986`


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.cluster import KMeans
import math
import matplotlib.pyplot as plt

In [2]:
def get_column_datatypes(filename, df, file_datatype):
    '''
    Returns a list specifying the datatype of each column, requires the filename, 
    original data and specified file datatype
    '''
    col=[]
    if file_datatype=='nominal':                      #'nominal', 'ordinal' and 'numeric' have consistent column datatypes
        col= [0 for i in range(df.shape[1]-1)]
    elif file_datatype=='ordinal':
        col=[1 for i in range(df.shape[1]-1)]
    elif file_datatype=='numeric':
        col= [2 for i in range(df.shape[1]-1)]
    elif file_datatype=='mixed':                      #file with datatype 'mixed' need their column types hard coded
        if filename=='adult.data':
            col= [2, 0, 2, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2, 0]
        elif filename=='bank.data':
            col= [2, 0, 0, 1, 0, 2, 0, 0, 0, 2, 2, 2, 2, 0]
        elif filename=='university.data':
            col= [0, 0, 0, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 0]
    return col

def discretize_numeric_data(df, col_datatypes, k_numeric_splits):
    '''
    PLEASE NOTE: This function was made to answer Q1
    Transforms numeric data to discrete, requires dataframe, list of datatypes for each 
    column and the number of clusters (level of discretisation) required
    '''
    for i in range(len(col_datatypes)):
        if col_datatypes[i]==2:                                          #if datatype of column is numeric
            
            df.iloc[:, i].fillna((df.iloc[:, i].mean()), inplace=True)   #fill empty values with average column value
            
            column=np.array(df.iloc[:, i]).reshape(-1,1)                 #create 2D array with column values
            kmeans=KMeans(n_clusters=k_numeric_splits, random_state=100).fit(column) 
            centroids=kmeans.labels_                                     
            df.iloc[:, i]=centroids                                      #replace column values with centroid labels
            
            col_datatypes[i]=0                                           #change datatype to nominal in datatype list
            
    return df, col_datatypes

In [3]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing
#from sklearn.model_selection import train_test_split

def preprocess(filename, file_datatype, header_row=None, id_column=None, class_col=-1, k_numeric_splits=None):
    '''
    Preprocess a .csv file to ensure consistency across data
    
    INPUT:
    filename: Exact name of file
    file_datatype: As specified in README.txt ('nominal'/'numeric'/'ordinal'/'mixed')
    header_row: Whether there is a header in data or not
    id_column: Whether there is an ID column in the first column of dataset
    class_col: Index location of class labels (-1 if at end)
    k_numeric_splits: FOR Q1 ONLY, if want to discretise all numeric data, number of centroids
    
    OUTPUT:
    X: dataframe of all X attributes
    y: array of all y class labels
    col_datatypes: list specifying the datatypes of all columns in X
    
    NOTE: tokens representing missing values will be maintained in their original form
    '''
    
    #Read in data as dataframe
    df=pd.read_csv(filename, header=header_row)
    
    #Remove ID column if specified to exist - ID column always located in first column
    if id_column is not None:              
        df.drop(0, axis=1, inplace=True)
    
    #If class column not located as last column, move it to last column
    if class_col != -1:
        class_column=df[class_col]
        del df[class_col]
        df[class_col]=class_column
    
    #Find datatypes of each column
    col_datatypes=[]    
    col_datatypes=get_column_datatypes(filename, df, file_datatype)
        
    #Q1 ONLY: Discretise numeric data 
    if k_numeric_splits is not None:
        df, col_datatypes=discretize_numeric_data(df, col_datatypes, k_numeric_splits)
        
    #Seperate X attributes from class by creating two dataframes
    X=df.iloc[:, :-1]
    y=df.iloc[:, -1]

    #Re-name columns in X to their numerical order (since class column might have been moved/ID column removed)
    X.columns = np.arange(0,len(X.columns))
    
    return X, y, col_datatypes

In [4]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train(X, y, col_datatypes):
    '''
    Find the prior probabilities and likelihoods from input training data
    
    INPUT:
    X: dataframe of all X attributes
    y: array of all class labels
    col_datatypes: list specifying datatypes of each column
    
    OUTPUT:
    likelihood_prob: dictionary holding all likelihoods for input dataset
    priors: dictionary holding prior probabilities for input dataset
    '''
    #n_X_attributes=number of columns in X
    n_X_attributes=len(col_datatypes)   
    
    #n_rows=number of rows in X
    n_rows=len(X)           
    
    #class_labels=list of unique classes (y)
    class_labels=list(np.unique(y))           

    #col_dict will contain all unique values for each non-numeric column
    col_dict={}                               
    for i in range(n_X_attributes):
        if col_datatypes[i] != 2:
            unique_vals=list(pd.unique(X.iloc[:, i])) 
            col_dict[i]=unique_vals    
            
    #likelihood-contains count of class lables for unique values in each non-numeric column/mu and sigma for numeric data
    likelihood={} 
    
    #create empty data structure first
    for i in range(n_X_attributes):           
        likelihood[i]={}                   
        if col_datatypes[i]!=2:               
            #for non-numeric data access method = likelihood[column name][unique val][class]
            
            for value in col_dict[i]:
                likelihood[i][value]={}
                for class_1 in class_labels:
                    likelihood[i][value][class_1]=0
        else:                                
            #for numeric data access method = likelihood[column name][class][mu/sigma] 
            for class_1 in class_labels:
                likelihood[i][class_1]={}
     
    #insert count of instances for non-numeric data into likelihood dictionary
    for j in range(len(X)):                 
        class_label=y.iloc[j]
        for i in range(n_X_attributes):
            if col_datatypes[i]!=2:
                likelihood[i][X.iloc[j,i]][class_label]+=1
            
    #add class into X dataframe to assist in finding the conditional mu/sigma
    X['y']=y                                
    key_stats=['mu', 'sigma']
    for i in range(n_X_attributes):
        if col_datatypes[i]==2:   
            #for numeric data:
            
            for class_1 in class_labels:
                #insert mean into data structure:
                likelihood[i][class_1]['mu']=X.loc[X['y'] == class_1, i].mean() 
                
                #insert sigma into data structure
                likelihood[i][class_1]['sigma']=X.loc[X['y'] == class_1, i].std() 
    
    #delete class label from X dataframe, no longer required
    del X['y']                              
    
    #finds the number of instances belonging to each class
    total_class_distribution=defaultdict(int) 
    for i in range(len(X)):
        total_class_distribution[y.iloc[i]]+=1
        
     #likelihood_prob - contains all likelihoods for all attributes
    likelihood_prob={}              
    
    #create empty data structure, identical to likelihood
    for i in range(n_X_attributes):         
        likelihood_prob[i]={}
        if col_datatypes[i]!=2:
            for value in col_dict[i]:
                likelihood_prob[i][value]={}
                for class_1 in class_labels:
                    likelihood_prob[i][value][class_1]=0
        else:
            #if numeric column, copy over values from likelihood
            for class_1 in class_labels:
                likelihood_prob[i][class_1]={}
                for stat in likelihood[i][class_1]:
                    likelihood_prob[i][class_1][stat]=likelihood[i][class_1][stat] 
   
    #if non-numeric column, find the likelihood (probability value) using float division
    for col in likelihood:                 
        if col_datatypes[col]!=2:
            for col_values in likelihood[col]:
                for class_1 in total_class_distribution:
                    likelihood_prob[col][col_values][class_1]=float(likelihood[col][col_values][class_1])/float(total_class_distribution[class_1])
   
    #calculate priors using total_class_distribution
    priors={}                              
    for class_1 in total_class_distribution:
        priors[class_1]=float(total_class_distribution[class_1])/float(len(X))

    return likelihood_prob, priors

In [5]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

def predict(X, y, col_datatypes, likelihood_prob, priors):
    '''
    Predict classes for items in given dataset
    
    INPUT:
    X: dataframe of all X attributes
    y: array of all class labels
    col_datatypes: list specifying datatypes of each column
    likelihood_prob: dictionary containing all likelihoods of all columns
    priors: dictionary containing prior probabilities
    
    OUTPUT:
    y_df: dataframe with two columns, y (recieved class labels) and y_guess (predicted class labels)
    '''
    
    y_df=pd.DataFrame(y)
    
    #y_guess will contain all predicted classes
    y_guess=[]      
    
    for row in range(len(X)):
        #prior_x_likelihood_prob will contain result of all (prior * likelihood) values for each class
        prior_x_likelihood_prob={}   
        
        for class_1 in priors:
            #need to multiply prior value with all other likelihoods for same class
            multiplier_temp=priors[class_1]         
            
            for col in range(len(X.iloc[0])):
                if col_datatypes[col]!=2:           
                    #if non-numeric column, access likelihood via likelihood_prob
                    
                    try:
                        multiplier_temp*=likelihood_prob[col][X.iloc[row, col]][class_1]
                    except KeyError:
                        #if never seen instance before, assume very rare probability, epsilon
                        multiplier_temp*=0.000001   
               
                else:                              
                    #if numeric, access sigma and mu from likelihood_prob
                    sigma=likelihood_prob[col][class_1]['sigma'] 
                    mu=likelihood_prob[col][class_1]['mu']
                    x_hat=X.iloc[row, col]
                    
                    #find gaussian probability (below)
                    multiplier_temp*=math.exp(-0.5*((x_hat-mu)/sigma)**2)*(1/(sigma*math.sqrt(2*math.pi))) 
           
            #add (prior * likelihood) value to prior_x_likelihood_prob for each class
            prior_x_likelihood_prob[class_1]=multiplier_temp 
        
        #find sum of (prior * likelihood), if both probabilities 0, assume rare event, add epsilon
        denominator=sum(prior_x_likelihood_prob.values())+0.000001 
        
        #divide (prior * likelihood) for each class by sum of (prior * likelihood) values for each class
        for key in prior_x_likelihood_prob: 
            prior_x_likelihood_prob[key]=float(prior_x_likelihood_prob[key])/float(denominator) 
        
        #find maximum probability, and associated class
        max_value=0.0
        max_class=''
        for key in prior_x_likelihood_prob:
            if float(prior_x_likelihood_prob[key])>max_value:
                max_value=prior_x_likelihood_prob[key]
                max_class=key
                
        #append associated class to y_guess
        y_guess.append(max_class)
        
    #append y_guess to y_df dataframe
    y_df['y_guess']=y_guess
    return y_df

In [6]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels
def evaluate(y_test_df):
    '''
    Evaluate the accuracy of the model by considering the number of correctly predicted labels
    
    INPUT:
    y_test_df: dataframe containing 2 columns, actual class and predicted class
    
    OUTPUT:
    accuracy: the accuracy value of the model
    '''
    correct=0
    incorrect=0
    for row in range(len(y_test_df)):
        if y_test_df.iloc[row, 0]==y_test_df.iloc[row, 1]:
            #if the prediction is equal to the actual class, then true prediction (true positive/true negative)
            correct+=1
        else:
            #if prediction different from actual class, false positive/false negative
            incorrect+=1
    
    #float division to find accuracy
    accuracy=correct/float(correct+incorrect)
    
    return accuracy

In [7]:
X, y, col_datatypes=preprocess(filename='university.data', file_datatype='mixed', missing_value=0, class_col=14)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
print(evaluate(y_df))

TypeError: preprocess() got an unexpected keyword argument 'missing_value'

## Questions 


If you are in a group of 1, you will respond to question (1), and **one** other of your choosing (two responses in total).

If you are in a group of 2, you will respond to question (1) and question (2), and **two** others of your choosing (four responses in total). 

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions in respond to the question, but your formal answer should be added to a separate file.

### Q1
Try discretising the numeric attributes in these datasets and treating them as discrete variables in the na¨ıve Bayes classifier. You can use a discretisation method of your choice and group the numeric values into any number of levels (but around 3 to 5 levels would probably be a good starting point). Does discretizing the variables improve classification performance, compared to the Gaussian na¨ıve Bayes approach? Why or why not?

In [None]:
X, y, col_datatypes=preprocess(filename='wdbc.data', file_datatype='numeric', class_col=1)
preprocess_datatypes=col_datatypes
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
print('Accuracy with no discretisation ' + str(evaluate(y_df)))
accuracy=[]
for k_split in range(1, 10):
    X, y, col_datatypes=preprocess(filename='wdbc.data', file_datatype='numeric', class_col=1, k_numeric_splits=k_split)
    likelihood_prob, priors=train(X, y, col_datatypes)
    y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
    accuracy.append(evaluate(y_df))
    
print('Max Accuracy with discretisation: ' + str(max(accuracy)))
print('N K-splits to receive Max Accuracy: '+ str(accuracy.index(max(accuracy))+1))
plt.plot(list(range(1,10)), accuracy)
plt.xlabel('NSplits')
plt.title('NUMERIC DATASET: WDBC Data')
plt.ylabel('Accuracy')
plt.show()

print('Distribution of Numeric Attributes')
for col in X.columns:
    if preprocess_datatypes[col] ==2:
        X.hist(column=col)

In [None]:
X, y, col_datatypes=preprocess(filename='wine.data', file_datatype='numeric', class_col=0)
preprocess_datatypes=col_datatypes
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
print('Accuracy with no discretisation ' + str(evaluate(y_df)))

accuracy=[]
for k_split in range(1, 10):
    X, y, col_datatypes=preprocess(filename='wine.data', file_datatype='numeric', class_col=0, k_numeric_splits=k_split)
    likelihood_prob, priors=train(X, y, col_datatypes)
    y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
    accuracy.append(evaluate(y_df))
    
print('Max Accuracy with discretisation: ' + str(max(accuracy)))
print('N K-splits to receive Max Accuracy: '+ str(accuracy.index(max(accuracy))+1))
plt.plot(list(range(1,10)), accuracy)
plt.xlabel('NSplits')
plt.ylabel('Accuracy')
plt.title('NUMERIC DATASET: Wine Data')
plt.show()

print('Distribution of Numeric Attributes')
for col in X.columns:
    if preprocess_datatypes[col] ==2:
        X.hist(column=col)

In [None]:
X, y, col_datatypes=preprocess(filename='adult.data', file_datatype='mixed', missing_value='?')
preprocess_datatypes=col_datatypes
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
print('Accuracy with no discretisation ' + str(evaluate(y_df)))

accuracy=[]
for k_split in range(1, 10):
    X, y, col_datatypes=preprocess(filename='adult.data', file_datatype='mixed', missing_value='?', k_numeric_splits=k_split)
    likelihood_prob, priors=train(X, y, col_datatypes)
    y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
    accuracy.append(evaluate(y_df))
    
print('Max Accuracy with discretisation: ' + str(max(accuracy)))
print('N K-splits to receive Max Accuracy: '+ str(accuracy.index(max(accuracy))+1))
plt.plot(list(range(1,10)), accuracy)
plt.xlabel('NSplits')
plt.ylabel('Accuracy')
plt.title('MIXED DATASET: Adult Data')
plt.show()

print('Distribution of Numeric Attributes')
for col in X.columns:
    if preprocess_datatypes[col] ==2:
        X.hist(column=col)

In [None]:
X, y, col_datatypes=preprocess(filename='bank.data', file_datatype='mixed')
preprocess_datatypes=col_datatypes
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
print('Accuracy with no discretisation ' + str(evaluate(y_df)))

accuracy=[]
for k_split in range(1, 10):
    X, y, col_datatypes=preprocess(filename='bank.data', file_datatype='mixed', k_numeric_splits=k_split)
    likelihood_prob, priors=train(X, y, col_datatypes)
    y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
    accuracy.append(evaluate(y_df))
    
print('Max Accuracy with discretisation: ' + str(max(accuracy)))
print('N K-splits to receive Max Accuracy: '+ str(accuracy.index(max(accuracy))+1))
plt.plot(list(range(1,10)), accuracy)
plt.xlabel('NSplits')
plt.ylabel('Accuracy')
plt.title('MIXED DATASET: Bank Data')
plt.show()

print('Distribution of Numeric Attributes')
for col in X.columns:
    if preprocess_datatypes[col] ==2:
        X.hist(column=col)

### Q2
Implement a baseline model (e.g., random or 0R) and compare the performance of the na¨ıve Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance varies across datasets, and to what extent the na¨ıve Bayes classifier improves on the baseline performance.

In [None]:
accuracy_df=pd.DataFrame()
def zero_r(y_df):
    y_class_count=defaultdict(int)
    for column in y_df.columns:
        if column != 'y_guess':
            y_test=column
    y_test_df=pd.DataFrame(y_df[y_test], columns=[y_test])
    
    y_class=list(y_test_df[y_test].unique())
    
    for class_1 in y_class:
        y_class_count[class_1]=0

    for row in range(len(y_df)):
        y_class_count[y_df.iloc[row, 0]]+=1
        
    for (key, value) in y_class_count.items():
        if value == max(y_class_count.values()):
            max_class=key
            
    y_test_df=pd.DataFrame(y_df[y_test])
    y_test_df['zero_r']=max_class
    
    return y_test_df

In [None]:
X, y, col_datatypes=preprocess(filename='breast-cancer-wisconsin.data', file_datatype='nominal', id_column=0)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Breast Cancer Wisconsin']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.ylabel('Accuracy')
plt.title('Breast Cancer Wisconsin Data')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='mushroom.data', file_datatype='nominal', missing_value='?', class_col=0)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Mushroom']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.title('Mushroom Data')
plt.ylabel('Accuracy')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='lymphography.data', file_datatype='nominal', class_col=0)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Lymphography']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.ylabel('Accuracy')
plt.title('Lymphography Data')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='wdbc.data', file_datatype='numeric', class_col=1)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['WDBC']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.ylabel('Accuracy')
plt.title('WDBC Data')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='wine.data', file_datatype='numeric', class_col=0)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))
#3 classes

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Wine']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.title('Wine Data')
plt.ylabel('Accuracy')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='car.data', file_datatype='ordinal')
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Car']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.ylabel('Accuracy')
plt.title('Car Data')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='nursery.data', file_datatype='ordinal')
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Nursery']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.ylabel('Accuracy')
plt.title('Nursery Data')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='somerville.data', file_datatype='ordinal', class_col=0)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Somerville']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.title('Somerville Data')
plt.ylabel('Accuracy')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='adult.data', file_datatype='mixed', missing_value='?')
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Adult']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.title('Adult Data')
plt.ylabel('Accuracy')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='bank.data', file_datatype='mixed')
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['Bank']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.title('Bank Data')
plt.ylabel('Accuracy')
plt.show()

In [None]:
X, y, col_datatypes=preprocess(filename='university.data', file_datatype='mixed', missing_value=0, class_col=14)
likelihood_prob, priors=train(X, y, col_datatypes)
y_df=predict(X, y, col_datatypes, likelihood_prob, priors)
model_acc=evaluate(y_df)
print('Model Accuracy: ' + str(model_acc))

y_test_df=zero_r(y_df)
zeror_acc=evaluate(y_test_df)
print('Zero R Accuracy: ' + str(zeror_acc))

performance_diff=model_acc-zeror_acc
print('Increase in Performance using Model: ' +str(performance_diff))

accuracy_df['University']=[model_acc, zeror_acc, performance_diff]

performance=[model_acc, zeror_acc]
models=['Model', 'Zero R']
pos=[1, 1.4]
plt.bar(pos, performance, width=0.3, align='center')
plt.xticks(pos, models)
plt.title('University Data')
plt.ylabel('Accuracy')
plt.show()

In [None]:
accuracy_df.index=['Model Accuracy', 'Zero R Accuracy', 'Difference']

In [None]:
accuracy_df

### Q3
Since it’s difficult to model the probabilities of ordinal data, ordinal attributes are often treated as either nominal variables or numeric variables. Compare these strategies on the ordinal datasets provided. Deterimine which approach gives higher classification accuracy and discuss why.

### Q4
Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy (you should implement this yourself and do not simply call existing implementations from `scikit-learn`). How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)

### Q5
Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the na¨ıve Bayes classifier? Explain why, or why not.

### Q6
The Gaussian na¨ıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in these datasets? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the NB classifier’s predictions.