
### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2020 Semester 1

## Assignment 1: Naive Bayes Classifiers

###### Submission deadline: 7 pm, Monday 20 Apr 2020

**Student Name(s):**    `Alec Yu, Michael Jaworski`

**Student ID(s):**     `993433, `


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [261]:
import numpy as np
from collections import defaultdict
import re
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

In [262]:
# Lists for all the datafiles and their types
nominal_files = ["breast-cancer-wisconsin", "mushroom", "lymphography"]
numeric_files = ["wdbc", "wine"]
ordinal_files = ["car", "nursery", "somerville"]
mixed_files = ["adult", "bank", "university"]

file = "university"
filename = "datasets/%s.data" % file
headerfile = "datasets/%s.h" % file

if(file in nominal_files):
    datatype = "nominal"
elif (file in ordinal_files):
    datatype = "ordinal"
elif (file in numeric_files):
    datatype = "numeric"
elif (file in mixed_files):
    datatype = "mixed"

In [263]:
# Read the datafile in with a separate header file for attributes
dataframe = pd.read_csv(filename, header=None)
header = open(headerfile, "r")
attributes = header.readline().split(",")
dataframe.columns = attributes

In [264]:
# Deal with missing values in some datasets
# Remove all the rows with missing values. 1. Because we have many instances of data to work with in these datasets. 
# 2. Because most of the missing values occur in the nominal datasets. If we have a numeric missing value, can use mean 
# As the numerics are gaussian distributed. Categorical values might be skewed if we took the mode or something.


TO DO LIST:
Import and clean data into pandas dataframe, using new header files as columns
Decide what to do with the erroneous instances, identify them. Either remove or use average
Make sure each instance included has all attributes filled, data is consistent
Google generic ways to clean data.
DISCRETISE

For each type of data, make sure to convert ordinal into numeric, and discretize numeric in mixed. 

Once all this is done, train predict and evaluate shouldn't be too hard

then use this clean data to train.


In [None]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

def preprocess(dataframe, datatype): 
    # Create a copy of the original dataframe. Remove rows with ? in this copy
    df = dataframe

    # If it's mushroom data, drop the stalk-root column as it has too many missing values. Don't use this attribute
    if(filename == "datasets/mushroom.data"):
        df = df.drop(columns = ['stalk-root'])

    # Otherwise just remove the entries with a ? for now
    for index in df.index:
        if('?' in df.loc[index].values):
            df = df.drop(index)
    
    # Leave nominal, ordinal and numeric data as it is.
    
    # If the data is mixed, we need to discretise the data, as each mixed data set is a classification task using NB
    if(datatype == 'mixed'):
        df = discretiseNumeric(df)
            
    return df
############################################################################################################################
# Choice is to use equal-frequency discretisation. More detailed than equal-width discretisation. Here, computing an 
# "Optimal" number of clusters for k means clustering for every numeric column would not be feasible. If choosing an 
# Arbitrary K clusters for every single discretisation, then it isn't as effective
# For equal-frequency, sometimes the attributes won't have many values, so migth be less than 5 bins. Thats okay though
def discretiseNumeric(df):
    for column in df.columns:
        if(df.dtypes[column] == "int64" or df.dtypes[column] == "float64"):
            values = df[column].values
            values.sort()
            
            # Check if the values are already "Discretised" (<, say, 8 unique values. This occurs for ratings 0-5 etc..).
            # If so, then no need to discretise the attribute
            uniqueValues = set(values)
            if(len(uniqueValues) <= 8):
                continue
            
            # Choose 5 equal-frequency bins, based on sorted values
            bin1 = values[(int)(len(values)/5)]
            bin2 = values[2 * (int)(len(values)/5)]
            bin3 = values[3 * (int)(len(values)/5)]
            bin4 = values[4 * (int)(len(values)/5)]
            bin5 = values[5 * (int)(len(values)/5)]

            # Now assign every value to be in a bin number
            for i in range(len(df[column].values)):
                if(df[column].values[i] <= bin1):
                    df[column].values[i] = 1
                elif(df[column].values[i] <= bin2):
                    df[column].values[i] = 2
                elif(df[column].values[i] <= bin3):
                    df[column].values[i] = 3
                elif(df[column].values[i] <= bin4):
                    df[column].values[i] = 4
                elif(df[column].values[i] <= bin5):
                    df[column].values[i] = 5
                
    return df

############################################################################################################################

df2 = dataframe.copy()
df2 = preprocess(df2, datatype)
df2


In [246]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train(dataframe, datatype):
    
    if(datatype in ["nominal", "ordinal", "mixed"]):
        modeltype = gaussianNB
    elif(datatype in ["numeric"]):
        modeltype = NB
        
    #####################################################################################################################
    # code for normal naive bayes
    return

In [247]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

def predict():
    return

In [248]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate():
    return

## Questions 


If you are in a group of 1, you will respond to question (1), and **one** other of your choosing (two responses in total).

If you are in a group of 2, you will respond to question (1) and question (2), and **two** others of your choosing (four responses in total). 

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions in respond to the question, but your formal answer should be added to a separate file.

### Q1
Try discretising the numeric attributes in these datasets and treating them as discrete variables in the na¨ıve Bayes classifier. You can use a discretisation method of your choice and group the numeric values into any number of levels (but around 3 to 5 levels would probably be a good starting point). Does discretizing the variables improve classification performance, compared to the Gaussian na¨ıve Bayes approach? Why or why not?

### Q2
Implement a baseline model (e.g., random or 0R) and compare the performance of the na¨ıve Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance varies across datasets, and to what extent the na¨ıve Bayes classifier improves on the baseline performance.

### Q3
Since it’s difficult to model the probabilities of ordinal data, ordinal attributes are often treated as either nominal variables or numeric variables. Compare these strategies on the ordinal datasets provided. Deterimine which approach gives higher classification accuracy and discuss why.

### Q4
Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy (you should implement this yourself and do not simply call existing implementations from `scikit-learn`). How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)

### Q5
Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the na¨ıve Bayes classifier? Explain why, or why not.

### Q6
The Gaussian na¨ıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in these datasets? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the NB classifier’s predictions.