
### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2020 Semester 1

## Assignment 1: Naive Bayes Classifiers

###### Submission deadline: 7 pm, Monday 20 Apr 2020

**Student Name(s):**    `Alec Yu, Michael Jaworski`

**Student ID(s):**     `993433, 833751`


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [378]:
import os
import pandas as pd
import numpy as np
# count attribute values
from collections import Counter
from math import sqrt, e, pi
from random import sample
import matplotlib.pyplot as plt

In [495]:
def get_filenames(location: str):
    
    # returns filenames from datasets folder location
    
    files = os.listdir(location)
    
    return [filename.replace(".data", "") for filename in files if ".data" in filename]
    

def read_data(filename: str):  
    
    # reads the filename.data data file and the filename.h header file and return a dataframe 
    
    df = pd.read_csv(f"datasets/{filename}.data", header=None)
    header = open(f"datasets/{filename}.h", "r").read().split(",")
    df.columns = header
    
    return df

def get_dtypes(filename: str):
    
    dtypes = {}
    
    attributes = open("datasets/university.dtypes.txt").read().split("\n")
    
    for attribute in attributes:
        name = attribute.split(": ")[0]
        dtype = attribute.split(": ")[1]
        dtypes[name] = dtype
    
    return dtypes

def replace_all(df, d):
    
    # replace multiple strings (i.e. '?' and 'unknown' with np.nan)
    
    for k, v in d.items():
        df = df.replace(k, v)
    
    return df

In [380]:
location = "datasets"
filenames = get_filenames("datasets")
filenames

['adult',
 'bank',
 'breast-cancer-wisconsin',
 'car',
 'lymphography',
 'mushroom',
 'nursery',
 'somerville',
 'trainingtest',
 'university',
 'wdbc',
 'wine']

In [463]:
for filename in filenames:
    df = read_data(filename)
    print(df.head())

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country  CLASS  
0          2174             0              40  United-States  <=50K  
1             0             0             

In [411]:
# # find nan encoding i.e. '?' or 'unknown' 

# # get filenames
# filenames = get_filenames("datasets")

# attribute_values_list = []

# for filename in filenames:
    
#     df = read_data(filename)
    
#     for column in df.columns[:-1]:
#         attribute_values = df[column].tolist()
#         attribute_values_list.append(attribute_values)
        
# # sort attribute values by frequency

# attribute_values = [value for attribute_values in attribute_values_list for value in attribute_values]
# attribute_value_counter = Counter(attribute_values)
# attribute_value_counter = {k:v for k, v in sorted(attribute_value_counter.items(), key = lambda item: item[1], reverse = True)}

hence, 'unknown' and '?' represent missing values... replace missing values with np.nan.

In [513]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

def preprocess(df, filename):
    
    # replace missing values with np.nan
    
    replacements = {"unknown": np.nan, "?": np.nan}  
    df = replace_all(df, replacements)
    
    # drop 'na' columns
    df = df.dropna(axis = 'columns')
    
    # rearrange class to [-1] position
    CLASS = df.pop("CLASS").astype("category")
    df["CLASS"] = CLASS
    
    # correct dtypes
    
    nominal_files = ["breast-cancer-wisconsin", "mushroom", "lymphography"]
    numeric_files = ["wdbc", "wine"]
    ordinal_files = ["car", "nursery", "somerville"]
    mixed_files = ["adult", "bank", "university"]
    
    if (filename in nominal_files):
        datatype = "nominal"
        for column in df.columns[:-1]:
            df[column] = df[column].astype("category")
            
    elif (filename in ordinal_files):
        datatype = "ordinal"
        for column in df.columns[:-1]:
            df[column] = df[column].astype("category")
        
    elif (filename in numeric_files):
        datatype = "numeric"
        for column in df.columns[:-1]:
            df[column] = pd.to_numeric(df[column])
        
    elif (filename in mixed_files):
        datatype = "mixed"
        dtypes = get_dtypes(filename)
        for column in df.columns[:-1]:
            try:
                dtype = dtypes[column]
                if dtype == "categoric":
                    df[column] = df[column].astype("category")
                if dtype == "numeric":
                    df[column] = pd.to_numeric(df[column])
            except:
                pass
            
    return df, datatype

In [506]:
# for filename in filenames:
#     df = read_data(filename)
#     df = preprocess(df, filename)
#     print(df.head())

In [508]:
filename = "university"
df = read_data(filename)
df = preprocess(df, filename)
df.head()

Unnamed: 0,university-name,state,control,number-of-students,male:female-ratio,student:faculty-ratio,sat-veral,sat-math,expenses,percent-financial-aid,number-of-applicants,percent-admittance,percent-enrolled,social,quality-of-life,academic-emphasis,CLASS
0,adelphi,newyork,private,5-10,0.3,15.0,500,475,7-10,60,4-7,70,40,2,2,biology,2
1,arizona-state,arizona,state,20+,0.5,20.0,450,500,4-7,50,17+,80,60,4,5,fine-arts,3
2,boston-college,massachusetts,private:roman-catholic,5-10,0.4,20.0,500,550,10+,60,10-13,50,40,5,3,english,4
3,boston-university,massachusetts,private,10-15,0.45,12.0,550,575,10+,60,13-17,60,40,4,3,liberal-arts,4
4,brown,rhodeisland,private,5-,0.5,11.0,625,650,10+,40,10-13,20,50,4,5,arts:sciences,5


In [433]:
# k_means discretization function.
# optimise_k and get_wss function used to determine optimal k value.

def k_means(values, k, old_centroids = None):

    # 1. Select k points at random to act as seed centroids
    # 2. Assign each instance to the cluster with nearest
    # centroid
    # 3. Recompute centroids of the clusters using current
    # assignment. Centroid = centre or mean point of cluster
    # 4. Repeat step 2 until the assignment of instances to
    # clusters is stable
    
    if old_centroids == None:
        centroids = sample(list(set(values)), k)
    else:
        centroids = old_centroids    
        
    clusters = [None for val in range(len(values))]

    for i in range(len(values)):
        value = values[i]

        # initialize distance to 'infinity' and cluster to 'none'
        distance = np.inf
        cluster = None

        for j in range(len(centroids)):

            centroid = centroids[j]

            if abs(centroid - value) < distance:

                distance = abs(centroid - value)

                clusters[i] = j  

    new_centroids = [[] for i in range(k)]

    for i in range(len(clusters)):
        cluster = clusters[i]
        new_centroids[cluster].append(values[i])
    
    new_centroids = list(map(np.mean, new_centroids))

    if set(new_centroids) == set(centroids):
        return clusters, centroids
    else:
        clusters, centroids = k_means(values, k, new_centroids)
    
    return clusters, centroids

def get_wss(values, clusters, centroids):
    
    se = []

    for i in range(len(values)):
        se.append((centroids[clusters[i]] - values[i]) ** 2)

    wss = sum(se)

    return wss

def optimize_k(values, ks):
    
    wsss = {}
    
    for k in ks:
        clusters, centroids = k_means(values, k)
        wss = get_wss(values, clusters, centroids)
        wsss[k] = wss
        
    return wsss

testing optimize_k() function

In [434]:
# kss = [3, 5, 10, 50, 100]
# wsss = optimize_k(values, kss)
# plt.plot(list(wsss.keys()), list(wsss.values()))
# plt.title("elbow")

In [517]:
def get_categoric_priors(df):
    
    # returns dataframe of categorical attributes' prior probabilities
    
    priors = []
    
    for column in df.columns[:-1]:
        
        # create "n" column to be able to aggregate attributes
        
        df2 = df.copy()
        df2["n"] = 1
        
        # get count of classes and count of attribute given class
        
        attributes = df2.groupby(["CLASS", column]).agg({"n": "count"}).reset_index()
        classes = attributes.groupby(["CLASS"])["n"].sum().reset_index().rename(columns = {"n": "total"})
        prior = pd.merge(attributes, classes, on = "CLASS", how = "left")
        
        # create "attribute" column to be able to concat prior dataframes
        
        prior["attribute"] = column
        prior = prior.rename(columns = {column: "attribute_value"})
        
        # get P(attribute|class)
        
        prior["p"] = prior.n / prior.total
        prior = prior[["CLASS", "attribute", "attribute_value", "n", "total", "p"]]
        
        priors.append(prior)
        
    priors = pd.concat(priors)
        
    return priors

In [542]:
def get_numeric_statistics(df):
    
    # returns dataframe of numerical attributes' mean and standard deviation
    
    numerical_statistics = df.agg(["mean", "std"]).transpose().reset_index().rename(columns = {"index": "attribute"})
    
    return numerical_statistics

In [511]:
def get_class_priors(df):
    
    # returns the probability of each class occuring
    
    prior = df.groupby(["CLASS"]).size().reset_index().rename(columns = {0: "n"})
    prior["total"] = sum(prior["n"])
    
    # get P(class)
    
    prior["p"] = prior.n / prior.total
    
    return prior

In [512]:
def gaussian_pdf(x, mean, std):
    
    # gaussian function used to determine P(X = x|C) 
    
    return (1/(std * sqrt(2 * pi))) * e ** (-1/2 * ((x - mean)/std) ** 2)

In [439]:
class_priors = get_class_priors(df)

In [537]:
class train_model:
    def __init__(self, class_priors, categoric_priors, numeric_statistics):
        self.class_priors = class_priors
        self.categoric_priors = categoric_priors
        self.numeric_statistics = numeric_statistics
    
    def gaussian_pdf(x, mean, std):
        # gaussian function used to determine P(X = x|C) 
        return (1/(std * sqrt(2 * pi))) * e ** (-1/2 * ((x - mean)/std) ** 2)
    
    def predict_nb(df):
        pass
    def predict_gnb(df):
        pass

In [538]:
get_class_priors(df)

Unnamed: 0,CLASS,n,total,p
0,1,6,231,0.025974
1,2,27,231,0.116883
2,3,84,231,0.363636
3,4,82,231,0.354978
4,5,32,231,0.138528


In [539]:
class_priors = get_class_priors(df)
categoric_priors = get_categoric_priors(df)

In [540]:
nb = model(class_priors, categoric_priors, None)

In [541]:
nb.numeric_statistics

In [448]:
# This function should calculat prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train(df, datatype):
    
    class train_model:
        def __init__(self, class_priors, categoric_priors, numeric_statistics):
            self.class_priors = class_priors
            self.categoric_priors = categoric_priors
            self.numeric_statistics = numeric_statistics

        def gaussian_pdf(x, mean, std):
            # gaussian function used to determine P(X = x|C) 
            return (1/(std * sqrt(2 * pi))) * e ** (-1/2 * ((x - mean)/std) ** 2)

        def predict_nb(df):
            pass
        def predict_gnb(df):
            pass

    class_priors = get_class_priors(df)
    
    if datatype == "nominal" or datetype == "ordinal":
        categoric_priors = get_categoric_priors(df)
        model = train_model(class_priors, categoric_priors, None)
        return model
        
    if datatype == "numeric":
        numeric_statistics = get_numeric_statistics(df)
        model = train_model(class_priors, None, numeric_statistics)
        return model
        
    if datatype == "mixed":
        dtypes = get_dtypes(filename)
        for column in df.columns[:-1]:
            try:
                dtype = dtypes[column]
                if dtype == "categoric":
                    df[column] = df[column].astype("category")
                if dtype == "numeric":
                    df[column] = pd.to_numeric(df[column])
            except:
                pass
    
    return attribute_priors, class_priors

In [553]:
df.dtypes

university-name            object
state                      object
control                    object
number-of-students         object
male:female-ratio         float64
student:faculty-ratio     float64
sat-veral                   int64
sat-math                    int64
expenses                 category
percent-financial-aid       int64
number-of-applicants     category
percent-admittance          int64
percent-enrolled            int64
social                   category
quality-of-life          category
academic-emphasis        category
CLASS                    category
dtype: object

In [548]:
len(df.columns)

17

In [550]:
len(df.select_dtypes("number").columns)

7

In [551]:
len(df.select_dtypes("category").columns)

6

In [None]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

def predict(df, datatype, attribute_priors, class_priors):
    
    if datatype == "categorical":
        attribute_priors = get_categorical_attribute_priors(df)
    
    if datatype == "numeric":
        attribute_priors = get_numeric_attribute_priors(df)
        
    return

In [396]:
index, row = list(df2.iterrows())[0]

In [397]:
row

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
Name: 0, dtype: object

In [566]:
CLASSES = class_priors.CLASS.tolist()
CLASS = CLASSES[0]

In [567]:
CLASS

'<=50K'

In [398]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
CLASS             object
dtype: object

In [652]:
probability = []

index = row.index[0]
    
attribute = index 
attribute_value = row[index]

CLASSES = class_priors.CLASS.tolist()

class_ps = {}

for CLASS in CLASSES:

    attribute_ps = {}
    
    for index in row.index:

        attribute = index 
        attribute_value = row[index]
        
        if datatype == "categorical":
            
            attribute_p = attribute_priors.loc[
                (attribute_priors["CLASS"] == CLASS) & \
                (attribute_priors["attribute"] == attribute) & \
                (attribute_priors["attribute_value"] == attribute_value)
            ]
            
        if datatype == "numerical":
            
            mean = attribute_prior.loc[attribute_prior["attribute"] == attribute]["mean"].values[0]
            std = attribute_prior.loc[attribute_prior["attribute"] == attribute]["std"].values[0]
        
        # no classes
        
        if len(attribute_p) == 0:
            attribute_p = 0
        else:
            attribute_p = attribute_p.p.values[0]

        attribute_ps[attribute] = attribute_p
    
        class_p = class_priors.loc[class_priors["CLASS"] == CLASS].p.values[0]
        
        pi_attributes_p = 1
        for value in attribute_ps.values():
            pi_attributes_p *= value
        
        p = class_p * pi_attributes_p
        
        class_ps[CLASS] = p

age 39


In [653]:
class_ps

{'<=50K': 3.2731043372845494e-15, '>50K': 0.0}

In [648]:
attribute_ps = {}
    
for index in row.index:

    attribute = index 
    attribute_value = row[index]

    attribute_p = attribute_priors.loc[
        (attribute_priors["CLASS"] == CLASS) & \
        (attribute_priors["attribute"] == attribute) & \
        (attribute_priors["attribute_value"] == attribute_value)
    ]
    # no classes
    if len(attribute_p) == 0:
        attribute_p = 0
    else:
        attribute_p = attribute_p.p.values[0]
        
    attribute_ps[attribute] = attribute_p

In [641]:
attribute_p = attribute_priors.loc[
        (attribute_priors["CLASS"] == CLASS) & \
        (attribute_priors["attribute"] == attribute) & \
        (attribute_priors["attribute_value"] == attribute_value)
    ]

In [647]:
attribute_p.values

AttributeError: 'numpy.float64' object has no attribute 'values'

In [638]:
CLASS

'>50K'

In [633]:
attribute_value

77516

In [623]:
attribute_ps

{'age': 0.0354546613952302, 'workclass': 0.046143790849673204}

In [621]:
pi_attributes_p = 1
for value in attribute_ps.values():
    pi_attributes_p *= value

In [619]:
attribute_ps.values()

dict_values([array([0.03545466]), array([0.04614379]), array([], dtype=float64)])

In [614]:
attribute_p

4.311308670158666e-15

In [594]:
p = attribute_priors.loc[
    (attribute_priors["CLASS"] == CLASS) & \
    (attribute_priors["attribute"] == attribute) & \
    (attribute_priors["attribute_value"] == attribute_value)
]["p"].values[0]

In [595]:
z = []
z.append(p)

In [596]:
z

[0.021763754045307445]

In [None]:
def nb(row, attribute_priors, class_priors):
    
    for column in row.columns()[1]:
        
    return prediction

In [None]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate():
    return

## Questions 


If you are in a group of 1, you will respond to question (1), and **one** other of your choosing (two responses in total).

If you are in a group of 2, you will respond to question (1) and question (2), and **two** others of your choosing (four responses in total). 

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions in respond to the question, but your formal answer should be added to a separate file.

### Q1
Try discretising the numeric attributes in these datasets and treating them as discrete variables in the na¨ıve Bayes classifier. You can use a discretisation method of your choice and group the numeric values into any number of levels (but around 3 to 5 levels would probably be a good starting point). Does discretizing the variables improve classification performance, compared to the Gaussian na¨ıve Bayes approach? Why or why not?

### Q2
Implement a baseline model (e.g., random or 0R) and compare the performance of the na¨ıve Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance varies across datasets, and to what extent the na¨ıve Bayes classifier improves on the baseline performance.

### Q3
Since it’s difficult to model the probabilities of ordinal data, ordinal attributes are often treated as either nominal variables or numeric variables. Compare these strategies on the ordinal datasets provided. Deterimine which approach gives higher classification accuracy and discuss why.

### Q4
Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy (you should implement this yourself and do not simply call existing implementations from `scikit-learn`). How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)

### Q5
Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the na¨ıve Bayes classifier? Explain why, or why not.

### Q6
The Gaussian na¨ıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in these datasets? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the NB classifier’s predictions.