# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2018 Semester 1
-----
## Project 1: What is labelled data worth to Naive Bayes?
-----
###### Student Name(s): William Liandri
###### Student ID: 728710
###### Python version: 3.0

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the seven functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [1]:
# Libraries
import pandas as pd
import numpy as np
import random
import math
from collections import defaultdict
from collections import OrderedDict

# Preprocess

In [2]:
# This function should open a data file in csv, and transform it into a usable format 
def preprocess(filedir):
    df = pd.read_csv(filedir, header=None)
    
    # Remove the missing values from the data
    df = remove_missing_values(df)
    
    return df

In [3]:
# This function will help to remove instance that have missing values.
def remove_missing_values(df):
    newdf = pd.DataFrame
    list_data = []
    df_columns = df.columns
    tempDict = OrderedDict()
    totalMissingValues = 0
    
    # Iterate through the data to find missing values. 
    for i in df.values:

        # The instance does not contain missing values, so append it to the tempList.
        if('?' not in i):
            list_data.append(i)
        
        # Record the number of missing values
        else:
            totalMissingValues += 1
            
    # Put the data back into DataFrame. 
    for i in range(len(df_columns)):
        tempData = [k[i] for k in list_data]
        tempDict[df_columns[i]] = tempData
    
    return newdf.from_dict(tempDict), totalMissingValues

# Supervised

## Train Supervised

In [4]:
# This function should build a supervised NB model.
def train_supervised(df):   
    
    # Calculate the priors and the posteriors
    prob_priors = prob_priors_supervised(df)
    posteriors = count_posteriors_supervised(df)
    
    prob_posteriors = prob_posteriors_supervised(df, posteriors)
    
    return prob_priors, prob_posteriors

In [5]:
# This function will calculate the probability priors of the data.
def prob_priors_supervised(df):
    
    last_column = df.columns[len(df.columns) - 1]
    tempDict = pd.value_counts(df[last_column]).to_dict()
    sumPriors = sum(tempDict.values())

    # Calculate the probability of the priors
    for i in tempDict.keys():
        tempDict[i] /= sumPriors
    
    return tempDict

In [6]:
# This function will count the posteriors of the data.
def count_posteriors_supervised(df):
    
    columns = df.columns
    df_posteriors = defaultdict(defaultdict)
    
    # Iterate through the columns
    for i in range(len(columns) - 1):
        dictdf = defaultdict(defaultdict)
        
        # Find the unique values for the labelled class
        unique_class = df[columns[len(columns) - 1]].unique()
        
        # Iterate through the unique values.
        for j in range(len(unique_class)):
            dictdf2 = defaultdict(int)
            
            #Iterate through the data
            for k in range(len(df.values)):
                
                # Calcuate the posteriors probability.
                if(df.values[k][len(columns)-1] == unique_class[j]):
                    dictdf2[df.values[k][i]] += 1
                
            # Put the data into the dictionaries and array.
            dictdf[unique_class[j]] = dictdf2
                    
        df_posteriors[columns[i]] = dictdf
        
    return df_posteriors

In [7]:
# This function will calculate the probability of the posteriors of the data.
def prob_posteriors_supervised(df, dictPosteriors):
    
    columns = df.columns
    
    # The classes
    uniqueClass = df[columns[-1]].unique()
    
    # How many times the specified class appear in the data. 
    countUniqueClass = pd.value_counts(df[columns[-1]]).to_dict()
    
    # Iterate through the attributes
    for i in range(len(columns) - 1):
        uniqueAttributes = df[columns[i]].unique()
    
        for currClass in uniqueClass:
            for currAttrb in uniqueAttributes:
                
                # Calculating the probability of the posteriors and also do the
                # smoothing (Laplace smoothing).
                dictPosteriors[columns[i]][currClass][currAttrb] += 1
                divisor = (len(uniqueAttributes) + countUniqueClass[currClass])
                dictPosteriors[columns[i]][currClass][currAttrb] /= divisor
                
    return dictPosteriors

## Predict Supervised

In [8]:
# This function should predict the class for a set of instances, based on a trained model. 
def predict_supervised(df, priors_df, posteriors_df):
    
    predictionResults = []
    listdf = df.values
    columns = df.columns
    uniqueClass = priors_df.keys()
    
    # Iterate through the data and make a prediction using Naive Bayes. 
    for i in range(len(listdf)):
        calcPredictions = []
        
        for j in uniqueClass:
                
            # Multiply the probability with the priors. 
            currProb = math.log(priors_df[j])
            
            for k in range(len(listdf[i]) - 1):
                currProb += math.log(posteriors_df[columns[k]][j][listdf[i][k]])
                
            # Record all of the calculation.
            calcPredictions.append([currProb,j])
            
        # Find the one that has the maximum value and store the predicted class.
        predictionResults.append(max(calcPredictions)[1])
    
    return predictionResults

## Evaluate Supervised

In [9]:
# This function should evaluate a set of predictions, in a supervised context. 
def evaluate_supervised(df, predict_df):
    
    totalCorrect = 0
    
    # Iterate through the data and find how many correct predictions.
    for i in range(len(df.values)):
        last_element = len(df.values[i]) - 1
        if(predict_df[i] == df.values[i][last_element]):
            
            totalCorrect += 1
            
    return totalCorrect/len(df.values)

# Unsupervised

## Train Unsupervised

In [10]:
def train_unsupervised(df):
    
    # Insert some random numbers according to the number of classes.
    listFractionClasses = generate_random_value(df)
    
    # Calculate the priors and the posteriors probabilities.
    prob_priors = prob_priors_unsupervised(df, listFractionClasses)
    prob_posteriors = prob_posteriors_unsupervised(df, listFractionClasses)
    
    return prob_priors, prob_posteriors
    

In [11]:
# This functions will generate random fractional value for each class.
def generate_random_value(df):
    
    # Get the unique classes that the data has.
    uniqueClasses = df[df.columns[-1]].unique()
    
    listFractionValue = []
    
    # Generate random fractional value.
    for i in range(len(df)):
        randNum = np.random.dirichlet(np.ones(len(uniqueClasses)))
        dictFractionValue = defaultdict(float)
        
        # Maps the random value to the class and stores it in a dictionary.
        for j in range(len(uniqueClasses)):
            dictFractionValue[uniqueClasses[j]] = randNum[j] 
        listFractionValue.append(dictFractionValue)
        
    return listFractionValue
        

In [12]:
def prob_priors_unsupervised(df, listFractionClasses):
    
    columns = df.columns

    # Find the possible classes in the given dataset.
    uniqueClasses = df[df.columns[-1]].unique()
    
    # Dictionary to store the priors probability. 
    dictPriors = defaultdict(float)

    for i in range(len(uniqueClasses)):
        currPriors = [k[uniqueClasses[i]] for k in listFractionClasses]

        dictPriors[uniqueClasses[i]] = sum(currPriors)/len(df)
        
    return dictPriors

In [13]:
def prob_posteriors_unsupervised(df, listFractionClasses):
    
    columns = df.columns
    
    # Find the possible classes in the given dataset. 
    uniqueClasses = df[df.columns[-1]].unique()
    
    # Dictionary to store the posteriors probability.
    dictPosteriors = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
    
    # Convert the data into list
    listData = df.values
    
    # Iterate through the instances.
    for i in range(len(listData)):
        
        # Iterate through each of the attributes
        for j in range(len(listData[i])):
            
            # Iterate through the classes
            for k in range(len(uniqueClasses)):
                dictPosteriors[j][uniqueClasses[k]][listData[i][j]] += listFractionClasses[i][uniqueClasses[k]]
    
    # Divided by class counts to get the probabilities.    
    for i in dictPosteriors.keys():
        
        for j in dictPosteriors[i].keys():
            totalClassCounts = sum(dictPosteriors[i][j].values())
            
            for k in dictPosteriors[i][j].keys():
                initialValue = dictPosteriors[i][j][k]
                dictPosteriors[i][j][k] = initialValue/totalClassCounts
    
    return dictPosteriors

## Predict Unsupervised

In [14]:
def predict_unsupervised(df, priors_df, posteriors_df):
    
    # Convert the data into a list. 
    listData = df.values
        
    columns = df.columns
    
    # The possible classes on the dataset
    uniqueClasses = df[columns[-1]].unique()
    
    # Specify a random number of iterations.
    noIterations = random.randint(10,15)
    
    for iteration in range(0, noIterations):
    
        # An array to store the results of the prediction 
        predictionResults = []
    
        # An array to store fraction value for the next iteration.
        newFractionValue = []

        for i in range(len(listData)):
            calcPredictions = []
            classPredictions = []
            for j in uniqueClasses:
                currProb = priors_df[j]
                
                for k in range(len(listData[i]) - 1):
                    currProb *= posteriors_df[k][j][listData[i][k]]
        
                # Record the results
                calcPredictions.append(currProb)
                classPredictions.append(j)
        
            # Normalize the data
            totalPredictions = sum(calcPredictions)
            calcPredictions = [value/totalPredictions for value in calcPredictions]
    
            # Store the value to newFractionValue to be used for the next iteration.
            tempDictFractionValue = {}
            for noClasses in range(len(classPredictions)):
                tempDictFractionValue[classPredictions[noClasses]] = calcPredictions[noClasses]
            newFractionValue.append(tempDictFractionValue)
    
            # Record the predicted class.
            predictionResults.append(classPredictions[calcPredictions.index(max(calcPredictions))])
    
        priors_df = prob_priors_unsupervised(df, newFractionValue)
        posteriors_df = prob_posteriors_unsupervised(df, newFractionValue)
    
    return predictionResults, noIterations

# Evaluate Unsupervised

In [15]:
# This function should evaluate a set of predictions, in a supervised context. 
def evaluate_unsupervised(df, predict_df):
    
    totalCorrect = 0
    
    # Iterate through the data and find how many correct predictions.
    for i in range(len(df.values)):
        
        # Get the index of the actual class in the original dataset. 
        indActualClass = len(df.values[i]) - 1
        if(predict_df[i] == df.values[i][indActualClass]):
            totalCorrect += 1
            
    return totalCorrect/len(df.values)

# Datasets

## breast_cancer.csv

In [16]:
# Preprocess
df1, df1_missing_values = preprocess('2018S1-proj1_data/breast-cancer.csv')

# SUPERVISED
supervised_priors_df1, supervised_posteriors_df1 = train_supervised(df1)
predict_supervised_df1 = predict_supervised(df1, supervised_priors_df1, supervised_posteriors_df1)
evaluate_supervised_df1 = evaluate_supervised(df1, predict_supervised_df1)

print("This dataset has {} missing values.".format(df1_missing_values))
print("The accuracy for supervised Naive Bayes is " + str(evaluate_supervised_df1))

This dataset has 9 missing values.
The accuracy for supervised Naive Bayes is 0.7689530685920578


In [34]:
# UNSUPERVISED
unsupervised_priors_df1, unsupervised_posteriors_df1 = train_unsupervised(df1)
predict_unsupervised_df1, noIterations = predict_unsupervised(df1, unsupervised_priors_df1, unsupervised_posteriors_df1)
evaluate_unsupervised_df1 = evaluate_unsupervised(df1, predict_unsupervised_df1)

print("The total number of iterations is " + str(noIterations) + ". "
      "The accuracy for unsupervised Naive Bayes is " + str(evaluate_unsupervised_df1))


The total number of iterations is 13. The accuracy for unsupervised Naive Bayes is 0.7328519855595668


## car.csv

In [18]:
df2, df2_missing_values = preprocess('2018S1-proj1_data/car.csv')

# SUPERVISED
supervised_priors_df2, supervised_posteriors_df2 = train_supervised(df2)
predict_supervised_df2 = predict_supervised(df2, supervised_priors_df2, supervised_posteriors_df2)
evaluate_supervised_df2 = evaluate_supervised(df2, predict_supervised_df2)

print("This dataset has {} missing values.".format(df2_missing_values))
print("The accuracy for supervised Naive Bayes is " + str(evaluate_supervised_df2))


This dataset has 0 missing values.
The accuracy for supervised Naive Bayes is 0.8715277777777778


In [28]:
# UNSUPERVISED
unsupervised_priors_df2, unsupervised_posteriors_df2 = train_unsupervised(df2)
predict_unsupervised_df2, noIterations = predict_unsupervised(df2, unsupervised_priors_df2, unsupervised_posteriors_df2)
evaluate_unsupervised_df2 = evaluate_unsupervised(df2, predict_unsupervised_df2)

print("The total number of iterations is " + str(noIterations) + ". "
      "The accuracy for unsupervised Naive Bayes is " + str(evaluate_unsupervised_df2))

The total number of iterations is 12. The accuracy for unsupervised Naive Bayes is 0.42476851851851855


## hypothyroid.csv

In [20]:
df3, df3_missing_values = preprocess('2018S1-proj1_data/hypothyroid.csv')

# SUPERVISED
supervised_priors_df3, supervised_posteriors_df3 = train_supervised(df3)
predict_supervised_df3 = predict_supervised(df3, supervised_priors_df3, supervised_posteriors_df3)
evaluate_supervised_df3 = evaluate_supervised(df3, predict_supervised_df3)

print("This dataset has {} missing values.".format(df3_missing_values))
print("The accuracy for supervised Naive Bayes is " + str(evaluate_supervised_df3))

This dataset has 73 missing values.
The accuracy for supervised Naive Bayes is 0.9514563106796117


In [26]:
# UNSUPERVISED
unsupervised_priors_df3, unsupervised_posteriors_df3 = train_unsupervised(df3)
predict_unsupervised_df3, noIterations = predict_unsupervised(df3, unsupervised_priors_df3, unsupervised_posteriors_df3)
evaluate_unsupervised_df3 = evaluate_unsupervised(df3, predict_unsupervised_df3)

print("The total number of iterations is " + str(noIterations) + ". "
      "The accuracy for unsupervised Naive Bayes is " + str(evaluate_unsupervised_df3))

The total number of iterations is 15. The accuracy for unsupervised Naive Bayes is 0.8728155339805825


## mushroom.csv

In [22]:
df4, df4_missing_values = preprocess('2018S1-proj1_data/mushroom.csv')

# SUPERVISED
supervised_priors_df4, supervised_posteriors_df4 = train_supervised(df4)
predict_supervised_df4 = predict_supervised(df4, supervised_priors_df4, supervised_posteriors_df4)
evaluate_supervised_df4 = evaluate_supervised(df4, predict_supervised_df4)

print("This dataset has {} missing values.".format(df4_missing_values))
print("The accuracy for supervised Naive Bayes is " + str(evaluate_supervised_df4))

This dataset has 2480 missing values.
The accuracy for supervised Naive Bayes is 0.976966690290574


In [25]:
# UNSUPERVISED
unsupervised_priors_df4, unsupervised_posteriors_df4 = train_unsupervised(df4)
predict_unsupervised_df4, noIterations = predict_unsupervised(df4, unsupervised_priors_df4, unsupervised_posteriors_df4)
evaluate_unsupervised_df4 = evaluate_unsupervised(df4, predict_unsupervised_df4)

print("The total number of iterations is " + str(noIterations) + ". "
      "The accuracy for unsupervised Naive Bayes is " + str(evaluate_unsupervised_df4))

The total number of iterations is 14. The accuracy for unsupervised Naive Bayes is 0.8525868178596739


Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.

# Question 1

# Question 2

Breast Cancer
The accuracy for supervised Naive Bayes is 0.7689530685920578

