# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2019 Semester 1
-----
## Project 1: Gaining Information about Naive Bayes
-----
###### Student Name(s): galaxian explosion
###### Python version: Python 3
###### Submission deadline: 1pm, Fri 5 Apr 2019

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [3]:
import pandas as pd
import numpy as np
import csv
import math

In [5]:
# This function should open a data file in csv, and transform it into a usable format 
def preprocess(file):
    
    
    raw=csv.reader(open(file))
    
    
    data=list(raw)
    
    return data

In [26]:
# This function should build a supervised NB model

#find prior probability for each class
def get_prior(data):
    
    pri_count={}
    
    
    #count number of the classes
    for instance in data:
        class_label=instance[-1]
        if class_label not in pri_count:
            pri_count[class_label]=1
        else:
            pri_count[class_label]+=1
    
    #divide the number by the sum to get the prior probability
    for key, value in pri_count.items():
        pri_count[key]=value/len(data)
        
    
    return pri_count



#find posterior probaility for the data
#parameter: 
#pri_count--- the prior probability for the dataset 
def get_posterior(data, pri_count):

       
    post_count={}
            
    attribute_types=[]
    
    
    #since lapalce smoothing is used in this project, 
    #first we record all the possible values for each attribute except for missing value
    for attribute in range(len(data[0])-1):
        attr={}
        for instance in data:
            if instance[attribute]!="?":
                if instance[attribute] not in attr:
                    attr[instance[attribute]]=0
        attribute_types.append(attr)

    
    for attribute in range(len(data[0])-1):
        
        attribute_post={}
        diverse=len(attribute_types[attribute])
        
        #for each class and each attribute, record the posteriror count
        for classes in pri_count.keys():
            attribute_post[classes]={}
            for key,value in attribute_types[attribute].items():
                attribute_post[classes][key]=value
                
        for instance in data:
            if instance[attribute] !="?":
                if instance[attribute] in attribute_post[instance[-1]]:
                    attribute_post[instance[-1]][instance[attribute]]+=1
           
        #use add-1 smoothing
        for classes in pri_count.keys():
            sums=sum(attribute_post[classes].values())
            for key, value in attribute_post[classes].items():
                attribute_post[classes][key]=(1+value)/(diverse+sums)
        
        post_count[attribute]=attribute_post
    
    return post_count
    
    
    
def train(data):
    
    prior=get_prior(data)
    
    posterior=get_posterior(data,prior)
    
    return prior, posterior

In [153]:
#make a posterior probaility function that has no smoothing 
def get_post2(data, pri_count):

       
    post_count={}
            
    attribute_types=[]
    
    
    #since lapalce smoothing is used in this project, 
    #first we record all the possible values for each attribute except for missing value
    for attribute in range(len(data[0])-1):
        attr={}
        for instance in data:
            if instance[attribute]!="?":
                if instance[attribute] not in attr:
                    attr[instance[attribute]]=0
        attribute_types.append(attr)

    
    for attribute in range(len(data[0])-1):
        
        attribute_post={}
        diverse=len(attribute_types[attribute])
        
        #for each class and each attribute, record the posteriror count
        for classes in pri_count.keys():
            attribute_post[classes]={}
            for key,value in attribute_types[attribute].items():
                attribute_post[classes][key]=value
                
        for instance in data:
            if instance[attribute] !="?":
                if instance[attribute] in attribute_post[instance[-1]]:
                    attribute_post[instance[-1]][instance[attribute]]+=1
           
        #use no smoothing
        for classes in pri_count.keys():
            sums=sum(attribute_post[classes].values())
            if sums!=0:
                for key, value in attribute_post[classes].items():
                    attribute_post[classes][key]=value/sums
        
        post_count[attribute]=attribute_post
    
    return post_count

#train data without lapace smoothing
def train2(data):
    
    prior=get_prior(data)
    
    posterior=get_post2(data,prior)
    
    return prior, posterior

In [47]:
# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict(data, prior, post):
    
    #list to store the predictions
    prediction=[]
    
    #loop through the data to calculate the prediction probaility
    for instance in data:
        calculation_dict=get_standard(prior)
        for key, value in calculation_dict.items():
            for i in range(len(instance)-1):
                attribute=instance[i]
                if attribute in post[i][key]:
                    calculation_dict[key]*=post[i][key][attribute]
        #record the maximum and accept it as the predcition result
        prediction.append(max(calculation_dict, key=calculation_dict.get))
    
    return prediction

#function produce a standard vector recording prior probaility for caluculation
#the reason to produce a new vector but not use the raw one is to prevent any potential memory leaks
def get_standard(prior):
    pri_dict={}
    for key, value in prior.items():
        pri_dict[key]=value
    
    return pri_dict


        

In [139]:
# This function should evaluate a set of predictions, in a supervised context 
def evaluate(data, prediction):
    
    print("The accracy is ",get_accuracy(data, prediction)/len(data))
    
    
    matrix=confusion(data, prediction)
      
    print('\n')
    print("Confusion Matrix:")
    print('\n')
    print_confusion(matrix)
    
    print('\n')
    recall(matrix)
    print('\n')
    precision(matrix)
    print('\n')
    return

def get_accuracy(data, prediction):
    
    count=0
    #count the num of accurate prediction
    for i in range(len(data)):
        if data[i][-1]==prediction[i]:
            count+=1
    
    return count

def confusion(data, prediction):
    
    matrix_data={}
    actual_labels=[]
    accurates=get_accuracy(data,prediction)
    #find the distinct classes that exist
    for i in range(len(data)):
        actual_labels.append(data[i][-1])
    #make a dict for each class
    for instance in data:
        matrix_data[instance[-1]]={}
    #match the numbers for each actual classes based on the predictions, the result is the confusion matrix
    for key in matrix_data.keys():
        for instance in data:
            matrix_data[key][instance[-1]]=0
    
    for j in range(len(prediction)):
        matrix_data[actual_labels[j]][prediction[j]]+=1
        
    return matrix_data


#print the matrix
def print_confusion(matrix_data):
    
    print('|Actual\Predict|',end='')
    for k, v in matrix_data.items():
        print('{0:^20s}'.format(k),end='')
    print()
    for k, v in matrix_data.items():
        print('{0:^20s}'.format(k),end='')
        for values in v.values():
            print('{0:^20d}'.format(values),end='')
        print()
    return

#function to find recalls
def recall(matrix):
    
    recalls={}
    for k, v in matrix.items():
        tp=matrix[k][k]
        fn=sum(v.values())
        if fn==0:
            recalls[k]=0
        else:
            recalls[k]=tp/fn
        
    for k,v in recalls.items():
        print("Class        ", k, "          's  recall is         ",v)
    
    
    print("The micro-average recall is               ",sum(recalls.values())/len(matrix))
    return

    #function to find precision
def precision(matrix):
    
    precisions={}
    for k, v in matrix.items():
        tp=matrix[k][k]
        fp=0
        for key, value in matrix.items():
            fp+=matrix[key][k]
        if fp==0:
            precisions[k]=0
        else:
            precisions[k]=tp/fp
        
    for k,v in precisions.items():
        print("Class        ", k, "          's  precision is         ",v)
    
    print("The micro-average precision is               ",sum(precisions.values())/len(matrix))
    return

In [120]:
# This function should calculate the Information Gain of an attribute or a set of attribute, with respect to the class
def info_gain(data, prediction):
    
    
    entropy_data=[]
    
    mean_info=[]
    
    prior,post=train(data)
    
    #first calculate the number and appearance of each attribute branching
    for attribute in range(len(data[0])-1):
        attr_data={}
        for instance in data:
            if instance[attribute]!="?":
                if instance[attribute] not in attr_data:
                    attr_data[instance[attribute]]={}
            if instance[attribute]!="?":
                if instance[-1] not in attr_data[instance[attribute]]:
                    attr_data[instance[attribute]][instance[-1]]=1
                else:
                    attr_data[instance[attribute]][instance[-1]]+=1
        
        entropy_data.append(attr_data)
    
    #calculate the entropy for each attribute branching, sum of each attr_H is the mean_infor of an attribute
    for i in range(len(entropy_data)):
        attr_H={}
        for key, value in entropy_data[i].items():
            entropy=0
            the_sum=sum(value.values())
            for v in value.values():
                entropy+=(v/the_sum)*math.log(v/the_sum,2)
            attr_H[key]=-entropy*(the_sum/len(data))
        mean_info.append(attr_H)
    
    H_R=0
    
    #get the entropy for parents
    for key,value in prior.items():
        H_R+=value*(math.log(value,2))
        
    H_R=H_R*(-1)
    
    
    #get the information gain for each attribute
    for i in range(len(mean_info)):
        mi=0
        for key,value in mean_info[i].items():
            mi+=value
        print("The information gain for attribute     ", i, "       is       ", H_R-mi)
    
    return

In [155]:
#model no smoothing

#Main functionality
#Check performace for each dataset given
#Be care of the long output


datas=["anneal.csv","breast-cancer.csv","car.csv","cmc.csv","hepatitis.csv","hypothyroid.csv","mushroom.csv","nursery.csv","primary-tumor.csv"]
for i in range(len(datas)):
    data=preprocess(datas[i])
    prior,post=train2(data)
    pred=predict(data,prior,post)
    print("The name of the dataset is "+datas[i])
    evaluate(data, pred)
    info_gain(data,pred)
    print("distribution of the classes are")
    print(prior)
    print('\n')
    print("number of the attributes are      ",len(data[0])-1)
    print('\n')
    print("number of the instances are      ",len(data))
    print('\n')
    print('\n')
    print("**"*50)
    print('\n')

The name of the dataset is anneal.csv
The accracy is  0.9910913140311804


Confusion Matrix:


|Actual\Predict|         3                   U                   1                   5                   2          
         3                  677                  2                   0                   0                   5          
         U                   0                   40                  0                   0                   0          
         1                   0                   0                   8                   0                   0          
         5                   0                   0                   0                   67                  0          
         2                   1                   0                   0                   0                   98         


Class         3           's  recall is          0.9897660818713451
Class         U           's  recall is          1.0
Class         1           's  recall is          1.0
Class  

The name of the dataset is hypothyroid.csv
The accracy is  0.9522605121719886


Confusion Matrix:


|Actual\Predict|    hypothyroid           negative      
    hypothyroid              0                  151         
      negative               0                  3012        


Class         hypothyroid           's  recall is          0.0
Class         negative           's  recall is          1.0
The micro-average recall is                0.5


Class         hypothyroid           's  precision is          0
Class         negative           's  precision is          0.9522605121719886
The micro-average precision is                0.4761302560859943


The information gain for attribute      0        is        0.004628873031652547
The information gain for attribute      1        is        0.0009139351160850073
The information gain for attribute      2        is        0.0012382074503017315
The information gain for attribute      3        is        0.00014844815831743796
The informatio

         V                   0                   0                   0                   2                   2                   0                   0                   0                   0                   0                   0                   0                   0                   0                   0                   0                   0                   0                   0                   0                   20         


Class         A           's  recall is          0.7857142857142857
Class         B           's  recall is          1.0
Class         C           's  recall is          0.2222222222222222
Class         D           's  recall is          0.35714285714285715
Class         E           's  recall is          0.28205128205128205
Class         F           's  recall is          1.0
Class         G           's  recall is          0.07142857142857142
Class         H           's  recall is          0.3333333333333333
Class         J           's  recall is 

In [154]:
#model with lapace smoothing

#Main functionality
#Check performace for each dataset given
#Be care of the long output


datas=["anneal.csv","breast-cancer.csv","car.csv","cmc.csv","hepatitis.csv","hypothyroid.csv","mushroom.csv","nursery.csv","primary-tumor.csv"]
for i in range(len(datas)):
    data=preprocess(datas[i])
    prior,post=train(data)
    pred=predict(data,prior,post)
    print("The name of the dataset is "+datas[i])
    evaluate(data, pred)
    info_gain(data,pred)
    print("distribution of the classes are")
    print(prior)
    print('\n')
    print("number of the attributes are      ",len(data[0])-1)
    print('\n')
    print("number of the instances are      ",len(data))
    print('\n')
    print('\n')
    print("**"*50)
    print('\n')

The name of the dataset is anneal.csv
The accracy is  0.9220489977728286


Confusion Matrix:


|Actual\Predict|         3                   U                   1                   5                   2          
         3                  618                  53                  1                   0                   12         
         U                   2                   38                  0                   0                   0          
         1                   1                   0                   7                   0                   0          
         5                   0                   0                   0                   67                  0          
         2                   1                   0                   0                   0                   98         


Class         3           's  recall is          0.9035087719298246
Class         U           's  recall is          0.95
Class         1           's  recall is          0.875
Clas

The name of the dataset is hypothyroid.csv
The accracy is  0.9519443566234588


Confusion Matrix:


|Actual\Predict|    hypothyroid           negative      
    hypothyroid              0                  151         
      negative               1                  3011        


Class         hypothyroid           's  recall is          0.0
Class         negative           's  recall is          0.999667994687915
The micro-average recall is                0.4998339973439575


Class         hypothyroid           's  precision is          0.0
Class         negative           's  precision is          0.9522454142947502
The micro-average precision is                0.4761227071473751


The information gain for attribute      0        is        0.004628873031652547
The information gain for attribute      1        is        0.0009139351160850073
The information gain for attribute      2        is        0.0012382074503017315
The information gain for attribute      3        is        0.0001

The information gain for attribute      2        is        1.0262234265467285
The information gain for attribute      3        is        2.0947714990558133
The information gain for attribute      4        is        0.21246189904816637
The information gain for attribute      5        is        0.0203669388480483
The information gain for attribute      6        is        0.10088123982399111
The information gain for attribute      7        is        0.0678727757044233
The information gain for attribute      8        is        0.22052193470670511
The information gain for attribute      9        is        0.1997614363902529
The information gain for attribute      10        is        0.06714460241010656
The information gain for attribute      11        is        0.06025390884525317
The information gain for attribute      12        is        0.29153013602249356
The information gain for attribute      13        is        0.12715354518198252
The information gain for attribute      14        is 

Questions (you may respond in a cell or cells below):

1. The Naive Bayes classifiers can be seen to vary, in terms of their effectiveness on the given datasets (e.g. in terms of Accuracy). Consider the Information Gain of each attribute, relative to the class distribution — does this help to explain the classifiers’ behaviour? Identify any results that are particularly surprising, and explain why they occur.
2. The Information Gain can be seen as a kind of correlation coefficient between a pair of attributes: when the gain is low, the attribute values are uncorrelated; when the gain is high, the attribute values are correlated. In supervised ML, we typically calculate the Infomation Gain between a single attribute and the class, but it can be calculated for any pair of attributes. Using the pair-wise IG as a proxy for attribute interdependence, in which cases are our NB assumptions violated? Describe any evidence (or indeed, lack of evidence) that this is has some effect on the effectiveness of the NB classifier.
3. Since we have gone to all of the effort of calculating Infomation Gain, we might as well use that as a criterion for building a “Decision Stump” (1-R classifier). How does the effectiveness of this classifier compare to Naive Bayes? Identify one or more cases where the effectiveness is notably different, and explain why.
4. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy. How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)
5. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the Naive Bayes classifier? Explain why, or why not.
6. Naive Bayes is said to elegantly handle missing attribute values. For the datasets with missing values, is there any evidence that the performance is different on the instances with missing values, compared to the instances where all of the values are present? Does it matter which, or how many values are missing? Would a imputation strategy have any effect on this?

Don't forget that groups of 1 student should respond to question (1), and one other question of your choosing. Groups of 2 students should respond to question (1) and question (2), and two other questions of your choosing. Your responses should be about 150-250 words each.

# Question 1

1. The Naive Bayes classifiers can be seen to vary, in terms of their effectiveness on the given datasets (e.g. in terms of Accuracy). Consider the Information Gain of each attribute, relative to the class distribution — does this help to explain the classifiers’ behaviour? Identify any results that are particularly surprising, and explain why they occur.




In [119]:
#the accracy for each dataset
#use the result of the model with laplace smoothing

datas=["anneal.csv","breast-cancer.csv","car.csv","cmc.csv","hepatitis.csv","hypothyroid.csv","mushroom.csv","nursery.csv","primary-tumor.csv"]

for i in range(len(datas)):
    data=preprocess(datas[i])
    a,b=train(data)
    pred=predict(data,a,b)
    print(datas[i])
    print("The Accuracy is                 ", get_accuracy(data, pred)/len(data))
    
    print('\n')


anneal.csv
The Accuracy is                  0.9220489977728286


breast-cancer.csv
The Accuracy is                  0.7482517482517482


car.csv
The Accuracy is                  0.8715277777777778


cmc.csv
The Accuracy is                  0.5057705363204344


hepatitis.csv
The Accuracy is                  0.832258064516129


hypothyroid.csv
The Accuracy is                  0.9519443566234588


mushroom.csv
The Accuracy is                  0.9588872476612507


nursery.csv
The Accuracy is                  0.9030092592592592


primary-tumor.csv
The Accuracy is                  0.5545722713864307




# The 
following factors reagrds the dataset are considered to positively affect the classifier to make it perform better and get trained better.

1. A decently large size of instances, attributes and classes.


In this way, the classifier will take more cirsumstances into training to reduce bias.





2. (As) Evenly (as possible) distributed classes and attributes


The number of each class and attribute value should be better more close to each other. Information gain perfers highly braching data. Evenly distributed can also reduce bias and gain a general perspective on the reality.

Next, analyse dataset "hypothyroid.csv", "mushroom.csv" and "cmc.csv" as examples.


In [140]:
datas="hypothyroid.csv"
data=preprocess(datas)
prior,post=train(data)
pred=predict(data,prior,post)
print(datas)
print("The Accuracy is                 ", get_accuracy(data, pred)/len(data),'\n')
evaluate(data, pred)
info_gain(data,pred)
print("distribution of the classes are")
print(prior)
print('\n')
print("number of the attributes are      ",len(data[0])-1)
print('\n')
print("number of the instances are      ",len(data))
print('\n')


hypothyroid.csv
The Accuracy is                  0.9519443566234588 

The accracy is  0.9519443566234588


Confusion Matrix:


|Actual\Predict|    hypothyroid           negative      
    hypothyroid              0                  151         
      negative               1                  3011        


Class         hypothyroid           's  recall is          0.0
Class         negative           's  recall is          0.999667994687915
The micro-average recall is                0.4998339973439575


Class         hypothyroid           's  precision is          0.0
Class         negative           's  precision is          0.9522454142947502
The micro-average precision is                0.4761227071473751


The information gain for attribute      0        is        0.004628873031652547
The information gain for attribute      1        is        0.0009139351160850073
The information gain for attribute      2        is        0.0012382074503017315
The information gain for attribute    




"hypothyroid.csv" shows surprising result. 
The ratio of the two classes are 0.95:0.4. The majority of the classes are "negative", which has a huge impact on the prior and posterior probaility and the prediction. The precision and recall for class "hypothyroid" are both 0 and low information gain for every attribute, which means the classifier did not learn enough instance with class "hypothyroid". High accuracy does not mean classifier is good. In fact, not evenly distributed classes affect the classifier badly. In this case, the classifier cannot handle instances that actually under class "hypothyroid".





In [141]:
datas2="mushroom.csv"
data2=preprocess(datas2)
prior2,post2=train(data2)
pred2=predict(data2,prior2,post2)
print(datas2)
print("The Accuracy is                 ", get_accuracy(data2, pred2)/len(data2),'\n')
evaluate(data2, pred2)
info_gain(data2,pred2)
print("distribution of the classes are")
print(prior2)
print('\n')
print("number of the attributes are      ",len(data2[0])-1)
print('\n')
print("number of the instances are      ",len(data2))
print('\n')

mushroom.csv
The Accuracy is                  0.9588872476612507 

The accracy is  0.9588872476612507


Confusion Matrix:


|Actual\Predict|         p                   e          
         p                  3614                302         
         e                   32                 4176        


Class         p           's  recall is          0.9228804902962207
Class         e           's  recall is          0.9923954372623575
The micro-average recall is                0.9576379637792891


Class         p           's  precision is          0.9912232583653319
Class         e           's  precision is          0.9325591782045556
The micro-average precision is                0.9618912182849437


The information gain for attribute      0        is        0.04879670193537311
The information gain for attribute      1        is        0.028590232773772817
The information gain for attribute      2        is        0.03604928297620391
The information gain for attribute      3       



"mushroom.csv"
One of the best datasets and best result in the project for following reasons:
1. Evenly distributed classes and attributes. 
2. High Information gain for particular attributes, high precision and recall for both classes. 
3. Decent number of attributes and instance.




In [142]:

datas23="cmc.csv"
data23=preprocess(datas23)
prior23,post23=train(data23)
pred23=predict(data23,prior23,post23)
print(datas23)
print("The Accuracy is                 ", get_accuracy(data23, pred23)/len(data23),'\n')
evaluate(data23, pred23)
info_gain(data23,pred23)
print("distribution of the classes are")
print(prior23)
print('\n')
print("number of the attributes are      ",len(data23[0])-1)
print('\n')
print("number of the instances are      ",len(data23))
print('\n')

cmc.csv
The Accuracy is                  0.5057705363204344 

The accracy is  0.5057705363204344


Confusion Matrix:


|Actual\Predict|       No-use            Long-term           Short-term     
       No-use               374                 135                 120         
     Long-term               64                 190                  79         
     Short-term             172                 158                 181         


Class         No-use           's  recall is          0.5945945945945946
Class         Long-term           's  recall is          0.5705705705705706
Class         Short-term           's  recall is          0.3542074363992172
The micro-average recall is                0.5064575338547941


Class         No-use           's  precision is          0.6131147540983607
Class         Long-term           's  precision is          0.39337474120082816
Class         Short-term           's  precision is          0.4763157894736842
The micro-average precision is   

"cmc.csv"
Result not so good for fowlloing reason:
1. realtive low numbers of attributes.
2. small data size
3. low information gain means the attributes branching poorly so not much useful information



# Question 5
Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the Naive Bayes classifier? Explain why, or why not.

In this project I used Laplace add-k (k=1) smoothing to handle data. The effectiveness of changing the smoothing regime in this projectmay not be that obivious or even result in a lower raccuracy because we are treating the train data as the test data. 

Use datasets accuracy as an exmaple.

In [158]:
#accracy of models without smoothing and with smoothing

datas=["anneal.csv","breast-cancer.csv","car.csv","cmc.csv","hepatitis.csv","hypothyroid.csv","mushroom.csv","nursery.csv","primary-tumor.csv"]

for i in range(len(datas)):
    data=preprocess(datas[i])
    a,b=train2(data)
    c,d=train(data)
    pred=predict(data,a,b)
    pred2=predict(data,c,d)
    print(datas[i])
    print("No Smoothing accuracy is                 ", get_accuracy(data, pred)/len(data))
    
    print('\n')
    print("Add-k Smoothing accuracy is                 ", get_accuracy(data, pred2)/len(data))
    
    print('\n')

anneal.csv
No Smoothing accuracy is                  0.9910913140311804


Add-k Smoothing accuracy is                  0.9220489977728286


breast-cancer.csv
No Smoothing accuracy is                  0.7587412587412588


Add-k Smoothing accuracy is                  0.7482517482517482


car.csv
No Smoothing accuracy is                  0.8738425925925926


Add-k Smoothing accuracy is                  0.8715277777777778


cmc.csv
No Smoothing accuracy is                  0.5057705363204344


Add-k Smoothing accuracy is                  0.5057705363204344


hepatitis.csv
No Smoothing accuracy is                  0.8387096774193549


Add-k Smoothing accuracy is                  0.832258064516129


hypothyroid.csv
No Smoothing accuracy is                  0.9522605121719886


Add-k Smoothing accuracy is                  0.9519443566234588


mushroom.csv
No Smoothing accuracy is                  0.9958148695224027


Add-k Smoothing accuracy is                  0.9588872476612507


nursery.cs

As the above results shows. In this project, the accuracy of model without smoothing is higher than model with smoothing for EVERY DATASET!!!


Smoothing is to adjust zeros that in our data to some small number. The whole reason is that we generally assume there is nothing "impossible" or "abusolutely zero probability" unless there is a particular assumption. e.g. A fish can fly, we have flying fish in reality. It may not in our training data, but it exist in reality. So rather than treat the probaility of fish can fly to 0, we rather assign some reasonably small value to it.

However, in this project, we use trained data as test data. This decision adjusted our assumptions to "we assume something is impossible", because we don't need to consider the case that the classifier hasn't learned since it has already learned "everything". The purpose of smoothing is to help classifier deal with circumstances that we rarely seen to prevent potienal bias and mistakes. Since we use trained data to test in this project, smoothing does not prevent potienal mistakes but add mistakes to the classifier.


In reality, rather than classifier with no smoothing method, however, it is believed that smoothing will help build a better classifier when dealing with unseen instances.

