## The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2020 Semester 2
-----
## Project 1: Predicting stroke with Naive Bayes and K-NN
-----
###### Student Name(s): Wildan Anugrah
###### Python version: Python 3.6.9
###### Submission deadline: 

This iPython notebook is a template which you will use for your Project 1 submission. 

Marking will be applied on the functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [76]:
from sklearn.model_selection import train_test_split # Newer versions
import numpy as np
import pprint

In [104]:
# This function should transform data into a usable format 
def convert_class_to_integer(x):
        return int(x)

def convert_feature_gender(x):
        if x == "Female": return 0
        elif x == "Male": return 1
        else: return 999

def convert_feature_ever_married(x):
        if x == "No": return 0
        elif x == "Yes": return 1
        else: return 999

def convert_feature_work_type(x):
        if x == "Govt_job": return 0
        elif x == "Private": return 1
        elif x == "Self-employed": return 2
        elif x == "children": return 3
        elif x == "Never_worked": return 4
        else: return 999

def convert_feature_Residence_type(x):
        if x == "Rural": return 0
        elif x == "Urban": return 1
        else: return 999

def convert_feature_smoking_status(x):
        if x == "formerly smoked": return 0
        elif x == "never smoked": return 1
        elif x == "smokes": return 2
        else: return 999

def preprocess(filename):
    attributes = []
    labels = []
    with open(filename, mode='r') as fin:
        for line in fin:
            atts = line.strip().split(",")
            attributes.append(atts[:-1]) #all atts, excluding the class
            labels.append(atts[-1])
    
    # remove header
    attributes = attributes[1:]
    labels = labels[1:]
    
    attributes_ordinal = []
    for x in attributes:
        f1, f2, f3, f4, f5, f6, f7, f8, f9, f10 = x
        f1 = float(f1) # avg_glucose_level
        f2 = float(f2) #bmi
        f3 = int(f3) #age
        f4 = convert_feature_gender(f4) #gender
        f5 = int(f5) #hypertension
        f6 = int(f6) #heart_disease
        f7 = convert_feature_ever_married(f7) #ever_married
        f8 = convert_feature_work_type(f8) #work_type
        f9 = convert_feature_Residence_type(f9) #Residence_type
        f10 = convert_feature_smoking_status(f10) #smoking_status
        x = [f1, f2, f3, f4, f5, f6, f7, f8, f9, f10]
        attributes_ordinal.append(x)
        
    attributes = attributes_ordinal
    
    labels_ordinal = []
    for x in labels:
        x = convert_class_to_integer(x)
        labels_ordinal.append(x)
        
    labels = labels_ordinal
    
    #make sure everything is Integer
    attributes = np.array(attributes, dtype='int')
    labels = np.array(labels, dtype='int')
    
    return attributes.tolist(), labels.tolist()

#preprocess("stroke_update.csv")

In [192]:
# This function should split a data set into a training set and hold-out test set
def split_data(attributes, labels):
        return train_test_split(attributes, labels, test_size=0.2)

#attributes, labels = preprocess("stroke_update.csv")
#split_data(attributes, labels)

In [198]:
# This function should build a supervised NB model
# Function for counting the frequency of classes to claculate prior probability p(y=i) = n(i)/N
def p_y(y):
    class_priors = [0]*len(set(y))
    for c in y:
        class_priors[c]+=1    
    return class_priors

# Function for likelihood p(x=j|y=i) = n(i,j)/n(i)
def p_xy(x,y):
    # init dict (over classes) of dict (over features) of dict (over value counts)
    outdict = {c:{} for c in y}
    for d in outdict.keys():
        for f in range(len(x[0])):
            outdict[d][f]={}
            rng = set([i[f] for i in x])
            outdict[d][f] = {v:0 for v in rng}
    
      
    # fill dict with counts
    for idx,_ in enumerate(x):
        for fidx, _ in enumerate(x[idx]):
            outdict[y[idx]][fidx][x[idx][fidx]]+=1
    
    # normalize, or fill in epsilons as needed
    for cl in outdict.keys():
        for f in outdict[cl].keys():
            for val in outdict[cl][f]:
                if outdict[cl][f][val] > 0:
                    outdict[cl][f][val] = outdict[cl][f][val] / p_y(y)[cl]

            
    return outdict

def train(x, pc, pxc):
    class_probs = []
    for y in range(len(pc)):
        class_prob=pc[y]/sum(pc)
        for fidx, f in enumerate(x):
            if f in pxc[y][fidx]:
                class_prob = class_prob * pxc[y][fidx][f]
        class_probs.append(class_prob)
    return class_probs, np.argmax([class_probs])

#for x in X_train:
#    print(train(x, py, pxy))
#pxy = p_xy(X_train, y_train)
#pp = pprint.PrettyPrinter(indent=4)
#pp.pprint(pxy)

In [204]:
# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict(x, pc, pxc):
    class_probs = []
    for y in range(len(pc)):
        class_prob=pc[y]/sum(pc)
        for fidx, f in enumerate(x):
            if f in pxc[y][fidx]:
                class_prob = class_prob * pxc[y][fidx][f]
        class_probs.append(class_prob)
    return class_probs, np.argmax([class_probs])

In [199]:
# This function should evaluate a set of predictions in terms of metrics
# Function to evaluate a set of predictions in terms of metrics
from sklearn import metrics
def evaluate(pred,true):
    CM = metrics.confusion_matrix(true, pred) # Confusion Matrix
    Acc = metrics.accuracy_score(true, pred) # Accuracy
    precf1 = metrics.precision_recall_fscore_support(true, pred) # Precision, Recall and F1-score
    return CM, Acc, precf1

In [210]:
attr, labels = preprocess("stroke_update.csv")
X_train, X_test, y_train, y_test = split_data(attr, labels)
pxy = p_xy(X_train, y_train)
py = p_y(y_train)

print("\nevaluation using training data")

correct = 0
preds = []
for i in range(len(X_train)):
    prediction = predict(X_train[i], py, pxy)[1]
    correct = correct + int(prediction==y_train[i])
    preds.append(prediction)                 
CM, Acc, precf1 = evaluate(preds, y_train)

print("Confusion Matrix:\n{}\naccuracy: {}\naccuracy by sklearn.metric: {}\nprecision: {}\nrecall: {}\nF1: {}".format(CM,
                                                correct / len(X_train), 
                                                Acc,
                                                precf1[0],
                                                precf1[1],
                                                precf1[2]))

# predict on test
print("\nevaluation using test data")

correct = 0
preds = []
for i in range(len(X_test)):
    prediction = predict(X_test[i], py, pxy)[1]
    correct = correct + int(prediction==y_test[i])
    preds.append(prediction)            
CM, Acc, precf1 = evaluate(preds, y_test)

print("Confusion Matrix:\n{}\naccuracy: {}\naccuracy by sklearn.metric: {}\nprecision: {}\nrecall: {}\nF1: {}".format(CM, 
                                                correct / len(X_test), 
                                                Acc,
                                                precf1[0],
                                                precf1[1],
                                                precf1[2]))


evaluation using training data
Confusion Matrix:
[[1559  197]
 [ 180  256]]
accuracy: 0.8280109489051095
accuracy by sklearn.metric: 0.8280109489051095
precision: [0.89649224 0.56512141]
recall: [0.88781321 0.58715596]
F1: [0.89213162 0.57592801]

evaluation using test data
Confusion Matrix:
[[390  46]
 [ 75  37]]
accuracy: 0.7791970802919708
accuracy by sklearn.metric: 0.7791970802919708
precision: [0.83870968 0.44578313]
recall: [0.89449541 0.33035714]
F1: [0.86570477 0.37948718]


## Questions (you may respond in a cell or cells below):

You should respond to questions 1-4. In question 2 (b) you can choose between two options. A response to a question should take about 100--200 words, and make reference to the data wherever possible.

### Question 1: Data exploration

- a) Explore the data and summarise different aspects of the data. Can you see any interesting characteristic in features, classes or categories? What is the main issue with the data? Considering the issue, how would the Naive Bayes classifier work on this data? Discuss your answer based on the Naive Bayes' formulation.
- b) Is accuracy an appropriate metric to evaluate the models created for this data? Justify your answer. Explain which metric(s) would be more appropriate, and contrast their utility against accuracy. [no programming required]



### Question 2: Naive Bayes concepts and formulation

- a) Explain the independence assumption underlying Naive Bayes. What are the advantages and disadvantages of this assumption? Elaborate your answers using the features of the provided data. [no programming required]
- b) Implement the Naive Bayes classifier. You need to decide how you are going to apply Naive Bayes for nominal and numeric attributes. You can combine both Gaussian and Categorical Naive Bayes (option 1) or just using Categorical Naive Bayes (option 2). Explain your decision. For Categorical Naive Bayes, you can choose either epsilon or Laplace smoothing for this calculation. Evaluate the classifier using accuracy and appropriate metric(s) on test data. Explain your observations on how the classifiers have performed based on the metric(s). Discuss the performance of the classifiers in comparison with the Zero-R baseline.
- c) Explain the difference between epsilon and Laplace smoothing. [no programming required]

### Question 3: Model Comparison
- a) Implement the K-NN classifier, and find the optimal value for K. 
- b) Based on the obtained value for K in question 4 (a), evaluate the classifier using accuracy and chosen metric(s) on test data. Explain your observations on how the classifiers have performed based on the metric(s). Discuss the performance of the classifiers in comparison with the Zero-R baseline.
- c) Is K-NN sensitive to imbalanced data? Justify your answer. [no programming required]
- d) Compare the classifiers (Naive Bayes and K-NN) based on metrics' results. Provide a comparatory discussion on the results. [no programming required]