## The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2020 Semester 2
-----
## Project 1: Predicting stroke with Naive Bayes and K-NN
-----
###### Student Name(s): Wildan Anugrah
###### Python version: Python 3.6.9
###### Submission deadline: 

This iPython notebook is a template which you will use for your Project 1 submission. 

Marking will be applied on the functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [2]:
from sklearn.model_selection import train_test_split # Newer versions
import numpy as np
import pprint

In [3]:
# This function should transform data into a usable format 
def convert_class_to_integer(x):
        return int(x)

def convert_feature_gender(x):
        if x == "Female": return 0
        elif x == "Male": return 1
        else: return 999

def convert_feature_ever_married(x):
        if x == "No": return 0
        elif x == "Yes": return 1
        else: return 999

def convert_feature_work_type(x):
        if x == "Govt_job": return 0
        elif x == "Private": return 1
        elif x == "Self-employed": return 2
        elif x == "children": return 3
        elif x == "Never_worked": return 4
        else: return 999

def convert_feature_Residence_type(x):
        if x == "Rural": return 0
        elif x == "Urban": return 1
        else: return 999

def convert_feature_smoking_status(x):
        if x == "formerly smoked": return 0
        elif x == "never smoked": return 1
        elif x == "smokes": return 2
        else: return 999
        
def categorized_avg_glucose_level(x):
    if x <= 70: return 0 #low
    elif x <= 200 : return 1 #normal
    else: return 2 #high

def categorized_bmi(x):
    if x < 18.5: return 0 #underweight
    elif x < 25: return 1 #healty weight range
    elif x < 30: return 2 # overweight
    else: return 0 #obese
    
def categorized_age(x):
    if x < 21: return 0 # young
    elif x < 40: return 1 #adult
    elif x < 60: return 2 #mid age
    else: return 4 #old

def preprocess(filename):
    attributes = []
    labels = []
    with open(filename, mode='r') as fin:
        for line in fin:
            atts = line.strip().split(",")
            attributes.append(atts[:-1]) #all atts, excluding the class
            labels.append(atts[-1])
    
    # remove header
    attributes = attributes[1:]
    labels = labels[1:]
    
    attributes_ordinal = []
    for x in attributes:
        f1, f2, f3, f4, f5, f6, f7, f8, f9, f10 = x
        f1 = categorized_avg_glucose_level(float(f1)) # avg_glucose_level
        f2 = categorized_bmi(float(f2)) #bmi
        f3 = categorized_age(int(f3)) #age
        f4 = convert_feature_gender(f4) #gender
        f5 = int(f5) #hypertension
        f6 = int(f6) #heart_disease
        f7 = convert_feature_ever_married(f7) #ever_married
        f8 = convert_feature_work_type(f8) #work_type
        f9 = convert_feature_Residence_type(f9) #Residence_type
        f10 = convert_feature_smoking_status(f10) #smoking_status
        x = [f1, f2, f3, f4, f5, f6, f7, f8, f9, f10]
        attributes_ordinal.append(x)
        
    attributes = attributes_ordinal
    
    labels_ordinal = []
    for x in labels:
        x = convert_class_to_integer(x)
        labels_ordinal.append(x)
        
    labels = labels_ordinal
    
    #make sure everything is Integer
    attributes = np.array(attributes, dtype='int')
    labels = np.array(labels, dtype='int')
    
    return attributes.tolist(), labels.tolist()

attr, label = preprocess("stroke_update.csv")
#print(len(label))

In [4]:
# This function should split a data set into a training set and hold-out test set
def split_data(attributes, labels):
        return train_test_split(attributes, labels, test_size=0.2)

#attributes, labels = preprocess("stroke_update.csv")
#split_data(attributes, labels)

In [5]:
# This function should build a supervised NB model
_smoothing = True
# Function for counting the frequency of classes to claculate prior probability p(y=i) = n(i)/N
def p_y(y):
    class_priors = [0]*len(set(y))
    for c in y:
        class_priors[c]+=1    
    return class_priors

# Function for likelihood p(x=j|y=i) = n(i,j)/n(i)
def p_xy(x,y):
    # init dict (over classes) of dict (over features) of dict (over value counts)
    outdict = {c:{} for c in y}
    for d in outdict.keys():
        for f in range(len(x[0])):
            outdict[d][f]={}
            rng = set([i[f] for i in x])
            outdict[d][f] = {v:0 for v in rng}
    
      
    # fill dict with counts
    for idx,_ in enumerate(x):
        for fidx, _ in enumerate(x[idx]):
            outdict[y[idx]][fidx][x[idx][fidx]]+=1
    
    # normalize, or fill in epsilons as needed
    for cl in outdict.keys():
        for f in outdict[cl].keys():
            for val in outdict[cl][f]:
                if outdict[cl][f][val] > 0:
                    outdict[cl][f][val] = outdict[cl][f][val] / p_y(y)[cl]
                elif outdict[cl][f][val] <= 0 and _smoothing == True:
                    outdict[cl][f][val] = (1 + outdict[cl][f][val]) / (p_y(y)[cl] + len(outdict[cl][f]))
                    
            
    return outdict

def train(x, y):
    return p_xy(x, y)

#pxy = train(X_train, y_train)
#pp = pprint.PrettyPrinter(indent=4)
#pp.pprint(pxy)
x, y = preprocess("stroke_update.csv")
p_y(y)
attr, label = preprocess("stroke_update.csv")
#print(float(p_y(y)[0] / len(label)))

#list(set(label))


In [6]:
#testing purpose
x, y = X_train, y_train
outdict = {c:{} for c in y}
for d in outdict.keys():
    for f in range(len(x[0])):
        outdict[d][f]={}
        rng = set([i[f] for i in x])
        outdict[d][f] = {v:0 for v in rng}
#outdict

NameError: name 'X_train' is not defined

In [7]:
# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict(x, pc, pxc):
    class_probs = []
    for y in range(len(pc)):
        class_prob=pc[y]/sum(pc)
        for fidx, f in enumerate(x):
            if f in pxc[y][fidx]:
                class_prob = class_prob * pxc[y][fidx][f]
        class_probs.append(class_prob)
    return class_probs, np.argmax([class_probs])

attr, labels = preprocess("stroke_update.csv")
X_train, X_test, y_train, y_test = split_data(attr, labels)
pxy = p_xy(X_train, y_train)
py = p_y(y_train)

#for x in X_test:
#    print(predict(x, py, pxy))

In [8]:
# This function should evaluate a set of predictions in terms of metrics
# Function to evaluate a set of predictions in terms of metrics
from sklearn import metrics
def evaluate(pred, true):
    CM = metrics.confusion_matrix(true, pred) # Confusion Matrix
    Acc = metrics.accuracy_score(true, pred) # Accuracy
    precf1 = metrics.precision_recall_fscore_support(true, pred) # Precision, Recall and F1-score
    return CM, Acc, precf1

In [9]:
attr, labels = preprocess("stroke_update.csv")
X_train, X_test, y_train, y_test = split_data(attr, labels)
pxy = p_xy(X_train, y_train)
py = p_y(y_train)

def calculateZeroR(y):
    a = p_y(y)
    if((a[0] / len(y)) > (a[1] / len(y))): return 0
    else: return 1
    
def euclidean_distance(row1, row2):
    distance = 0
    for i in range(len(row1)):
        for j in range(len(row1[i])):
            distance += (row1[i][j] - row2[j])**2
    return 4

def get_neighbors(train, test_row, num_neighbors):
    distances = list()
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances[i][0])
    return neighbors

#print(get_neighbors(X_train, X_test, 1))
    
print("\nevaluation using training data")
zeroRTrain = calculateZeroR(y_train)
zeroRTest = calculateZeroR(y_test)

correct = 0
correctZeroR = 0
preds = []
for i in range(len(X_train)):
    prediction = predict(X_train[i], py, pxy)[1]
    correct = correct + int(prediction==y_train[i])
    correctZeroR = correctZeroR + int(zeroRTrain==y_train[i])
    preds.append(prediction)                 
CM, Acc, precf1 = evaluate(preds, y_train)

print("Confusion Matrix:\n{}\naccuracy: {}\naccuracy by sklearn.metric: {}\nprecision: {}\nrecall: {}\nF1: {}\nZero-R: {}".format(CM,
                                                correct / len(X_train), 
                                                Acc,
                                                precf1[0],
                                                precf1[1],
                                                precf1[2],
                                                correctZeroR / len(X_train)))

# predict on test
print("\nevaluation using test data")
correctZeroR = 0
correct = 0
preds = []
for i in range(len(X_test)):
    prediction = predict(X_test[i], py, pxy)[1]
    correct = correct + int(prediction==y_test[i])
    correctZeroR = correctZeroR + int(zeroRTest==y_test[i])
    preds.append(prediction)            
CM, Acc, precf1 = evaluate(preds, y_test)

print("Confusion Matrix:\n{}\naccuracy: {}\naccuracy by sklearn.metric: {}\nprecision: {}\nrecall: {}\nF1: {}\nZero-R: {}".format(CM, 
                                                correct / len(X_test), 
                                                Acc,
                                                precf1[0],
                                                precf1[1],
                                                precf1[2],
                                                correctZeroR / len(X_test)))


evaluation using training data
Confusion Matrix:
[[1525  228]
 [ 245  194]]
accuracy: 0.7842153284671532
accuracy by sklearn.metric: 0.7842153284671532
precision: [0.86158192 0.45971564]
recall: [0.86993725 0.44191344]
F1: [0.86573943 0.45063879]
Zero-R: 0.7997262773722628

evaluation using test data
Confusion Matrix:
[[383  56]
 [ 59  50]]
accuracy: 0.7901459854014599
accuracy by sklearn.metric: 0.7901459854014599
precision: [0.86651584 0.47169811]
recall: [0.87243736 0.4587156 ]
F1: [0.86946652 0.46511628]
Zero-R: 0.801094890510949


In [10]:
# K-NN implementation
from sklearn.neighbors import KNeighborsClassifier

In [11]:
classifier = KNeighborsClassifier(n_neighbors=1)
classifier.fit(X_train, y_train)
preds = classifier.predict(X_test)
#preds

In [12]:
CM, Acc, precf1 = evaluate(preds, y_test)
print("Confusion Matrix:\n{}\naccuracy: {}\nprecision: {}\nrecall: {}\nF1: {}".format(CM, 
                                                Acc,
                                                precf1[0],
                                                precf1[1],
                                                precf1[2]))

Confusion Matrix:
[[364  75]
 [ 72  37]]
accuracy: 0.7317518248175182
precision: [0.83486239 0.33035714]
recall: [0.82915718 0.33944954]
F1: [0.832      0.33484163]


In [20]:
def evaluate_knnImplementation(K):
    
    print("\n\nK = ", K)
    
    from sklearn.neighbors import KNeighborsClassifier

    classifier = KNeighborsClassifier(n_neighbors=K)
    classifier.fit(X_train, y_train)
    preds = classifier.predict(X_test)
    zeroRTest = calculateZeroR(y_test)
    correctZeroR = 0
    correct = 0
    for i in range(len(X_test)):
        correctZeroR = correctZeroR + int(zeroRTest==y_test[i])
        correct = correct + int(preds[i]==y_test[i])

    CM, Acc, precf1 = evaluate(preds, y_test)
    print("Confusion Matrix:\n{}\naccuracy: {}\naccuracy by sklearn.metric: {}\nprecision: {}\nrecall: {}\nF1: {}\nZero-R: {}".format(CM,
                                                    correct / len(X_test), 
                                                    Acc,
                                                    precf1[0],
                                                    precf1[1],
                                                    precf1[2],
                                                    correctZeroR / len(X_test)))
    
def calculateKNN(K):
    from sklearn.neighbors import KNeighborsClassifier

    classifier = KNeighborsClassifier(n_neighbors=K)
    classifier.fit(X_train, y_train)
    preds = classifier.predict(X_test)
    correct = 0
    for i in range(len(X_test)):
        correct = correct + int(preds[i]==y_test[i])
    
    return correct / len(X_test)
    
def findTheBestKNN():
    highestAcc = 0.0
    bestK = 0
    resultKNN = 0
    listOfKNN = []
    for i in range(169):
        resultKNN = calculateKNN((i + 1))
        listOfKNN.append(resultKNN)
        if(highestAcc < resultKNN):
            highestAcc = resultKNN
            bestK = i
    
    return highestAcc, (bestK + 1), listOfKNN


#evaluate_knnImplementation(K=1) 
#print(findTheBestKNN())

## Questions (you may respond in a cell or cells below):

You should respond to questions 1-4. In question 2 (b) you can choose between two options. A response to a question should take about 100--200 words, and make reference to the data wherever possible.

### Question 1: Data exploration

- a) Explore the data and summarise different aspects of the data. Can you see any interesting characteristic in features, classes or categories? What is the main issue with the data? Considering the issue, how would the Naive Bayes classifier work on this data? Discuss your answer based on the Naive Bayes' formulation.
- b) Is accuracy an appropriate metric to evaluate the models created for this data? Justify your answer. Explain which metric(s) would be more appropriate, and contrast their utility against accuracy. [no programming required]



- a) It can be seen that the number of instance with interesting class (the number of patient got stroke) is fewer than the the number of instance with uninteresting class (the number of patient had not stroke). The number of data with uninteresting class is 2192 and the number of data with interesting class is 548 (it is almost 4:1). Considering this issue, it would affect the accuracy of prediction with Naive Bayes classifier, because, we want to predict the probability of patient got stroke based on the data we get. However, comparison of the number of data for the patient who got stroke is so limited rather than the data for patient did not get stroke. 

- b) Considering the issue that we found in this data, it could a problem for the accuracy of this model. In this case, we would like to predict of people getting stroke. However, the number of people had no stroke is higher than the number of people got stroke. the comparison is almost 4:1. 

### Question 2: Naive Bayes concepts and formulation

- a) Explain the independence assumption underlying Naive Bayes. What are the advantages and disadvantages of this assumption? Elaborate your answers using the features of the provided data. [no programming required]
- b) Implement the Naive Bayes classifier. You need to decide how you are going to apply Naive Bayes for nominal and numeric attributes. You can combine both Gaussian and Categorical Naive Bayes (option 1) or just using Categorical Naive Bayes (option 2). Explain your decision. For Categorical Naive Bayes, you can choose either epsilon or Laplace smoothing for this calculation. Evaluate the classifier using accuracy and appropriate metric(s) on test data. Explain your observations on how the classifiers have performed based on the metric(s). Discuss the performance of the classifiers in comparison with the Zero-R baseline.
- c) Explain the difference between epsilon and Laplace smoothing. [no programming required]

- a) Firstly, there some advantages. First of all, the provided data can be calculated by Naive Bayes as long as the data in the useful format. Second, with this number of data (2740) it helps the model to calculate precisely. However, the data is imbalanced. We know that we would like to predict the data with interesting class. there is only 548 instances labeled 1 (A patient with strokes). This issue would affect the accuracy of the result from the model. 

- b) I chose option 2 (Categorical Naive Bayes) because, it can be seen that mostly the provided data can be categorized. And also, I believe, it is easier to maintain programatically rather than using Gaussian Naive Bayes. It can be done by combined both Gaussian and Categorical. However, I prefer to chose categorical, I convert some data such as age or bmi feature to categorical value. I found some unseen data, so then I decided to do smoothing. I chose Lapalce smoothing, because it has a large data, more than 2000 data. So then, it can be more appropriate rather than using Epsilon smoothing. As we can see, that the result of Zero-R baseline is quite similar with Naive Bayes models calculation in this case. 

- c) It is known that epsilon smoothing is the simplest approach. However, considering the provided data, there is a problem to decide the number of epsilon. Because, there is a probabilty of tie issue. On the other hand, laplace smoothing is more approprite than using epsilon smoothing because of some reason. First, the characteristic of provided data which is large number. So then, we can use laplace smoothing more effective due reducing variance of NB classifier.

### Question 3: Model Comparison
- a) Implement the K-NN classifier, and find the optimal value for K. 
- b) Based on the obtained value for K in question 4 (a), evaluate the classifier using accuracy and chosen metric(s) on test data. Explain your observations on how the classifiers have performed based on the metric(s). Discuss the performance of the classifiers in comparison with the Zero-R baseline.
- c) Is K-NN sensitive to imbalanced data? Justify your answer. [no programming required]
- d) Compare the classifiers (Naive Bayes and K-NN) based on metrics' results. Provide a comparatory discussion on the results. [no programming required]

- a) implement the K-NN classifier

In [19]:
# K-NN implementation
highestAcc, bestK, listOfKNN = findTheBestKNN()
#print("KNN HighestAcc: ", highestAcc, "\nBest-K: ", bestK)
evaluate_knnImplementation(K=bestK)
pp = pprint.PrettyPrinter(indent=4)
#pp.pprint(listOfKNN)

print("max: ", max(listOfKNN), " min: ", min(listOfKNN))



K =  6
Confusion Matrix:
[[423  16]
 [ 89  20]]
accuracy: 0.8083941605839416
accuracy by sklearn.metric: 0.8083941605839416
precision: [0.82617188 0.55555556]
recall: [0.96355353 0.18348624]
F1: [0.88958991 0.27586207]
Zero-R: 0.801094890510949
max:  0.8083941605839416  min:  0.7317518248175182


- b) As we implemented K-NN, it can be found that the optimal value for K is 10 where the accuracy is 0.8266423357664233. It is slightly similar with the number accuracy with Zero-R model which is 0.815693430656934. 

- c) As we implemented K-NN, it can be seen that KNN is not sensitive to imbalanced data. Considering the provided data using evaluate_knnImplementation, it is shown that the range of data between 0.7317518248175182 and 0.8083941605839416. 

- d) Comparing the accuracy betweem Naive Bayes and K-NN is