###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Assignment 1: Pose classification with naive Bayes

###### Submission deadline: 7 pm, Tuesday 6 Apr 2021

**Student ID(s):**     `PLEASE ENTER YOUR ID(S) HERE`


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [136]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

def preprocess(train_path, test_path):
    """
    Notes:
    1) preprocessing train dataset using mean
    2) do not preprocess test dataset
    ----------
    Parameters
    ---------- 
    train_path  : path to train csv dataset
    test_path   : path to test csv dataset
    
    Returns
    -------
    df_train    : train df dataset
    np_test     : test np array dataset
    """
    
    headers = ['Pose', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'y1', 'y2', 'y3', 'y4', 'y5', 'y6', 'y7', 'y8', 'y9', 'y10', 'y11']
    df_train = pd.read_csv(train_path, names=headers)
    df_test = pd.read_csv(test_path, names=headers)

    # replace 9999 with NaN for further processing
    df_train = df_train.replace([9999], np.nan)
    df_train = df_train.dropna(subset = headers[1:], how = 'all')

    '''Dealing with NaN values'''
    # For train data
    # filling missing values by mean in each group
    f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
    df_train = df_train.fillna(df_train.groupby('Pose').transform(f))

    # For test data
    # convert df to np array
    np_test = np.array(df_test)

    return df_train, np_test


train_path = 'data/train.csv'
test_path = 'data/test.csv'

df_train, np_test = preprocess(train_path, test_path)


In [137]:
# This function should calculate prior probabilities and likelihoods from the training data and usingthem to build a naive Bayes model

def train(df_train):
    """
    Notes:
    1) calculate statistics for train dataset
    ----------
    Parameters
    ---------- 
    df_train  : train datafram dataset
    
    Returns
    -------
    pose_dict : dict with prior probability for every pose, 
                with mean and std for every xi & yi
                    eg. {pose: [pb_x, [[x1 mu, ..., y11 mu], [x1 sigma, ..., y11 sigma]]], ...}
    """

    # get all concepts(poses) as a list
    poses = list(df_train['Pose'].value_counts().index)

    # calculate statistics in training dataset
    pose_dict = dict()
    for pose in poses:
        pb_c = df_train.groupby('Pose').size()[pose] / len(df_train) # get prior probabilities
        pose_dict[pose] = np.array([pb_c, df_train[df_train['Pose'] == pose].describe().iloc[[1,2]].values], dtype=object) # get mean & stdv for all x and y (from x1y1 to x11y11)

    return pose_dict
    

pose_dict = train(df_train)

In [138]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you can re-use the training data as a test set)

def calc_phi(x, mu, sigma):
    """
    Notes:
    1) get ϕ𝜎(x − mu)
    ----------
    Parameters
    ---------- 
    x       : observed x for class i
    mu      : mean for class i
    sigma   : standard deviation for class i
    
    Returns
    -------
    phi     : probability of observing value x from a variable with μ and σ
    """

    # return the likelihood of feature x_i in class c
    phi = 1 / (sigma * np.sqrt(2 * np.pi)) * (np.e ** (- 1 / 2 * (((x - mu) / sigma) ** 2)))

    # smoothing (with epsilon 0.0000001)
    if phi == 0:
        phi = 0.0000001

    return phi


def predict(np_test, pose_dict):
    """
    Notes:
    1) get ϕ𝜎(x − mu)
    ----------
    Parameters
    ---------- 
    np_test     : test dataset                      
                    eg. [[pose, x1, x2, ..., y10, y11], ...]
    pose_dict   : prior probability for each pose   
                    eg. {pose: [pb_x, [[x1 mu, ..., y11 mu], [x1 sigma, ..., y11 sigma]]], ...}
    
    Returns
    -------
    results     : probability of observing value x from a variable with μ and σ
    """

    results = list()
    for instance in np_test:
        # prob_dict to store all the probabilities (log form) for comparison later
        prob_dict = dict()
        features = instance[1:] # remove concept
        for pose in pose_dict:
            prob = 0
            pb_c = pose_dict[pose][0] # Prior probability of class c_j
            prob += np.log(pb_c)
            for i in range(len(features)):
                x = features[i] # data x of feature x_i
                mu = pose_dict[pose][1][0][i] # mean of feature x_i
                sigma = pose_dict[pose][1][1][i] # stf of feature x_i
                pb_x_c = calc_phi(x, mu, sigma)
                prob += np.log(pb_x_c)
            prob_dict[pose] = prob
        predicted = max(prob_dict, key = prob_dict.get) # choose the highest value as predicted result
        actual = instance[0]
        results.append([predicted, actual])
        

    return results

np_test = np_test
results = predict(np_test, pose_dict)


In [139]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground truth labels

def evaluate(results):
    """
    Notes:
    1) get accuracy for the results
    ----------
    Parameters
    ---------- 
    results     : 2d list       eg. [[predicted, result], ...]
    
    Returns
    -------
    accuracy  : int
    """

    t = 0
    n = len(results)
    for predicted, actual in results:
        if predicted == actual:
            t += 1

    accuracy = t / n

    return accuracy
    

evaluate(results)

0.7327586206896551

In [15]:
def get_con_matrix(results, poses):
    """
    Notes:
    1) convert result list to confusion matrix
    ----------
    Parameters
    ---------- 
    results     : 2d list       eg. [[predicted, result], ...]
    poses       : pose list     eg. [pose, ...]

    Returns
    -------
    con_matrix  : confusion matrix for the result
    """

    # create a dict for indexing
    d = {key: 0 for key in list(pose_dict.keys())}
    i = 0
    for key in d.keys():
        d[key] += i
        i += 1

    # show at which index is which pose
    print(list(d.keys()))

    # create confution matrix
    length = len(pose_dict)
    con_matrix = np.zeros(shape=[length, length], dtype='int64')
    for predicted, actual in results:
        i = d[actual]
        j = d[predicted]
        con_matrix[i][j] += 1

    return con_matrix
    

poses = list(pose_dict.keys())
get_con_matrix(results, poses)


['mountain', 'downwarddog', 'bridge', 'tree', 'childs', 'trianglepose', 'plank', 'warrior2', 'warrior1', 'seatedforwardbend']


array([[27,  0,  0,  3,  0,  0,  0,  0,  0,  0],
       [ 3, 15,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  8,  4,  0,  0,  0,  2,  0,  0,  0],
       [ 0,  1,  0,  5,  0,  0,  0,  0,  0,  0],
       [ 1,  7,  0,  0,  5,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  4,  0,  0,  0,  0],
       [ 0,  2,  1,  0,  0,  2,  4,  0,  0,  0],
       [ 0,  0,  0,  1,  0,  0,  0,  7,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  1,  0,  4,  0],
       [ 0,  7,  0,  0,  0,  0,  1,  0,  0,  1]])

## Questions 


If you are in a group of 1, you will respond to **two** questions of your choosing.

If you are in a group of 2, you will respond to **four** questions of your choosing.

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer should be submitted separately as a PDF.

### Q1
Since this is a multiclass classification problem, there are multiple ways to compute precision, recall, and F-score for this classifier. Implement at least two of the methods from the "Model Evaluation" lecture and discuss any differences between them. (The implementation should be your own and should not just call a pre-existing function.)

In [141]:
def print_model_eval(results, poses):
    """
    ----------
    Parameters
    ---------- 
    results     : 2d list       eg. [[predicted, result], ...]
    pose_dict   :

    Returns
    -------
    print Precision, Recall for Macro & Micro averaging
    """

    # create a dict for indexing
    d = {key: 0 for key in list(pose_dict.keys())}
    i = 0
    for key in d.keys():
        d[key] += i
        i += 1

    # calculate confusion matrix
    pose_num = len(pose_dict)
    m_eval = np.zeros((pose_num,pose_num),dtype = int)
    for predicted,actual in results:
        m_eval[d[predicted]][d[actual]]+=1

    TP = np.zeros((1,pose_num))
    FP = np.zeros((1,pose_num))
    FN = np.zeros((1,pose_num))
    precision_array = np.zeros((1,pose_num))
    recall_array = np.zeros((1,pose_num))
    for i in range(pose_num):
        tp = m_eval[i][i]                                               #tp value
        fp = sum(m_eval[i][j] for j in range(pose_num) if not j == i)   #fp value
        fn = sum(m_eval[j][i] for j in range(pose_num) if not j == i)   #fn value
        # Macro-averaging
        TP[0][i]=tp
        FP[0][i]=fp
        FN[0][i]=fn
        precision_array[0][i] = tp/(tp+fp) if not tp +fp ==0 else 0          
        recall_array[0][i] = tp/(tp+fn) if not tp +fn ==0 else 0

    precision_macro = precision_array.mean()
    recall_macro = recall_array.mean()

    # Micro-averaging
    precision_micro  = TP.sum()/(TP.sum()+FP.sum())
    recall_micro = TP.sum()/(TP.sum()+FN.sum())

    print("Precision with Macro-averaging: ",precision_macro.mean())
    print("Recall with Macro-averaging: ",recall_macro.mean())
    print("F-score with Macro-averaging:" , 2*precision_macro*recall_macro/(precision_macro+recall_macro))
    print("Precision with Micro-averagin: ", precision_micro)
    print("Recall with Micro-averagin: ", recall_micro)
    print("F-score with Micro-averaging:" , 2*precision_micro*recall_micro/(precision_micro+recall_micro))


print_model_eval(results, pose_dict)

Precision with Macro-averaging:  0.7198611111111111
Recall with Macro-averaging:  0.7356074481074482
F-score with Macro-averaging: 0.7276491018405014
Precision with Micro-averagin:  0.7327586206896551
Recall with Micro-averagin:  0.7327586206896551
F-score with Micro-averaging: 0.732758620689655


### Q2
The Gaussian naıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in this dataset? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the classifier’s predictions.

### Q3
Implement a kernel density estimate (KDE) naive Bayes classifier and compare its performance to the Gaussian naive Bayes classifier. Recall that KDE has kernel bandwidth as a free parameter -- you can choose an arbitrary value for this, but a value in the range 5-25 is recommended. Discuss any differences you observe between the Gaussian and KDE naive Bayes classifiers. (As with the Gaussian naive Bayes, this KDE naive Bayes implementation should be your own and should not just call a pre-existing function.)

In [147]:
def predict_kde(np_test, df_train, SIGMA=5):

    # get all concepts(poses) as a list
    poses = list(df_train['Pose'].value_counts().index)

    # calculate prior prob for each pose
    pb_c_dict = dict()
    for pose in poses:
        pb_c_dict[pose] = df_train.groupby('Pose').size()[pose] / len(df_train)

    results = list()
    for instance in np_test:
            # prob_dict to store all the probabilities (log form) for comparison later
            prob_dict = dict()
            features = instance[1:] # remove concept
            for pose in poses:
                prob_sum = 0
                pb_c = pb_c_dict[pose] # prior probability of class c_j
                prob_sum += np.log(pb_c)

                for i in range(len(features)):
                    xn_array = np.array(df_train[df_train['Pose'] == pose].drop(['Pose'], axis=1).iloc[:,i]) # get n rows of feature i in train set
                    x_test = features[i] # datapoint of feature i in test set
                    phi = 0
                    length = len(xn_array)
                    for j in range(len(xn_array)):
                        xn = xn_array[j]
                        phi += calc_phi(x_test, xn, SIGMA) 
                    phi = phi / length
                    prob_sum += np.log(phi)

                prob_dict[pose] = prob_sum

            predicted = max(prob_dict, key = prob_dict.get) # choose the highest value as predicted result
            actual = instance[0]
            results.append([predicted, actual])

    return results

results_kde = predict_kde(np_test, df_train)

In [148]:
evaluate(results_kde)

0.7672413793103449

### Q4
Instead of using an arbitrary kernel bandwidth for the KDE naive Bayes classifier, use random hold-out or cross-validation to choose the kernel bandwidth. Discuss how this changes the model performance compared to using an arbitrary kernel bandwidth.

### Q5
Naive Bayes ignores missing values, but in pose recognition tasks the missing values can be informative. Missing values indicate that some part of the body was obscured and sometimes this is relevant to the pose (e.g., holding one hand behind the back). Are missing values useful for this task? Implement a method that incorporates information about missing values and demonstrate whether it changes the classification results.

### Q6
Engineer your own pose features from the provided keypoints. Instead of using the (x,y) positions of keypoints, you might consider the angles of the limbs or body, or the distances between pairs of keypoints. How does a naive Bayes classifier based on your engineered features compare to the classifier using (x,y) values? Please note that we are interested in explainable features for pose recognition, so simply putting the (x,y) values in a neural network or similar to get an arbitrary embedding will not receive full credit for this question. You should be able to explain the rationale behind your proposed features. Also, don't forget the conditional independence assumption of naive Bayes when proposing new features -- a large set of highly-correlated features may not work well.