# [Machine Learning] Binary Classifier to Classify Digit 0


### Student ID : 20144367, Name : Lee, Donghyun


## 1. Problem

#### [Binary Classifier to Classify Digit 0]

Build a binary classifier to classify digit 0 against all the other digits at MNIST dataset.

Let $x = (x_1, x_2, ... , x_m)$ be a vector representing an image in the dataset.

The prediction function $f_w(x)$ is defined by the linear combination of data $(1, x)$ and the model parameter w:

$$f_w(x) = w_0 * 1 + w_1 * x_1 + w_2 * x_2 + ... + w_m * x_m , w = (w_0, w_1, ... , w_m)$$

The prediction function $f_w(x)$ should have the following values:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ f_w(x) = +1$ if $label(x) = 0$ <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ f_w(x) = -1$ if $label(x)$ is not $0$

The optimal model parameter w is obtained by minimizing the following objective function:
$$\sum_i ( f_w(x^{(i)} - y^{(i)})^2$$

1. Compute an optimal model parameter using the training dataset
2. Compute <br/>
    (1) True Positive, <br/>
    (2) False Positive, <br/>
    (3) True Negative, <br/>
    (4) False Negative <br/>
based on the computed optimal model parameter using (1) training dataset and (2) testing dataset.


## 2. Codes

#### Libraries

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import copy

#### Global Variables

In [5]:
ROWS = 28
COLS = 28

#### Function readFile : Read file in given file path

In [6]:
def readFile(filePath) :
    inputFile = open(filePath, 'r')
    dataset = inputFile.readlines()
    inputFile.close()
    
    datasetCnt = len(dataset)
    list_labels = np.zeros(datasetCnt, dtype = int)
    list_images = np.zeros((datasetCnt, ROWS * COLS), dtype = float)
    
    idx = 0    
    for data in dataset :        
        list_data = data.split(',')
        label = list_data[0]
        image = list_data[1:]
        
        list_labels[idx] = label
        list_images[idx] = image
        idx += 1
    
    return list_labels, list_images, datasetCnt

#### Function initEquationSetup : Setup matrix A and Vector b for equation Ax = b

In [7]:
def initEquationSetup (list_labels, list_images, datasetCnt) :
    rowCnt = datasetCnt
    colCnt = ROWS * COLS
    
    mtrx_A = np.zeros((rowCnt,colCnt), dtype = float)
    vector_b = np.zeros(rowCnt, dtype = float)
    
    mtrx_A = copy.deepcopy(list_images)
    
    for i in range(rowCnt) :
        if list_labels[i] == 0 : vector_b[i] = 1
        else : vector_b[i] = -1
    
    return mtrx_A, vector_b

#### Function leastSquare : Conduct least square for matrix A and vector b

In [8]:
def leastSquare(mtrx_A, vector_b) :
    
    A = mtrx_A
    b = vector_b
    
    A_pInv = np.linalg.pinv(A)
    x = np.matmul(A_pInv, b)
    
    return x

#### Function calculateAccuracy : Calculate accuracy using estimated vector and answer vector

In [9]:
def calculateAccuracy(estimated_vector, answer_vector) :
    TP = FP = TN = FN = 0

    for i in range(len(estimated_vector)) :
        if estimated_vector[i] >= 0 : # Judge as zero
            if answer_vector[i] == 1 : # Zero for real (TP)
                TP += 1
            elif answer_vector[i] == -1 : # But not zero, actually (FP, Wrong!!)
                FP += 1
        else :# Judge as non-zero
            if answer_vector[i] == 1 : # True negative (FN, Wrong!!)
                FN += 1
            elif answer_vector[i] == -1 : # Non-zero for real (TN)
                TN += 1
 
    return TP, FP, FN, TN

#### Function zeroClassifier : Classifiying zero and non-zero

In [10]:
def zeroClassfier(trainingFilePath, testingFilePath) :
    list_labels, list_images, datasetCnt = readFile(trainingFilePath)
    Tlist_labels, Tlist_images, TdatasetCnt = readFile(testingFilePath)
    
    mtrx_A, vector_b = initEquationSetup(list_labels, list_images, datasetCnt)
    Tmtrx_A, Tvector_b = initEquationSetup(Tlist_labels, Tlist_images, TdatasetCnt)
    
    x = leastSquare(mtrx_A, vector_b)

    estimated_vector = np.matmul(mtrx_A, x)
    (TP,FP,FN,TN) = calculateAccuracy(estimated_vector, vector_b)

    print("[ Accuracy for Training Data set] \n")
    print("True Positive : %d / %d, (%.3f)" % (TP, datasetCnt, round(TP / datasetCnt, 3)))
    print("False Positive : %d / %d, (%.3f)" % (FP, datasetCnt, round(FP / datasetCnt, 3)))
    print("True Negative : %d / %d, (%.3f)" % (TN, datasetCnt, round(TN / datasetCnt, 3)))
    print("False Negative : %d / %d, (%.3f)" % (FN, datasetCnt, round(FN / datasetCnt, 3)))
    
    estimated_vector = np.matmul(Tmtrx_A, x)    
    (TP,FP,FN,TN) = calculateAccuracy(estimated_vector, Tvector_b)

    print("\n[ Accuracy for Test Data set ]\n")
    print("True Positive : %d / %d, (%.3f)" % (TP, TdatasetCnt, round(TP / TdatasetCnt, 3)))
    print("False Positive : %d / %d, (%.3f)" % (FP, TdatasetCnt, round(FP / TdatasetCnt, 3)))
    print("True Negative : %d / %d, (%.3f)" % (TN, TdatasetCnt, round(TN / TdatasetCnt, 3)))
    print("False Negative : %d / %d, (%.3f)" % (FN, TdatasetCnt, round(FN / TdatasetCnt, 3)))


#### Zero classifying

In [11]:
trainingFilePath = "mnist_train.csv"
testingFilePath = "mnist_test.csv"

zeroClassfier(trainingFilePath, testingFilePath)

[ Accuracy for Training Data set] 

True Positive : 5347 / 60000, (0.089)
False Positive : 318 / 60000, (0.005)
True Negative : 53759 / 60000, (0.896)
False Negative : 576 / 60000, (0.010)

[ Accuracy for Test Data set ]

True Positive : 917 / 10000, (0.092)
False Positive : 61 / 10000, (0.006)
True Negative : 8959 / 10000, (0.896)
False Negative : 63 / 10000, (0.006)
