# [Machine Learning] Digit Classifier_(1)


### Student ID : 20144367, Name : Lee, Donghyun


## 1. Problem

#### [Binary Classifier]

Build a binary classifier for each digit against all the other digits at MNIST dataset.

Let $ x = (x_1, x_2, ... , x_m) $ be a vector representing an image in the dataset.

The prediction function $ f_d(x; w) $ is defined by the linear combination of data $(1, x)$ and the model parameter w for each digit $d $:
$$f_d(x; w) = w_0 * 1 + w_1 * x_1 + w_2 * x_2 + ... + w_m * x_m where $w = (w_0, w_1, ... , w_m)$$

The prediction function $f_d(x; w)$ should have the following values:
$$f_d(x; w) = +1 if label(x) = d$$
$$f_d(x; w) = -1 if label(x) is not d$$

The optimal model parameter w is obtained by minimizing the following objective function for each digit d :
$$\sum_i ( f_d(x^{(i)}; w) - y^{(i)} )^2$$

and the label of input x is given by:

$$argmax_d f_d(x; w)$$

1. Compute an optimal model parameter using the training dataset for each classifier $f_d(x, w)$
2. Compute (1) true positive rate, (2) error rate using (1) training dataset and (2) testing dataset.

## 2. Codes

#### Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import copy

#### Global Variables

In [2]:
ROWS = 28
COLS = 28

#### Function readFile : Read file in given file path

In [3]:
def readFile(filePath) :
    inputFile = open(filePath, 'r')
    dataset = inputFile.readlines()
    inputFile.close()
    
    datasetCnt = len(dataset)
    list_labels = np.zeros(datasetCnt, dtype = int)
    list_images = np.zeros((datasetCnt, ROWS * COLS), dtype = float)
    
    idx = 0    
    for data in dataset :        
        list_data = data.split(',')
        label = list_data[0]
        image = list_data[1:]
        
        list_labels[idx] = label
        list_images[idx] = image
        idx += 1
    
    return list_labels, list_images, datasetCnt

#### Function initEquationSetup : Setup matrix A and Vector b for equation Ax = b

In [4]:
def initEquationSetup (list_labels, datasetCnt, digit) :
    rowCnt = datasetCnt
    colCnt = ROWS * COLS
    
    vector_b = np.zeros(rowCnt, dtype = float)
    
    for i in range(rowCnt) :
        if list_labels[i] == digit : vector_b[i] = 1
        else : vector_b[i] = -1
    
    return vector_b

#### Function pseudoInv : Calculate pseudo inverse

In [5]:
def pseudoInv(mtrx_A) :
    A = mtrx_A
    print("Pseudo inverse calculated.")
    return np.linalg.pinv(A)

#### Function leastSquare : Conduct least square for matrix A and vector b

In [6]:
def leastSquare(A_pInv, vector_b) :
    b = vector_b
    
    x = np.matmul(A_pInv, b)
    return x

#### Function calculateAccuracy : Calculate accuracy using estimated vector and answer vector

In [7]:
def calculateAccuracy(estimated_vector, answer_vector) :
    TP = FP = TN = FN = 0

    for i in range(len(estimated_vector)) :
        if estimated_vector[i] >= 0 : # Judge as zero
            if answer_vector[i] == 1 : # Zero for real (TP)
                TP += 1
            elif answer_vector[i] == -1 : # But not zero, actually (FP, Wrong!!)
                FP += 1
        else :# Judge as non-zero
            if answer_vector[i] == 1 : # True negative (FN, Wrong!!)
                FN += 1
            elif answer_vector[i] == -1 : # Non-zero for real (TN)
                TN += 1
 
    return TP, FP, FN, TN

#### Digit Classifying

In [8]:
trainingFilePath = "mnist_train.csv"
testingFilePath = "mnist_test.csv"

list_labels, list_images, datasetCnt = readFile(trainingFilePath)
Tlist_labels, Tlist_images, TdatasetCnt = readFile(testingFilePath)

list_digitClassifier = []
list_trainingEstVector = []
list_testEstVector = []
list_answer = []
list_Tanswer = []


mtrx_A = copy.deepcopy(list_images)
Tmtrx_A = copy.deepcopy(Tlist_images)
A_pInv = pseudoInv(mtrx_A)

for digit in range(10) :
    vector_b = initEquationSetup(list_labels, datasetCnt, digit)
    Tvector_b = initEquationSetup(Tlist_labels, TdatasetCnt, digit)
    list_answer.append(vector_b)
    list_Tanswer.append(Tvector_b)

    x = leastSquare(A_pInv, vector_b)
    list_digitClassifier.append(x)

    estimated_vector = np.matmul(mtrx_A, x)
    list_trainingEstVector.append(estimated_vector)

    Testimated_vector = np.matmul(Tmtrx_A, x)    
    list_testEstVector.append(Testimated_vector)
print("Digit classifying finished.")

Pseudo inverse calculated.
Digit classifying finished.


#### Accuracy for Training Data set

In [9]:
estimated_labels = []
TP = TN = FP = FN = 0

for i in range(datasetCnt) :
    estVal = np.zeros(10, dtype = float)
    for j in range(10) :
        estVal[j] = list_trainingEstVector[j][i]
    estIdx = np.argmax(estVal)

    estimated_labels.append(estIdx)
    if estVal[estIdx] >= 0 and list_answer[estIdx][i] == 1 :
        TP += 1
    elif estVal[estIdx] < 0 and list_answer[estIdx][i] == -1 :
        TN += 1
    elif estVal[estIdx] >= 0 and list_answer[estIdx][i] == -1 :
        FP += 1
    elif estVal[estIdx] < 0 and list_answer[estIdx][i] == 1 :
        FN += 1
    
print("[ Accuracy for Training Data set] \n")
print("True Positive : %d / %d, (%.1f" % (TP, datasetCnt, round(TP / datasetCnt * 100, 1)), "%)")
print("False Positive : %d / %d, (%.1f" % (FP, datasetCnt, round(FP / datasetCnt * 100, 1)), "%)")
print("True Negative : %d / %d, (%.1f" % (TN, datasetCnt, round(TN / datasetCnt * 100, 1)), "%)")
print("False Negative : %d / %d, (%.1f" % (FN, datasetCnt, round(FN / datasetCnt * 100, 1)), "%)")

[ Accuracy for Training Data set] 

True Positive : 42506 / 60000, (70.8 %)
False Positive : 3463 / 60000, (5.8 %)
True Negative : 5421 / 60000, (9.0 %)
False Negative : 8610 / 60000, (14.3 %)


#### Accuracy for Testing Data set

In [10]:
estimated_labels = []
TP = TN = FP = FN = 0

for i in range(TdatasetCnt) :
    estVal = np.zeros(10, dtype = float)
    for j in range(10) :
        estVal[j] = list_testEstVector[j][i]
    estIdx = np.argmax(estVal)

    estimated_labels.append(estIdx)
    if estVal[estIdx] >= 0 and list_Tanswer[estIdx][i] == 1 :
        TP += 1
    elif estVal[estIdx] < 0 and list_Tanswer[estIdx][i] == -1 :
        TN += 1
    elif estVal[estIdx] >= 0 and list_Tanswer[estIdx][i] == -1 :
        FP += 1
    elif estVal[estIdx] < 0 and list_Tanswer[estIdx][i] == 1 :
        FN += 1
    
print("[ Accuracy for Test Data set] \n")
print("True Positive : %d / %d, (%.1f" % (TP, TdatasetCnt, round(TP / TdatasetCnt, 3) * 100), "%)")
print("False Positive : %d / %d, (%.1f" % (FP, TdatasetCnt, round(FP / TdatasetCnt, 3) * 100), "%)")
print("True Negative : %d / %d, (%.1f" % (TN, TdatasetCnt, round(TN / TdatasetCnt, 3) * 100), "%)")
print("False Negative : %d / %d, (%.1f" % (FN, TdatasetCnt, round(FN / TdatasetCnt, 3) * 100), "%)")

[ Accuracy for Test Data set] 

True Positive : 7013 / 10000, (70.1 %)
False Positive : 564 / 10000, (5.6 %)
True Negative : 902 / 10000, (9.0 %)
False Negative : 1521 / 10000, (15.2 %)
