# [Machine Learning] Binary Classifier_(2)


### Student ID : 20144367, Name : Lee, Donghyun


## 1. Problem

Build a binary classifier based on k random features for each digit against all the other digits at MNIST dataset.

Let $x = (x_1, x_2, ... , x_m)$ be a vector representing an image in the dataset.

The prediction function $f_d(x; w)$ is defined by the linear combination of input vector x and the model parameter w for each digit d :

$$f_d(x; w) = w_0 * 1 + w_1 * g_1 + w_2 * g_2 + ... + w_k * g_k $$

where $w = (w_0, w_1, ... , w_k)$ and the basis function $g_k$ is defined by the inner product of random vector $r_k$ and input vector $x$. 

You may want to try to use $g_k = max( inner production( r_k, x ), 0 )$ to see if it improves the performance.

The prediction function $f_d(x; w)$ should have the following values:

$$f_d(x; w) = +1 if label(x) = d$$
$$f_d(x; w) = -1 if label(x) is not d$$

The optimal model parameter w is obtained by minimizing the following objective function for each digit d :
$$\sum_i ( f_d(x^{(i)}; w) - y^{(i)} )^2$$

and the label of input x is given by:

$$argmax_d f_d(x; w)$$

1. Compute an optimal model parameter using the training dataset for each classifier $f_d(x, w)$
2. Compute (1) true positive rate, (2) error rate using (1) training dataset and (2) testing dataset.
## 2. Codes

#### Libraries

In [1]:
import numpy as np
import copy
import random

#### Global Variables

In [2]:
ROWS = 28
COLS = 28
trainingFilePath = "mnist_train.csv"
testingFilePath = "mnist_test.csv"

#### Class DigitClassifier : Data structures and functions for Digit classifying

In [3]:
class DigitClassifier :
    list_imageLabels = []     # Save the labels for each images in MNIST data set
    list_images = []          # Save the information about whole images in MNIST data set
    featureMtrx = []          # Feature matrix (10 x imageCounts matrix)
    answerMtrx = []           # Answer matrix (10 x imageCounts matrix)
    estimatedMtrx = []        # Estimated matix using input images and feature matrix
    estimatedLabels = []      # Labels determined by values in estimated matrix
    weightMtrx = []           # Weight matrix filled with random Gaussian value N ~ (0, stddev)
    list_weightedImages = []  # weighted images
    datasetCnt = []           # total image counts
    
    # Initialize related variables using input file in filepath
    def __init__(self, filePath) :
        try :
            inputFile = open(filePath, 'r')
            dataset = inputFile.readlines()
            inputFile.close()
        except Exception as e :
            print(e)
        
        datasetCnt = len(dataset)
        list_labels = np.zeros(datasetCnt, dtype = int)
        list_images = np.zeros((datasetCnt, ROWS * COLS), dtype = float)

        idx = 0    
        for data in dataset :        
            list_data = data.split(',')
            label = list_data[0]
            image = list_data[1:]

            list_labels[idx] = label
            list_images[idx] = image
            idx += 1
        
        self.list_images = list_images
        self.list_imageLabels = list_labels
        self.datasetCnt = datasetCnt
#        print("File loaded and labels, pixels info and dataset count have been updated.")
#        print("Total dataset count : %d" % self.datasetCnt)
    
    # Create answersheet for each digit using imageLabels
    def createAnswerSheet(self) :    
        self.answerMtrx = np.zeros((10,self.datasetCnt), dtype = int)
        for digit in range(10) :
            for idx in range(self.datasetCnt) :
                if self.list_imageLabels[idx] == digit : self.answerMtrx[digit][idx] = 1
                else : self.answerMtrx[digit][idx] = -1
#        print("Answersheet has been created.")
        
    # Create weight matrix with Gaussian random values
    def createWeightMtrx(self, stddev) :

        self.weightMtrx = np.zeros((ROWS * COLS, ROWS * COLS), dtype = float)
        for i in range(ROWS * COLS) :
            for j in range(ROWS * COLS) :
                self.weightMtrx[i][j] = random.gauss(0.0,stddev)
#        print("Weight matrix has been created.")
    
    # Create feature matrix using Pseudo inverse matrix and weight matrix
    def createFeatureMtrx(self) :
        self.featureMtrx = np.zeros((10, ROWS * COLS), dtype = float)
        self.list_weightedImages = np.matmul(self.list_images, self.weightMtrx)
        
        for i in range(self.datasetCnt) :
            for j in range(ROWS * COLS) :
                self.list_weightedImages[i][j] = max(0, self.list_weightedImages[i][j])
                
        pseudoInv = np.linalg.pinv(self.list_weightedImages)
#        print("Pseudo Inv - rows : %d, cols : %d" % (len(pseudoInv), len(pseudoInv[0])))

        for digit in range(10) :
            self.featureMtrx[digit] = np.matmul(pseudoInv, self.answerMtrx[digit])
#        print("Feature matrix has been created.")
    
    # Import metrices weightMtrx and featureMtrx
    def importMetrices(self, weightMtrx, featureMtrx) :
        self.weightMtrx = weightMtrx
        self.featureMtrx = featureMtrx
        self.list_weightedImages = np.matmul(self.list_images, self.weightMtrx)     
        
        for i in range(self.datasetCnt) :
            for j in range(ROWS * COLS) :
                self.list_weightedImages[i][j] = max(0, self.list_weightedImages[i][j])


    # Classify the digit images based on feature matrix
    def classifyDigits(self) :
        self.estimatedMtrx = np.zeros((10, self.datasetCnt), dtype = float)

        for digit in range(10) :
#            self.estimatedMtrx[digit] = np.matmul(self.list_images,self.featureMtrx[digit])
            self.estimatedMtrx[digit] = np.matmul(self.list_weightedImages, self.featureMtrx[digit])
#        print("Estimated matrix has been calculated.")

        TP = TN = FP = FN = 0
        for i in range(self.datasetCnt) :
            estVal = np.zeros(10, dtype = float)
            answers = np.zeros(10,dtype = float)
            for j in range(10) :
                estVal[j] = self.estimatedMtrx[j][i]
                answers[j] = self.answerMtrx[j][i]
            estIdx = np.argmax(estVal)
            #print("Estimated Values : ",estVal,", Estimated index : ", estIdx, ", Answer : ", self.answerMtrx[estIdx][i])
            #print("Real values : ", answers)

            self.estimatedLabels.append(estIdx)
            if estVal[estIdx] >= 0 and self.answerMtrx[estIdx][i] == 1 :
                TP += 1
            elif estVal[estIdx] < 0 and self.answerMtrx[estIdx][i] == -1 :
                TN += 1
            elif estVal[estIdx] >= 0 and self.answerMtrx[estIdx][i] == -1 :
                FP += 1
            elif estVal[estIdx] < 0 and self.answerMtrx[estIdx][i] == 1 :
                FN += 1

        return TP, TN, FP, FN

#### Accuracy for Training Data set

In [4]:
trainingDigitClassifier = DigitClassifier(trainingFilePath)
trainingDigitClassifier.createAnswerSheet()
trainingDigitClassifier.createWeightMtrx(2)
trainingDigitClassifier.createFeatureMtrx()
(TP, TN, FP, FN) = trainingDigitClassifier.classifyDigits()

datasetCnt = trainingDigitClassifier.datasetCnt
print("[ Accuracy for Training Data set] \n")
print("True Positive : %d / %d, (%.1f" % (TP, datasetCnt, round(TP / datasetCnt * 100, 2)), "%)")
print("False Positive : %d / %d, (%.1f" % (FP, datasetCnt, round(FP / datasetCnt * 100, 2)), "%)")
print("True Negative : %d / %d, (%.1f" % (TN, datasetCnt, round(TN / datasetCnt * 100, 2)), "%)")
print("False Negative : %d / %d, (%.1f" % (FN, datasetCnt, round(FN / datasetCnt * 100, 2)), "%)")

[ Accuracy for Training Data set] 

True Positive : 51451 / 60000, (85.8 %)
False Positive : 1105 / 60000, (1.8 %)
True Negative : 2569 / 60000, (4.3 %)
False Negative : 4875 / 60000, (8.1 %)


#### Accuracy for Testing Data set

In [5]:
testingDigitClassifier = DigitClassifier(testingFilePath)
testingDigitClassifier.createAnswerSheet()
testingDigitClassifier.importMetrices(trainingDigitClassifier.weightMtrx, trainingDigitClassifier.featureMtrx)
(TP,TN,FP,FN) = testingDigitClassifier.classifyDigits()

datasetCnt = testingDigitClassifier.datasetCnt
print("[ Accuracy for Testing Data set] \n")
print("True Positive : %d / %d, (%.1f" % (TP, datasetCnt, round(TP / datasetCnt * 100, 2)), "%)")
print("False Positive : %d / %d, (%.1f" % (FP, datasetCnt, round(FP / datasetCnt * 100, 2)), "%)")
print("True Negative : %d / %d, (%.1f" % (TN, datasetCnt, round(TN / datasetCnt * 100, 2)), "%)")
print("False Negative : %d / %d, (%.1f" % (FN, datasetCnt, round(FN / datasetCnt * 100, 2)), "%)")


[ Accuracy for Testing Data set] 

True Positive : 8586 / 10000, (85.9 %)
False Positive : 202 / 10000, (2.0 %)
True Negative : 410 / 10000, (4.1 %)
False Negative : 802 / 10000, (8.0 %)
