**Assignment 1 - Handwriting recognition on MNIST Dataset using the Naive Baye's Classifier  (Training Images - 5000, Testing Images - 1000)**

**Author - Snehal Utage, Z1888637**


Formulae Used for calculations of probabilitites(training) and classification->
1. Prior Probability ->

   P (class = c) =( # of training examples where class = c)/(# of training examples)

2. Likelihood with Laplace smoothing (k - (0.1 - 10) ->

   P (Fi,j = f|class = c) = (k + # of times Fi,j = f when class = c)/(2k +   Total number of training examples where class = c)

3. Posterior Probability ->

   log(P (class)) + log(P (f1,1|class)) + log(P (f1,2|class)) + ... + log(P (f28,28|class))

4. Classify image using MAP (Maximum a posterior) classification


In [14]:
import numpy as np
import math
import requests
from io import StringIO  

################################################################
# Function - InputImageSlicing                                 #
# Details  - Read the input file from the URL provided and     # 
#            Convert the image text file into 28 * 28 matrix   #
#            Replace " " with 0 and "#","+" with 1             #
# Input    - filename_URL - Filename URL(GIT path url)         #
# Output.  - Return a numpy array which is 28,28 in size.      #
################################################################
def InputImageSlicing(filename_URL):
  #Get the file url and get the http request output
  url = filename_URL
  data = requests.get(url)
  #Split the lines and convert it to list
  each_line = list(map(list,data.text.splitlines()))
  #Convert to numpy array
  each_line=np.array(each_line,dtype="object")
  #Replace the space,#,+ with respective 0 and 1
  for line in each_line:
    for i,val in enumerate(line):
      if val == " ":
        line[i] = 0
      else:
        line[i] = 1  
  #Split the text images into 28 * 28 list 
  dataset = [each_line[i:i+28] for i in range(0,len(each_line),28)]
  return dataset

################################################################
# Function - GetLabels                                         #
# Details  - Read the labels text file                         #
# Input    - filename_URL - Labels Filename URL (GIT path url) #
# Output.  - Return a numpy array of all the Labels as int     #
################################################################     
def GetLabels(filename_URL):
  #Get the HTTP response from the labels file url and convert to numpy array
  url = filename_URL
  data = requests.get(url)
  input_data = StringIO(data.text)
  return np.loadtxt(input_data).astype(int)

################################################################
# Function - GetCountLabels                                    #
# Details  - Read the labels text file and count the number of # 
#            samples belonging to each class(0-9)              #
# Input    - labels - numpy array of labels,                   #
#            classes - # of classes(10)                        #
# Output.  - Return a dictionary with count of image samples   #
#            belonging to each class                           #
################################################################
def GetCountLabels(labels,classes):
  count={}
  for i in range(classes):
    count[i] = np.count_nonzero(labels == i)
  return count
  
################################################################
# Function - GetClassPriorProbability                          #
# Details  - From the training labels file find the Prior      # 
#             probability for each class(0-9).                 #
# Input    - classes - # of classes(10),                       #
#            training_labels - numpy array of training labels, #
#            total_training_data - # of training samples.      #
# Output.  - Return a dictionary for the prior probability of  #
#            each class(0-9)                                   #    
################################################################
def GetClassPriorProbability(classes,training_labels,total_training_data):
  prior_probability={}
  #For each class(0-9) find the priori probability
  for i in range(classes):
    prior_probability[i] = (np.count_nonzero(training_labels == i))/ total_training_data
  return prior_probability

################################################################
# Function - GetCombinedClassSet                               #
# Details  - Read the training data array and combine all the  # 
#            images for respective class in a dictionary       #
# Input    - training_data - numpy array of training images    #
#            classes - # of classes(10)                        #                 
# Output.  - Return a dictionary class_set which has all the.  # 
#            images belonging to same class                    #
################################################################
def GetCombinedClassSet(training_data,classes):
  class_set={}
  #For each class(0-9) combine the images belonging to respective class
  for class_id in range(classes):
     id=[]
     for item in training_data:
       if item[1] == class_id:
         id.append(item)
     class_set[class_id]=id
  return class_set

################################################################
# Function - GetLikelihood                                     #
# Details  - Calculate the likelihood for each training imgpixe#
#            (28*28=784) for each class for both features (0,1)#
# Input    - combined_classset - Dictionary of combined images #
#            for each class                                    #
#            classes - # of classes                            #
# Output.  - Return two dictionaries with likelihood for each  # 
#            pixel for each feature(0,1) . Taken Log values of # 
#            prob and used Laplace smoothing factor k = 0.1    #
################################################################
def GetLikelihood(combined_classset,classes):
  zero_prob={}
  one_prob={}
  #Laplace smoothing factor
  k=0.1 
  #Create dictionary having key as class(0-9) having the values of likelihood for each 
  #pixel (28*28=784)of training images for respective class for feature {0,1}
  for value in range(classes):
    item=np.array(combined_classset[value], dtype="object")
    zero_values=[]
    one_values=[]
    for m in range(28):
      for n in range(28):
        #Find the value of each pixel in all the images of each class and add to a list
        rows = [i[m][n] for i in item[:,0]]
        #Calculate the log of likelihood for 0 and 1 and append to list
        zero_values.append(math.log((k + rows.count(0)) / (2 * k + len(rows))))
        one_values.append(math.log((k + rows.count(1)) / (2 * k + len(rows))))
    #Append the likelihoods list for each class in dictionary
    zero_prob[value] = np.array(zero_values).reshape(28,28)
    one_prob[value] = np.array(one_values).reshape(28,28)
  return zero_prob,one_prob

################################################################
# Function - GetProbabilityValues                              #
# Details  - Get the likelihood for each pixel of the testing  #
#            image sample                                      #
# Input    - testing_images- sliced test img(28*28) numpy array#
#            zero_prob - Likelihood, dictionary for each class #
#                       for feature 0                          #
#            one_prob  - Likelihood, dictionary for each class.#
#                       for feature 1                          #
# Output.  - Return a dictionary of dictionary for all the     #
#            images in testing file, inner dict having the     # 
#            likelihood found for each pixel for each class    #
################################################################  
def GetProbabilityValues(testing_images,zero_prob,one_prob,classes):
  test_img_likelihood={}
  #For each image in testing file find the likelihood based on the
  #value of the pixel, if it is 0 then get the Likelihood for feature0
  #else if it is 1 then get the Likelihood for feature1
  #for each class
  for num, img in enumerate(testing_images):
    final={}
    for class_id in range(classes):
      prob_array=[]
      #Get the pixel
      for i in range(28):
        for j in range(28):
          #check the pixel value
          if img[i][j] == 0:
            #Get the likelihood for 0
            prob_array.append(zero_prob[class_id][i][j])
          else:
            #Get likelihood for 1
            prob_array.append(one_prob[class_id][i][j])
      #Add to dictionary
      final[class_id] = prob_array
    #Add to dictionary for each image
    test_img_likelihood[num]=final         
  return test_img_likelihood

################################################################
# Function - GetPosteriorProbability                           #
# Details  - Calculate the posterior probability for each image#
#            in testing images file using the likelihood found #
#            in training, and the class prior                  #
# Input    - test_img_likelihood - Dict of dict with likelihood#
#            for each image for each class                     #
#            class_priori - Dict of prior prob of each class   #                    
# Output.  - Return a dictionary of posterior prob of each img #
################################################################
def GetPosteriorProbability(test_img_likelihood,class_priori):
  all_images_posterior={}
  #From the test image likelihood dictionary find the posterior
  #for all images for each class
  for k,v in test_img_likelihood.items():
    final_posterior={}
    for k1,v1 in v.items():
      final_posterior[k1] = math.log(class_priori[k1])+sum(v1)
    all_images_posterior[k]=final_posterior
  return all_images_posterior

################################################################
# Function - ClassifyImage                                     #
# Details  - Classify each test image based on the posterior   #
#            probability calculated, assign the class which has#
#            max posterior probability                         #
# Input    - all_images_posterior- Dict of posterior prob of   #
#            test img                                          #
# Output.  - Return a dictionary of each image classified(0-9) #
################################################################   
def ClassifyImage(all_images_posterior):
  classification_result={}
  #For posterior calculated for each class, find the max posterior
  #probability value and assign the classified class id to the image
  for k,v in all_images_posterior.items():
    class_id = max(v, key=v.get) 
    classification_result[k]=class_id
  return classification_result

################################################################
# Function - Accuracy                                          #
# Details  - Find the accuracy(%) by comapring the classified  #
#            result with actual test labels                    #
# Input    - classification_result - Dict of each test image   #
#             classified(0-9).                                 #
#            testing_labels - array of actual test labels      #
# Output.  - Returns % accuracy, count of correctly classified #
#            images, total testing images                      #
################################################################
def Accuracy(classification_result,testing_labels):
  count=0
  #Find the correct_classified/total percent
  for i, val in classification_result.items():
    if val == testing_labels[i]:
      count += 1
  total_accuracy = (count / len(testing_labels)) * 100
  return total_accuracy,count,len(testing_labels)

################################################################
# Function - CombinedOutput                                    #
# Details  - Combine the classified class value and the actual # 
#            value of image class for displaying               #
# Input    - classification_result - Dict of each test img     #
#            classified(0-9).                                  #
#            labels - Actual(True) Labels array                # 
# Output.  - Return a dictionary with the values as classified #
#            value, actual value for each test image           #
################################################################
def CombinedOutput(classification_result,labels):
  #For printing the output as image,actual image class, classified class
  # update the classified dictionary values
  for i,val in classification_result.items():
    val = (val,labels[i])
    classification_result[i]=val
  return classification_result

################################################################
# Function - ConfusionMatrix                                   #
# Details  - Create a 10*10 confusion matrix for our model.    #
# Input    - combined_classification - Dict for all test images#
#            with values as a tuple                            #
#            (classified class, actual class)                  #
# Output.  - Return a confusion matrix(number based, % based,  #
#            rounded %based)                                   #
################################################################
def ConfusionMatrix(combined_classification):
  get_classes_tuple=combined_classification.values()
  num_conf_matrix=np.zeros((10,10))
  for val1,val2 in get_classes_tuple:
    if val1 == val2:
      num_conf_matrix[val2][val1] +=1
    else:
      num_conf_matrix[val2][val1] +=1
  percent_conf_matrix=num_conf_matrix/num_conf_matrix.sum(axis=1, keepdims=True)
  rounded_conf_matrix=np.around(percent_conf_matrix)
  return num_conf_matrix,percent_conf_matrix,rounded_conf_matrix

################################################################
# Function - FindAccuracyValues                                #
# Details  - Find the accuracy values for the input files      #
#            - training and testing                            #
# Input    - images_filename - URL for the test images file    #
#            labels_filename - URL for actual test labels file.#
#            zero_prob -Likelihood prob dictionary for feature0#
#            one_prob - Likelihood prob dictionary for feature1#
#            class_priori - Prior prob for each class(0-9).    #
#            classes - # of classes(10)                        #
# Output.  - Returns                                           #
#            total_accuracy - Accuracy for the input test file #
#            correct_count - # of correctly classified images. #
#            total_count - # of total test images              #
#            classification_result - Classified and actual     #
#            classes dict for each image                       #
################################################################     
def FindAccuracyValues(images_filename,labels_filename,zero_prob,one_prob,class_priori,classes):
  #Read the testing data labels
  testing_labels = GetLabels(labels_filename)

  #Get the testing images sliced in 28 * 28 format
  testing_images = InputImageSlicing(images_filename)

  #Find the posterior probability for test images
  test_img_likelihood=GetProbabilityValues(testing_images,zero_prob,one_prob,classes)
  all_images_posterior=GetPosteriorProbability(test_img_likelihood,class_priori)

  #Get the classification of image
  classification_result=ClassifyImage(all_images_posterior)

  #Get Accuracy for the testing file passed
  total_accuracy,correct_count,total_count=Accuracy(classification_result,testing_labels)

  return total_accuracy,correct_count,total_count,classification_result
  

################################################################
# Function - main()                                            #
# Details  - 1.Create a training model by calculating the      #
#            class prior probability and the likelihood prob.  #
#            using the training data samples.                  #
#            2. Calls the FindAccuracyValues() to find the     #
#            accuracy of model on training data and test data. #
#            3. Print the output result of classification for  # 
#            each image.                                       #
#            4. Prints the confusion matrix                    #
# Input    - NA                                                #
# Output.  - NA                                                #
################################################################ 
def main():
  total_training_data = 5000
  total_testing_data  = 1000
  classes = 10

  #URLs for the training and test files present on GITHUB
  training_images_file = 'https://raw.githubusercontent.com/snehalutage/data/master/trainingimages.txt'
  training_labels_file = 'https://raw.githubusercontent.com/snehalutage/data/master/traininglabels.txt'
  testing_images_file = 'https://raw.githubusercontent.com/snehalutage/data/master/testimages.txt'
  testing_labels_file = 'https://raw.githubusercontent.com/snehalutage/data/master/testlabels.txt'
  
  #Get the training labels list
  training_labels = GetLabels(training_labels_file)
  
  #Get the testing data labels list
  testing_labels = GetLabels(testing_labels_file)

  #Preprocessing
  #Get the training images sliced in 28 * 28 array 
  #Replace the " " with 0 and "+","#" with 1
  training_images = InputImageSlicing(training_images_file)

  #Calculate the prior probability for each class from the training images
  class_priori= GetClassPriorProbability(classes,training_labels,total_training_data)

  #Combine training image with its actual training labels
  training_data = list(zip(training_images,training_labels))
  
  #Get dictionary which has the all occurences of each class images stored in the respective class key
  class_set=GetCombinedClassSet(training_data,classes)

  #Get the Likelihood for each class for each feature(0,1)
  zero_prob,one_prob = GetLikelihood(class_set,classes)

  #Find the accuracy of model on Training data
  train_accuracy,train_correct_count,train_total_count,train_classification_result=FindAccuracyValues(training_images_file,training_labels_file,zero_prob,one_prob,class_priori,classes)
                                
  #Find the accuracy of model on Testing data
  test_accuracy,test_correct_count,test_total_count,test_classification_result=FindAccuracyValues(testing_images_file,testing_labels_file,zero_prob,one_prob,class_priori,classes)

  #Print the accuracy
  print("\n************** OUTPUT OF TRAINED NAIVE BAYE'S CLASSIFIER MODEL **************\n")
  print("Total number of training sample images : ",train_total_count)
  print("Accuracy with the training dataset (correct/total) : {0}/{1} , {2} %\n".format(train_correct_count,train_total_count,train_accuracy))
  print("Total number of testing sample images : ",test_total_count)
  print("Accuracy with the testing dataset (correct/total) : {0}/{1} , {2} %\n".format(test_correct_count,test_total_count,test_accuracy))

  #Print the combined result of actual image class and the classified class using our model for all the Testing images
  print_outcome=CombinedOutput(test_classification_result,testing_labels)
  print("Result of classification on Testing Dataset (Tabular) :: ")
  print ("{:<10} {:<10} {:<10}".format("#IMAGE" , 'ORIGINAL', 'PREDICTED'))
  for key, value in print_outcome.items(): 
    num, original, classified = key+1,value[1],value[0] 
    print ("{:<10} {:<10} {:<10}".format(num, original, classified))

  #Print the number of images belonging to each class in the Testing file  
  count=GetCountLabels(testing_labels,classes)
  print("\nTotal count of each class image in the Testing Dataset :: \n",count)
  
  #Print the confusion matrix
  num_conf_matrix,percent_conf_matrix,rounded_conf_matrix=ConfusionMatrix(test_classification_result)
  print("\nConfusion Matrix (Numbers) :: ")
  print(num_conf_matrix)
  print("\nConfusion Matrix (Percentage) :: ")
  print(percent_conf_matrix)
  print("\nConfusion Matrix (Percentage) (rounded values) :: ")
  print(rounded_conf_matrix)
   
#Call to main() to run the python code
main()


************** OUTPUT OF TRAINED NAIVE BAYE'S CLASSIFIER MODEL **************

Total number of training sample images :  5000
Accuracy with the training dataset (correct/total) : 4214/5000 , 84.28 %

Total number of testing sample images :  1000
Accuracy with the testing dataset (correct/total) : 773/1000 , 77.3 %

Result of classification on Testing Dataset (Tabular) :: 
#IMAGE     ORIGINAL   PREDICTED 
1          9          7         
2          0          0         
3          2          2         
4          5          3         
5          1          1         
6          9          9         
7          7          7         
8          8          8         
9          1          1         
10         0          0         
11         4          4         
12         1          1         
13         7          9         
14         9          9         
15         6          4         
16         4          9         
17         2          2         
18         6          2       