<a href="https://colab.research.google.com/github/sherifmost/CSED2021_Projects/blob/master/Face_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **About the project**

This is a project in the information systems and software course. Our objective is to perform face recognition on the ORL dataset and test its accuracy using PCA and LDA along with KNN classifiers.

# **About the data**

We used the ORL data set for face recognition, which contains 40 different subjects with 10 images each. Each image is a 92x112 image in PGM (Portable Gray Map) format. Images are classified by being placed in different directories; where those in folder sx belong to subject number x(x between 1 and 40). An image for a certain subject is named Y.pmg where Y is the image number (between 1 and 10).
Credits to: *AT&T Laboratories Cambridge* for providing the data.

# **Needed library imports**

In [1]:
from google.colab import drive
# used to manipulate the folders containing the images and read them out
import os
import matplotlib.image as mpimg 
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as lg 
from sklearn.neighbors import KNeighborsClassifier

# **Labels and constants**

In [2]:
# file paths
path_data = '/content/drive/My Drive/Information systems/Assignment 1/Data set';

# symbols
delim = '/';

# image dimensions
image_len = 92;
image_width = 112;

# constant numbers
training = 1;
testing = -1;
num_subjects = 40;

# **Helper functions**

## Helper functions for manipuating data

In [3]:
# functions used as keys for sort function
def numeric_key_folders(x):
  return int(x[1:]);
def numeric_key_images(x):
  # we want to get the number till .pmg so remove last 4 characters from the string considered
  return int(x[0:(len(x)-4)]);

## Helper functions for LDA dimensionality reduction

In [4]:
# function calculates the class means (mean of each class) given an array containing the number of samples per class (assuming the data matrix is sorted accordingly)
def get_class_means(D,num_samples):
  means = [];
  # keep the begining of the class from which the mean is calculated
  class_begin = 0;
  for curr_num in num_samples:
    curr_num = int(curr_num);
    means.append(np.mean(D[class_begin : class_begin + curr_num,:],axis = 0));
    class_begin = class_begin + curr_num;
  return np.array(means);
# function that calculates the S_b matrix given the number of samples for each class, the class means and the overall mean
def get_S_b(num_samples,class_means):
  # calculating the overal sample mean
  overal_mean = np.mean(class_means,axis = 0);
  # S_b has dimensions same as B which are d x d (where d is number of dimensions which is same as shape of the image after flattening)
  S_b = np.zeros(shape = (overal_mean.shape[0],overal_mean.shape[0]));
  # looping to calculate S_b
  for i in range(0,num_samples.shape[0]):
    S_b = S_b + num_samples[i] * np.dot((class_means[i,:] - overal_mean).reshape(overal_mean.shape[0],1),(class_means[i,:] - overal_mean).reshape(1,overal_mean.shape[0]));
  return S_b;
# function that centers the data given the data matrix, the class means and number of samples per class
def get_data_centered(D,num_samples,class_means):
  data_centered = [];
  class_begin = 0;
  mean_location = 0; 
  for curr_num in num_samples:
    curr_num = int(curr_num);
    data_centered.append(D[class_begin : class_begin + curr_num,:] - class_means[mean_location,:]);
    mean_location = mean_location + 1;
    class_begin = class_begin + curr_num;
  return np.array(data_centered).reshape(D.shape[0],D.shape[1]);
# function that obtains S matrix given the centered data and the number of samples per class
def get_S(D_centered,num_samples):
  S = np.zeros(shape = (D_centered.shape[1],D_centered.shape[1]));
  class_begin = 0;
  for curr_num in num_samples:
    curr_num = int(curr_num);
    S = S + D_centered[class_begin : class_begin + curr_num].T @ D_centered[class_begin : class_begin + curr_num];
    class_begin = class_begin + curr_num;
  return S;
# function given a matrix returns the n dominant eigen vectors
def get_dom_eig_vec(mat,n):
  # Getting eigen values and eigen vectors
  # As we don't know whether mat is symmetric or not, eig is used.
  eig_val,eig_vec = lg.eig(mat);
  # As dominant eigen vectors should be taken according to the magnitude of eigen values 
  # (as negative values only indicate reverse of the vector direction), we should consider the
  # absolute value of the eigen values.
  eig_val = np.absolute(eig_val);
  # using argsort to get dominant eigen vectors according to largest eigen values
  sorted_indecies = eig_val.argsort()[::-1];
  eig_vec = eig_vec[:,sorted_indecies];
  # when checking the results, eigen vectors may include imaginary parts.
  # we are only concerned with the real parts
  eig_vec = np.real(eig_vec);
  # getting first n dominant eigen vectors to be the proection matrix
  P = eig_vec[:,:n];
  return P;


# **Obtaining the data and cleaning it**

In [5]:
# I uploaded the data to google drive as a zip file in order use it here
# Mounting the drive
drive.mount('/content/drive/')


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# unzipping the file, to be run only once
!unzip '/content/drive/My Drive/Information systems/Assignment 1/orl_dataset.zip' -d '/content/drive/My Drive/Information systems/Assignment 1/Data set'

## Reading the data to generate the data matrix and the label vector

In [6]:
def generate_data():
  # first obtaining the directories and sorting them
  subjects_dir = os.listdir(path_data);
  # Note that I manually removed the readme file from my drive after unzipping the data
  # sorting the directories to obtain the subjects' data sorted from 1 to 40
  subjects_dir.sort(key = numeric_key_folders);
  # converting the images to the flattened format and filling the D and Y matrices as required
  D = [];
  Y = [];
  flatten_dim = image_len * image_width;
  for current_dir in subjects_dir:
    current_label = numeric_key_folders(current_dir);
    subject_images = os.listdir(path_data + delim + current_dir);
    # sorting the images to obtain the current subject's images sorted from 1 to 10
    subject_images.sort(key = numeric_key_images);
    for current_image in subject_images:
      # image is reshaped to be flattened as a vector
      D.append(mpimg.imread(path_data + delim + current_dir + delim + current_image).reshape(flatten_dim));
      Y.append(current_label);
  return np.array(D), np.array(Y);

## splitting the data and labels to training and testing

In [7]:
# this function splits the data according to specified values to take which for training and which for testing.
# to get odd rows for training and even rows for testing, make train_each = test_each = 1 and start = testing (as matrices and vectors are 0 indexed).
def split_data(D,Y,train_each = 1,test_each = 1,start = testing):
  # flag checks whether data is in training or testing
  destination = start;
  # counter checks how many samples were taken
  taken = 0;
  D_train = [];
  Y_train = [];
  num_samples_train = np.array(np.zeros(num_subjects));
  D_test = [];
  Y_test = [];
  num_samples_test = np.array(np.zeros(num_subjects));
  for i in range(0, Y.shape[0]):
    taken = taken + 1;
    if destination == training:
      D_train.append(D[i,:]);
      Y_train.append(Y[i]);
      num_samples_train[Y[i] - 1] = num_samples_train[Y[i] - 1] + 1; 
      if taken == train_each:
        destination = testing;
        taken = 0;
    else:
      D_test.append(D[i,:]);
      Y_test.append(Y[i]);
      num_samples_test[Y[i] - 1] = num_samples_test[Y[i] - 1] + 1; 
      if destination == testing:
        destination = training;
        taken = 0;
  return np.array(D_train),np.array(Y_train),num_samples_train,np.array(D_test),np.array(Y_test),num_samples_test;    
        


# **Dimensionality reduction using LDA**

In [14]:
def reduce_lda(D_train,D_test,num_samples_train):
  # getting the class means for the training data
  means_class_train = get_class_means(D_train,num_samples_train);
  # getting S_b (which replaces the between-class scatter matrix B in case of multiclass LDA)
  S_b_train = get_S_b(num_samples_train,means_class_train); 
  # getting S using centered data and number of samples
  S_train = get_S(get_data_centered(D_train,n_train,means_class_train),n_train);
  # getting S^-1 * S_b
  # pinv is used to overcome numerical errors (by approximation)
  # when trying this, I found that results are not correct (numbers don't make sense) unless we identify the matrix to be 
  # hermitian which corresponds to the S matrix being symmetric by nature.
  mul_res_train = lg.pinv(S_train,hermitian = True) @ S_b_train;
  # getting the dominant m - 1 eigen vectors where m is number of classes as the projection matrix
  P = get_dom_eig_vec(mul_res_train,num_samples_train.shape[0] - 1);
  # finally we just return the data in reduced form
  return np.real(D_train @ P), np.real(D_test @ P);


# **Classification using KNN**

In [8]:
# given the train data, train labels, test data, test labels and number of neighbours 
# it returns the test accuracy using KNN with this number of neighbours.
# for the tie breaking, we decided to keep the default strategy.
# for weights we used distance as a parameter, so that the nearer the neighbour the more impact it has
# on the classification.
def classify_KNN(D_train,Y_train,D_test,Y_test,n_neigh = 1):
  classifier = KNeighborsClassifier(n_neighbors = n_neigh, weights = 'distance');
  classifier.fit(D_train,Y_train);
  acc_test = classifier.score(D_test, Y_test);
  return acc_test;

# **Scripts used to run the function and give the required outputs**

In [9]:
# script to generate the data and split it
D,Y = generate_data(); 
D_train,Y_train,n_train,D_test,Y_test,n_test = split_data(D,Y);

In [15]:
# script to perform the LDA reduction
D_train_reduced,D_test_reduced = reduce_lda(D_train,D_test,n_train);

In [16]:
# script to get the KNN accuracy for LDA using first NN
acc_test = classify_KNN(D_train_reduced,Y_train,D_test_reduced,Y_test);
print('accuracy of LDA using first nearest neighbour:');
print(acc_test);

accuracy of LDA using first nearest neighbour:
0.96
