# SC1015 Mini Project: Signature Forgery Detection

  Welcome to our mini project! This first section will run you through the overview of what our project is about. We decided to follow the Datapipeline taught to us in the DataScience portion of this module: 
  
  1. __Practical Motivation:__ Why do we choose this problem?
  2. __Sample Collection:__ Which dataset do we use?
  3. __Problem Formulation:__ Rephrasing the problem into a datascience problem
  4. __Data Preparation:__ Steps needed to setup the dataset
  5. __Statistical Description:__ Statistics to describe our input data
  6. __Exploratory Analysis:__ What can we analyse from the statistics of our data?
  7. __Pattern Recognition:__ What patterns/features can we find from our data?
  8. __Analytic Visualization:__ How can we visualise what we're working with?
  9. __Machine Learning:__ Which model do we use to solve our problem?
  10. __Algorithmic Optimization:__ How can we optimise our algorithm? 
  11. __Statistical Inference:__ What can we infer from the results?
  12. __Information Presentation:__ Side by side comparison of the results

## Imports
  The section below will import all the necessary libraries for this project. Although, there are some pre-requisites needed to install into your python CMD or anaconda CMD. 
  
  * __pip install "tensorflow<2.11"__
  * __pip install opencv-python__

In [1]:
# Import relevant libraries
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.cm as cm
from scipy import ndimage
from skimage.measure import regionprops
from skimage import io
from skimage.filters import threshold_otsu   # For finding the threshold for grayscale to binary conversion
import tensorflow as tf
import pandas as pd
import numpy as np
from time import time
import keras
import cv2 
from PIL import Image

ModuleNotFoundError: No module named 'tensorflow'

## Practical Motivation
  Signature forgery is a problem that bypasses an untrained eye. Not everyone has the patience to meticulously check the authenticity of thousands of documents. In fact, a Singaporean man was able to forge documents that helped him get 38 jobs in 4 years ([CNA](https://www.channelnewsasia.com/singapore/man-forge-documents-nus-degree-get-jobs-38-companies-890881), 2019). 
  ![Signature Forgery](https://www.martypearce.com/wp-content/uploads/2019/01/3e406f_00f3feb43e17449c8bfacc0f850bf362mv2.jpg)
  Forgery is a real concern and humans are bound to make mistakes when checking for forged documents. On a tiring day, we may not always be able to find forged documents and so our group strives to build a working model that can consistently detect Signature Forgery.

## Sample Collection
  The dataset we will be using is from Kaggle: [https://www.kaggle.com/datasets/robinreni/signature-verification-dataset](https://www.kaggle.com/datasets/robinreni/signature-verification-dataset). 
  It has the following directory:
* sign_data
  * test
  * train
  * test_data.csv
  * train_data.csv
  
  
1. There are a total of 2,149 images(png)
2. A training set of 69 different individuals' signatures, separated into a genuine and forged set of images
3. A test set of 21 different individuals' signatures, separated into a genuine and forged set of images

## Problem Formulation
  In terms of Data Science, this is definitely a "Classification" type of problem and we will be working with unstructured data, namely images. If we were to phrase this problem we would ask:
  
  __Is this image a genuine or forged signature?__
  
  This is a binary type of classification

## Data Preparation

Before we start with the analyses of the dataset, we first restructure the folder layout of the dataset.

### **Before** 
```
Root_Folder
└── Dataset/
    ├── test/
    │   ├── 049/
    │   │   ├── 01_049.png
    │   │   ├── 02_049.png
    │   │   └── ...
    │   ├── 049_forg/
    │   │   ├── 01_0114049.png
    │   │   ├── 01_0206049.png
    │   │   └── ...
    │   ├── 050/
    │   ├── 050_forg/
    │   └── ...
    ├── train/
    │   ├── 001/
    │   │   ├── 001_01.png
    │   │   ├── 001_02.png
    │   │   └── ...
    │   ├── 001_forg/
    │   │   ├── 0119001_01.png
    │   │   ├── 0119001_02.png
    │   │   └── ...
    │   ├── 002/
    │   ├── 002_forg/
    │   └── ...
    ├── test_data.csv
    └── train_data.csv
```

### **After**
```
Root_Folder
└── Dataset/
    ├── test/
    │   ├── forged/
    │   │   ├── 049_forg/
    │   │   │   ├── 049_forg_01.png
    │   │   │   ├── 049_forg_02.png
    │   │   │   └── ...
    │   │   ├── 050_forg/
    │   │   ├── 051_forg/
    │   │   └── ...
    │   └── real/
    │       ├── 049/
    │       │   ├── 01_049.png
    │       │   ├── 02_049.png
    │       │   └── ...
    │       ├── 050/
    │       ├── 051/
    │       └── ...
    ├── train/
    │   ├── forged/
    │   │   ├── 001_forg/
    │   │   │   ├── 001_forg_01.png
    │   │   │   ├── 001_forg_02.png
    │   │   │   └── ...
    │   │   ├── 002_forg/
    │   │   └── ...
    │   └── real/
    │       ├── 001/
    │       │   ├── 001_01.png
    │       │   ├── 001_02.png
    │       │   └── ...
    │       ├── 002/
    │       └── ...
    ├── test_data.csv
    └── train_data.csv
```


We added 2 folders(real & forged) in each **train** and **test** folders to categorise the signatures for better access.

Next, we moved on to actual preparation of the data. As the data is already split into **train** and **test** set, we do not need further splitting of the data.

Additionally, no data cleaning is required as the images does not contain any noises(images are high definition of different signatures). Also, all the signatures have the been categorised based on the individuals, no further classifying the images is needed.

However, one thing we need to do is to convert the image into numerical value to do processing on the data extracted.

We have come up with 2 ways to convert the images into numerical values

    1) Converting images into 2D array of RGB values

    2) Converting images into different properties (will be covered more in depth later)

These 2 approaches will be used in the different solutions to address our problem.

### Image Conversion Method 1 - Converting images into 2D array of RGB Values. 

This is done using existing library (`cv2`) to read the image (`imread`) and convert them into RGB values

A detailed example is shown below

```python
img = cv2.imread(*filename*)
"""
Example value of img:

[[250 250 250 ... 250 250 250]
 [250 250 250 ... 250 250 250]
 [250 250 250 ... 250 250 250]
 ...
 [250 250 250 ... 250 250 250]
 [250 250 250 ... 250 250 250]
 [250 250 250 ... 250 250 250]]
"""
```

However, as our images are of different dimensions, we would need to resize/scale the images so that there are of the same dimensions. An example of the scaling function is shown below

```python
"""
imageA: RGB values of first image
imageB: RGB values of second image
return: RGB values of both image A and image B

The function check for the smaller dimension 
of the images and scale up the smaller one 
to the dimension of the bigger one
"""
def image_resize(imageA, imageB):
    if not (imageA.shape == imageB.shape):
        # Scale the smaller dimension pictures
        if (imageA.shape < imageB.shape):
            dim = (imageB.shape[1], imageB.shape[0])
            imageA = cv2.resize(imageA, dim)
        elif (imageB.shape < imageA.shape):
            dim = (imageA.shape[1], imageA.shape[0])
            imageB = cv2.resize(imageB, dim)
        
    return imageA, imageB
```

## Statistical Description

## Exploratory Analysis

## Pattern Recognition

## Analytic Visualization

## Machine Learning

## Algorithmic Optimization

## Statistical Inference

## Information Presentation