# Building a Random Forest for photo classification using Scikit-learn: An introduction 


In this notebook, we build a Random Forest Classifier (RFC) in an attempt to classify bird species using scikit-learn and seaborn. 

Assumed background: Beginners in Machine Learning, some basic python knowledge (although not mandatory).

Note, using RFCs to classify images is not the standard method of practice, but is useful to provide insights and comparisons between Convolutional Neural Networks (CNNs) and traditional methods. See ../Q2_CNN/Q2_CNN_Birds.ipynb for CNN methods. 

In [None]:
# First, this will ensure that your machine has the corrext packages
!pip install -r ../dependancies_Q1.txt

## Introduction
Image classification is a common and very useful technique in computing, and this skill is becoming increasingly in demand. Whilst Convolutional Neural Networks are a popular choice for image classification, this notebook combined with ../Q2_CNN/Q2_CNN_Birds.ipynb will provides insights into both traditional methods and CNNs. Before we begin, below is a quick overview of the dataset we will be using and how we will build the Random Forest Classifier:

Random Forest classifiers (RFCs) are a traditional Machine learning methods that identifies patterns in data by building many decision trees. Each tree infers the data and the model combines these to classify new unseen data. A classification tree splits data at each step by seperating the group by a certain feature. Each path from the root (top) to the leaf (bottom) represents rules used to classsify data. 

A useful thought experiment: Think about a room full of people. There is a flowchart that asks yes/no questoions - the first question is "Do you have brown hair?" - based on the answer you give, you are instructed to move to the left or to the right. There are now 2 groups, those with brown hair and those with not. Both groups now get asked further questions that split them up repeatidly until a certain criteria is met (eg how many people in a single group). This is similar to how RFCs work. 


Dataset:
- We will use the Birds classification dataset from Rahma Sleam, Kaggle. This will remain constant across Q1, Q2 and Q3. 
- There are 6 types of bird we need to predict, these are called "classes" or "Targets" as they are what we want to predict givien the features of an image.
    - American Goldfinch
    - Barn Owl
    - Carmine Bee-Eater
    - Downy Woodpecker
    - Emperor Penguin
    - Flamingo
- All these birds have their own distinct features as they range over across a wide range of habitats - see link for more detail.
- This will be very beneficial for our decision tree classifier. 

We Will:
- Load an image dataset through extracting hog features
- Explore how the dataset is distributed among each class (ie, how many images there are for each dataset)
    - This will help us limit any biases
- Perform Preprocessing
- Split data into testing and training
- Train a RFC
- Evaluate its performance

_Link to dataset: https://www.kaggle.com/datasets/rahmasleam/bird-speciees-dataset All rights to their respective owners_

## Importing Libraries

Here, we import the revelent packages to help us visualise and manipulate data

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import os
import random
import numpy as np
import cv2

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from skimage.feature import hog
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

## Feature extraction using HOGs
Traditional ML models cant work directly with images, therefore we use Histograms of Orientated Gradients (HOGs) to interpret edges, texture, shape and convert these findings to numerical feature vectors that then can be put into the model

In [3]:
def extract_hog_features(image):
    ## Take a single image as an input and extractes its hog features
    ## hog(...): Histogram of Orientated Gradients (HOG)
    ## orientations = 9: Gradient directions are divided into 9 bins
    ## Pixel per cell = (8,8): Each cell is used to compute a HOG
    hog_features = hog(
        image,
        orientations = 9,
        pixels_per_cell = (8, 8),
        cells_per_block = (2, 2),
        visualize=False
    )
    return hog_features

Now, we loop through each class folder, resize them to be consistent with CNN notebook, then extract the HOG features.

In [4]:
def load_and_extract_features(directory):
    ## loads and extracts features
    ## Directory: path to dataset
    ## Create empty lists
        ## x = list of HOG features
        ## y = list of class labels
    x, y = [], []
    for label, class_name in enumerate(os.listdir(directory)):
        ## Loops through each folder in the directory
        ## os.listdir(directory) returns folder names (each folder is a class)
        ## enumerate: changes class label from string to integer
        ## class_dir: stores directory to each class
        class_dir = os.path.join(directory, class_name)
        for filename in os.listdir(class_dir):
            ## Loops through each image in the class folder
            ## image_path: Stores the full direcotry
            ## img = cv2....: Reads the image using OpenCV, result numpy array
            ## if img is None: Skips any images that fail to load
            ## img_resized = resizes images to 128, 128 to be fair with CNN Notebook
            ## hog_features: Calls the function defined in the cell above
            ## Appends x and y with features and labels 
            image_path = os.path.join(class_dir, filename)
            img = cv2.imread(image_path)
            if img is None:
                continue
            img_resized = cv2.resize(img, (128,128))
            img_gray = cv2.cvtColor(img_resized, cv2.COLOR_BGR2GRAY) 
            ## HOG is better suited for grayscale images and sees better performance this way
            ## Hence the deviation from the color images used in the CNNs
            hog_features = extract_hog_features(img_gray)
            x.append(hog_features)
            y.append(label)
    return np.array(x), np.array(y)

## Loading data
Now we can use the two helper functions defined above to load and extract the hog features ready to be processed

In [5]:
data_path = "../../birds_dataset/Bird Speciees Dataset/"
x, y = load_and_extract_features(data_path)
print(">>> Finished loading data")

>>> Finished loading data


## Train / Test Split

We now will split the dataset into a 80% : 20% split for training and testing
- 80% of the total data will be used to train the model
- 20% of the total data will be used to test the model and produce accuracy scores

This is common practice in ML because:
- It can help to ensure that the RFC is not simply memorising patterns from training. This helps to prevent overfitting:
- Overfitting occurs when the RFC fits the training data too closely, that it cannot provide good predictions on unseen data

In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    ## Using sklearn's built in data splitter
    ## Test size = 20% (Consistent with CNN)
    ## Train size = 80% (consistent with CNN)
    ## Random state is set to promote reproducability
    ## Stratify ensures that class distribution in train and test sets matches original
    x, y,
    test_size = 0.2,
    random_state = 42,
    stratify = y
)


## Standard Scaling
RFC models work more efficiently when the images are scaled, therefore we use standard scaler on HOG feature so that mean = 0 and std = 1.

In [7]:
## Using the standard scaler from scikit-learn
## Normalises data so that mean = 0 and standard deviation = 1
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

## Definition of Random Forest Classifier

Now that the data has been preprocessed, we can now define our model arcitecture

In [8]:
rf_clf = RandomForestClassifier(
    ## Definition of the Random Forest Classifier
    ## n_estimators = 100: Number of trees in the forest
    ## n_jobs = 1: number of CPU cores
    ## max_depth: Limits the depth of each decision tree, prevents overfitting
    n_estimators = 100,
    random_state = 42,
    n_jobs = 1,
    max_depth = 5
)

## Training the random forest classifier on the training dataset
print(">>> Training Random Forest Classifier")
rf_clf.fit(x_train_scaled, y_train)
print(">>> Training complete")

>>> Training Random Forest Classifier
>>> Training complete


## Model Testing and evaluation
Now that our model is trained, its time to test it on the testing dataset (Remember we split the data into 80% training and 20% testing). We will print out the accuract scores for each class in the next section

In [9]:
## Testing Random Forest Classifier
## y_pred: array of predicted classes
## accuracy_score = accuracy = 100* (# Correct predictions / # Total predictions)
print(">>> Testing model")
y_pred = rf_clf.predict(x_test_scaled)
print(">>> Testing completed, results follow")

>>> Testing model
>>> Testing completed, results follow


# Interpreting results and evaluating performance

Now that we have trained our Random Forest Classifier, we can evaluate how well it performed during testing

In [11]:
## Defining some variables that will help us to interprest results
## target_names: names of each class
data_path = "../../birds_dataset/Bird Speciees Dataset/"
target_names = sorted(os.listdir(data_path))

## Classification Report

A classification report gives a summary on how the Random Forest Classifier performed for each class.

Precision = # True Positive / # True and False Positives. Low precision would mean tha tthe model often confuses other birds for the true result

Recall = # True Positive / # True Positive + # False Negative. High recall would mean that the model only gets a few false negatives (does not miss many real exampled)

F1-Score = 2 x Precision*Recall / Precision+Recall. Balance between precision and recall

Support = How many images of birds where in the test set. Useful to elimate biases if needed

In [12]:
## Classification Report
## class_report: uses the built in classification report
## accuracy: uses accuracy_score to obtain the overall perfomance: 100*(Correct predictions / Total predictions)
class_report = classification_report(y_test, y_pred, target_names=target_names)
print(">>> Classification Report\n")
print(class_report)
accuracy = 100*accuracy_score(y_test, y_pred)
print(f"Overall accuracy: {accuracy:.4}%")

>>> Classification Report

                    precision    recall  f1-score   support

AMERICAN GOLDFINCH       0.59      0.66      0.62        29
          BARN OWL       0.56      0.38      0.45        26
 CARMINE BEE-EATER       0.52      0.42      0.47        26
  DOWNY WOODPECKER       0.50      0.61      0.55        28
   EMPEROR PENGUIN       0.71      0.79      0.75        28
          FLAMINGO       0.41      0.42      0.42        26

          accuracy                           0.55       163
         macro avg       0.55      0.55      0.54       163
      weighted avg       0.55      0.55      0.55       163

Overall accuracy: 55.21%


Interpretation:

American Goldfinch
- High recall - Catches most American Goldfinches
- Sometimes predicts goldfinches when it is not

Barn Owl
- Low recall - Misses many Barn Owls
- Low precision - May get confused with birds of similar feature

Carmine Bee-eater
- High precision - catches most Bee-eater
- Fairly reasonable recall - Misses some bee eaters

Downy Woodpecker
- Fairly high recall - catches most woodpeckers
- Low precision - confuses many other birds as woodpeckers

Emperor Penguin
- High precision and recall due to distinct features

Flamingo
- Moderaate precision and recall.
- Struggles to distinguish flamingoes from rest of set. 




## Confusion Matrix
A confusion matrix shows the number of true positives, true negatives, false positives and false negatives between testing and training

Rows: True classes in order of American Goldfinch, Barn Owl, Carmine Bee-eater, Downy Woodpecker, Emperor Penguin and Flamingo
Columns: Prediction classes of the same order as above

The number in the leading diagnonal show correct predictions, any other entry is incorrect.

In [13]:
## Confusion matrix
## conf_matrix uses the built in function to compute the confusion matrix between y_test and y_pred
conf_matrix = confusion_matrix(y_test, y_pred)
print(">>> Confusion matrix\n")
print(conf_matrix)


>>> Confusion matrix

[[19  1  3  2  2  2]
 [ 1 10  2  9  1  3]
 [ 7  1 11  2  2  3]
 [ 1  4  2 17  1  3]
 [ 0  0  0  1 22  5]
 [ 4  2  3  3  3 11]]


## Interpretting the confusion matrix
HOG focuses on Edges, Shapes and main silhouettes and largely ignores fine texture and color. It captures couarse texture but not fine-grain texture or spatial hierarchies. Therefore, the model performs better on classes that have very distinct features such as the Emperor Penguin. Classes with more subtle features can become confused, for example Barn Owls are often mistaken for Downy Woodpackers, suggesting that edge-based features alone are often insufficient for fine image classification. This is a key limitation for Random Forest Classifiers used with HOG extraction. 

This limitation is well documented alongside the fact that Random Forests do not learn hierarchial visual features. However, there is an alternative method - Convolutional Neural Networks, To view how we approach this, see notebook ../Q2_CNN/Q2_CNN_Birds.ipynb .

# References used in this notebook

**Code inspired by**
1. Class Notes and lecture notes: 24th December to 5th December, Coleman K.
    - General workflow
    - Specific inspiration for the definition of 
        - Transforms
        - Loading data
        - General information regarding Radom Forest Classifiers
2. Interpreting Random Forest Results, Geeks for Geeks
    - Aid in analysing results from the Random Forest
    - Link: https://www.geeksforgeeks.org/machine-learning/interpreting-random-forest-classification-results/
3. HOG Feature Visualisation
    - Research into what HOG is and how to use it - Justification for using gray scale
    - Link: https://www.geeksforgeeks.org/machine-learning/hog-feature-visualization-in-python-using-skimage/
4. Random Forest Classifiers, six sided dice
    - Research into the justification for using RFCs for image classification
    - Aid in coding general workflow
    - Link: https://www.sixsideddice.com/Blog/MLByExample/RandomForestsForImageClassification.html
5. Random Forest for Image Classification using open cv
    - How to use open cv
    - Link: https://www.geeksforgeeks.org/machine-learning/random-forest-for-image-classification-using-opencv/
6. Random Forest Classifier using scikit learn
    - General workfloy - how to run the forest classifier
    - Link:https://www.geeksforgeeks.org/dsa/random-forest-classifier-using-scikit-learn/
7. How to import Kaggle datasets
    - Research into how to import kaggle datasets
    - Link: https://www.geeksforgeeks.org/python/how-to-import-kaggle-datasets-directly-into-google-colab/
