# Building a Random Forest for photo classification using Scikit-learn: An introduction 


In this notebook, we build a Random Forest Classifier (RFC) in an attempt to classify bird species using scikit-learn and seaborn. 

Assumed background: Beginners in Machine Learning, some basic python knowledge (although not mandatory).

Note, using RFCs to classify images is not the standard method of practice, but is useful to provide insights and comparisons between Convolutional Neural Networks (CNNs) and traditional methods. See ../Q2_CNN/Q2_CNN_Birds.ipynb for CNN methods. 

## Introduction
Image classification is a common and very useful technique in computing, and this skill is becoming increasingly in demand. Whilst Convolutional Neural Networks are a popular choice for image classification, this notebook combined with ../Q2_CNN/Q2_CNN_Birds.ipynb will provides insights into both traditional methods and CNNs.Before we begin, below is a quick overview of the dataset we will be using and how we will build the Random Forest Classifier:

Random Forest classifiers (RFCs) are a traditional Machine learning methods that identifies patterns in data by building many decision trees. Each tree infers the data and the model combines these to classify new unseen data. A classification tree splits data at each step by seperating the group by a certain feature. Each path from the root (top) to the leaf (bottom) represents rules used to classsify data. 

A useful thought experiment: Think about a room full of people. There is a flowchart that asks yes/no questoions - the first question is "Do you have brown hair?" - based on the answer you give, you are instructed to move to the left or to the right. There are now 2 groups, those with brown hair and those with not. Both groups now get asked further questions that split them up repeatidly until a certain criteria is met (eg how many people in a single group). This is similar to how RFCs work. 


Dataset:
- We will use the Birds classification dataset from Rahma Sleam, Kaggle. This will remain constant across Q1, Q2 and Q3. 
- There are 6 types of bird we need to predict, these are called "classes" or "Targets" as they are what we want to predict givien the features of an image.
    - American Goldfinch
    - Barn Owl
    - Carmine Bee-Eater
    - Downy Woodpecker
    - Emperor Penguin
    - Flamingo
- All these birds have their own distinct features as they range over across a wide range of habitats - see link for more detail.
- This will be very beneficial for our decision tree classifier. 

We Will:
- Load an image dataset through extracting hog features
- Explore how the dataset is distributed among each class (ie, how many images there are for each dataset)
    - This will help us limit any biases
- Perform Preprocessing
- Split data into testing and training
- Train a RFC
- Evaluate its performance

_Link to dataset: https://www.kaggle.com/datasets/rahmasleam/bird-speciees-dataset All rights to their respective owners_

## Importing Libraries

Here, we import the revelent packages to help us visualise and manipulate data

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import os
import random
import numpy as np
import cv2

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from skimage.feature import hog

In [10]:
def extract_hog_features(image):
    ## Take a single image as an input and extractes its hog features
    ## hog(...): Histogram of Orientated Gradients (HOG)
    ## orientations = 9: Gradient directions are divided into 9 bins
    ## Pixel per cell = (8,8): Each cell is used to compute a HOG
    hog_features = hog(
        image,
        orientations = 9,
        pixels_per_cell = (8, 8),
        cells_per_block = (2, 2),
        visualize=False
    )
    return hog_features

In [15]:
def load_and_extract_features(directory):
    ## loads and extracts features
    ## Directory: path to dataset
    ## Create empty lists
        ## x = list of HOG features
        ## y = list of class labels
    x, y = [], []
    for label, class_name in enumerate(os.listdir(directory)):
        ## Loops through each folder in the directory
        ## os.listdir(directory) returns folder names (each folder is a class)
        ## enumerate: changes class label from string to integer
        ## class_dir: stores directory to each class
        class_dir = os.path.join(directory, class_name)
        for filename in os.listdir(class_dir):
            ## Loops through each image in the class folder
            ## image_path: Stores the full direcotry
            ## img = cv2....: Reads the image using OpenCV, result numpy array
            ## if img is None: Skips any images that fail to load
            ## img_resized = resizes images to 128, 128 to be fair with CNN Notebook
            ## hog_features: Calls the function defined in the cell above
            ## Appends x and y with features and labels 
            image_path = os.path.join(class_dir, filename)
            img = cv2.imread(image_path)
            if img is None:
                continue
            img_resized = cv2.resize(img, (128,128))
            img_gray = cv2.cvtColor(img_resized, cv2.COLOR_BGR2GRAY)
            hog_features = extract_hog_features(img_gray)
            x.append(hog_features)
            y.append(label)
    return np.array(x), np.array(y)

In [16]:
x, y = load_and_extract_features("../../../Data_birds/birds_dataset/Bird Speciees Dataset/")

In [17]:
x_train, x_test, y_train, y_test = train_test_split(
    ## Using sklearn's built in data splitter
    ## Test size = 20% (Consistent with CNN)
    ## Train size = 80% (consistent with CNN)
    ## Random state is set to promote reproducability
    ## Stratify ensures that class distribution in train and test sets matches original
    x, y,
    test_size = 0.2,
    random_state = 42,
    stratify = y
)


In [6]:
## Using the standard scaler from scikit-learn
## Normalises data so that mean = 0 and standard deviation = 1
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

In [18]:
rf_clf = RandomForestClassifier(
    ## Definition of the Random Forest Classifier
    ## n_estimators = 100: Number of trees in the forest
    ## n_jobs = 1: number of CPU cores
    ## max_depth: Limits the depth of each decision tree, prevents overfitting
    n_estimators = 100,
    random_state = 42,
    n_jobs = 1,
    max_depth = 5
)

## Training the random forest classifier on the training dataset
rf_clf.fit(x_train_scaled, y_train)

In [32]:
## Testing Random Forest Classifier
## y_pred: array of predicted classes
## accuracy_score = accuracy = 100* (# Correct predictions / # Total predictions)
print(">>> Testing completed, results follow")
y_pred = rf_clf.predict(x_test_scaled)
accuracy = 100*accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4}%")


Accuracy: 63.19%
>>> Testing completed, results follow
Accuracy for class: American Goldfinch is nan %
Accuracy for class: Barn Owl is nan %
Accuracy for class: Carmime Bee-eater is nan %
Accuracy for class: Downy Woodpecker is nan %
Accuracy for class: Emperor Penguin is nan %
Accuracy for class: Flamingo is nan %
Total Accuracy: 63.19%


  total = np.sum(y_test == cls)
  correct = np.sum((y_test == cls) &(y_pred == cls))
  accuracy = 100* correct / total


In [None]:
"