# Machine Learning Project : To bee or not to bee
___

## 1. Business Understanding

* Context : Pollinator insects such as bees and bumblebees are important for ecosystems diversity and by fertilizing flowers, they play a vital role in the global food chain system.
* Problem : Distinguishing bee types amongst pollinators is crucial in this context and traditional methods which are mostly manual are quite time-consuming especially with our dataset of 250 images, although it might be less prone to error
* Solution : This project aims to leverage Machine Learning (ML), Deep Learning and Ai & Optimization techniques to automate the identification process by looking at an image (and its mask) and recognize the bug type and in a later time its species.
* Stakeholders : This project would interest more than one amongst researches, within the agricultural sector as well as for educational purposes.

Now that we're well versed in the subject of this project we can get into the technicalities

## 2. Data Understanding

In [3]:
# We need to import the necessary libraries 

import cv2
import os
import numpy as np
from tqdm.notebook import tqdm 
import re
import matplotlib.pyplot as plt

### Configuration

In [6]:
IMAGE_FOLDER = 'tobeeornottobee_train_v2/train' 
MASK_FOLDER = 'tobeeornottobee_train_v2/train/masks'  
MASK_FILE_EXTENSION = '.tif'
EXPECTED_TRAINING_FILES = 250

### Dataset Loading

In [7]:
training_data = [] # we will store each image and its mask identified by a unique ID (seq(0,250))

try:
    # list the files in IMAGE_FOLDER.
    all_files_in_image_folder = os.listdir(IMAGE_FOLDER)

    # keep only the .jpg files (filter out the /mask folder)
    image_filenames = [f for f in all_files_in_image_folder if f.lower().endswith('.jpg')]

    # sort the filenames numerically 
    # since our files aren't formatted such as 001, 002 if we keep them this way 100 will be before 2 
    def sort_key(filename):
        match = re.match(r'(\d+)\.jpg', filename, re.IGNORECASE) # find number (digits) before the extension
        return int(match.group(1))
    image_filenames.sort(key=sort_key)

    # make sure we have 250 files loaded so far
    files_to_load = image_filenames[:EXPECTED_TRAINING_FILES]
    if len(files_to_load) < EXPECTED_TRAINING_FILES:
         print(f"Warning: Loading {len(files_to_load)} files.")
    else:
         print(f"Loading the {len(files_to_load)} files based on sorted filenames.")

    for img_filename in tqdm(files_to_load, desc="Processing Files"):

        # getting the name that will serve as ID as well
        image_id = os.path.splitext(img_filename)[0]

        # creating full path to the image file
        image_path = os.path.join(IMAGE_FOLDER, img_filename)

        # finding expected filename for the corresponding mask.
        mask_filename = "binary_" + image_id + MASK_FILE_EXTENSION # e.g., "binary_1.tif"
        # creating full path to the mask file
        mask_path = os.path.join(MASK_FOLDER, mask_filename)
        
        # load the image using OpenCV (BGR format by default in OpenCV)
        image = cv2.imread(image_path)
        if image is None:
            print(f"  Warning: Could not read image file: '{image_path}'. Skipping this pair.") # check if loading works
            continue 

        # load the mask as Grayscale (single channel) using OpenCV
        mask = cv2.imread(mask_path, cv2.IMREAD_GRAYSCALE)
        if mask is None:
            print(f"  Warning: Could not read mask file: '{mask_path}'. Skipping this pair.") # check if loading worked.
            continue

        # finallt, store the loaded data in our list.
        data_item = {
            'id': image_id,        # The identifier (e.g., '1', '123')
            'image': image,        # The loaded image 
            'mask': mask           # The loaded mask (corresponding to the image)
        }
        training_data.append(data_item)

    print(f"\nFinished loading. Successfully loaded {len(training_data)} image/mask pairs.")

except FileNotFoundError:
    print("Please double-check the paths")
    training_data = [] # Ensure data list is empty if loading failed
except Exception as e:
    print("An unexpected error occurred")
    print(e)
    training_data = []

Loading the 250 files based on sorted filenames.


Processing Files:   0%|          | 0/250 [00:00<?, ?it/s]


Finished loading. Successfully loaded 249 image/mask pairs.


### Feature Extraction