## Goal
Here, we will use RESNET34 architecture to train a classification model to identify the age range of images.

Age range definition based on Health Promotion Board of Singapore
- (A0) Young Children: 0-6 years 
- (A1) Children and Youth: 7-17 years 
- (A2) Youth Adults: 18-25 years
- (A3) Adults: 26-49 years
- (A4) Older adults: 50 years

The training data used is from UTKFace. We will use In-the-wild dataset since it contains most data, and we have a face crop engine ready.
Link: https://susanqq.github.io/UTKFace/

## Data

In [2]:
import os
os.getcwd()

'/home/jupyter/fd_widerface_yolov8'

In [2]:
!gsutil ls gs://oak_datastorage

gs://oak_datastorage/utkface_dataset.zip
gs://oak_datastorage/yolov8-face-20epoch.pt


In [5]:
!gsutil cp gs://oak_datastorage/utkface_dataset.zip datasets/

Copying gs://oak_datastorage/utkface_dataset.zip...
- [1 files][  1.3 GiB/  1.3 GiB]   60.2 MiB/s                                   
Operation completed over 1 objects/1.3 GiB.                                      


In [9]:
# Commented out since the action has been done and we would not want to overwhelm the jupyter lab with lengthy feedback print
# !unzip datasets/utkface_dataset.zip -d datasets/

In [3]:
imglist = os.listdir("datasets/utkface_dataset")
print("Total number of images available:", len(imglist))
print("Example directory:", imglist[0])

Total number of images available: 24109
Example directory: 58_0_0_20170120222516888.jpg


The format of each data image is \<age>_\<gender>_\<race>_\<datetime>.jpg.

Since we are trying to develop an age classification model, let us leave out all the other information and extract out only the age.

## Face Extraction

Let us extract out faces from each frame for us to do age classification. By extracting out faces, the classifier can only look at facial information. The rationale for the focus on the face is to remove model biases that could originate from other potentially misleading information, such as clothing. 

We will extract the faces into a seperate folder, after which we would reorganize the data randomly into train, test and validation folder structure.

We will drop those images where duplicate faces are detected (<5% of the whole dataset) to avoid misleading the model, as we are not 100% sure the label is for which face.

In [77]:
# Since we have 5 age ranges, this means we have 5 labels. We will incorperate this into the cropped file names
def labelling(cropstr):
    age = int(cropstr.split("_")[0])
    if age<=6:
        agestr = "A0"
    elif age<=17:
        agestr = "A1"
    elif age<=25:
        agestr = "A2"
    elif age<=49:
        agestr = "A3"
    else:
        agestr = "A4"
    return agestr
    

In [78]:
import math
import cv2

save_directory = "datasets/cropped_utkface/"
classNames = ["face"]

def cropimgs(results, save_directory, expand_ratio = 0.2, CI = 0.7):
    # results: list of results for all the images after running the model
    # save_directory: location of output directory
    # expand_ratio: expand the size of x1,x2 by expand_ratio percent. 
    # CI: confidence interval of prediction. Only crop the image if the confidence is higher than this threshold
    
    count = 0
    dup_count = 0
    
    for r in results:
        boxes = r.boxes
        imgpath = r.path
        img = cv2.imread(imgpath)
        
        cropstr = imgpath.split("/")[-1].split(".jpg")[0]
        agestr = labelling(cropstr)
        
        img_index = 0
        
        # In this module, we remove duplicate
        if len(boxes)>1:
            # print("More than one face detected!")
            dup_count += 1
            continue

        for box in boxes:
            # bounding box
            x1, y1, x2, y2 = box.xyxy[0]
            x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2) # convert to int values
            
            # Calculate expanded coordinates
            x1 = max(0, int(x1 - expand_ratio * (x2 - x1)))
            y1 = max(0, int(y1 - expand_ratio * (y2 - y1)))
            x2 = min(img.shape[1], int(x2 + expand_ratio * (x2 - x1)))
            y2 = min(img.shape[0], int(y2 + expand_ratio * (y2 - y1)))

            # put box in cam
            cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 255), 3)

            # confidence
            confidence = math.ceil((box.conf[0]*100))/100
            # print("Confidence --->",confidence)

            # # class name
            cls = int(box.cls[0])
            # print("Class name -->", classNames[cls])

            if classNames[cls] == "face" and confidence > CI:
                # Crop the person from the image
                cropped_person = img[y1:y2, x1:x2]

                # Save the cropped person image
                filename = f"{save_directory}{cropstr}_{agestr}.jpg"
                cv2.imwrite(filename, cropped_person)
                # img_index += 1
                
#                 if img_index > 1:
#                     print("Possible duplicate faces at", filename)
                
        count += 1
        
        if count%1000 == 0:
            print("Success count:", count)
            print("Duplicate count:", dup_count)


In [79]:
# Create directory first if it does not exist
if not os.path.exists(save_directory):
    os.makedirs(save_directory)
    print(f"Directory '{save_directory}' created.")
else:
    print(f"Directory '{save_directory}' already exists.")

Directory 'datasets/cropped_utkface/' created.


In [80]:
# Download the best available face detection model
from ultralytics import YOLO
model = YOLO('model/best.pt')
img_dir = "/home/jupyter/fd_widerface_yolov8/datasets/utkface_dataset/"
model.info()


Model summary: 225 layers, 3011043 parameters, 0 gradients, 8.2 GFLOPs


(225, 3011043, 0, 8.1941504)

In [81]:
# We will use generator for prediction to prevent overwhelming the CPU:
results = model(img_dir, stream=True, verbose=False);
print("Done creating generator")

Done creating generator


In [82]:
# Process results generator
cropimgs(results, save_directory)

Success count: 1000
Duplicate count: 39


Corrupt JPEG data: bad Huffman code
Corrupt JPEG data: bad Huffman code


Success count: 2000
Duplicate count: 71
Success count: 3000
Duplicate count: 124
Success count: 4000
Duplicate count: 163
Success count: 5000
Duplicate count: 219
Success count: 6000
Duplicate count: 298
Success count: 7000
Duplicate count: 342
Success count: 8000
Duplicate count: 376
Success count: 9000
Duplicate count: 408
Success count: 10000
Duplicate count: 450
Success count: 11000
Duplicate count: 486
Success count: 12000
Duplicate count: 534
Success count: 13000
Duplicate count: 573
Success count: 14000
Duplicate count: 605
Success count: 15000
Duplicate count: 644
Success count: 16000
Duplicate count: 677
Success count: 17000
Duplicate count: 714
Success count: 18000
Duplicate count: 749
Success count: 19000
Duplicate count: 787
Success count: 20000
Duplicate count: 837


Corrupt JPEG data: premature end of data segment
Corrupt JPEG data: premature end of data segment


Success count: 21000
Duplicate count: 880
Success count: 22000
Duplicate count: 928
Success count: 23000
Duplicate count: 973


### Generation of dataset

Here, we will generate folders for datasets and move our data there.

In [None]:
# Since we have 5 age ranges, this means we have 5 labels.

# Creating relevant directory
train_dir = "datasets/labcrop_utkface/train"
val_dir = "datasets/labcrop_utkface/val"
test_dir = "datasets/labcrop_utkface/test"
dirlist = [train_dir, val_dir, test_dir]
lablist = ["/A0", "/A1", "/A2", "/A3", "/A4"]

for datadir in dirlist:
    directory = datadir+lablist    
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Directory '{directory}' created.")
    else:
        print(f"Directory '{directory}' already exists.")


In [None]:
source_path = "datasets/cropped_utkface"
allimages = os.listdir(source_path)

In [None]:
import random

def split_data(data, train_percent=0.8, validation_percent=0.1, test_percent=0.1):
    total_size = len(data)
    train_size = int(total_size * train_percent)
    validation_size = int(total_size * validation_percent)
    
    # Ensure that the sizes add up to the total size
    test_size = total_size - train_size - validation_size
    
    # Shuffle the data randomly
    random.shuffle(data)
    
    # Split the data into three sets
    train_data = data[:train_size]
    validation_data = data[train_size:train_size + validation_size]
    test_data = data[train_size + validation_size:]
    
    return train_data, validation_data, test_data


In [None]:
import shutil

def copy_file(source_path, destination_path):
    try:
        shutil.copy(source_path, destination_path)
        # print(f"File copied successfully from {source_path} to {destination_path}")
    except FileNotFoundError:
        print(f"Source file not found: {source_path}")
    except PermissionError:
        print(f"Permission error. Make sure you have the necessary permissions.")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:


def assign_folders(savedir, dataset):
    
    