# Data Preparation

## Objectives

* Ensure data only contains image files
* Split train, validation and test sets

## Inputs

* Raw dataset - inputs/datasets/raw/cherry-leaves

## Outputs

* Split datasets - test, train and validation in inputs/datasets/cherry-leaves

---

## Change working directory

Change working directory to project root directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))

# confirm new directory
current_dir = os.getcwd()
current_dir

---

## Data cleaning

Check for and remove non-image files

In [None]:
# function taken from Code Institute walkthrough projects
# e.g. https://github.com/Code-Institute-Solutions/WalkthroughProject01
def remove_non_image_files(my_data_dir):
    image_extension = (".png", ".jpg", ".jpeg")
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + "/" + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + "/" + folder + "/" + given_file
                os.remove(file_location)  # remove non-image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [None]:
remove_non_image_files("inputs/datasets/raw/cherry-leaves")

---

## Split train, validation and test sets

Install `joblib` library to enable file/folder changes

In [None]:
%pip install joblib>=0.11

In [None]:
import os
import shutil
import random
import joblib


# function taken from Code Institute walkthrough projects
# e.g. https://github.com/Code-Institute-Solutions/WalkthroughProject01
def split_train_validation_test_images(
    my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio
):
    """Creates folders in my_data_dir for train, validation and test sets, shuffles images and moves them
    into these folders in the passed rations, then deletes the original class label folders
    """

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print(
            "train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0"
        )
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder names

    # check if data already split into desired sets
    if "test" in labels:
        pass
    else:
        # create train, validation and test folders, each with sub-folders for each label class
        for folder in ["train", "validation", "test"]:
            for label in labels:
                os.makedirs(name=my_data_dir + "/" + folder + "/" + label)

        for label in labels:
            # get all images from each class and shuffle them
            files = os.listdir(my_data_dir + "/" + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(
                        my_data_dir + "/" + label + "/" + file_name,
                        my_data_dir + "/train/" + label + "/" + file_name,
                    )

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(
                        my_data_dir + "/" + label + "/" + file_name,
                        my_data_dir + "/validation/" + label + "/" + file_name,
                    )

                else:
                    # move given file to test set
                    shutil.move(
                        my_data_dir + "/" + label + "/" + file_name,
                        my_data_dir + "/test/" + label + "/" + file_name,
                    )

                count += 1

            os.rmdir(my_data_dir + "/" + label)

Remove `joblib` library file

In [None]:
! rm =0.11

Split data into conventional ratios: 70% train, 10% validation, 20% test

In [None]:
split_train_validation_test_images(
    my_data_dir="inputs/datasets/raw/cherry-leaves",
    train_set_ratio=0.7,
    validation_set_ratio=0.1,
    test_set_ratio=0.2,
)

Remove split dataset folders from `raw` diretory and delete `raw`

In [None]:
! mv inputs/datasets/raw/cherry-leaves inputs/datasets/cherry-leaves \
  && rmdir inputs/datasets/raw

---

## Conlusions and next steps

Data is cleaned and split into train, test and validation sets

Next - image visualization study

---