# **Data Collection**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

*   Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: inputs/datasets/mildew_dataset

## Additional Comments | Insights | Conclusions


* No additional comments.


---

# Import packages

In [1]:
import numpy as np
import os

## Change the working directory

In [None]:
os.chdir('D:\\vscode-projects\\mildew-detector-v1')
print("You set a new current directory")

In [None]:

current_dir = os.getcwd()
current_dir

# Install Kaggle

In [None]:
# install kaggle package
%pip install kaggle==1.5.12

---

Run the cell below **to change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**.

In [15]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

* Get the dataset path from the [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves). When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ (in some cases kaggle.com/datasets). You should copy that at KaggleDatasetPath.

Set the Kaggle Dataset and Download it.

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, and delete the zip file.

In [17]:
import zipfile
zip_path = os.path.join(DestinationFolder, 'cherry-leaves.zip')
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)
os.remove(zip_path)

---

# Data Preparation

---

## Data cleaning

### Check and remove non-image files

In [8]:
def remove_non_image_file(my_data_dir):
    image_extensions = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    
    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        if not os.path.isdir(folder_path):
            continue

        image_count = 0
        non_image_count = 0

        for filename in os.listdir(folder_path):
            file_path = os.path.join(folder_path, filename)
            if not filename.lower().endswith(image_extensions):
                os.remove(file_path)
                non_image_count += 1
            else:
                image_count += 1

        print(f"Folder: {folder} - Image files: {image_count}")
        print(f"Folder: {folder} - Non-image files removed: {non_image_count}")

In [None]:
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

## Split train validation test set

In [12]:
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if round(train_set_ratio + validation_set_ratio + test_set_ratio, 2) != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    labels = [label for label in os.listdir(my_data_dir) 
              if os.path.isdir(os.path.join(my_data_dir, label)) 
              and label not in ['train', 'validation', 'test']]

    for folder in ['train', 'validation', 'test']:
        for label in labels:
            os.makedirs(os.path.join(my_data_dir, folder, label), exist_ok=True)

    for label in labels:
        label_path = os.path.join(my_data_dir, label)
        files = os.listdir(label_path)
        random.shuffle(files)

        train_qty = int(len(files) * train_set_ratio)
        valid_qty = int(len(files) * validation_set_ratio)

        for idx, file_name in enumerate(files):
            src = os.path.join(label_path, file_name)

            if idx < train_qty:
                dst = os.path.join(my_data_dir, 'train', label, file_name)
            elif idx < train_qty + valid_qty:
                dst = os.path.join(my_data_dir, 'validation', label, file_name)
            else:
                dst = os.path.join(my_data_dir, 'test', label, file_name)

            shutil.move(src, dst)

        os.rmdir(label_path)

Conventionally,
* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [13]:
split_train_validation_test_images(
    my_data_dir='inputs/mildew_dataset/cherry-leaves',
    train_set_ratio=0.7,
    validation_set_ratio=0.1,
    test_set_ratio=0.2
)