# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token

## Outputs

* Generate Dataset: inputs/datasets/mildew_dataset

## Additional Comments

* The dataset contains +4 thousand images taken from the client's crop fields. The images show healthy cherry leaves and cherry leaves that have powdery mildew, a fungal disease that affects many plant species.
* The client provided the data under an NDA (non-disclosure agreement), therefore the data should only be shared with professionals that are officially involved in the project.

---

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\All\\Desktop\\Zoe_Deve\\ML-Mildew-Detector-\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\All\\Desktop\\Zoe_Deve\\ML-Mildew-Detector-'

Fetch data from Kagggle website and download the zip file to the folder created at below

In [6]:
import os
try:
  os.makedirs(name='inputs/mildew_dataset') # create inputs/mildew_dataset folder
except Exception as e:
  print(e)   

Unzip the downloaded file, and delete the zip file.

In [7]:
import zipfile
with zipfile.ZipFile('inputs/mildew_dataset' + '/archive.zip', 'r') as zip_ref:
    zip_ref.extractall('inputs/mildew_dataset')

os.remove('inputs/mildew_dataset' + '/archive.zip')

---

## Data Preparation

### Data cleaning

#### Check and remove non-image files

In [8]:
import os
from PIL import Image

def is_image(filename):
    try:
        with Image.open(filename) as img:
            return True
    except Exception:
        return False

def remove_non_images_and_count(directory):
    for root, _, files in os.walk(directory):
        image_count = 0
        non_image_count = 0
        
        for file in files:
            file_path = os.path.join(root, file)
            if is_image(file_path):
                image_count += 1
            else:
                print(f"Removing non-image file: {file_path}")
                os.remove(file_path)
                non_image_count += 1
        
        print(f"Folder: {root}")
        print(f"Total image files: {image_count}")
        print(f"Non-image files removed: {non_image_count}")
        print("\n")


In [10]:
remove_non_images_and_count('inputs/mildew_dataset/cherry-leaves')

Folder: inputs/mildew_dataset/cherry-leaves
Total image files: 0
Non-image files removed: 0


Folder: inputs/mildew_dataset/cherry-leaves\healthy
Total image files: 2104
Non-image files removed: 0


Folder: inputs/mildew_dataset/cherry-leaves\powdery_mildew
Total image files: 2104
Non-image files removed: 0




#### Split train validation test set

In [11]:
 # This below function was taken from CI training notebook

import os
import shutil
import random


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

We will then divide the dataset conventionally to train set 0.7, validation set 0.10 and test set 0.20

In [13]:
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

## Push files to Repo

In [2]:
!git add .

In [3]:
!git commit -am"add new slots"

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	../outputs/v1-mildew/avg_diff.png
	../outputs/v1-mildew/avg_var_healthy.png
	../outputs/v1-mildew/avg_var_powdery_mildew.png

nothing added to commit but untracked files present (use "git add" to track)


In [4]:
!git push

Everything up-to-date
