# **Data Collection**

## Objectives

* Retrieve data from the Kaggle image dataset provided by Farmy & Foods and prepare it for subsequent processing.

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Dataset Generation: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves

### Additional Comments

* The client has provided the data under a non-disclosure agreement (NDA), necessitating its restricted sharing solely with project-involved professionals.
Regarding the dataset's nature, it implies binary image classification, distinguishing between healthy cherry leaves and those infected with powdery mildew.



---

## Import Packages

In [1]:
%pip install -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/home/gitpod/.pyenv/versions/3.12.7/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy
import os

## Change Working directory 

* Change working directory to current one

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves/jupyter_notebooks'

In [4]:
import os
current_dir = os.getcwd()
current_dir
os.chdir('/workspace/mildew-detection-in-cherry-leaves')
print("You set a new current directory")

You set a new current directory


* Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves'

## Install Kaggle 

In [6]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/home/gitpod/.pyenv/versions/3.12.7/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


---

* Modify the Kaggle configuration directory and adjust the permissions of the Kaggle authentication JSON file.

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json



#### Important consideration 
During the development of the ML model, I encountered a problem related to the compatibility of label names with the text displayed as predictions in the Powdery Mildew detector page. The names of the folders extracted from the Kaggle dataset were 'healthy' and 'powdery-mildew', resulting in label names ['powdery-mildew', 'healthy'].

    labels = os.listdir(train_path)
    print('Label for the images are',labels)

    output : 
    Label for the images are ['powdery-mildew', 'healthy']

However, when rendering the Powdery Mildew detector page, I noticed that the text showing the prediction result was compatible with the 'healthy' case but not with the 'powdery-mildew' one.

Code snippet found on the *load_model_and_predict* function found on the ‘predictive_analysis.py” file :

   st.write(
       f'The predictive analysis indicates the leaf image is '
       f'**{pred_class.lower()}** from Powderly Mildew.'
   )

Showing:
* 'The predictive analysis indicates the leaf image is 'healthy' from Powderly Mildew'
vs. 
* 'The predictive analysis indicates the leaf image is 'powdery-mildew' from Powderly Mildew'

After consulting with colleagues and brainstorming solutions, I decided to change the label for the 'powdery-mildew' class to 'fungal-infected'. This involved renaming the folder containing the images and updating the label names accordingly.

Here's the process I followed:

1. Downloaded and extracted the dataset zip file.
2. Renamed the 'powdery-mildew' folder to 'fungal-infected', which automatically updated all file names within.
3. Removed the original zip file after ensuring all changes were made.
        
        labels = os.listdir(train_path)
        print('Label for the images are',labels)

        output : 
        Label for the images are ['fungal-infected', 'healthy']

4. Repeated subsequent processes with the updated label names.

To ensure correct execution of the Jupyter Notebooks, it's necessary to omit the step of downloading and opening a zip file through the notebook.
The cells will be retained but commented out for documentation purposes.


* Get the dataset path from the [Kaggle url](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves) and set your destination folder.

*information on the download un the previous cell*

In [11]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves_dataset/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry-leaves_dataset
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:02<00:00, 34.2MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 22.9MB/s]


---

In [14]:
import zipfile
with zipfile.ZipFile('inputs/cherry-leaves_dataset/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall('inputs/cherry-leaves_dataset')
os.remove('inputs/cherry-leaves_dataset//cherry-leaves.zip')

## Data Preparation 

* Checking and removing non image files

In [15]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
        #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) 
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))
    
    

In [14]:
remove_non_image_file(my_data_dir='inputs/cherry-leaves_dataset/cherry-leaves')

Folder: fungal-infected - has image file 2104
Folder: fungal-infected - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


IsADirectoryError: [Errno 21] Is a directory: 'inputs/cherry-leaves_dataset/cherry-leaves/test/fungal-infected'

# Split train validation test set

* We are dividing the data set into 3 subsets by using the "split_train_validation_test_images" function: <br/>
    * Training Set<br/> 
    * Test Set  <br/>
    * Validation Set 

In [16]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    
    # Print the number of files in each set after splitting
    print("Number of files in Train set:")
    for label in labels:
        train_files = os.listdir(my_data_dir + '/train/' + label)
        print(f"Class {label}: {len(train_files)}")

    print("\nNumber of files in Validation set:")
    for label in labels:
        validation_files = os.listdir(my_data_dir + '/validation/' + label)
        print(f"Class {label}: {len(validation_files)}")

    print("\nNumber of files in Test set:")
    for label in labels:
        test_files = os.listdir(my_data_dir + '/test/' + label)
        print(f"Class {label}: {len(test_files)}")

* By convention:

    * The training set comprises 70% of the data.
    * The validation set comprises 10% of the data.
    * The test set comprises 20% of the data.

In [17]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry-leaves_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )


* Check number of images on each, Train, Validation and Test Sets.

In [18]:
test_dir_healthy = 'inputs/cherry-leaves_dataset/cherry-leaves/test/healthy/'
test_dir_powdery = 'inputs/cherry-leaves_dataset/cherry-leaves/test/fungal-infected/'

train_dir_healthy= 'inputs/cherry-leaves_dataset/cherry-leaves/train/healthy/'
train_dir_powdery = 'inputs/cherry-leaves_dataset/cherry-leaves/train/fungal-infected/'
validation_dir_healthy = 'inputs/cherry-leaves_dataset/cherry-leaves/validation/healthy/'
validation_dir_powdery = 'inputs/cherry-leaves_dataset/cherry-leaves/validation/fungal-infected/'

# Function to count the number of files in a directory
def count_files(my_data_dir):
    return len(os.listdir(my_data_dir))

# Count the number of images in each directory
test_count = count_files(test_dir_healthy)
test_count = count_files(test_dir_powdery)
train_count = count_files(train_dir_healthy)
train_count = count_files(train_dir_powdery)
validation_count = count_files(validation_dir_healthy)
validation_count = count_files(validation_dir_powdery)

# Print the counts
print("Number of images in the test set:", test_count)
print("Number of images in the train set:", train_count)
print("Number of images in the validation set:", validation_count)

Number of images in the test set: 422
Number of images in the train set: 1472
Number of images in the validation set: 210


## Push files to Repo 

In [20]:
import os
try:
 # Change directory to your repository
    os.chdir("/workspace/mildew-detection-in-cherry-leaves/")

    # Add all files to the staging area
    os.system("git add .")

    # Commit the changes with a message
    commit_message = "Save the data Collection notebook "
    os.system(f"git commit -m '{commit_message}'")

    # Push changes to the origin 
    os.system("git push origin main")  
except Exception as e:
    print(e)

[main 7413745] Save the data Collection notebook
 3 files changed, 100 insertions(+), 76 deletions(-)
 delete mode 100644 inputs/cherry-leaves_dataset/archive.zip


To https://github.com/FerchaPombo/mildew-detection-in-cherry-leaves.git
   8dbd721..7413745  main -> main


---

## Conclusions and Next Steps

* In conclusion, the dataset acquired from FarmFood Inc. comprises two folders containing images of healthy and diseased leaves. No missing data was identified, and all data is in image format (jpg, jpeg, png). No additional data cleaning procedures were deemed necessary.

* The dataset has been effectively partitioned into three sets—Train, Validation, and Test sets—following a conventional split ratio.

* Subsequent stages of the data cycle will involve the implementation of data visualization techniques to provide a comprehensive graphical depiction for better understanding of the dataset.