# **Data Collection**

## Objectives

* Fetch the data from Kaggle, unzip, prepare and storage it for further analisis. 

## Inputs

* Kaggle JSON file, authentication token.

## Outputs

* Generate Dataset: inputs/datasets/cherry_leaves_dataset

## Additional Comments

* The data must be stored once prepared, take out any file that is not an image. Then split the data in Train, Validation and Set folders.


---

### Import packages

In [2]:
import numpy
import os

**We must install the requirements here, not in the terminal, otherwise kaggle give me a problem**

In [2]:
%pip install -r /workspace/milestone-project-mildew-detection-in-cherry-leaves/requirements.txt

Note: you may need to restart the kernel to use updated packages.


### Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-mildew-detection-in-cherry-leaves'

### Install kaggle

In [6]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


---

* Change kaggle configuration directory to current working directory and permission of kaggle authentication json.

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

### Set Kaggle Dataset and Download it

In [11]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/datasets"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/datasets
 76%|█████████████████████████████         | 42.0M/55.0M [00:01<00:00, 30.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 38.3MB/s]


* Unzip the downloaded file, delete the zip file

In [12]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Preparation

---

## Data cleaning

### Check and remove non images files.

In [10]:
def remove_non_image_file(my_data_dir):
    """If there any image that do not have an extension finished with png, jpg 
    or jpeg, this function will remove it"""
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        # Iterate over every file in each folder of the dataset
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [13]:
remove_non_image_file(my_data_dir='inputs/datasets/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


### Split train, validation and test sets

In [16]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  """ Conventionally, we used to divide the train, validation and test set in 70,10 and 20 percent of the 
  data respectively """
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    # Check that the sum of all the ratios is 1
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    # If test exists means that all the folders have been created
    pass
  else: 
    # create train, validation and test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    

* Split in train(70%), validation(10%) and test(20%) sets

In [17]:
split_train_validation_test_images(my_data_dir = f"inputs/datasets/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

---

## Conclusions and next steps

* The data has been downloaded and cleaned as expected.
* Now We have three different folders inside the inputs/datasets/cherry_leaves folder(Train, Validation and Test), and inside every one of them, two folders with the images, one with the healthy leaves and the other one with the powdery mildew infected leaves.
* The next steps are get visualization of the different kind of leaves, their average and variation images, distinguish the contrast between them and try to answer the business requirement number 1.
