# **Data Collection**

## Objectives

* Retrieve data from the Kaggle image dataset provided by Farmy & Foods and prepare it for subsequent processing.

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Dataset Generation: https://www.kaggle.com/codeinstitute/cherry-leaves

## Additional Comments

* The client has provided the data under a non-disclosure agreement (NDA), necessitating its restricted sharing solely with project-involved professionals.
Regarding the dataset's nature, it implies binary image classification, distinguishing between healthy cherry leaves and those infected with powdery mildew.


---

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

## Import Packages

In [1]:
%pip install -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy
import os

## Change Working directory 

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves/jupyter_notebooks'

In [4]:
import os
current_dir = os.getcwd()
current_dir
os.chdir('/workspace/mildew-detection-in-cherry-leaves')
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves'

# Install Kaggle 

Section 1 content

In [6]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73049 sha256=a7e467c4d87b7648e984c324bc4ca556f8e5afe2dbd7d09cf02ddc0627c3148f
  Stored in directory: /workspace/.pyenv_mirror/pip_cache/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b394e6a5725cbb2f50106
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.6.6
    Uninstalling kaggle-1.6.6:
      Successfully uninstalled kaggle-1.6.6
Successfully installed kaggle-1.5.12

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.

---

* Modify the Kaggle configuration directory and adjust the permissions of the Kaggle authentication JSON file.

In [7]:
os.environ['/workspace/mildew-detection-in-cherry-leaves/jupyter_notebooks'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the [Kaggle url](https://www.kaggle.com/codeinstitute/cherry-leaves) and set your destination folder.

In [11]:
KaggleDatasetPath = "/codeinstitute/cherry-leaves_dataset"
DestinationFolder = "inputs/cherry-leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/gitpod/.pyenv/versions/3.8.18/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/gitpod/.pyenv/versions/3.8.18/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/gitpod/.pyenv/versions/3.8.18/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /home/gitpod/.kaggle. Or use the environment method.


---

* Unzip the downloaded file, delete the zip file

In [13]:
import zipfile
with zipfile.ZipFile('/workspace/mildew-detection-in-cherry-leaves/inputs/cherry-leaves_dataset/archive.zip', 'r') as zip_ref:
    zip_ref.extractall('inputs/cherry-leaves_dataset')


## Data Preparation 

* Checking and removing non image files

In [14]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))
     

In [18]:
remove_non_image_file(my_data_dir='inputs/cherry-leaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


# Split train validation test set

* We are dividing the data set into 3 subsets by using the "split_train_validation_test_images" function: <br/> ** Training Set<br/> ** Test Set  <br/>** Validation Set 

In [16]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    

*By convention,

The training set comprises 70% of the data.
The validation set comprises 10% of the data.
The test set comprises 20% of the data.

In [19]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry-leaves_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
    # create here your folder
    # os.makedirs(name='')
except Exception as e:
    print(e)
