# **(Data Collection)**

## Objectives

*Fetch data from Kaggle and prepare it for further processes

## Inputs

* Kaggle JSON file - the authentication token

## Outputs

* Generate Dataset: inputs/cherry_dataset/cherry-leaves




---

# Import Packages required

In [2]:
%pip install -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt

Collecting typing-extensions~=3.7.4 (from tensorflow-cpu==2.6.0->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.8.0
    Uninstalling typing_extensions-4.8.0:
      Successfully uninstalled typing_extensions-4.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
astroid 3.0.1 requires typing-extensions>=4.0.0; python_version < "3.11", but you have typing-extensions 3.7.4.3 which is incompatible.
async-lru 2.0.4 requires typing-extensions>=4.0.0; python_version < "3.11", but you have typing-extensions 3.7.4.3 which is incompatible.
mypy 1.6.1 requires typing-extensions>=4.1.0, but you have typing-extensions 3.7.4.3 which is

# Change working directory

* Here there is a  need to change the working directory
* Access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves/jupyter_notebooks'

Making  the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves'

# Section 1

## install Kaggle package

In [7]:
!pip install kaggle


Collecting kaggle
  Downloading kaggle-1.5.16.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding

## change kaggle cconfiguration to current working directory

In [8]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

##  Download the cherry leaves kaggle Dataset

In [9]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_dataset
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:47<00:00, 3.67MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:47<00:00, 1.20MB/s]


## unzip the downloaded and the delete the zip

In [13]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)
os.remove(DestinationFolder + '/cherry-leaves.zip')


# Section 2

## Data Preparation

### Data Cleaning

* Check and remove non-image files

In [37]:
import os

def remove_non_image_file(cherry_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    for root, folders, files in os.walk(cherry_data_dir):
        i = []
        j = []
        for given_file in files:
            if not any(given_file.lower().endswith(extension) for extension in image_extension):
                file_location = os.path.join(root, given_file)
                os.remove(file_location)  # Remove non-image file
                i.append(1)
            else:
                j.append(1)
        print(f"Folder: {root} - has {len(j)} image files")
        print(f"Folder: {root} - has {len(i)} non-image files")





In [38]:
remove_non_image_file('inputs/cherry_dataset/cherry-leaves')


Folder: inputs/cherry_dataset/cherry-leaves - has 0 image files
Folder: inputs/cherry_dataset/cherry-leaves - has 0 non-image files
Folder: inputs/cherry_dataset/cherry-leaves/healthy - has 2104 image files
Folder: inputs/cherry_dataset/cherry-leaves/healthy - has 0 non-image files
Folder: inputs/cherry_dataset/cherry-leaves/powdery_mildew - has 2104 image files
Folder: inputs/cherry_dataset/cherry-leaves/powdery_mildew - has 0 non-image files


---

Split train Validation test set

In [39]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(cherry_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(cherry_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=cherry_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(cherry_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(cherry_data_dir + '/' + label + '/' + file_name,
                                cherry_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(cherry_data_dir + '/' + label + '/' + file_name,
                                cherry_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(cherry_data_dir + '/' + label + '/' + file_name,
                                cherry_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(cherry_data_dir + '/' + label)

# Note: The training set is divided into a 0.70 ratio of data.
        The validation set is divided into a 0.10 ratio of data.
        The test set is divided into a 0.20 ratio of data.

In [41]:
split_train_validation_test_images(cherry_data_dir=f"inputs/cherry_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

---

# Push files to Repo

!git add .

In [42]:
!git commit -m " Data collection and preparation"

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mdeleted:    inputs/cherry_dataset/cherry-leaves/validation/healthy/002efba9-09b3-43de-93b7-5c2460185cde___JR_HL 9655.JPG[m
	[31mdeleted:    inputs/cherry_dataset/cherry-leaves/validation/healthy/01958ee7-f585-4956-90aa-a40dc79102d4___JR_HL 9836.JPG[m
	[31mdeleted:    inputs/cherry_dataset/cherry-leaves/validation/healthy/02a7466b-4847-4a18-bbd9-f0278e7b8d20___JR_HL 9582.JPG[m
	[31mdeleted:    inputs/cherry_dataset/cherry-leaves/validation/healthy/02d14b15-897d-4081-8953-e7d83189cff4___JR_HL 9723.JPG[m
	[31mdeleted:    inputs/cherry_dataset/cherry-leaves/validation/healthy/0580bdc7-e60e-4ba9-87dc-1202e57b94aa___JR_HL 4159.JPG[m
	[31mdeleted:    inputs/cherry_dataset/cherry-leaves/validation/healthy/0a68b3a9-e38f-45b5-8664-9c60e557a41d___JR_HL 951

In [43]:
!git push

Everything up-to-date
