This notebook takes "dogs-vs-cats-redux-kernels-edition" data and saves them as an output with the following structure 

under /kaggle/output/working/:

* .../train_validate (a directory with images for training and validation)
    * .../cats (a sub-directory with 12500 cats images)
    * .../dogs (a sub-directory with 12500 dogs images)
* .../test_dir/test (a directory with 12500 unlabelled images for testing)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Unzip raw data

Unzip "train.zip" to "/kaggle/working/train" 

In [None]:
import zipfile
# train data
local_zip = '/kaggle/input/dogs-vs-cats-redux-kernels-edition/train.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall()
zip_ref.close()

Duplicate test data by unzipping it to "/kaggle/working/test_dir/test" (a directory structure needed to be able to load test images using ```tf.keras.preprocessing.image_dataset_from_directory```)

In [None]:
import zipfile
# test data
local_zip = '/kaggle/input/dogs-vs-cats-redux-kernels-edition/test.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall()
zip_ref.close()

!cd '/kaggle/working'
!mkdir test_dir
!cd test_dir
!mv test test_dir

In [None]:
import os
if len(os.listdir('/kaggle/working/test_dir/test'))==12500:
    print("Test images are saved into {}".format('/kaggle/working/test_dir/test'))
else:
    raise ValueError("The number of test images in '/kaggle/working/test_dir/test' is not 12500")

Unzip "test.zip" to "/kaggle/working/test" (a directory structure needed to be able to load test images using ```tensorflow.keras.preprocessing.image.ImageDataGenerator.flow_from_dataframe```)

In [None]:
import zipfile
# test data
local_zip = '/kaggle/input/dogs-vs-cats-redux-kernels-edition/test.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall()
zip_ref.close()

In [None]:
import os
if len(os.listdir('/kaggle/working/test'))==12500:
    print("Test images are saved into {}".format('/kaggle/working/test'))
else:
    raise ValueError("The number of test images in '/kaggle/working/test' is not 12500")

Initialize a folder with subfolders for cats and dogs needed to keep the data from the "train" folder.

In [None]:
import os
try:
  #set the base directory to be the project directory
    base_dir = '/kaggle/working/'
    #initialize a directory with subdirectories for cats and dogs
    for sub_directory in ('train_validate/cats','train_validate/dogs'):
        path = os.path.join(base_dir, sub_directory)
        os.makedirs(path,exist_ok=True)
except OSError:
    pass

Sort the images from the "train" folder into cats ("train_validate/cats") and dogs ("train_validate/dogs").

In [None]:
from random import seed
from random import random
from shutil import copyfile
import os,sys

# set the directories according to the structure defined before
#base_dir = '/kaggle/working/'
source_dir = '/kaggle/working/train/'
dst_dir = '/kaggle/working/train_validate/'

#seed=1 to make sure we get the same split each time the code is run
seed(1)
counter = 0
#
for file in os.listdir(source_dir):
  src = source_dir + file
  # select cats' images
  if file.startswith('cat'):
    dst = dst_dir + 'cats/'  + file
    copyfile(src, dst)
  elif file.startswith('dog'):
    dst = dst_dir + 'dogs/'  + file
    copyfile(src, dst)
  #
  # print progress every 10th iteration
  counter +=1 
  if counter%100==0:
    ii=(counter/len(os.listdir(source_dir))*100)
    print("\r Progress: {} %".format(ii), end="")

In [None]:
import os
if len(os.listdir('/kaggle/working/train_validate/cats'))==12500:
    print("Cats images for training and validation are saved into {}".format(dst_dir+'cats'))
else:
    raise ValueError("The number of cats images read is not 12500")
if len(os.listdir('/kaggle/working/train_validate/cats'))==12500:
    print("Dogs images for training and validation are saved into {}".format(dst_dir+'dogs'))
else:
    raise ValueError("The number of dogs images read is not 12500")

In [None]:
import os
print("Number of images in cats folder :", len(os.listdir("/kaggle/working/train_validate/cats")))
print("Number of images in dogs folder :", len(os.listdir("/kaggle/working/train_validate/dogs")))

Notebook's output:

* "train_validate" folder contains images for training and validation. They are split into two classes (cats and dogs). But they are not split for training and validation yet.

* "test" folder contains test images we will use later for testing (directory structure needed to load data using ```tensorflow.keras.preprocessing.image.ImageDataGenerator.flow_from_dataframe```)

* "test_dir/test" folder contains test images we will use later for testing (directory structure needed to load data using ```tf.keras.preprocessing.image_dataset_from_directory```)

* "train" folder contains images which we already copied to "train_validate" folder and can be deleted. 

In [None]:
!rm -rf /kaggle/working/train/

In [None]:
import os
if os.path.exists("/kaggle/working/train/")==False:
    print(" The 'train' folder {}".format(source_dir), " was removed")
else:
    raise ValueError("train folder was not removed")