# Split Catmon Folders

This notebook splits the catmon input folders into train, validation and test datasets.

Uses python split-folders module; this assumes input is of the form:

```
input/
    class1/
        <unique>.jpg
        ...
    class2/
        <unique>.jpg
        ...
    ...
```

Giving output of the form:
```
output/
    train/
        class1/
            <unique>.jpg
            ...
        class2/
            <unique>.jpg
            ...
    val/
        class1/
            <unique>.jpg
            ...
        class2/
            <unique>.jpg
            ...
    test/
        class1/
            <unique>.jpg
            ...
        class2/
            <unique>.jpg
            ...
```
---
In case of catmon the input is of the required form:

```
catmon_input/
    boo/
        <unique>.jpg
        ...
    simba/
        <unique>.jpg
        ...
    unknown/
        <unique>.jpg
        ...
```

## Set-up

In [1]:
import splitfolders
from pathlib import Path

In [2]:
splitfolders.__version__

'0.5.1'

## Check understanding of split folder algorithm

### Create a set of dummy files to test split folder algorithm

In [3]:
# define dummy structure
DUMMY_ROOT_INPUT = '.\datasets\dummy_catmon_input'
DUMMY_ROOT_OUTPUT = '.\datasets\dummy_catmon'
DUMMY_LABELS = ['boo', 'simba', 'unknown']
DUMMY_NUM_FILES = 10

# create input folders
Path(f'{DUMMY_ROOT_INPUT}').mkdir()
for label in DUMMY_LABELS:
    folder = f'{DUMMY_ROOT_INPUT}/{label}'
    Path(folder).mkdir()
    
# create input files
for label in DUMMY_LABELS:
    for file_num in range(DUMMY_NUM_FILES):
        file_path = f'{DUMMY_ROOT_INPUT}/{label}/{label}{file_num}.txt'
        print(file_path)
        Path(file_path).touch()
        
# show input file structure
!dir /s {DUMMY_ROOT_INPUT}

.\datasets\dummy_catmon_input/boo/boo0.txt
.\datasets\dummy_catmon_input/boo/boo1.txt
.\datasets\dummy_catmon_input/boo/boo2.txt
.\datasets\dummy_catmon_input/boo/boo3.txt
.\datasets\dummy_catmon_input/boo/boo4.txt
.\datasets\dummy_catmon_input/boo/boo5.txt
.\datasets\dummy_catmon_input/boo/boo6.txt
.\datasets\dummy_catmon_input/boo/boo7.txt
.\datasets\dummy_catmon_input/boo/boo8.txt
.\datasets\dummy_catmon_input/boo/boo9.txt
.\datasets\dummy_catmon_input/simba/simba0.txt
.\datasets\dummy_catmon_input/simba/simba1.txt
.\datasets\dummy_catmon_input/simba/simba2.txt
.\datasets\dummy_catmon_input/simba/simba3.txt
.\datasets\dummy_catmon_input/simba/simba4.txt
.\datasets\dummy_catmon_input/simba/simba5.txt
.\datasets\dummy_catmon_input/simba/simba6.txt
.\datasets\dummy_catmon_input/simba/simba7.txt
.\datasets\dummy_catmon_input/simba/simba8.txt
.\datasets\dummy_catmon_input/simba/simba9.txt
.\datasets\dummy_catmon_input/unknown/unknown0.txt
.\datasets\dummy_catmon_input/unknown/unknown1.tx

### Split input data into output data using splitfolder

In [4]:
# Test split with a ratio.
splitfolders.ratio(
    input=DUMMY_ROOT_INPUT, output=DUMMY_ROOT_OUTPUT,
    seed=4242, ratio=(.5, .3, .2), group_prefix=None, move=False) # default values

In [5]:
# show output file structure
!dir /s {DUMMY_ROOT_OUTPUT}

 Volume in drive C is OS
 Volume Serial Number is 5470-E9D2

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\dummy_catmon

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47    <DIR>          test
11/08/2022  14:47    <DIR>          train
11/08/2022  14:47    <DIR>          val
               0 File(s)              0 bytes

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\dummy_catmon\test

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47    <DIR>          boo
11/08/2022  14:47    <DIR>          simba
11/08/2022  14:47    <DIR>          unknown
               0 File(s)              0 bytes

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\dummy_catmon\test\boo

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47                 0 boo2.txt
11/08/202

In [6]:
# tidy up, remove ROOT_OUTPUT
!rmdir {DUMMY_ROOT_OUTPUT} /q /s

In [7]:
# Test split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.
splitfolders.fixed(input=DUMMY_ROOT_INPUT, output=DUMMY_ROOT_OUTPUT,
    seed=1337, fixed=(1, 2), oversample=False, group_prefix=None, move=False) # default values

In [8]:
# show output file structure
!dir /s {DUMMY_ROOT_OUTPUT}

 Volume in drive C is OS
 Volume Serial Number is 5470-E9D2

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\dummy_catmon

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47    <DIR>          test
11/08/2022  14:47    <DIR>          train
11/08/2022  14:47    <DIR>          val
               0 File(s)              0 bytes

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\dummy_catmon\test

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47    <DIR>          boo
11/08/2022  14:47    <DIR>          simba
11/08/2022  14:47    <DIR>          unknown
               0 File(s)              0 bytes

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\dummy_catmon\test\boo

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47                 0 boo8.txt
11/08/202

In [9]:
# tidy up, remove ROOT_INPUT and ROOT_OUTPUT
!rmdir {DUMMY_ROOT_INPUT} /q /s
!rmdir {DUMMY_ROOT_OUTPUT} /q /s

# show output file structure (should be File Not Found)
!dir /s {DUMMY_ROOT_INPUT}
!dir /s {DUMMY_ROOT_OUTPUT}

 Volume in drive C is OS
 Volume Serial Number is 5470-E9D2

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\tmp

08/07/2022  00:41    <DIR>          dummy_catmon_input
               0 File(s)              0 bytes

     Total Files Listed:
               0 File(s)              0 bytes
               1 Dir(s)  259,181,576,192 bytes free
 Volume in drive C is OS
 Volume Serial Number is 5470-E9D2

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\tmp

08/07/2022  11:25    <DIR>          dummy_catmon
               0 File(s)              0 bytes

     Total Files Listed:
               0 File(s)              0 bytes
               1 Dir(s)  259,181,576,192 bytes free


## Split catmon folders

In [10]:
CATMON_ROOT_INPUT = '.\datasets\catmon_input'
CATMON_ROOT_OUTPUT = '.\datasets\catmon'
CATMON_RATIO = (.8, .1, .1)

In [11]:
# Split with selected ratio
splitfolders.ratio(
    input=CATMON_ROOT_INPUT, output=CATMON_ROOT_OUTPUT,
    seed=4242, ratio=CATMON_RATIO, group_prefix=None, move=False)

In [12]:
!dir /s {CATMON_ROOT_OUTPUT} | findstr /c:"File(s)" /c:"Dir"

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon
               0 File(s)              0 bytes
 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\test
               0 File(s)              0 bytes
 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\test\boo
             100 File(s)     22,747,669 bytes
 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\test\simba
             100 File(s)     22,261,611 bytes
 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\test\unknown
              80 File(s)     10,049,099 bytes
 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\train
               0 File(s)              0 bytes
 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\train\boo
             800 File(s)  

In [13]:
# Show output file structure
!dir /s {CATMON_ROOT_OUTPUT}

 Volume in drive C is OS
 Volume Serial Number is 5470-E9D2

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47    <DIR>          test
11/08/2022  14:47    <DIR>          train
11/08/2022  14:47    <DIR>          val
               0 File(s)              0 bytes

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\test

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
11/08/2022  14:47    <DIR>          boo
11/08/2022  14:47    <DIR>          simba
11/08/2022  14:47    <DIR>          unknown
               0 File(s)              0 bytes

 Directory of C:\Users\Terry\Documents\Python\project-catmon-img-classifier\datasets\catmon\test\boo

11/08/2022  14:47    <DIR>          .
11/08/2022  14:47    <DIR>          ..
30/04/2021  14:44           255,858 2021-04-30_144447.jpg
07/05/2021  12

14/05/2021  21:27           230,599 2021-05-14_212714.jpg
15/05/2021  22:45           195,651 2021-05-15_224543.jpg
16/05/2021  10:12           257,681 2021-05-16_101241.jpg
17/05/2021  08:30           250,633 2021-05-17_083042.jpg
18/05/2021  11:15           225,696 2021-05-18_111532.jpg
19/05/2021  20:53           224,811 2021-05-19_205350.jpg
20/05/2021  13:02           245,506 2021-05-20_130242.jpg
21/05/2021  10:33           237,786 2021-05-21_103321.jpg
21/05/2021  11:22           259,099 2021-05-21_112210.jpg
21/05/2021  14:02           256,442 2021-05-21_140242.jpg
21/05/2021  22:21           244,637 2021-05-21_222136.jpg
22/05/2021  13:23           238,818 2021-05-22_132301.jpg
22/05/2021  20:32           241,927 2021-05-22_203217.jpg
22/05/2021  22:46           243,928 2021-05-22_224636.jpg
22/05/2021  23:24           245,211 2021-05-22_232455.jpg
23/05/2021  00:46           242,524 2021-05-23_004621.jpg
23/05/2021  08:46           237,827 2021-05-23_084626.jpg
23/05/2021  11

22/03/2022  13:39           242,238 2022-03-22_123919.jpg
23/03/2022  01:21           218,517 2022-03-23_002139.jpg
23/03/2022  06:40           216,146 2022-03-23_054034.jpg
23/03/2022  09:00           227,086 2022-03-23_080056.jpg
23/03/2022  11:40           249,836 2022-03-23_104014.jpg
23/03/2022  16:13           252,402 2022-03-23_151333.jpg
23/03/2022  21:51           218,995 2022-03-23_205123.jpg
24/03/2022  05:29           220,000 2022-03-24_042947.jpg
24/03/2022  07:50           206,430 2022-03-24_065055.jpg
24/03/2022  16:01           255,143 2022-03-24_150154.jpg
24/03/2022  18:07           240,416 2022-03-24_170721.jpg
25/03/2022  00:44           220,209 2022-03-24_234454.jpg
25/03/2022  06:05           219,536 2022-03-25_050541.jpg
25/03/2022  08:28           219,738 2022-03-25_072814.jpg
25/03/2022  11:05           259,383 2022-03-25_100531.jpg
25/03/2022  20:15           233,108 2022-03-25_191509.jpg
25/03/2022  23:19           233,280 2022-03-25_221954.jpg
26/03/2022  18

06/03/2021  21:42           118,443 2021-03-06_204202.jpg
11/03/2021  23:48           118,066 2021-03-11_224847.jpg
12/03/2021  21:39           145,391 2021-03-12_203904.jpg
13/03/2021  08:00           247,280 2021-03-13_070012.jpg
17/03/2021  00:06           119,854 2021-03-16_230609.jpg
17/03/2021  00:37           113,732 2021-03-16_233712.jpg
18/03/2021  06:20           113,794 2021-03-18_052039.jpg
18/03/2021  21:38           116,917 2021-03-18_203857.jpg
19/03/2021  20:50           132,874 2021-03-19_195030.jpg
21/03/2021  22:38           119,771 2021-03-21_213819.jpg
22/03/2021  23:22           108,926 2021-03-22_222256.jpg
23/03/2021  23:31           110,578 2021-03-23_223106.jpg
28/03/2021  06:29           118,347 2021-03-28_062920.jpg
02/04/2021  20:24           122,411 2021-04-02_202431.jpg
05/04/2021  15:41           244,935 2021-04-05_154138.jpg
07/04/2021  22:27           115,899 2021-04-07_222723.jpg
09/04/2021  21:42           130,889 2021-04-09_214208.jpg
12/04/2021  00