# Week 4. Create directories and datasets
Cognitive Systems for Health Technology Applications<br>
Sakari Lukkarinen & Juha Kopu, 6.2.2018<br>
[Helsinki Metropolia University of Applied Sciences](http://metropolia.fi/en)

## Introduction
This is a helper script that:
1. Checks how many samples are in the original downloaded dataset
2. Creates the directory structure for the datasets
3. Creates the directories
4. Splits the data into train, validation, and test sets
5. Copies the data into the directories

Before running the script you need to download the dataset from: https://github.com/Nomikxyz/retinopathy-dataset and extract the data to your computer. 

This script assumes that the original dataset is extracted to a folder named `retinopathy-dataset-master` and this script is in the sub-sub-folder under `Documents` 

The final directory structure when the script is executed is the following:
```
Documents/
    retinopathy-dataset-master/
        nosymptoms/
        symptoms/
    Python scripts/
        Case2/
            This script.ipynb
    dataset2/
        test/
            nosymptoms/
            symptoms/
        evaluation/
            nosymptoms/
            symptoms/
        validation/
            nosymptoms/
            symptoms/
```



## 1. Check the downloaded data
First check that we have access to the downloaded and extracted data. This script reads the filenames in `nosymptoms` and `sympotms` subfoldes and counts how many observations are there totally.

In [40]:
import os, shutil

In [41]:
# List all filenames in the master dataset and count how many samples there are
original_dir = '\\Users\\Sanni Tolonen\\Documents\\Kolmosvuoden jutut\\Cognitive Applications\\retinopathy-dataset-master'

class1 = 'nosymptoms'
original_nosymptoms_dir = os.path.join(original_dir, class1)
nosymptoms_fnames = os.listdir(original_nosymptoms_dir)

class2 = 'symptoms'
original_symptoms_dir = os.path.join(original_dir, class2)
symptoms_fnames = os.listdir(original_symptoms_dir)

len(nosymptoms_fnames), len(symptoms_fnames)

(1468, 595)

## 2. Directory structure
Next we create the names for the directories and save them into variables.

In [42]:
# Base directory is where the datasets will be created
base_dir = '..\\..\\dataset2'

# For training set
sub_dir = 'train'
train_dir = os.path.join(base_dir, sub_dir)
train_nosymptoms_dir = os.path.join(base_dir, sub_dir, class1)
train_symptoms_dir = os.path.join(base_dir, sub_dir, class2)

# For validation set
sub_dir = 'validation'
validation_dir = os.path.join(base_dir, sub_dir)
validation_nosymptoms_dir = os.path.join(base_dir, sub_dir, class1)
validation_symptoms_dir = os.path.join(base_dir, sub_dir, class2)

# For test set
sub_dir = 'test'
test_dir = os.path.join(base_dir, sub_dir)
test_nosymptoms_dir = os.path.join(base_dir, sub_dir, class1)
test_symptoms_dir = os.path.join(base_dir, sub_dir, class2)

## 3. Create directories
The following script checks that if the folder already exists. If this is the first time to run this script it creates the directory structure for the train, test, and evaluation sets

See also: 
- [Helper Python scripts](https://github.com/geekcomputers/Python)
- [How to delete the contents of a folder in Python?](https://stackoverflow.com/questions/185936/how-to-delete-the-contents-of-a-folder-in-python)

In [43]:
if not(os.path.exists(base_dir)):
    print('Creating dataset folders to:', base_dir)
    os.mkdir(base_dir)
    os.mkdir(train_dir)
    os.mkdir(train_nosymptoms_dir)
    os.mkdir(train_symptoms_dir)
    os.mkdir(validation_dir)
    os.mkdir(validation_nosymptoms_dir)
    os.mkdir(validation_symptoms_dir)
    os.mkdir(test_dir)
    os.mkdir(test_nosymptoms_dir)
    os.mkdir(test_symptoms_dir)
else:
    print(base_dir, 'already exists!')

..\..\dataset2 already exists!


## 4. Split the data filenames into train, validation and test sets
Now when we have the directories ready, it's time to split the original dataset into training, validation and test sets. For that we use `scikit-learn` library's `train_test_split` function. First we split the dataset with rule 80%-20% and then we split 80% to 60% and 20%. Finally we get:
- 60% training set
- 20% validation set, and 
- 20% test set.

This needs to be repeated both for the healthy (nosymptoms) and disease (symptom) cases.

In [44]:
from sklearn.model_selection import train_test_split

In [45]:
# Disease (symptom) cases split

# Take 20 % out for testing
train_symptoms_fnames, test_symptoms_fnames = train_test_split(symptoms_fnames, test_size = 0.2)

# From the remaining 80% take 0.25 (=0.8*0.25 = 20% of total) out for validation
train_symptoms_fnames, validation_symptoms_fnames = train_test_split(train_symptoms_fnames, test_size = 0.25)

len(train_symptoms_fnames), len(validation_symptoms_fnames), len(test_symptoms_fnames)
# For debugging purposes, remove the comment marks.
# print(train_symptoms_fnames)
# print(validation_symptoms_fnames)
# print(test_symptoms_fnames)

(357, 119, 119)

In [46]:
# Healthy (nosyptom) cases split

# Take 20 % out for testing
train_nosymptoms_fnames, test_nosymptoms_fnames = train_test_split(nosymptoms_fnames, test_size = 0.2)

# From the remaining 80% take 0.25 (20% of total) out for validation
train_nosymptoms_fnames, validation_nosymptoms_fnames = train_test_split(train_nosymptoms_fnames, test_size = 0.25)

len(train_nosymptoms_fnames), len(validation_nosymptoms_fnames), len(test_nosymptoms_fnames)
# For debugging purposes, remove the comment marks.
# print(train_nosymptoms_fnames)
# print(validation_nosymptoms_fnames)
# print(test_nosymptoms_fnames)

(880, 294, 294)

## 5. Copy data into the directories
Last thing to do is to copy the original data into the training, validation and test directories. As this might take some time, we want to watch the time spend on it.

In [47]:
import time

In [48]:
tStart = time.time()

# Copy the original files into the dataset folders

# Training set
# Disease 
for fname in train_symptoms_fnames:
    src = os.path.join(original_symptoms_dir, fname)
    dst = os.path.join(train_symptoms_dir, fname)
    shutil.copyfile(src, dst)
# Healthy 
for fname in train_nosymptoms_fnames:
    src = os.path.join(original_nosymptoms_dir, fname)
    dst = os.path.join(train_nosymptoms_dir, fname)
    shutil.copyfile(src, dst)

# Validation set
# Disease 
for fname in validation_symptoms_fnames:
    src = os.path.join(original_symptoms_dir, fname)
    dst = os.path.join(validation_symptoms_dir, fname)
    shutil.copyfile(src, dst)
# Healthy
for fname in validation_nosymptoms_fnames:
    src = os.path.join(original_nosymptoms_dir, fname)
    dst = os.path.join(validation_nosymptoms_dir, fname)
    shutil.copyfile(src, dst)

# Test set
# Disease
for fname in test_symptoms_fnames:
    src = os.path.join(original_symptoms_dir, fname)
    dst = os.path.join(test_symptoms_dir, fname)
    shutil.copyfile(src, dst)
# Healthy
for fname in test_nosymptoms_fnames:
    src = os.path.join(original_nosymptoms_dir, fname)
    dst = os.path.join(test_nosymptoms_dir, fname)
    shutil.copyfile(src, dst)

tStop = time.time()
tElapsed = tStop - tStart
print('Time elapsed: {:.2f} sec'.format(tElapsed))

Time elapsed: 72.42 sec


## Conclusion
Now our case 2 directories and dataset are ready for training, validating and testing our convolutional neural networks to make predictions for retinopathy.