### Setting up paths

It is strongly recommended to [set up paths in a platform independent way](https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f). The reason is that different team members work with different operating systems, but we are all using the same code.

Platform-agnostic paths can be set up by (a) creating an environment variable with a path to the root directory of this project on your local machine, and (b) using ```os``` and ```pathlib``` modules. I call the environment variable ```SKIN_LESION_CLASSIFICATION```. 

The custom ```utils``` module contains a function called ```path_setup```, which will create a dictionary with paths to all directories under the root directory.

In [1]:
project_environment_variable = "SKIN_LESION_CLASSIFICATION"

import os
from pathlib import Path
project_path = Path(os.environ.get(project_environment_variable))

scripts_path = project_path.joinpath("scripts")

import sys
sys.path.append(str(scripts_path)) 

from utils import path_setup
path = path_setup.subfolders(project_path)

path['project'] : D:\projects\skin-lesion-classification
path['images'] : D:\projects\skin-lesion-classification\images
path['models'] : D:\projects\skin-lesion-classification\models
path['expository'] : D:\projects\skin-lesion-classification\expository
path['literature'] : D:\projects\skin-lesion-classification\literature
path['notebooks'] : D:\projects\skin-lesion-classification\notebooks
path['presentation'] : D:\projects\skin-lesion-classification\presentation
path['scripts'] : D:\projects\skin-lesion-classification\scripts


### Loading and pre-processing data

The custom ```processing``` module contains the ```process``` class. We can work with instances of this class, which have attributes like the path to the directory with the ```metadata.csv``` file (which is the same as the directory with the images), train/val set ratio, whether we want a stratified partition, the lesion types we are interested in classifying, and more. The partitioning will be done automatically as soon as we instantiate the class. If we wish to experiment with differnet set-ups, we simply need to create a different instance of the class. It just isn't feasible to work any other way.

In [2]:
from processing import process

In [3]:
from typing import Type

data_dir: Path = path["images"] # Path to directory containing metadata.csv file
filename: str = "metadata.csv"  # The filename
tvr: int = 3                    # Ratio of training set to validation set. See discussion below for explanation.
seed: int = 0                   # Random seed for parts of the process where randomness is called for.
keep_first: bool = False        # If False, then, for each lesion, we choose a random image to assign to our training set. 
stratified: bool = True         # If True, we stratify classes so that the proportions remain as stable as possible after train/val split. 
                                # If False, the proportions will be roughly similar.
to_classify: list = ["mel",     # These are the lesion types we are interested in classifying. Any missing ones will be grouped together as the 0-label class.
                     "bcc", 
                     "akiec", 
                     "nv"]

In [4]:
# Create an instance of the process class with the above parameters for the attributes.
metadata = process(data_dir, 
                   filename, 
                   tvr, 
                   seed, 
                   keep_first,
                   stratified,
                   to_classify,)

Successfully loaded file 'D:\projects\skin-lesion-classification\images\metadata.csv'.
Inserted 'num_images' column in dataframe, to the right of 'lesion_id' column.
Created label_dict (maps labels to indices).
Inserted 'label' column in dataframe, to the right of 'dx' column.
Added 'set' column to dataframe, with values 't1', 'v1', 'ta', and 'va', to the right of 'localization' column.


In [5]:
# Let's have a look at our metadata dataframe, which is now just an attribute of the metadata instance of the process class.
metadata.df.head()

Unnamed: 0,lesion_id,num_images,image_id,dx,label,dx_type,age,sex,localization,set
0,HAM_0000118,2,ISIC_0027419,bkl,0,histo,80.0,male,scalp,ta
1,HAM_0000118,2,ISIC_0025030,bkl,0,histo,80.0,male,scalp,t1
2,HAM_0002730,2,ISIC_0026769,bkl,0,histo,80.0,male,scalp,va
3,HAM_0002730,2,ISIC_0025661,bkl,0,histo,80.0,male,scalp,v1
4,HAM_0001466,2,ISIC_0031633,bkl,0,histo,75.0,male,ear,va


Notice the ```label```. If we wish to classify all seven classes of lesion, labels will run from $0$ to $6$. If we specified $n < 7$ classes in our ```to_classify``` attribute, then the labels will range from from $0$ to $n$, with $0$ representing all lesions that do not fall into any of the specified classes. We can see the correspondence in the ```label_dict``` attribute.

In [6]:
metadata.label_dict

{'bkl': 0, 'vasc': 0, 'df': 0, 'mel': 1, 'nv': 2, 'bcc': 3, 'akiec': 4}

In [7]:
metadata.df['label'].value_counts()

2    6705
0    1356
1    1113
3     514
4     327
Name: label, dtype: int64

Notice also the ```set``` column. ```t``` stands for ```train```, ```v``` for ```validate```, while ```1``` signifies 'one-image-per-lesion', and ```a``` signified 'all images'. To train a model on one image per lesion, we restrict to rows with set value ```t1```. To train a model on all images of each lesion in a subset of all lesions, we restrict to rows with ```t1``` or ```ta```. Similarly for validation: see discussion below for particulars.

The ```dx_dist``` function within the ```process``` class gives us a breakdown of the number (and relative frequency) of lesions (or images) of a given class in the entire set, the training set, and the validation set. As we can see below, with the stratified split, the proportions remain as constant as possible between overall/train/validation sets.

In [9]:
for across in ["lesions", "images"]:
    for subset in ["all", "train", "val"]:
        process.dx_dist(metadata, subset = subset, across = across)

DISTRIBUTION OF LESIONS BY DIAGNOSIS: OVERALL


dx,nv,other,mel,bcc,akiec
freq,5403.0,898.0,614.0,327.0,228.0
%,72.33,12.02,8.22,4.38,3.05


Total lesions: 7470.

DISTRIBUTION OF LESIONS BY DIAGNOSIS: TRAIN


dx,nv,other,mel,bcc,akiec
freq,4052.0,673.0,460.0,245.0,171.0
%,72.34,12.02,8.21,4.37,3.05


Total lesions: 5601 (74.98% of all lesions).

DISTRIBUTION OF LESIONS BY DIAGNOSIS: VAL


dx,nv,other,mel,bcc,akiec
freq,1351.0,225.0,154.0,82.0,57.0
%,72.28,12.04,8.24,4.39,3.05


Total lesions: 1869 (25.02% of all lesions).

DISTRIBUTION OF IMAGES BY DIAGNOSIS: OVERALL


dx,nv,other,mel,bcc,akiec
freq,6705.0,1356.0,1113.0,514.0,327.0
%,66.95,13.54,11.11,5.13,3.27


Total images: 10015.

DISTRIBUTION OF IMAGES BY DIAGNOSIS: TRAIN


dx,nv,other,mel,bcc,akiec
freq,5007.0,1008.0,831.0,384.0,250.0
%,66.94,13.48,11.11,5.13,3.34


Total images: 7480 (74.69% of all images).

DISTRIBUTION OF IMAGES BY DIAGNOSIS: VAL


dx,nv,other,mel,bcc,akiec
freq,1698.0,348.0,282.0,130.0,77.0
%,66.98,13.73,11.12,5.13,3.04


Total images: 2535 (25.31% of all images).



To show the beauty of classes, suppose we wishes to set up everything just as above, with one exception: no stratified train/val split. All we need is the following.

In [16]:
metadata2 = process(data_dir, 
                   filename, 
                   tvr, 
                   seed, 
                   keep_first,
                   stratified = False,
                   to_classify = to_classify,)
for across in ["lesions", "images"]:
    for subset in ["all", "train", "val"]:
        process.dx_dist(metadata2, subset = subset, across = across)

Successfully loaded file 'D:\projects\skin-lesion-classification\images\metadata.csv'.
Inserted 'num_images' column in dataframe, to the right of 'lesion_id' column.
Created label_dict (maps labels to indices).
Inserted 'label' column in dataframe, to the right of 'dx' column.
Added 'set' column to dataframe, with values 't1', 'v1', 'ta', and 'va', to the right of 'localization' column.
DISTRIBUTION OF LESIONS BY DIAGNOSIS: OVERALL


dx,nv,other,mel,bcc,akiec
freq,5403.0,898.0,614.0,327.0,228.0
%,72.33,12.02,8.22,4.38,3.05


Total lesions: 7470.

DISTRIBUTION OF LESIONS BY DIAGNOSIS: TRAIN


dx,nv,other,mel,bcc,akiec
freq,4024.0,681.0,476.0,251.0,170.0
%,71.83,12.16,8.5,4.48,3.03


Total lesions: 5602 (74.99% of all lesions).

DISTRIBUTION OF LESIONS BY DIAGNOSIS: VAL


dx,nv,other,mel,bcc,akiec
freq,1379.0,217.0,138.0,76.0,58.0
%,73.82,11.62,7.39,4.07,3.1


Total lesions: 1868 (25.01% of all lesions).

DISTRIBUTION OF IMAGES BY DIAGNOSIS: OVERALL


dx,nv,other,mel,bcc,akiec
freq,6705.0,1356.0,1113.0,514.0,327.0
%,66.95,13.54,11.11,5.13,3.27


Total images: 10015.

DISTRIBUTION OF IMAGES BY DIAGNOSIS: TRAIN


dx,nv,other,mel,bcc,akiec
freq,4999.0,1025.0,865.0,390.0,240.0
%,66.48,13.63,11.5,5.19,3.19


Total images: 7519 (75.08% of all images).

DISTRIBUTION OF IMAGES BY DIAGNOSIS: VAL


dx,nv,other,mel,bcc,akiec
freq,1706.0,331.0,248.0,124.0,87.0
%,68.35,13.26,9.94,4.97,3.49


Total images: 2496 (24.92% of all images).



### Train/val split

We partition our dataset based on ```lesion_id```, **not** on ```image_id```: that way, every lesion will be represented in training or in validation, but not both.

For each binary classification task, we will train a model on
* **exactly one** image for every lesion in our training set;
* **all** images of every lesion in our training set.

In both cases, we will vaildate our model on 
* **exactly one** image for every lesion in our validation set;
* **all** images of every lesion in our validation set. 

**However**, we will make only one prediction per lesion (```lesion_id```) in our validation set, i.e. in the second case (validate on all images), if there are multiple images of a lesion in the validation set, we will combine the predictions for the multiple images into a single prediction for the lesion.

Accordingly, we proceed as follows: 
1. Randomly select (without replacement) a proportion of our $7470$ distinct ```lesion_id```s and label them with ```t``` (train).
2. Label the remaining ```lesion_id```s with ```v``` (validate).
3. For each ```lesion_id``` labeled with a ```t```:
    * Select an ```image_id``` and label it ```t1```.
    * Label all (if any) remaining ```image_id```s corresponding to this ```lesion_id``` with ```ta```.
4.  For each ```lesion_id``` labeled with a ```v```:
    * Select an ```image_id``` and label it ```v1```.
    * Label all (if any) remaining ```image_id```s corresponding to this ```lesion_id``` with ```va```.

In Step 1, the number of ```lesion_id```s randomly selected to be labeled ```t``` will be such that the ratio of ```t```s to ```v```s (in each class if 'stratified' is selected) is as close as possible to a specified ratio (we default to $3$, i.e. $\approx75\%$ of lesions are represented in training). In Steps 3 and 4, the first substep can be done randomly (our default choice), or we can simply choose the "first" image in our table that corresponds to the lesion. 

The four train/val scenarios we consider are:
* ```t1v1```: train on precisely those images labeled ```t1``` and validate on precisely those labeled ```v1```.
* ```t1va```: train on precisely those images labeled ```t1``` and validate on precisely those labeled ```v1``` **or** ```va```.
* ```tav1```: train on precisely those images labeled ```t1``` **or** ```ta``` and validate on precisely those labeled ```v1```.
* ```tava```: train on precisely those images labeled ```t1``` **or** ```ta``` and validate on precisely those labeled ```v1``` ***or*** ```va```.

The mnemonic is ```t``` for training, ```v``` for validation, ```1``` for one-image-per-lesion, and ```a``` for all images. 