# Preprocessing

Here we present the pre-processing done to the dataset. We have basically two relevant task, that will change slightly depending on the ML method ( `feature-based` or `deep-learning` ) we are using. The basic steps are the following:

1) Understand the cleaning of the dataset ( jut conceptually ).
2) **Train/Test split** - here we need to make splits by patients, also being aware of class imbalances.
3) Apply cleaning based on **training set**.
4) Create a **processing pipeline**, we have two ways to apply the pipeline ( `on-loading` or storing the images ). We will check what work best for our case.

**Note 1:** For number (4) it could make sense to have a `DatasetClass` able to apply _different pipelines_. It could make sense to apply this `on-loading` as it wouldn't be necessary to store and move the files to other environment ( if the models are trained in Colab or other services )

In [46]:
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns
import pandas as pd
import numpy as np
import scipy
import cv2
import os

BASE_DATA_DIR = os.path.join('data', 'train')

## 1. Cleaning Idea

It is relevant to say that the previously defined `perturbation` are, in practice, nothing more than a kind of `data augmentation`. This is a really helpful process for Machine Learning algorithms and, given this, is something we want to use.

We have two ways to do this (**note:** we want to develop a process useful for the three different classification approaches we are using ):

1) We keep the dataset as it is ( with some minor preprocessing on top ) and, if wanted, apply _data augmentation techniques_ on top of it. -> This could be closer to the actual `test dataset`, so it could be benefitial to our performance.

2) We clean the dataset to ideally have only "good images" and apply _data augmentation_ on top of it. This could allow us to perform more "aggresive" augmentation techniques ( as we start from a "un-perturbed" image ). Also, if the cleaning is not "perfect" it wouldn't be a real problem (as it would be a small augmentation) -> We would have to clean each `test` sample too.

**Idea:** We will develop a cleaning idea, but ultimatly we will test both pipelines to check what yields better results.

## 2. Train/Test Split

Ideally we want to define a unique `training` and `testing` datasets to compare in a un-biased way the methods we are developing. 

* As we are working with patients that can have more than one image, we have to split by patient ids.
* As we are working with an imbalanced dataset, we have to split taking into account the stratify of the datasets ( both sets keep the same balance )

In [36]:
labels_file = pd.read_csv( os.path.join(BASE_DATA_DIR, 'labels_train.csv'))
labels_file['patient_id'] = labels_file['file'].str.split('_').str[0]
print("Total number of images:", len(labels_file) )

# drop file column and drop duplicates
patients_file = labels_file.drop(['file'], axis=1)
patients_file = patients_file.drop_duplicates(subset=['patient_id']).reset_index(drop=True)

# show head and stats
patients_file.head()
print("Total number of distinct patients:", len(patients_file) )

Total number of images: 15470
Total number of distinct patients: 12086


Now we do the split based on the `patient_id`, take into account the class distribution:

In [47]:
classes, counts = np.unique(patients_file['label'], return_counts=True)
original_dist = { classes[i]: counts[i] / len(patients_file) for i in range(len(classes)) }
print("Original class distribution:", original_dist )

Original class distribution: {'N': 0.6057421810359093, 'P': 0.27345689227205033, 'T': 0.12080092669204037}


In [54]:
TRAIN_SIZE = 0.80

# make the actual split
training_ids, testing_ids = train_test_split(
    patients_file['patient_id'].values,
    train_size=TRAIN_SIZE,
    stratify=patients_file['label'].values
)
training_data = labels_file[ labels_file['patient_id'].isin( training_ids )]
testing_data = labels_file[ labels_file['patient_id'].isin( testing_ids )]
print("Training data size:", len(training_data), "samples" )
print("Testing data size:", len(testing_data), "samples" )



Training data size: 12386 samples
Testing data size: 3084 samples
