# Pre-processing

## Table of Contents 

### Goals: 
The primary goal here is to make transformations to both datasets that would potentially allow them to be merged.  As noted during EDA, there are some notable differences including:

### Steps:
  #### Dummy/Indicator Features for Categorical Variables
  + [] Convert the target (`dx_class`) labels from `object` type to numerical/boolean for more convenient analysis
  + [] Same for `laterality`
  + [] The columns actually relevant to classification will be:
    + The Target: `dx_class` 
    + Laterality: `laterality` - essentially the only tabular "feature"; don't expect much of a correlation with the target here, but will consider it.
    + Set: `set` - to assess if there are statistical differences in SET 1 and SET 2 that make them ultimately "un-mergable" when testing them out during modeling
    + Patient: `patient_id` - each patient may have associated confounding factors (i.e., if their PMHx includes things like myopia which can make glaucoma harder to detect); may be useful to look at depending on how the results turn out. 
  #### Standardization of Volume/Image Arrays 
  ##### Array Dimensions
  SET 1 consists of 3D Volumes, while SET 2 consists of 2D Images
  + [] 3D Volume slices need to be extracted that are within an equivalent plane to the B-scans
  + [] may need to consider the index number in the B-scan images and calculate the `slice_depth` equivalent.
  ##### Aspect Ratio
  2D Images represent full B-Scans including a broader area around the optic nerve, while the 3D volumes are optic-nerve-head centered scans
  + [] Crop 2D images to a W:H ratio of 1:2
  + [] Consider shaded regions within the B-scan images themselves vs. unshaded B-scan images that are accompanied by "cup"/"disc" images (masks/graphical labels). 
  ##### Size/Scale
  The 3D Volumes have been downscaled to 64x128x64, while the 2D Images appear to be original scale or at least much larger 
  + [] downscale 2D Images to 64w x 128h after cropping
  ##### Color Channels
  The volumes are all 3D Grayscale, while the images are all 2D Color
  + [] convert 2D images to grayscale using `cv2`
  ##### Pixel Intensities
  + [] check pixel intensity histograms again
  + [] normalize pixel intensities




## Import Statements

In [90]:
import pandas as pd
import numpy as np
import re

In [91]:
from warnings import filterwarnings
filterwarnings(action='ignore')

## Loading Metada Table 

In [146]:
metadata = pd.read_csv('../datasrc/compositeOCT_metadata.csv')
metadata.head()

Unnamed: 0,dx_class,patient_id,laterality,image_type,set,filepath
0,glaucoma,1978,OS,volume,1,../datasrc/volumesOCT/POAG-001978-2012-02-08-O...
1,normal,2743,OS,volume,1,../datasrc/volumesOCT/Normal-002743-2009-03-26...
2,glaucoma,1086,OS,volume,1,../datasrc/volumesOCT/POAG-001086-2008-08-04-O...
3,glaucoma,92,OS,volume,1,../datasrc/volumesOCT/POAG-000092-2010-12-14-O...
4,glaucoma,3223,OD,volume,1,../datasrc/volumesOCT/POAG-003223-2014-01-10-O...


## Dummy/Indicator Features for Categorical Variables 

Although some ML classification frameworks might be more flexible with the data types of features, will convert the columns potentially relevant to classification just in case.

### Target Variable: `dx_class` --> `glaucoma`

In [147]:
ntarg = 'glaucoma'
otarg = 'dx_class'
oldcols = list(metadata.columns)

if ntarg not in list(metadata.columns):
    newcolmap = {otarg:ntarg} # Save mapping for future reference
    new_target = pd.get_dummies(metadata[otarg], dtype='int')[ntarg] # Series for the new numerical column
    loc = metadata.columns.get_loc(otarg) + 1 # Add 1 to place it after the original target
    metadata.insert(loc,ntarg,new_target) # Insert the series after after `dx_class` as `glaucoma` column
    
print(newcolmap,'\n')
print(metadata.dtypes,'\n')
metadata.head()

{'dx_class': 'glaucoma'} 

dx_class      object
glaucoma       int64
patient_id    object
laterality    object
image_type    object
set            int64
filepath      object
dtype: object 



Unnamed: 0,dx_class,glaucoma,patient_id,laterality,image_type,set,filepath
0,glaucoma,1,1978,OS,volume,1,../datasrc/volumesOCT/POAG-001978-2012-02-08-O...
1,normal,0,2743,OS,volume,1,../datasrc/volumesOCT/Normal-002743-2009-03-26...
2,glaucoma,1,1086,OS,volume,1,../datasrc/volumesOCT/POAG-001086-2008-08-04-O...
3,glaucoma,1,92,OS,volume,1,../datasrc/volumesOCT/POAG-000092-2010-12-14-O...
4,glaucoma,1,3223,OD,volume,1,../datasrc/volumesOCT/POAG-003223-2014-01-10-O...


### Patient: `patient_id` - residual cleaning

#### Attempt `object` --> `int` 

In [148]:
N_patients = len(metadata.patient_id.unique()); print(f"Number of Patients: {N_patients}")

Number of Patients: 650


In [149]:
# Preview N_patients if patient_id converted to int
len(metadata.patient_id\
    .str.replace('P_','')\
    .astype('int')\
    .sort_values().unique())

647

The loss of three suggests some overlap in numbering.

#### Splitting & Rejoining Original Sets 

In [150]:
# Split into original sets
idf = metadata[metadata.set == 2]
vdf = metadata[metadata.set == 1]

# Create new column for patient_id's converted to integers (pid_int) for both dataframes
pidint = 'pid_int'
opid = 'patient_id'

if pidint not in list(metadata.columns):
    # Append to newcolmap for later reference
    newcolmap[opid] = pidint

    # Convert patient_id's to integers in the new column (pid_int)
    vdf[pidint] = vdf[opid].astype('int')
    idf[pidint] = idf[opid].str.replace('P_','').astype('int') + 4000

    # Re-join / concatenate
    metadata = pd.concat([vdf,idf],axis=0)

    # Re-order columns so that pid_int is next to patient_id
    loc = metadata.columns.get_loc(opid) + 1
    pidintSer = metadata.pop(pidint)
    metadata.insert(loc,pidint,pidintSer)

    # Sort by pid_int (i.e., numerical sort of patient_id "by proxy")
    metadata = metadata.sort_values(pidint)
    
print(newcolmap,'\n')
print(metadata.dtypes,'\n')
metadata.head()

{'dx_class': 'glaucoma', 'patient_id': 'pid_int'} 

dx_class      object
glaucoma       int64
patient_id    object
pid_int        int64
laterality    object
image_type    object
set            int64
filepath      object
dtype: object 



Unnamed: 0,dx_class,glaucoma,patient_id,pid_int,laterality,image_type,set,filepath
826,normal,0,2,2,OD,volume,1,../datasrc/volumesOCT/Normal-000002-2009-10-28...
813,normal,0,2,2,OS,volume,1,../datasrc/volumesOCT/Normal-000002-2009-10-28...
53,glaucoma,1,8,8,OD,volume,1,../datasrc/volumesOCT/POAG-000008-2009-02-03-O...
426,glaucoma,1,14,14,OD,volume,1,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...
747,glaucoma,1,14,14,OS,volume,1,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...


#### Re-map ordered patient_id's to proxy identification (PIN)

As the `patient_id` column reflects the original identification used in the respective research studies, there are some missing numbers (e.g., maybe representing patients later excluded from their study).  For our sake, we can keep this in order to trace it back to the original datasets/studies if needed later on, but create a new column for our own proxy identification -- we can call this `PIN` for "Patient Identification Number":

In [151]:
# Double check that the number of patients matches the original
uniq_pidints = metadata.pid_int.unique()
N_patients == len(uniq_pidints)

True

In [152]:
# Dictionary comprehension to map the pid_int to a more simple ordinal numbering (basically from 1 to N_patients)
pid2pin = {pid:f"{i+1:03d}" for i,pid in enumerate(uniq_pidints)}

# Create new column PIN using mapping of pid_int to proxy identification (1 to N_patients)
pin = 'PIN'
if pin not in list(metadata.columns):
    # Append newcolmap for future reference
    newcolmap['patient_id']=pin
    pinSer = metadata[pidint].apply(lambda p : pid2pin[p]) # re-map pidint to create new 'PIN' series

    # Replace pid_int with the re-mapped 'PIN' series 
    loc = metadata.columns.get_loc(pidint) # get column index of pid_int
    metadata = metadata.drop(pidint,axis=1) # drop pid_int (going to replace it)
    metadata.insert(loc,pin,pinSer) # place re-mapped series at the original location of pid_int, and name it 'PIN'

print(newcolmap,'\n')
print(metadata.dtypes,'\n')
metadata.head()

{'dx_class': 'glaucoma', 'patient_id': 'PIN'} 

dx_class      object
glaucoma       int64
patient_id    object
PIN           object
laterality    object
image_type    object
set            int64
filepath      object
dtype: object 



Unnamed: 0,dx_class,glaucoma,patient_id,PIN,laterality,image_type,set,filepath
826,normal,0,2,1,OD,volume,1,../datasrc/volumesOCT/Normal-000002-2009-10-28...
813,normal,0,2,1,OS,volume,1,../datasrc/volumesOCT/Normal-000002-2009-10-28...
53,glaucoma,1,8,2,OD,volume,1,../datasrc/volumesOCT/POAG-000008-2009-02-03-O...
426,glaucoma,1,14,3,OD,volume,1,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...
747,glaucoma,1,14,3,OS,volume,1,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...


### Laterality: `laterality`

In [153]:
l_eye = 'left_eye'
lat = 'laterality'

if l_eye not in list(metadata.columns):
    newcolmap[lat]=l_eye
    leyeSer = pd.get_dummies(metadata.laterality,dtype='int',drop_first=True)
    loc = metadata.columns.get_loc(lat) + 1
    metadata.insert(loc,l_eye,leyeSer)

print(newcolmap,'\n')
print(metadata.dtypes,'\n')
metadata.head()

{'dx_class': 'glaucoma', 'patient_id': 'PIN', 'laterality': 'left_eye'} 

dx_class      object
glaucoma       int64
patient_id    object
PIN           object
laterality    object
left_eye       int64
image_type    object
set            int64
filepath      object
dtype: object 



Unnamed: 0,dx_class,glaucoma,patient_id,PIN,laterality,left_eye,image_type,set,filepath
826,normal,0,2,1,OD,0,volume,1,../datasrc/volumesOCT/Normal-000002-2009-10-28...
813,normal,0,2,1,OS,1,volume,1,../datasrc/volumesOCT/Normal-000002-2009-10-28...
53,glaucoma,1,8,2,OD,0,volume,1,../datasrc/volumesOCT/POAG-000008-2009-02-03-O...
426,glaucoma,1,14,3,OD,0,volume,1,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...
747,glaucoma,1,14,3,OS,1,volume,1,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...


### Subsetting New 'modeling-ready' DataFrame

In [155]:
newcolmap

{'dx_class': 'glaucoma', 'patient_id': 'PIN', 'laterality': 'left_eye'}

In [156]:
oldcols

['dx_class', 'patient_id', 'laterality', 'image_type', 'set', 'filepath']

In [157]:
untouched = [ c for c in oldcols if c not in list(newcolmap.keys()) ]; untouched

['image_type', 'set', 'filepath']

In [158]:
newcols = [ newcolmap[c] for c in oldcols if c not in untouched ]; newcols

['glaucoma', 'PIN', 'left_eye']

In [160]:
# Subset columns for "modeling dataframe" (mdf):
mdf = metadata[newcols+untouched]

# Reset index values
mdf = mdf.reset_index().drop('index',axis=1)

# Reorder columns so that `set` is lumped in with the other numerical columns
loc = mdf.columns.get_loc('image_type')
setSer = mdf.pop('set')
mdf.insert(loc,'set',setSer)

# Preview
mdf.head()

Unnamed: 0,glaucoma,PIN,left_eye,set,image_type,filepath
0,0,1,0,1,volume,../datasrc/volumesOCT/Normal-000002-2009-10-28...
1,0,1,1,1,volume,../datasrc/volumesOCT/Normal-000002-2009-10-28...
2,1,2,0,1,volume,../datasrc/volumesOCT/POAG-000008-2009-02-03-O...
3,1,3,0,1,volume,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...
4,1,3,1,1,volume,../datasrc/volumesOCT/POAG-000014-2009-11-30-O...


Will leave `image_type` and `filepath` alone for now, as they are subject to changes in subsequent sections of this notebook. 

## Volume/Image Array Standardization 

### Array Dimensions

### Aspect Ratio 

### Size/Scale

### Color Channels 

### Pixel Intensities 

## Summary 