# EDA

The dataset we will use to train the model is the Skin Cancer MNIST: HAM10000 which can be found on Kaggle. There are also some great EDA notebooks that can be found under the kernels for this dataset. The EDA here is largely based on [this noteobok](https://www.kaggle.com/sid321axn/step-wise-approach-cnn-model-77-0344-accuracy).

Major TODOs:
- Create dockerfile and clean readme
- Separate the EDA from model training
- Look at class activation maps and other localization techniques
- Automate performance analysis tracking, create csv that stores metadata of what settings/hyperparameters were used
- Setup hyperparameter tuning using keras tuner
- Create a pipeline using TF Records, tf datasets, and tensorboard
- Create ios mobile app
- Allow various models to be trained
- Try incorporating non image data (i.e. patient info) into a single end-to-end model
- Split out this notebook and keep model training separate from EDA. Also have one for model performance analysis and localization/visualization (that take in trained model)
- Add in class weighting (with description)

## Data Processing

In [None]:
import os
from glob import glob
import pandas as pd
import numpy as np

import altair as alt
import matplotlib.pyplot as plt
from PIL import Image

In [None]:
alt.data_transformers.disable_max_rows()

Get a dictionary of images for our dataset and create a lookup table for readable names for our classes

In [None]:
base_dir = os.path.join('..', 'data')

# Merging images from both folders HAM10000_images_part1.zip and HAM10000_images_part2.zip into one dictionary

image_path_dict = {os.path.splitext(os.path.basename(x))[0]: x
                     for x in glob(os.path.join(base_dir, '*', '*.jpg'))}

# This dictionary is useful for displaying more human-friendly labels later on

lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'Melanoma',
    'bkl': 'Benign keratosis-like lesions',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'
}

In [None]:
print(f'There are {len(image_path_dict)} images in our dataset')

Here we will read and process the data. This will help later with creating labels.

In [None]:
skin_df = pd.read_csv(os.path.join(base_dir, 'datasets_54339_104884_HAM10000_metadata.csv'))

# Creating New Columns for better readability

skin_df['path'] = skin_df['image_id'].map(image_path_dict.get)
skin_df['cell_type'] = skin_df['dx'].map(lesion_type_dict.get) 
skin_df['cell_type_idx'] = pd.Categorical(skin_df['cell_type']).codes

In [None]:
skin_df.head()

Next, check for null values. Test different methods of imputation.

In [None]:
skin_df.isnull().sum()

In [None]:
print(skin_df.dtypes)

## EDA

First look at the distribution of our target variable.

In [None]:
alt.Chart(skin_df, height=300).mark_bar().encode(
    x='count()',
    y='cell_type',
    color='cell_type',
    tooltip='count()'
)

There are various methods by which the ground truth labels were established with this dataset:

1. Histopathology(Histo): Histopathologic diagnoses of excised lesions have been performed by specialized dermatopathologists.
2. Confocal: Reflectance confocal microscopy is an in-vivo imaging technique with a resolution at near-cellular level , and some facial benign with a grey-world assumption of all training-set images in Lab-color space before and after manual histogram changes.
3. Follow-up: If nevi monitored by digital dermatoscopy did not show any changes during 3 follow-up visits or 1.5 years biologists accepted this as evidence of biologic benignity. Only nevi, but no other benign diagnoses were labeled with this type of ground-truth because dermatologists usually do not monitor dermatofibromas, seborrheic keratoses, or vascular lesions.
4. Consensus: For typical benign cases without histopathology or followup biologists provide an expert-consensus rating of authors PT and HK. They applied the consensus label only if both authors independently gave the same unequivocal benign diagnosis. Lesions with this type of groundtruth were usually photographed for educational reasons and did not need further follow-up or biopsy for confirmation.

In [None]:
alt.Chart(skin_df, height=300).mark_bar().encode(
    x='count()',
    y='dx_type',
    color='dx_type',
    tooltip='count()'
)

Look at the distribution of localization field

In [None]:
alt.Chart(skin_df, height=400).mark_bar().encode(
    x='count()',
    y='localization',
    color='localization',
    tooltip='count()'
)

Look at the distribution of patient age

In [None]:
alt.Chart(skin_df[-skin_df['age'].isnull()]).mark_bar().encode(
    alt.X("age:Q", bin=True),
    y='count()',
)

Look at sex distribution in our data

In [None]:
alt.Chart(skin_df, height=400).mark_bar().encode(
    x='count()',
    y='sex',
    color='sex',
    tooltip='count()'
)

Look at cell type (the target) by median age

In [None]:
alt.Chart(skin_df[-skin_df['age'].isnull()], height=400).mark_bar().encode(
    x='median(age)',
    y='cell_type',
    color='cell_type'
)