# Proof of concept for SIIM-ISIC Melanoma Challenge using RADTorch

### Please updavote if you find this notebook useful ! Thanks

## 1. The Challege

### 1.A. Your Skin, Get to Know it

![](https://i.ibb.co/Wpnr9gJ/peau.png)


The skin is the body’s largest organ. It protects against heat, sunlight, injury, and infection. Skin also helps control body temperature and stores water, fat, and vitamin D. The skin has several layers, but the two main layers are the epidermis (upper or outer layer) and the dermis (lower or inner layer). Skin cancer begins in the epidermis, which is made up of three kinds of cells:

* *Squamous cells*: Thin, flat cells that form the top layer of the epidermis.
* *Basal cells*: Round cells under the squamous cells.
* *Melanocytes*: Cells that make melanin and are found in the lower part of the epidermis. Melanin is the pigment that gives skin its natural color. When skin is exposed to the sun, melanocytes make more pigment and cause the skin to darken.


(Source : https://www.dana-farber.org/skin-cancer/)



### 1.B. Skin Cancer

![](https://i.ibb.co/68n6TvG/GB-couleurs.png)


Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective.

<br>

---


## 2. RADTorch

RADTorch (https://www.radtorch.com) is an ongoing project that provides a framework of higher level classes and functions that aim at significantly reducing the time needed for implementatto createent machine and deep learning algorithms on DICOM medical images.


RADTorch is built upon widely used machine learning and deep learning frameworks. These include:
* PyTorch for Deep Learning and Neural Networks.
* Scikit-learn for Data Management and Machine Learning Algorithms.
* PyDICOM for handling of DICOM data.
* Bokeh, Matplotlib and Seaborn for Data Visualization.


In [None]:
!git clone -b nightly https://github.com/radtorch/radtorch/ -q
!pip install radtorch/. -q
!rm -r radtorch

from radtorch.settings import *
from radtorch import core, utils
from sklearn import preprocessing


In [None]:
utils.set_random_seed(100)



## 3. Approach

My approach to solve this challenge will include the following steps:

#### A. Model Training

1. Extract Imaging Features from provided train dataset using one of the famous CNN architectures with ImageNet pre-trained weights (e.g. vgg16, resnet50 , ..)
2. Pre-process the Patient Clinical features including creating dummy features and interaction terms between categorical features.
3. Combine both feature sets into a single feature input vector.
4. Since the data is **extremely unbalanced**, data has to balanced using upsample, downsample or SMOTE.
5. Train a regular machine learning classifier (e.g. logistic regression, ...)

#### B. Inference
1. Perform steps 1-3 from model training pipeline but on test images.
2. Use the trained classifier to classify instances of test feature vector.
3. Combine results in submission csv

### A.  Model Training

### Step 1 : Imaging Features Extraction

I am going to be demonstrating here train images features extracted using Alexnet architecture that I extracted earlier.

In [None]:
train_img_features = pd.read_csv('/kaggle/input/radtorch-challenges-data/train_imaging_features_alexnet.csv')

In [None]:
train_img_features.head()

### Step 2: Clinical Data Pre-processing

In [None]:
def create_patient_data(csv, normalize_age=True, test=False, drop_missing=True, root='/kaggle/input/siim-isic-melanoma-classification/jpeg/train/', ext='.jpg'):
    patient_features = pd.read_csv(csv)
    if drop_missing==True:
        patient_features.dropna(inplace=True)  
    x = []
    for i, r in patient_features.iterrows(): x.append(root+r['image_name']+ext)
    patient_features['IMAGE_PATH']=x
    cat_columns=['sex','anatom_site_general_challenge']
    dummy_data = pd.get_dummies(patient_features[['sex','anatom_site_general_challenge']])
    patient_features=pd.concat([patient_features, dummy_data], axis=1)
    dummy_col = ['sex_female',	'sex_male',	'anatom_site_general_challenge_head/neck',	'anatom_site_general_challenge_lower extremity',	'anatom_site_general_challenge_oral/genital',	'anatom_site_general_challenge_palms/soles',	'anatom_site_general_challenge_torso',	'anatom_site_general_challenge_upper extremity']
    if normalize_age:
        min_max_scaler = preprocessing.MinMaxScaler()
        patient_features[['age_approx']]=min_max_scaler.fit_transform(patient_features[['age_approx']])
    return patient_features[['IMAGE_PATH','age_approx', ]+dummy_col]

In [None]:
train_clinical_features=create_patient_data('/kaggle/input/siim-isic-melanoma-classification/train.csv', drop_missing=False, normalize_age=False)

In [None]:
train_clinical_features.head()

### Step 3 and 4: Combine Imaging and Clinical Features + Solve Class Imbalance

In [None]:
def balance_data(df, label_col='IMAGE_LABEL', method='upsample'):
    counts=df.groupby(label_col).count()
    classes=df[label_col].unique().tolist()
    max_class_num=counts.max()[0]
    max_class_id=counts.idxmax()[0]
    min_class_num=counts.min()[0]
    min_class_id=counts.idxmin()[0]
    if method=='upsample':
        resampled_subsets = [df[df[label_col]==max_class_id]]
        for i in [x for x in classes if x != max_class_id]:
            class_subset=df[df[label_col]==i]
            upsampled_subset=resample(class_subset, n_samples=max_class_num, random_state=100)
            resampled_subsets.append(upsampled_subset)
    elif method=='downsample':
        resampled_subsets = [df[df[label_col]==min_class_id]]
        for i in [x for x in classes if x != min_class_id]:
            class_subset=df[df[label_col]==i]
            upsampled_subset=resample(class_subset, n_samples=min_class_num, random_state=100)
            resampled_subsets.append(upsampled_subset)
    resampled_df = pd.concat(resampled_subsets) 
    return resampled_df 

def create_data(img_features, pt_features, test_split, balance='upsample'):
    min_max_scaler = preprocessing.MinMaxScaler()
    img_features[img_features.columns.tolist()[2:]] = min_max_scaler.fit_transform(img_features[img_features.columns.tolist()[2:]])

    combined = pd.merge(left=pt_features, right=img_features, left_on='IMAGE_PATH', right_on='IMAGE_PATH')

    feature_names=[x for x in combined.columns.tolist() if x not in ['IMAGE_PATH', 'IMAGE_LABEL']]

    if test_split:
        train,  test  = train_test_split(combined, test_size=test_split, random_state=100)
  
    else:
        train = combined

    if balance:
        train=balance_data(train, method=balance) 

    if test_split:
        feature_dict =  { 
    'train': {'features':train[feature_names], 'features_names':feature_names, 'labels': train.IMAGE_LABEL.tolist()}, 
    'test': {'features':test[feature_names], 'features_names':feature_names, 'labels': test.IMAGE_LABEL.tolist()}
        }     
        return feature_dict
    else:
        return train


In [None]:
train_data = create_data(train_img_features, train_clinical_features, 0.25, balance='downsample')
train_data['train']['features'].head()

### Step 5: Train Classifier

In [None]:
classifier_hyper={'tree_method':'gpu_hist', 
                    'learning_rate':0.25, 
                    'eval_score':'auc',
                    'booster':'dart',
                    'random_state':100,
                    'n_estimators':800
                    }

In [None]:
clf = core.Classifier(extracted_feature_dictionary=train_data, 
                      type='xgboost',
                      parameters=classifier_hyper,
                      cv=True,
                      num_splits=5)

In [None]:
train = clf.run()

In [None]:
clf.confusion_matrix()

In [None]:
clf.roc()

### Inference

In [None]:
test_clinical_features = create_patient_data('/kaggle/input/siim-isic-melanoma-classification/test.csv', root='/kaggle/input/siim-isic-melanoma-classification/jpeg/test/', normalize_age=False, test=True, drop_missing=False)
test_imaging_features = pd.read_csv('/kaggle/input/radtorch-challenges-data/test_imaging_features_alexnet.csv')
test_features = create_data(test_imaging_features, test_clinical_features, test_split=False, balance=False)
feature_names=[x for x in test_features.columns.tolist() if x not in ['IMAGE_PATH', 'IMAGE_LABEL']]
test_features = test_features[feature_names]
predictions = clf.classifier.predict_proba(test_features)
prediction_of_malignancy = [i[1] for i in predictions]
submission = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')
submission['target']=prediction_of_malignancy
submission.head()


In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
clf.export('trained_classifier.pkl')