# MELANOMA CLASSIFICATION

This kernel demostrates the steps followed in building a classifier for the various skin images shared under the SIIM-ISIC Melanoma Classification competition. The final model predicts the probabilities of malignancy of the lesions in the images. Let's start!!!

# About Melanoma

Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective.

Currently, dermatologists evaluate every one of a patient's moles to identify outlier lesions or “ugly ducklings” that are most likely to be melanoma. Existing AI approaches have not adequately considered this clinical frame of reference. Dermatologists could enhance their diagnostic accuracy if detection algorithms take into account “contextual” images within the same patient to determine which images represent a melanoma. If successful, classifiers would be more accurate and could better support dermatological clinic work.

As the leading healthcare organization for informatics in medical imaging, the Society for Imaging Informatics in Medicine (SIIM)'s mission is to advance medical imaging informatics through education, research, and innovation in a multi-disciplinary community. SIIM is joined by the International Skin Imaging Collaboration (ISIC), an international effort to improve melanoma diagnosis. The ISIC Archive contains the largest publicly available collection of quality-controlled dermoscopic images of skin lesions.

Melanoma is a deadly disease, but if caught early, most melanomas can be cured with minor surgery. Image analysis tools that automate the diagnosis of melanoma will improve dermatologists' diagnostic accuracy. Better detection of melanoma has the opportunity to positively impact millions of people.

# About the Data

The dataset used is under CC BY-NC 4.0 with the following attribution:

The ISIC 2020 Challenge Dataset https://doi.org/10.34970/2020-ds01 (c) by ISDIS, 2020

Creative Commons Attribution-Non Commercial 4.0 International License.

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, The University of Queensland, and the University of Athens Medical School.

You should have received a copy of the license along with this work.

If not, see https://creativecommons.org/licenses/by-nc/4.0/legalcode.txt.

# EXPLORATORY DATA ANALYSIS(EDA)

I started by making some imports.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import tempfile
#for dirname, _, filenames in os.walk('/kaggle/input'):
    #for filename in filenames:
        #print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.image import imread
import cv2
# Technically not necessary in newest versions of jupyter
%matplotlib inline

Next I read the contents of the train and test csv into two dataframes train and test. Here I have worked only with the images under the jpeg folders. As a start I got the image features of the images in both train and test folders extracted using a DenseNet121 architecture. The reference to the kernel is attached below:-

https://www.kaggle.com/siddhartamukherjee/melanoma-classification-image-feature-extraction

The image features csv files were included here under the folder image features. These image features were copied in two dataframes train_features and test_features as shown below:-

In [None]:
train = pd.read_csv("../input/siim-isic-melanoma-classification/train.csv")
test = pd.read_csv("../input/siim-isic-melanoma-classification/test.csv")
submission = pd.read_csv("../input/siim-isic-melanoma-classification/sample_submission.csv")
train_features = pd.read_csv("../input/siimisicimagefeaturesextracted/Image Features/train_img_features.csv")
test_features = pd.read_csv("../input/siimisicimagefeaturesextracted/Image Features/test_img_features.csv")

Next I changed the column name under which the image names in the train_features as well as the test_features dataframes appeared to **"image_name"** as in both the train_features and test_features dataframe they appeared with a name **"Unnamed: 0"**.

In [None]:
train_features = train_features.rename(columns={"Unnamed: 0" : "image_name"})

In [None]:
test_features = test_features.rename(columns={"Unnamed: 0" : "image_name"})

To check if the change reflected we displayed the first five columns of both the train_features and the test_features dataframes. From the below results it can be also seen that both train_features and the test_features dataframes contains 256 features per image.

In [None]:
train_features.head()

In [None]:
test_features.head()

Next I displayed the first five rows of the train set and also checked the individual column information as shown below.

In [None]:
train.head()

The train set contains dimension is 33126 row and 8 columns.

In [None]:
train.shape

In [None]:
train.info()

From the above results it can be seen that the columns sex,age_approx and anatom_site_general_challenge under the train dataframe have some null values. It would be interesting to see how we can deal with these missing values.

Just like the train set I then displayed the first rows of the test set as shown below.

In [None]:
test.head()

The test set is of dimension 10982 rows and 5 columns as shown below. The test set don't have the columns diagnosis, beningn_malignant and target.

In [None]:
test.shape

From, the test dataframe's info column it seems that the anatom_Site_general_challenge column has some null values.

In [None]:
test.info()

Next I checked the count of images that we have in train and test set. The train set has 33126 images and test set has 10982 images.

In [None]:
path, dirs, files = next(os.walk("/kaggle/input/siim-isic-melanoma-classification/jpeg/train"))
file_count = len(files)
file_count

In [None]:
path, dirs, files = next(os.walk("/kaggle/input/siim-isic-melanoma-classification/jpeg/test"))
file_count = len(files)
file_count

After this I analyzed how many unique patient ids is present. From the below results it can be found that in the train set we have 2056 unique patient ids.

In [None]:
train['patient_id'].nunique()

Next I made a data analysis of both the train and test set based on gender as shown below.

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.countplot(x='sex',data=train,ax=ax[0])
ax[0].set_xlabel(" ")
ax[0].set_title("Gender counts in train set")

sns.countplot(x='sex',data=test,ax=ax[1])
ax[1].set_xlabel(" ")
ax[1].set_title("Gender counts in test set")

People belonging to male gender is more in both train and test set. In test set male gender count is even more than in train set.

In [None]:
train[train['sex'].isnull() == True]['target'].value_counts()

Also, as shown above for 65 records in the training set we have sex as null. Also, the target values for those records are 0. Since, we have maximum images belonging to target 0 so we may drop these 65 records during our model building.

Next I made a count plot to check the age distribution in both train and test set.

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.countplot(x='age_approx',data=train,ax=ax[0])
ax[0].set_xlabel(" ")
ax[0].set_title("Age distribution in train set")

sns.countplot(x='age_approx',data=test,ax=ax[1])
ax[1].set_xlabel(" ")
ax[1].set_title("Age distribution in test set")

It seems the age distribution is uneven in test set.

Next I made a graphical display of the location of the images belonging to both train and test set.

In [None]:
temp_train = train.anatom_site_general_challenge.value_counts().sort_values(ascending=False)
temp_test = test.anatom_site_general_challenge.value_counts().sort_values(ascending=False)

fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.barplot(x=temp_train.index.values, y=temp_train.values,ax=ax[0])
ax[0].set_xlabel(" ")
labels = ax[0].get_xticklabels()
ax[0].set_xticklabels(labels, rotation=90)
ax[0].set_title("Image location in train set")

sns.barplot(x=temp_test.index.values, y=temp_test.values,ax=ax[1])
ax[1].set_xlabel(" ")
labels = ax[1].get_xticklabels()
ax[1].set_xticklabels(labels, rotation=90)
ax[1].set_title("Image location in test set")

The distribution of image locations is same in both train and test set.

Next, lets see what the diagnosis column of the train set has to tell us.

In [None]:
chart = sns.countplot(x='diagnosis', data = train)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')

It seems for most images diagnosis is unknown. So, we may drop this column during our model building.

Next, let's analyze the columns **target** and **benign_malignant**. Both,the columns are providing the same message. Hence, during model building I will only use target. The column **benign_malignant** will be dropped.

In [None]:
sns.countplot(x='target',data=train)

In [None]:
sns.countplot(x='benign_malignant',data=train)

From both the count plots we can see that the dataset is highly imbalanced i.e we have very minimal number of images belonging to class malignant.

Since, there is comparitively less number of malignat images so we will try resampling to increase the amount of malignat images during our data preprocessing.

In [None]:
#Paths to train and test images
train_img_path = '/kaggle/input/siim-isic-melanoma-classification/jpeg/train/'
test_img_path = '/kaggle/input/siim-isic-melanoma-classification/jpeg/test/'

Next I will display some images in the train and test set. The code for this was referred from the kernel given below:-

https://www.kaggle.com/siddhartamukherjee/siim-isic-melanoma-analysis-eda-prediction

**Let's take a look at some benign tumours from the train set.**

In [None]:
fig = plt.figure(figsize=(50, 50))
for i,idx in enumerate(np.random.choice(train[train['benign_malignant']=='benign'].index,8)):
    img = cv2.imread(train_img_path+str(train.loc[idx,'image_name'])+'.jpg')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    fig.add_subplot(2, 4, i+1)
    plt.imshow(img)
    plt.title('Patient_id: '+train.loc[idx,'patient_id']+'\n'\
              +'Site: '+str(train.loc[idx,'anatom_site_general_challenge'])+'\n'\
              +'Sex: '+str(train.loc[idx,'sex'])+'\n'\
              +'Approximate Age: '+str(train.loc[idx,'age_approx'])+'\n'\
              +'Diagnosis: '+str(train.loc[idx,'diagnosis']),fontsize=30)
    plt.axis("off")
    plt.tight_layout()

**Now let's take a look at some malignant tumours from the train set.**

In [None]:
fig = plt.figure(figsize=(50, 50))
for i,idx in enumerate(np.random.choice(train[train['benign_malignant']=='malignant'].index,8)):
    img = cv2.imread(train_img_path+str(train.loc[idx,'image_name'])+'.jpg')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    fig.add_subplot(2, 4, i+1)
    plt.imshow(img)
    plt.title('Patient_id: '+train.loc[idx,'patient_id']+'\n'\
              +'Site: '+str(train.loc[idx,'anatom_site_general_challenge'])+'\n'\
              +'Sex: '+str(train.loc[idx,'sex'])+'\n'\
              +'Approximate Age: '+str(train.loc[idx,'age_approx'])+'\n'\
              +'Diagnosis: '+str(train.loc[idx,'diagnosis']),fontsize=30)
    plt.axis("off")
    plt.tight_layout()

**Finally,let's check what we have to predict.**

In [None]:
fig = plt.figure(figsize=(50, 50))
for i,idx in enumerate(np.random.choice(test.index,8)):
    img = cv2.imread(test_img_path+str(test.loc[idx,'image_name'])+'.jpg')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    fig.add_subplot(2, 4, i+1)
    plt.imshow(img)
    plt.title('Patient_id: '+test.loc[idx,'patient_id']+'\n'\
              +'Site: '+str(test.loc[idx,'anatom_site_general_challenge'])+'\n'\
              +'Sex: '+str(test.loc[idx,'sex'])+'\n'\
              +'Approximate Age: '+str(test.loc[idx,'age_approx']),fontsize=30)
    plt.axis("off")
    plt.tight_layout()

# DATA PREPROCESSING

From the train set I took the columns image_name, sex, age_approx, anatom_site_general_challenge and traget and created a new dataframe df. The columns diagnosis , bening_malignant won't be of any use. Hence, I didnot consider them for model building.

In [None]:
df = train[['image_name','sex', 'age_approx','anatom_site_general_challenge','target']]

Next I merged the dataframe df with the train features by image_name.

In [None]:
df = pd.merge(df, train_features, on='image_name')

After merging the dataframe looks as shown below.

In [None]:
df.head()

Next, the same set of steps were repeated for the test set and a dataset df1 was created. Note, that df1 has no target column as this would be added later after model's prediction.

In [None]:
df1 = test[['image_name','sex','age_approx','anatom_site_general_challenge']]

In [None]:
df1 = pd.merge(df1, test_features, on='image_name')

In [None]:
df1.head()

As already highlighted in my section EDA I dropped the 65 records which has null value in the sex column of the dataset df. This won't impact our model's performance much as the records having null values under the sex column belonged to target class 0. And as highlighted the taget class is highly imbalanced with maximum number of images belonging to target 0, so removing 65 records belonging to target 0 won't cause much impact.

In [None]:
df = df.dropna(axis=0, subset=['sex'])

The sex column has 2 unique values **"male"** and **"female"**. Since, it is hard to make a model with text data so I converted the values to numeric with male replaced by 0 and female replaced by 1. This was done in both the dataframes df and df1.

In [None]:
sex = {"male":0, "female":1}
df['sex'] = df['sex'].map(sex)
df1['sex'] = df1['sex'].map(sex)

Next coming to the age_approx column I already highlighted during EDA that it has some null values so best thing to do would be to replace the null values with the mean age. This was done in both the dataframes df and df1.

In [None]:
df['age_approx'].fillna(df['age_approx'].mean(), inplace=True)

In [None]:
df1['age_approx'].fillna(df1['age_approx'].mean(), inplace=True)

Next I replaced the null values appearing in anatom_site_general_challenge with value unknown and then assigned a class to each values of this column as shown below. This was done for both dataframes df and df1.

In [None]:
df['anatom_site_general_challenge'].fillna('unknown', inplace=True)
df1['anatom_site_general_challenge'].fillna('unknown', inplace=True)
img_loc = {'head/neck':1, 'upper extremity':2, 'lower extremity':3, 'torso':4, 'palms/soles':5, 'oral/genital':6, 'unknown':7}
df['anatom_site_general_challenge'] = df['anatom_site_general_challenge'].map(img_loc)
df1['anatom_site_general_challenge'] = df1['anatom_site_general_challenge'].map(img_loc)

Next I one hot encoded the columns sex and anatom_site_general_challenge as shown below for both the datasets df and df1.

In [None]:
df = pd.get_dummies(df, columns=["sex"])
df1 = pd.get_dummies(df1, columns=["sex"])
df = pd.get_dummies(df, columns=["anatom_site_general_challenge"])
df1 = pd.get_dummies(df1, columns=["anatom_site_general_challenge"])

The results after one hot encoding is shown below for both the dataframes df and df1.

In [None]:
df.head()

In [None]:
df1.head()

Next I calculated the percentage of target 1 records in the dataframe. It seems the target 1 records only forms 1.77% of the entire dataset. This hints that we have to resmaple and increase the target 1 records to prevent overfitting of the model to the target class 0.

In [None]:
neg, pos = np.bincount(df['target'])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Now, lets make the X and y set. So, all the columns except the image_name and target will go to X. The target column would be our y set.

In [None]:
X = df.drop(columns = ['target','image_name'], axis=1)
y = df['target']
df_test = df1.drop(columns = ['image_name'],axis=1)

Next I made a train test split of our X and y dataset.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33, random_state=101)

I will use **Adaptive Synthetic Sampling (ADASYN)** to increase the minority class. It is a type of **Synthetic Minority Oversampling Technique(SMOTE)** which generates more synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high. More details of different types of SMOTE approaches can be found from the below link:-

https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

After applying ADASYN we can see that the samples belonging to target class 1 increased from 382 to 21754 in the X_train dataset. Also, SMOTE is applied only on the training set and not on the test set because we want to keep real data in test set.

In [None]:
from imblearn.over_sampling import SMOTE, SVMSMOTE, ADASYN,BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NearMiss 
from numpy import mean
from imblearn.pipeline import Pipeline
from collections import Counter
# summarize class distribution
counter = Counter(y_train)
print(counter)
# Oversample with SMOTE and random undersample for imbalanced dataset
over = ADASYN()
#under = RandomUnderSampler(sampling_strategy=0.5)
#steps = [('o', over), ('u', under)]
#pipeline = Pipeline(steps=steps)
X_train, y_train= over.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train)
print(counter)

Next I displayed the shape of X_train and X_test to check how many records we now have in train and test set. 

In [None]:
X_train.shape

In [None]:
X_test.shape

From the above results it seems the train set has 43522 samples and test set has 10911 samples.

# MODEL BUILDING


Now, its time build my model. I chose Light GBM Classifier to build my model.Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning. 

In this model I will also do hyper parameter tuning of LGBMClassifier with Random search followed by Grid Search. All of these are shown below. The implementation was referred from the below kernel:-

https://www.kaggle.com/mlisovyi/lightgbm-hyperparameter-optimisation-lb-0-761


* # Prepare learning rate shrinkage

In [None]:
def learning_rate_010_decay_power_099(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate  * np.power(.99, current_iter)
    return lr if lr > 1e-3 else 1e-3

def learning_rate_010_decay_power_0995(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate  * np.power(.995, current_iter)
    return lr if lr > 1e-3 else 1e-3

def learning_rate_005_decay_power_099(current_iter):
    base_learning_rate = 0.05
    lr = base_learning_rate  * np.power(.99, current_iter)
    return lr if lr > 1e-3 else 1e-3

* # Set up HyperParameter search
We use random search, which is more flexible and more efficient than a grid search

In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(6, 50), 
             'min_child_samples': sp_randint(100, 500), 
             'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
             'subsample': sp_uniform(loc=0.2, scale=0.8), 
             'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
             'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
             'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

In [None]:
#This parameter defines the number of HP points to be tested
n_HP_points_to_test = 100

import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

#n_estimators is set to a "large value". The actual number of trees build will depend on early stopping and 5000 define only the absolute maximum
clf = lgb.LGBMClassifier(max_depth=-1, random_state=314, silent=True, metric='None', n_jobs=4, n_estimators=5000)
gs = RandomizedSearchCV(
    estimator=clf, param_distributions=param_test, 
    n_iter=n_HP_points_to_test,
    scoring='roc_auc',
    cv=3,
    refit=True,
    random_state=314,
    verbose=True)

* # Use test subset for early stopping criterion
This allows us to avoid overtraining and we do not need to optimise the number of trees

In [None]:
fit_params={"early_stopping_rounds":30, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_test,y_test)],
            'eval_names': ['valid'],
            #'callbacks': [lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_099)],
            'verbose': 100,
            'categorical_feature': 'auto'}

The Hyper parameter optimization using random search was run as shown below.

In [None]:
gs.fit(X_train, y_train, **fit_params)
print('Best score reached: {} with params: {} '.format(gs.best_score_, gs.best_params_))

The optimal parameters from the search was saved in the dictionary opt_parameters.

In [None]:
opt_parameters = {'colsample_bytree': 0.9023523372315546, 
                  'min_child_samples': 237, 
                  'min_child_weight': 0.01, 
                  'num_leaves': 39, 
                  'reg_alpha': 10, 
                  'reg_lambda': 0.1, 
                  'subsample': 0.7187028219151861}

* # Some more tuning

In [None]:
clf_sw = lgb.LGBMClassifier(**clf.get_params())
#set optimal parameters
clf_sw.set_params(**opt_parameters)

In [None]:
gs_sample_weight = GridSearchCV(estimator=clf_sw, 
                                param_grid={'scale_pos_weight':[1,2,6,12]},
                                scoring='roc_auc',
                                cv=5,
                                refit=True,
                                verbose=True)

In [None]:
gs_sample_weight.fit(X_train, y_train, **fit_params)
print('Best score reached: {} with params: {} '.format(gs_sample_weight.best_score_, gs_sample_weight.best_params_))

* # Build the final model
I used the tuned parameter values but a smaller learning rate to allow smoother convergence to the minimum.

In [None]:
#Configure from the HP optimisation
clf_final = lgb.LGBMClassifier(**gs.best_estimator_.get_params())

#Configure locally from hardcoded values
#clf_final = lgb.LGBMClassifier(**clf.get_params())
#set optimal parameters
clf_final.set_params(**opt_parameters)

#Train the final model with learning rate decay
clf_final.fit(X_train, y_train, **fit_params, callbacks=[lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_0995)])

# Plot feature importance

In [None]:
feat_imp = pd.Series(clf_final.feature_importances_, index=X.columns)
feat_imp.nlargest(20).plot(kind='barh', figsize=(8,10))

# PREDICTIONS

Next the predictions were made on the df_test dataset which contains the unknown images.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,roc_auc_score

In [None]:
#Prediction
y_pred=clf_final.predict_proba(df_test)

In [None]:
y_pred = y_pred[:,1]

# SUBMISSIONS 

In [None]:
submission = pd.DataFrame({
    "image_name": df1.image_name, 
    "target": y_pred
})

In [None]:
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)