![SIIM-ISIC Melanoma](https://siim.org/resource/resmgr/SIIM_logo-600x315.png)

## Introduction
As the leading healthcare organization for informatics in medical imaging, the **Society for Imaging Informatics in Medicine (SIIM)**'s mission is to advance medical imaging informatics through education, research, and innovation in a multi-disciplinary community. SIIM is joined by the International Skin Imaging Collaboration (ISIC), an international effort to improve melanoma diagnosis. The ISIC Archive contains the largest publicly available collection of quality-controlled dermoscopic images of skin lesions.

### What is Melanoma?
Melanoma is a type of skin cancer that develops when melanocytes (the cells that give the skin its tan or brown color) start to grow out of control.

Cancer starts when cells in the body begin to grow out of control. Cells in nearly any part of the body can become cancer, and can then spread to other areas of the body. To learn more about cancer and how it starts and spreads, see What Is Cancer?

Melanoma is much less common than some other types of skin cancers. But melanoma is more dangerous because it’s much more likely to spread to other parts of the body if not caught and treated early.
![](https://www.cancer.org/cancer/melanoma-skin-cancer/about/what-is-melanoma/_jcr_content/par/textimage/image.img.jpg/1568208521537.jpg)

### Signs and Symptoms of Melanoma
The most important warning sign of melanoma is a new spot on the skin or a spot that is changing in size, shape, or color.

Another important sign is a spot that looks different from all of the other spots on your skin (known as the *ugly duckling sign*).

If you have one of these warning signs, have your skin checked by a doctor.

The **ABCDE** rule is another guide to the usual signs of melanoma. Be on the lookout and tell your doctor about spots that have any of the following features:

* A is for Asymmetry: One half of a mole or birthmark does not match the other.
* B is for Border: The edges are irregular, ragged, notched, or blurred.
* C is for Color: The color is not the same all over and may include different shades of brown or black, or sometimes with patches of pink, red, white, or blue.
* D is for Diameter: The spot is larger than 6 millimeters across (about ¼ inch – the size of a pencil eraser), although melanomas can sometimes be smaller than this.
* E is for Evolving: The mole is changing in size, shape, or color.

source: [cancer.org](https://www.cancer.org/cancer/melanoma-skin-cancer/)

# The Competition

### Objective
The objective of this competition is to identify melanoma in images of skin lesions. The data given consists of not only the images, but also metadata of every patient from which the images have been acquired. Hence, it is a problem of **Binary Classification**, where the two target labels refer to **0 - Benign**, **1 - Malignant**.

### Evaluation
Our submissions will be evaluated using **area under the ROC curve**. An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
![AUC-ROC](https://imgur.com/yNeAG4M.png)
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.
<img src="https://developers.google.com/machine-learning/crash-course/images/ROCCurve.svg" width="400">
AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

For more information, feel free to check out this [tutorial](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) by Google.

# 1. Importing necessary libraries

We begin this EDA by importing the necessary libraries. If you're wondering what is `colored`, it is just a python package for printing out formatted string using Python.

In [None]:
!pip install colored --upgrade

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from colored import fg, bg, attr

sns.set_style("whitegrid")
sns.set_palette("pastel")
%matplotlib inline

# 2. Loading data

In [None]:
basepath = "../input/siim-isic-melanoma-classification/"
train_df = pd.read_csv(basepath + "train.csv")
test_df = pd.read_csv(basepath + "test.csv")

print("Train df shape: %s%s%s%s" % (attr(1), fg(156), str(train_df.shape), attr(0)))
display(train_df.head())
print("Test df shape: %s%s%s%s" % (attr(1), fg(156), str(test_df.shape), attr(0)))
display(test_df.head())

# 3. Exploration

## Column information

In [None]:
print("%s%s ====== train_df INFO ====== %s" % (attr(1), fg(156), attr(0)))
print(train_df.info())
print("%s%s ====== test_df INFO ====== %s" % (attr(1), fg(156), attr(0)))
print(test_df.info())

In [None]:
missing_train = train_df.isnull().sum()
missing_test = test_df.isnull().sum()

print("%s%sMissing values in %straining data: %s" % (attr(1), fg(197), fg(156), attr(0)))
print(missing_train[missing_train > 0].sort_values(ascending=False))

print("%s%sMissing values in %stest data: %s" % (attr(1), fg(197), fg(156), attr(0)))
print(missing_test[missing_test > 0].sort_values(ascending=False))

We observe that there are missing values in both training data and test data. We will have to deal with this later for modelling purposes if we wish to use the metadata to make predictions.

## Unique patients

In [None]:
print("%sNo. of patients in training data: %s%s%s" % (attr(1), fg(45), train_df.patient_id.nunique(), attr(0)))
print("%sNo. of patients in testing data: %s%s%s" % (attr(1), fg(45), test_df.patient_id.nunique(), attr(0)))

## Visualizations

### Gender Distribution

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19,6))

sns.countplot(train_df.sex, ax=ax[0], palette=['dodgerblue','lightpink'], alpha=0.7)
ax[0].set_title("Training data")

sns.countplot(test_df.sex, ax=ax[1], palette=['dodgerblue', 'lightpink'], alpha=0.7)
ax[1].set_title("Test data")

plt.show()

The gender distribution seems to be more balanced in the training data. Number of males is higher is both datasets, but the test data has noticeably higher number of males.

### No. of images per patient

In [None]:
train_patient_count = train_df.groupby('patient_id').image_name.count()
test_patient_count = test_df.groupby('patient_id').image_name.count()

fig, ax = plt.subplots(2,2,figsize=(19,11))

sns.distplot(train_patient_count, bins=30, ax=ax[0,0], color="mediumspringgreen", kde=False)
ax[0,0].set_title("Training Data")
ax[0,0].set_xlabel("")
ax[0,0].set_ylabel("Frequency")

sns.distplot(test_patient_count, bins=30, ax=ax[0,1], color='mediumorchid', kde=False)
ax[0,1].set_title("Test Data")
ax[0,1].set_xlabel("")
ax[0,1].set_ylabel("Frequency")

sns.boxplot(train_patient_count, ax=ax[1,0], color="mediumspringgreen")
ax[1,0].set_xlabel("No. of images")

sns.boxplot(test_patient_count, ax=ax[1,1], color='mediumorchid')
ax[1,1].set_xlabel("No. of images")
plt.show()

We observe that there are multiple images for every patient. Most of them have around 0-20 images, but there are patients with upto approximately **120** images in the training data and **250** images in the test data.

### Patient Age Distribution

In [None]:
train_patient_ages = list(train_df.groupby("patient_id").age_approx.unique())
train_mean_patient_ages = [np.mean(ages) for ages in train_patient_ages]

test_patient_ages = list(test_df.groupby("patient_id").age_approx.unique())
test_mean_patient_ages = [np.mean(ages) for ages in test_patient_ages]

fig, ax = plt.subplots(1,2, figsize=(19,6))

sns.distplot(train_mean_patient_ages, bins=15, ax=ax[0], color='mediumspringgreen')
ax[0].set_title("Age Distribution in Training Data")
ax[0].set_xlabel("Mean age")
ax[0].set_ylabel("Frequency")

sns.distplot(test_mean_patient_ages, bins=15, ax=ax[1], color='mediumorchid')
ax[1].set_title("Age Distribution in Test Data")
ax[1].set_xlabel("Mean age")
ax[1].set_ylabel("Frequency")

plt.show()

We observe that patients might have multiple ages recorded in the metadata, probably because of varying sources of the images. Hence, I decided to take the mean of the recorded ages of each patient and plot a histogram.

Both training data and test data seems to have a similar bell-shaped distribution of age, with most of the patients being around the age of **40-60 years**. Though there is a slightly higher number of older patients in the test data. This sort of imbalance might be important to consider while modelling.

### Location Distribution

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19,6))

train_locations = train_df.anatom_site_general_challenge.value_counts().sort_values(ascending=False)
test_locations = test_df.anatom_site_general_challenge.value_counts().sort_values(ascending=False)

sns.barplot(x=train_locations.index.values, y=train_locations.values, ax=ax[0], color="mediumspringgreen", alpha=0.5);
ax[0].set_xlabel("");
labels = ax[0].get_xticklabels();
ax[0].set_xticklabels(labels, rotation=45);
ax[0].set_title("Image locations in Training data");

sns.barplot(x=test_locations.index.values, y=test_locations.values, ax=ax[1], color="mediumorchid", alpha=0.5);
ax[1].set_xlabel("");
labels = ax[1].get_xticklabels();
ax[1].set_xticklabels(labels, rotation=45);
ax[1].set_title("Image locations in Test data");

plt.show()

Both training and test data have similar distribution of images for each location, with the **Torso** region having the most number of images, followed by **Lower Extremity**.

### Diagnosis Distribution

In [None]:
train_diagnosis = train_df.diagnosis.value_counts().sort_values(ascending=False)

fig, ax = plt.subplots(1,2, figsize=(19,6))

sns.barplot(x=train_diagnosis.index.values, y=train_diagnosis.values, ax=ax[0], color="mediumspringgreen", alpha=0.5);
ax[0].set_xlabel("");
labels = ax[0].get_xticklabels();
ax[0].set_xticklabels(labels, rotation=45);
ax[0].set_title("Diagnoses' in Training data");

sns.countplot(train_df.benign_malignant, ax=ax[1], palette=['tomato','red'], alpha=0.7)
ax[1].set_title("Type of cancer in Training data")
plt.show()

In [None]:
benign = train_df.groupby('benign_malignant').image_name.count()['benign'] / train_df.shape[0] * 100
malignant = train_df.groupby('benign_malignant').image_name.count()['malignant'] / train_df.shape[0] * 100

print("%sTarget Distribution in Training data%s" % (attr(1), attr(0)))
print("%s%sBenign: %s%s%%%s" % (attr(1), fg(197), fg(156), np.round(benign, 2), attr(0)))
print("%s%sMalignant: %s%s%%%s" % (attr(1), fg(197), fg(156), np.round(malignant, 2), attr(0)))

We observe that most of the diagnoses are **unknown** and there is a very high imbalance in the target variable, with around 98% of the data being labeled as **benign**. This imbalance will play a key role in how your model makes predictions, hence we have to be smart dealing with such a situation.

### Age/Gender Distribution

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19,6))

sns.countplot(x=train_df.age_approx, hue=train_df.sex, palette='Greens', ax=ax[0], alpha=0.7)
ax[0].set_title("Training data")

sns.countplot(x=test_df.age_approx, hue=test_df.sex, palette='Purples', ax=ax[1], alpha=0.7)
ax[1].set_title("Test_data")
plt.show()

These graphs are pretty interesting actually, so iin the training data, there are **more females affected** at younger ages of around **25-40 years**, whereas in the test data, most of the age groups have **more males affected** with a few exceptions.

But what's most interesting is that, in both the datasets, at the age group of **70 years**, the number of males affected is **significantly higher** than females. This sort of pattern should be considered while modelling so as our model does not overfit or so.

### Correlations with Age

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19,6))

sns.boxplot(y=train_df.age_approx, x=train_df.diagnosis, ax=ax[1], palette='PuRd', boxprops=dict(alpha=.7))
ax[1].set_title("Age/Diagnosis in Training data")
ax[1].set_xlabel("");
labels = ax[1].get_xticklabels();
ax[1].set_xticklabels(labels, rotation=45);

sns.boxplot(y=train_df.age_approx, x=train_df.anatom_site_general_challenge, ax=ax[0], palette='GnBu', boxprops=dict(alpha=.7))
ax[0].set_title("Age/Location in Test data")
ax[0].set_xlabel("");
labels = ax[0].get_xticklabels();
ax[0].set_xticklabels(labels, rotation=45);

The distribution of locations wrt to age seems to almost similar for every location, whereas the distribution of diagnoses wrt age seems to be pretty random. ***Solar lentigo and Lichenoid keratosis*** seems to occur more at older ages whereas ***Nevus*** seems to occur more at younger ages.

The unknown class seems to be more uniformly distributed.

### Gender/Diagnosis Distribution

In [None]:
female_diagnosis = list(train_df.groupby(['sex','diagnosis']).image_name.count()[:7])
z = 0
female_diagnosis = [z]*2 + female_diagnosis

male_diagnosis = list(train_df.groupby(['sex', 'diagnosis']).image_name.count()[7:])

diagnosis = ['atypical melanocytic proliferation', 'cafe-au-lait macule', 'lentigo NOS', 'lichenoid keratosis', 'melanoma', 'nevus', 'seborrheic keratosis', 'solar lentigo', 'unknown']

assert train_df.sex.value_counts()['male'] == np.sum(male_diagnosis)
assert train_df.sex.value_counts()['female'] == np.sum(female_diagnosis)

male_diagnosis = (male_diagnosis / np.sum(male_diagnosis)) * 100
female_diagnosis = (female_diagnosis / np.sum(female_diagnosis)) * 100

print("%s%sMale Diagnosis Statistics%s" % (attr(1), fg(45), attr(0)))
for i in range(len(diagnosis)):
    print("%s%s=> " % (attr(1), fg(14)) + diagnosis[i] + ": %s" % (fg(156)) + str(np.round(male_diagnosis[i],2)) + "%%%s" % (attr(0)))
    
print("%s%s=============================================%s" % (attr(1), fg(250), attr(0)))

print("%s%sFemale Diagnosis Statistics%s" % (attr(1), fg(213), attr(0)))
for i in range(len(diagnosis)):
    print("%s%s=> " % (attr(1), fg(205)) + diagnosis[i] + ": %s" % (fg(156)) + str(np.round(female_diagnosis[i],2)) + "%%%s" % (attr(0)))

Instead of a graph, I decided to calculate the percentage of males and females having each of the diagnoses, giving a better understanding if any gender is more likely to be diagnosed with one of the diagnoses.

Most of the data is similar for both genders, but there are no cases of **atypical melanocytic proliferation and cafe-au-lait macule** in females, atleast in the training data.

## Conclusion

* The more important observations from the exploration would be the **high imbalance of target variable**.
* The fact that there are multiple images for every patient should be taken into consideration with the above point for validation of your model. Your model should generalize well enough.
* The imbalance in gender distributions at the age of **70 years** is also a worthwhile observation.

# 4. Modelling

I'll be extending this notebook to demonstrate how to get started with building a **Convolutional Neural Network with PyTorch** very soon.

> Please feel free to let me know in the **comments section** what you think about this notebook, if I've made a mistake somewhere and also how I could make my notebooks better. If you liked this notebook, please **upvote** it! I'm only getting started with Kaggle, and it would mean a lot if you guys did so. Thank you again!