# Description:

Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective.

Currently, dermatologists evaluate every one of a patient's moles to identify outlier lesions or “ugly ducklings” that are most likely to be melanoma. Existing AI approaches have not adequately considered this clinical frame of reference. Dermatologists could enhance their diagnostic accuracy if detection algorithms take into account “contextual” images within the same patient to determine which images represent a melanoma. If successful, classifiers would be more accurate and could better support dermatological clinic work.

As the leading healthcare organization for informatics in medical imaging, the Society for Imaging Informatics in Medicine (SIIM)'s mission is to advance medical imaging informatics through education, research, and innovation in a multi-disciplinary community. SIIM is joined by the International Skin Imaging Collaboration (ISIC), an international effort to improve melanoma diagnosis. The ISIC Archive contains the largest publicly available collection of quality-controlled dermoscopic images of skin lesions.

In this competition, you’ll identify melanoma in images of skin lesions. In particular, you’ll use images within the same patient and determine which are likely to represent a melanoma. Using patient-level contextual information may help the development of image analysis tools, which could better support clinical dermatologists.

Melanoma is a deadly disease, but if caught early, most melanomas can be cured with minor surgery. Image analysis tools that automate the diagnosis of melanoma will improve dermatologists' diagnostic accuracy. Better detection of melanoma has the opportunity to positively impact millions of people.

## Evaluation
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

<img src="https://imgur.com/yNeAG4M.png">

# Data:

## What should I expect the data format to be?
The images are provided in DICOM format. This can be accessed using commonly-available libraries like `pydicom`, and contains both image and metadata. It is a commonly used medical imaging data format.

Images are also provided in `JPEG` and `TFRecord` format (in the jpeg and tfrecords directories, respectively). Images in TFRecord format have been resized to a uniform 1024x1024.

Metadata is also provided outside of the DICOM format, in CSV files. See the `Columns` section for a description.

## What am I predicting?
You are predicting a binary target for each image. Your model should predict the probability (floating point) between 0.0 and 1.0 that the lesion in the image is malignant (the target). In the training data, train.csv, the value 0 denotes benign, and 1 indicates malignant.

## Files
1. train.csv - the training set
2. test.csv - the test set
3. sample_submission.csv - a sample submission file in the correct format

## Columns
1. image_name - unique identifier, points to filename of related DICOM image
2. patient_id - unique patient identifier
3. sex - the sex of the patient (when unknown, will be blank)
4. age_approx - approximate patient age at time of imaging
5. anatom_site_general_challenge - location of imaged site
6. diagnosis - detailed diagnosis information (train only)
7. benign_malignant - indicator of malignancy of imaged lesion
8. target - binarized version of the target variable

Download dataset from [here](https://www.kaggle.com/c/siim-isic-melanoma-classification/data)

or use kaggle API and run following command 

    kaggle competitions download -c siim-isic-melanoma-classification



## What is Melanoma?

Melanoma is a type of skin cancer that develops when melanocytes (the cells that give the skin its tan or brown color) start to grow out of control. Cancer starts when cells in the body begin to grow out of control. Cells in nearly any part of the body can become cancer, and can then spread to other areas of the body. 

But melanoma is more dangerous because it’s much more likely to spread to other parts of the body if not caught and treated early.

The stage of a cancer at diagnosis will indicate how far it has already spread and what kind of treatment will be suitable.

This method of assigning a stage to melanoma describes the cancer in five stages, from 0 to 4:

**Stage 0:** The cancer is only present in the outermost layer of skin. Doctors refer to this stage as “melanoma in situ.”

**Stage 1:** The cancer is up to 2 millimeters (mm) thick. It has not yet spread to lymph nodes or other sites, and it may or may not be ulcerated.

**Stage 2:** The cancer is at least 1 mm thick but may be thicker than 4 mm. It may or may not be ulcerated, and it has not yet spread to lymph nodes or other sites.

**Stage 3:** The cancer has spread to one or more lymph nodes or nearby lymphatic channels but not distant sites. The original cancer may no longer be visible. If it is visible, it may be thicker than 4 mm and also ulcerated.

**Stage 4:** The cancer has spread to distant lymph nodes or organs, such as the brain, lungs, or liver.

<img src="https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/11/15/17/43/ds00190_-ds00439_im04411_mcdc7_melanomathu_jpg.jpg">

In [None]:
# importing libraries
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import missingno as msno
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style="whitegrid")

#bokeh
from bokeh.models import ColumnDataSource, HoverTool, Panel, FactorRange
from bokeh.plotting import figure
from bokeh.io import output_notebook, show, output_file
from bokeh.palettes import Spectral6

import warnings
warnings.filterwarnings('ignore')

## Setup Directory and Files path

In [None]:
# set up directory and files path

base_dir = "../input/siim-isic-melanoma-classification/"
train_csv = os.path.join(base_dir + "train.csv")
test_csv = os.path.join(base_dir + "test.csv")
jpeg_train_images = os.path.join(base_dir + "jpeg/train")
jpeg_test_images = os.path.join(base_dir + "jpeg/test")

train_df = pd.read_csv(train_csv)
test_df = pd.read_csv(test_csv)

train_df.head(5)

We are predicting a binary target for each image. Model should predict the probability (floating point) between 0.0 and 1.0 that the lesion in the image is malignant (the target). So, let's check how many images ara benign and how many are malignant in training dataframe.

In [None]:
train_df["benign_malignant"].value_counts()

We have more examples from benign than malignant. So, we need to perform oversampling or undersampling

**Benign:** Benign tumors are normal cells that divide and grow too much, but do not interfere with the function of normal cells around them.

**Malignant:** Malignant tumors are overgrowths of abnormal cells (cancer) that divide without control and order.
They do not stop growing, even when they come into contact with nearby cells.

In [None]:
benign = train_df[train_df['benign_malignant']=='benign']
malignant = train_df[train_df['benign_malignant']=='malignant']

In [None]:
# Extract 9 random images from benign lesions
random_images = [np.random.choice((benign['image_name'].values)+'.jpg') for i in range(9)]

print('Display benign Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# Extract 9 random images from malignant lesions
random_images = [np.random.choice((malignant['image_name'].values)+'.jpg') for i in range(9)]

print('Display malignant Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

## Data Exploration

In [None]:
# check for missing values
print(train_df.isnull().any())

msno.matrix(train_df, color=(207/255, 196/255, 171/255), fontsize=10)

There are missing values in `sex`, `age_approx`, and `anatom_site_general_challenge` column. Let's check how many missing values we have

In [None]:
# Number of missing values in sex column
print("Number of missing values in sex column is {}".format(train_df.shape[0] - train_df['sex'].count()))
print("--------------------------------------------------")
# Number of missing values in age_approx column
print("Number of missing values in age_approx column is {}".format(train_df.shape[0] - train_df['age_approx'].count()))
print("--------------------------------------------------")
# Number of missing values in anatom_site_general_challenge column
print("Number of missing values in anatom_site_general_challenge column is {}".format(train_df.shape[0] - train_df['anatom_site_general_challenge'].count()))

In [None]:
print(test_df.isnull().any())

msno.matrix(train_df, color=(207/255, 196/255, 171/255), fontsize=10)

There are missing values in `anatom_site_general_challenge` column in test dataset. Let's check how many missing values we have

In [None]:
# Number of missing values in anatom_site_general_challenge column
print("Number of missing values in anatom_site_general_challenge column is {}".format(test_df.shape[0] - test_df['anatom_site_general_challenge'].count()))

In [None]:
# Total number of training and testing images
print("Total images in Train set:", train_df["image_name"].count())
print("Total images in Test set:", test_df["image_name"].count())

So, we have 75-25 distribution for train-test images

In [None]:
# unique number of patients
print("Total patients ids are {}".format(train_df["patient_id"].count()))
print("Unique patients ids are {}".format(len(train_df["patient_id"].unique())))

Total number of patient ids are much larger than unique patient that means we have multiple records of same patient. 

In [None]:
# exploring the target column
train_df["target"].value_counts()

In [None]:
# This function will plot different type of histogram with Bokeh. It takes dataframe, column for which we want 
# histogram, color palate, bins for axes and title and return histogram

# For more information on how histograms work follow this blog
# https://towardsdatascience.com/interactive-histograms-with-bokeh-202b522265f3

def hist_hover(dataframe, column, colors=["#94c8d8", "#ea5e51"], bins=30, title=''):
    hist, edges = np.histogram(dataframe[column], bins = bins)
    
    hist_df = pd.DataFrame({column: hist,
                            "left": edges[:-1],
                            "right": edges[1:]})
    hist_df["interval"] = ["%d to %d" % (left, right) for left,
                           right in zip(hist_df["left"], hist_df["right"])]
    
    src = ColumnDataSource(hist_df)
    plot = figure(plot_height = 400, plot_width = 600,
                  title = title,
                  x_axis_label = column,
                  y_axis_label = "Count")    
    plot.quad(bottom = 0, top = column,left = "left",
              right = "right", source = src, fill_color = colors[0],
              line_color = "#35838d", fill_alpha = 0.7,
              hover_fill_alpha = 0.7, hover_fill_color = colors[1])
    
    hover = HoverTool(tooltips = [('Interval', '@interval'), ('Count', str("@" + column))])
    plot.add_tools(hover)
    output_notebook()
    show(plot)

In [None]:
# histogram of Target column in training set
hist_hover(train_df, 'target', bins=3, title='Distribution of the Target column in the training set')

In [None]:
# Gender wise Distribution of target in traing set

Sex = ["Female", "Male"]
Target = ['0', '1']

g = train_df.groupby(["target", "sex"]).size()
male = list(g[0].values)
female = list(g[1].values)

data = {'Sex':Sex,
        'Male':male,
        'Female':female}

x = [(sex, target) for sex in Sex for target in Target]
counts = sum(zip(data['Male'], data['Female']), ())

source = ColumnDataSource(data=dict(x=x, counts=counts, color=Spectral6))

p = figure(x_range=FactorRange(*x), plot_height=400, plot_width=800, title="Location of Image site with respect of sex",
           tools="hover, pan, box_zoom, wheel_zoom, reset, save", tooltips= ("@x: @counts"))

p.vbar(x='x', top='counts', width=0.9, color='color', source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

So, we have more Male patients than Female in both target category.

In [None]:
# location of image anatom site
train_df["anatom_site_general_challenge"].value_counts(sort=True)

In [None]:
# Distribution of anatom site general challenge column in training set

Categories = ["torso", "lower extremity", "upper extremity", "head/neck", "palms/soles", "oral/genital"]
counts = list(train_df["anatom_site_general_challenge"].value_counts(sort=True))

source = ColumnDataSource(data=dict(Categories=Categories, counts=counts, color=Spectral6))

p = figure(x_range=Categories, y_range=(0,22000), plot_height=300, title="Distribution of the anatom_site_general_challenge in the training set",
           tools="hover, pan, box_zoom, wheel_zoom, reset, save", tooltips= ("@Categories: @counts"))

p.vbar(x='Categories', top='counts', width=0.9, color='color', legend_field="Categories", source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

We have maximum examples from torso and minimum from oral/genital but let's see distribution with respect to gender

In [None]:
# Gender wise distribution of anatom site column in training set
print(train_df.groupby(["sex", "anatom_site_general_challenge"]).size())

In [None]:
# Gender wise distribution of anatom site column in training set
Categories = ["head/neck", "lower extremity", "oral/genital", "palms/soles", "torso", "upper extremity"]
Sex = ["Male", "Female"]

g = train_df.groupby(["sex", "anatom_site_general_challenge"]).size()
male = list(g.male.values)
female = list(g.female.values)

data = {'Categories':Categories,
        'Male':male,
        'Female':female}

x = [(categories, sex) for categories in Categories for sex in Sex]
counts = sum(zip(data['Male'], data['Female']), ())

source = ColumnDataSource(data=dict(x=x, counts=counts, color=Spectral6))

p = figure(x_range=FactorRange(*x), plot_height=400, plot_width=800, title="Location of Image site with respect of sex",
           tools="hover, pan, box_zoom, wheel_zoom, reset, save", tooltips= ("@x: @counts"))

p.vbar(x='x', top='counts', width=0.9, color='color', source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

We still have maximum examples from torso and minimum from oral/genital in both gender type. Distribution are also same in both scenario. But, in torso Male has more cases while in lower extremity Female has more cases. Let's visualise images from torso and lower extremity.

In [None]:
# Extract 9 random images from malignant lesions with growth in torso

torsomale = train_df[(train_df['benign_malignant']=='malignant') & (train_df['anatom_site_general_challenge'] == 'torso') & (train_df['sex'] == 'male')]

random_images = [np.random.choice((torsomale['image_name'].values)+'.jpg') for i in range(9)]

print('Display malignant torso Images with Male')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# Extract 9 random images from malignant lesions with growth in lower extremity

lowerextremity = train_df[(train_df['benign_malignant']=='malignant') & (train_df['anatom_site_general_challenge'] == 'lower extremity') & (train_df['sex'] == 'female')]

random_images = [np.random.choice((lowerextremity['image_name'].values)+'.jpg') for i in range(9)]

print('Display malignant lower extremity Images with Female')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# Distribution of age_approx column in training set
# we have missing values in age_approx so we need to fix that before plotting histograms

training_df = train_df
training_df['age_approx'].fillna(45.0, inplace=True) # 45 is mode of age_approx
training_df['age_approx'].isnull().any()
hist_hover(training_df, 'age_approx', title='Age Distribution of patients')

Age column has normal distribution for training data where we have less number of patients in younger age or starting age and older age or ending age while more number of patients in average age or middle age. Let's visualize images with all three types of age gap.

In [None]:
hist_hover(test_df, 'age_approx', title='Age Distribution of patients')

We do not have normal distribution here but same more number of patients in middle age.

In [None]:
# Extract 9 random images from malignant lesions with age less than 21

youngerage = train_df[(train_df['benign_malignant']=='malignant') & (train_df['age_approx'] <= 21)]

random_images = [np.random.choice((youngerage['image_name'].values)+'.jpg') for i in range(9)]

print('Display malignant younger age Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# Extract 9 random images from malignant lesions with age gap 45-48

middleage = train_df[(train_df['benign_malignant']=='malignant') & (train_df['age_approx'] >= 45) & (train_df['age_approx'] <= 48)]

random_images = [np.random.choice((middleage['image_name'].values)+'.jpg') for i in range(9)]

print('Display malignant middle age Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# Extract 9 random images from malignant lesions with age greater than 78

oldage = train_df[(train_df['benign_malignant']=='malignant') & (train_df['age_approx'] >= 78)]

random_images = [np.random.choice((oldage['image_name'].values)+'.jpg') for i in range(9)]

print('Display malignant old age Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# distribution of diagnosis column in training set
train_df['diagnosis'].value_counts()

In [None]:
# Gender wise distribution of diagnosis column in training set

Categories = ["unknown", "nevus", "melanoma", "seborrheic keratosis", "lentigo NOS", "lichenoid keratosis", 
              "solar lentigo", "cafe-au-lait macule", "atypical melanocytic proliferation"]
counts = list(train_df["diagnosis"].value_counts())

source = ColumnDataSource(data=dict(Categories=Categories, counts=counts, color=Spectral6))

p = figure(x_range=Categories, y_range=(0,300), plot_width=800, plot_height=300, title="Distribution of the diagnosis in the training set",
           tools="hover, pan, box_zoom, wheel_zoom, reset, save", tooltips= ("@Categories: @counts"))

p.vbar(x='Categories', top='counts', width=0.9, color='color', legend_field="Categories", source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"
show(p)

We have 27124 maximum number of patients with unknown diagnosis and 584 only with melanoma which we are predicting. Let's visualize images of both

In [None]:
# Extract 9 random images with unknown diagnosis

unknown = train_df[train_df['diagnosis'] == 'unknown']

random_images = [np.random.choice((unknown['image_name'].values)+'.jpg') for i in range(9)]

print('Display unknown diagnosis Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
# Extract 9 random images with melanoma diagnosis

melanoma = train_df[train_df['diagnosis'] == 'melanoma']

random_images = [np.random.choice((melanoma['image_name'].values)+'.jpg') for i in range(9)]

print('Display melanoma diagnosis Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(jpeg_train_images, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()

## Label Encoding

we have sex, anatom_site_general_challenge, diagnosis column with categorical values so we need to encode them first. But, first we need to handle missing values.

In [None]:
train_df['sex'].fillna("male", inplace = True)
train_df['age_approx'].fillna(50, inplace = True)
train_df['anatom_site_general_challenge'].fillna('torso', inplace = True)

In [None]:
categorical = ['sex', 'anatom_site_general_challenge', 'diagnosis']

label_encoder = LabelEncoder()

for column in categorical:
    train_df[column] = label_encoder.fit_transform(train_df[column])
    
# we do not need benign_malignant column as information is already present in target
train_df.drop(['benign_malignant'], axis = 1, inplace = True)

In [None]:
test_df['anatom_site_general_challenge'].fillna('torso', inplace = True)

In [None]:
categorical = ['sex', 'anatom_site_general_challenge']

label_encoder = LabelEncoder()

for column in categorical:
    test_df[column] = label_encoder.fit_transform(test_df[column])

From visualization we can see that we have images with different shapes. So. let's visualize images shape.

In [None]:
images_shape = []

for k, image_name in enumerate(train_df['image_name']):
    image = Image.open(jpeg_train_images + "/" + image_name + '.jpg')
    images_shape.append(image.size)

images_shape_df = pd.DataFrame(data = images_shape, columns = ['H', 'W'], dtype='object')
images_shape_df['Size'] = '[' + images_shape_df['H'].astype(str) + ',' + images_shape_df['W'].astype(str) + ']'

In [None]:
images_shape_df.head()

In [None]:
print("We have {} types of different shapes in training images".format(len(list(images_shape_df['Size'].unique()))))

In [None]:
# Distribution of shapes in training set

# We have 88 types of unique shapes but many of them contain only few samples. so we will plot only 10 with 
# highest number of samples

Categories = list(images_shape_df['Size'].value_counts().keys())[0:10]
counts = list(images_shape_df['Size'].value_counts().values)[0:10]

source = ColumnDataSource(data=dict(Categories=Categories, counts=counts, color=Spectral6))

p = figure(x_range=Categories, y_range=(0,22000), plot_width = 1000, plot_height=300, title="Images shape in training set",
           tools="hover, pan, box_zoom, wheel_zoom, reset, save", tooltips= ("@Categories: @counts"))

p.vbar(x='Categories', top='counts', width=0.9, color='color', legend_field="Categories", source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

Maximum 14703 images have shape of 6000x4000

If you like this please upvote