## Context    
> * Skin cancer is the most prevalent type of cancer.    
> * Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer.    
> * The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020.    
> * It's also expected that almost 7,000 people will die from the disease.    
> * As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective. 
   
## Task    
> * In this competition, you’ll identify melanoma in images of skin lesions.    
> * In particular, you’ll use images within the same patient and determine which are likely to represent a melanoma.    
> * Using patient-level contextual information may help the development of image analysis tools, which could better support clinical dermatologists.    
   
## What should I expect the data format to be?    
> * The images are provided in DICOM format.    
> * This can be accessed using commonly-available libraries like pydicom, and contains both image and metadata.    
> * It is a commonly used medical imaging data format.    
   
   
> * Images are also provided in JPEG and TFRecord format (in the jpeg and tfrecords directories, respectively).    
> * Images in TFRecord format have been resized to a uniform 1024x1024.    
   
   
> * Metadata is also provided outside of the DICOM format, in CSV files.    
> * See the Columns section for a description. 
   
## What am I predicting?    
> * You are predicting a binary target for each image.    
> * Your model should predict the probability (floating point) between 0.0 and 1.0 that the lesion in the image is malignant (the target).    
> * In the training data, train.csv, the value 0 denotes benign, and 1 indicates malignant.    

## Libraries

In [None]:
import os
import glob

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

import matplotlib.image as mpimg

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout

## Data

In [None]:
! ls ../input/siim-isic-melanoma-classification

### Files    
> * train.csv - the training set    
> * test.csv - the test set    
> * sample_submission.csv - a sample submission file in the correct format 
   
### Columns    
> * **image_name** - unique identifier, points to filename of related DICOM image    
> * **patient_id** - unique patient identifier    
> * **sex** - the sex of the patient (when unknown, will be blank)    
> * **age_approx** - approximate patient age at time of imaging    
> * **anatom_site_general_challenge** - location of imaged site    
> * **diagnosis** - detailed diagnosis information (train only)    
> * **benign_malignant** - indicator of malignancy of imaged lesion    
> * **target** - binarized version of the target variable    

### Read dataset

In [None]:
train = pd.read_csv('../input/siim-isic-melanoma-classification/train.csv')
train.head()

In [None]:
test = pd.read_csv('../input/siim-isic-melanoma-classification/test.csv')
test.head()

In [None]:
sample_submission = pd.read_csv('../input/siim-isic-melanoma-classification/sample_submission.csv')
sample_submission.head()

### Explore datasets

In [None]:
print('No. of images in the train dataset :', train.shape[0])
print('No. of unique patients in train dataset :', train['patient_id'].nunique())

In [None]:
print('No. of images in the test dataset :', test.shape[0])
print('No. of unique patients in test dataset :', test['patient_id'].nunique())

### Value counts

In [None]:
def count_hbar(col, title, pal='Dark2'):
    df = pd.DataFrame(train[col].value_counts()).reset_index()
    
    sns.set_style('whitegrid')

    sns.barplot(data=df, x=col, y='index', palette=pal)
    
    for ind, row in df.iterrows():
        plt.text(row[col]+500, ind, row[col])
        
    
    sns.despine(bottom=True)
    plt.title(title)
    plt.xlabel('')
    plt.ylabel('')
    plt.show()

In [None]:
count_hbar('sex', 'Gender', ['royalblue', 'deeppink'])

In [None]:
count_hbar('anatom_site_general_challenge', 'Anatomy site of the mole')

In [None]:
count_hbar('diagnosis', 'Diagnosis  of the mole')

In [None]:
count_hbar('benign_malignant', 'Benign or Malignant', ['dimgray', 'orangered'])

### EDA

In [None]:
temp = train.groupby('patient_id').agg({'sex':max, 'age_approx':np.mean}).reset_index()

plt.figure(figsize=(12, 5))
sns.kdeplot(temp[temp['sex']=='male']['age_approx'], label='Male', shade=True, color='royalblue')
sns.kdeplot(temp[temp['sex']=='female']['age_approx'], label='Female', shade=True, color='deeppink')
plt.title('Age distribution Male and Female patients', 
          loc='left', fontsize=16)
plt.show()

In [None]:
df = pd.DataFrame(train.groupby(['anatom_site_general_challenge'])['target'].mean()) \
        .sort_values('target', ascending=False) \
        .reset_index() 
df['target'] = round(df['target'], 4)

plt.figure(figsize=(12, 5))
sns.set_style('darkgrid')

sns.barplot(data=df, x='target', y='anatom_site_general_challenge', palette='Set2')

for ind, row in df.iterrows():
    plt.text(row['target']+0.0001, ind+0.1, row['target'])

sns.despine(bottom=True)
plt.title('Probability of mole being a Malignant one wrt to it\'s possition on the human body', 
          loc='left', fontsize=16)
plt.xlabel('')
plt.ylabel('')
plt.show()

## Images

In [None]:
def plot_images(diagnosis, title, n):
    temp = train[train['diagnosis']==diagnosis]
    img_ids = ['../input/siim-isic-melanoma-classification/jpeg/train/'+i+'.jpg' for i in temp['image_name'].sample(n)]

    fig, ax = plt.subplots(figsize=(24, 5))
    fig.suptitle(title, fontsize=24)
    for ind, img in enumerate(img_ids[:n]):
        plt.subplot(1, 5, ind+1)
        image = plt.imread(img) # read image
        plt.axis('off')
        plt.imshow(image)

In [None]:
def plot_image(diagnosis, title):
    temp = train[train['diagnosis']==diagnosis]
    img_ids = ['../input/siim-isic-melanoma-classification/jpeg/train/'+i+'.jpg' for i in temp['image_name']]
    
    plt.figure(figsize = (4, 4))
    image = plt.imread(img_ids[0])
    plt.axis('off')
    plt.title(title, fontsize=16)
    plt.imshow(image)
    plt.show()

In [None]:
plot_images('melanoma', 'Melanoma', 5)

In [None]:
plot_images('seborrheic keratosis', 'Seborrheic Keratosis', 5)

In [None]:
plot_images('lichenoid keratosis', 'Lichenoid Keratosis', 5)

In [None]:
plot_images('lentigo NOS', 'Lentigo NOS', 5)

In [None]:
plot_images('solar lentigo', 'Solar Lentigo', 5)

In [None]:
plot_image('cafe-au-lait macule', 'Cafe-au-lait Macule')

In [None]:
plot_image('atypical melanocytic proliferation', 'Atypical Melanocytic Proliferation')