# SIIM Playground EDA

In this notebook, I will focus on exploring through EDA the tabular data joined with the images and compare properties between the training and testing set to see if they have similarities.

## 1. Introduction: Overview of the data

### 1.1 Imports

In [None]:
import numpy as np
import pandas as pd
import cv2
import os
import shutil
import matplotlib.pyplot as plt

input_dir = "/kaggle/input/siim-isic-melanoma-classification"
train_dir = os.path.join(input_dir, "jpeg/train")
test_dir = os.path.join(input_dir, "jpeg/test")


train_csv = pd.read_csv(os.path.join(input_dir,"train.csv"))
test_csv = pd.read_csv(os.path.join(input_dir,"test.csv"))

In [None]:
train_csv.head()

| Column                        | Type        | Description                                                                                    |
|-------------------------------|-------------|------------------------------------------------------------------------------------------------|
| image_name                    | String      | Identifies the name of the image file corresponding to the given row. Should not be redundant. |
| patient_id                    | String      | Identifier of the patient from whom the skin picture has been taken. Could be redundant.       |
| sex                           | Categorical | Identifies the sex of the patient.                                                             |
| age_approx                    | Integer     | Age rounded to the nearest multiple of 5.                                                      |
| anatom_site_general_challenge | Categorical | Defines the location of the picture on the patient's body.                                     |
| diagnosis                     | Categorical | Provide details about the diagnosis. Is only found in the train dataset.                       |
| benign_malignant              | Categorical | Target classes to be predicted (benign=0, malignant=1)                                         |

### 1.2 Features analysis

In this part, I will work on the previously described features and try to gain insights on data properties.

#### 1.2.a Patient Id
---
The '*patient_id*' column is a patient identifier. Using it, we could know if a same single patient has been taken care of multiple times and try to see how redundant this feature is within the dataset.

We will first split it to see if the nomenclature always stays the same (i.e '*IP_XXXX*' where '*XXXX*' consists of 7 numbers). 

In [None]:
train_csv["preffix_id"] = train_csv["patient_id"].apply(lambda x: x.split("_")[0])
train_csv["suffix_id"] = train_csv["patient_id"].apply(lambda x: x.split("_")[-1])

test_csv["preffix_id"] = test_csv["patient_id"].apply(lambda x: x.split("_")[0])
test_csv["suffix_id"] = test_csv["patient_id"].apply(lambda x: x.split("_")[-1])

In [None]:
train_csv["preffix_id"].value_counts()

In [None]:
test_csv["preffix_id"].value_counts()

We see here that there only exists a single preffix in the *id* field being *IP*.

In [None]:
print("\tTRAIN SUFFIX ID VALUE COUNTS\n---\n")
print(train_csv["suffix_id"].value_counts())
print("-----------------------------------")
print("\tTEST SUFFIX ID VALUE COUNTS\n---\n")
print(test_csv["suffix_id"].value_counts())

Ids are redundant: same patients seem to be present multiple times. Are the ids always the same size (7 characters) ?

In [None]:
train_csv["suffix_id"].apply(lambda x: len(x)).value_counts()

Are some patient both in the test **and** in the train set ?

In [None]:
list(set(train_csv["suffix_id"].unique()) & set(test_csv["suffix_id"].unique()))

It seems like none are shared between the two sets.

#### 1.2.b Patient age 
---
Type: Integer

Description: Age rounded to the nearest multiple of 5.

In [None]:
plt.figure(figsize=(25,10))
plt.ylabel("Age count (in # of entry)")
plt.xlabel("Age")

ax_tr = train_csv["age_approx"].plot(kind='hist')
ax_te = test_csv["age_approx"].plot(kind='hist')

ax_tr.set_label("Train set")
ax_te.set_label("Test set")

L=plt.legend()
L.get_texts()[0].set_text('Train set')
L.get_texts()[1].set_text('Test set')

plt.plot()

In [None]:
plt.close()

In [None]:
print("\tTRAIN AGE DESCRIPTION\n---\n")
print(train_csv["age_approx"].describe())
print('-----------------------------')
print("\tTEST AGE DESCRIPTION\n---\n")
print(test_csv["age_approx"].describe())

While having very similar distribution, the only noticeable difference we can find between these two descriptions are:

- mean value being slightly higher for the test set (but it is almost not noticeable);
- the value is **not skewed** as the mean and the median are almost equal for both sets;
- the training set has very young patients, while the test sets only have patients older than 10.

In [None]:
print("Train age unique values: ",np.sort(train_csv["age_approx"].unique()))
print("Test age unique values: ",np.sort(test_csv["age_approx"].unique()))

The training set seems to suffer from nan ages. How many are there ?

In [None]:
print(f"{train_csv['age_approx'].isnull().sum()} rows have a 'nan' age out of {len(train_csv)} total rows").

Which patients have 'nan' age ?

In [None]:
train_csv[train_csv['age_approx'].isnull()]

We seem to notice that the sex column is also empty (nan here).

In [None]:
train_csv[train_csv["age_approx"].isnull()]["suffix_id"].value_counts()

3 patients have null age. Let's check their sex column now.

In [None]:
train_csv[train_csv["age_approx"].isnull() & train_csv["sex"].isnull()]["suffix_id"].value_counts()

These two patients neither have gender nor age. Are all entries of these patients without an age ?

In [None]:
nan_age = ['5205991', '9835712', '0550106']
train_csv[train_csv["suffix_id"].apply(lambda x: x in nan_age)]

Dataframes are of the **same size** => patient without age are never labelled as having an age.

#### **1.2.c Patient sex**
---
Type: Categorical

Description: Identifies the sex of the patient

We noticed Nan values before, how often are they found ? Also, what is the distribution of men/women in both sets ? Is there any missing sex in test set ?

In [None]:
train_csv_cp = train_csv.copy()
train_csv_cp["sex"] = train_csv_cp["sex"].fillna("null")
test_csv_cp = test_csv.copy()
test_csv_cp["sex"] = test_csv_cp["sex"].fillna("null")

In [None]:
print("\tTRAIN SEX VALUE COUNTS\n---\n")
print(train_csv_cp["sex"].value_counts())
print("-----------------------------------")
print("\tTEST SEX VALUE COUNTS\n---\n")
print(test_csv_cp["sex"].value_counts())

Which individuals have missing sex ?

In [None]:
train_csv[train_csv["sex"].isnull()]["suffix_id"].value_counts()

The individuals with missing sex are the same than for missing age.

#### 1.2.d anatom_site_general_challenge
---
Type: Categorical

Description: Defines the location of the picture on the patient's body.

In [None]:
print("\tTRAIN ANATOM SITE VALUE COUNTS:\n---\n")
print(train_csv["anatom_site_general_challenge"].fillna("null").value_counts())
print('--------------------------------------')
print("\tTEST ANATOM SITE VALUE COUNTS:\n---\n")
print(test_csv["anatom_site_general_challenge"].fillna("null").value_counts())

Both train and test sets have missing values. Distributions look similar, let's plot frequency of each value for both sets.

In [None]:
plt.figure(figsize=(25,10))
plt.ylabel("Anatomy site count (in frequency)")
plt.xlabel("Anatomy site")

ax_tr = train_csv["anatom_site_general_challenge"].fillna("null").value_counts(normalize=True).plot()
ax_te = test_csv["anatom_site_general_challenge"].fillna("null").value_counts(normalize=True).plot()

ax_tr.set_label("Train set")
ax_te.set_label("Test set")

L=plt.legend()
L.get_texts()[0].set_text('Train set')
L.get_texts()[1].set_text('Test set')

plt.plot()

Distributions are indeed very similar.
Are the null values coming from the same patients (as for age or sex ?). Here it would seem unlikely except if multiple spots have been photographed for the same patients at the same body parts.

In [None]:
print("\tTRAIN NULL VALUE COUNTS:\n---\n")
print(train_csv[train_csv["anatom_site_general_challenge"].isnull()]["suffix_id"].value_counts())
print("-----------------------")
print("\tTEST NULL VALUE COUNTS:\n---\n")
print(test_csv[test_csv["anatom_site_general_challenge"].isnull()]["suffix_id"].value_counts())

- Train missing values come from 220 different patients, with 102 rows coming from a same single patient ‘3057277’.
- Test missing values come from 61 different patients, with 240 rows coming from a same single patient ‘3579794’.

In [None]:
print(train_csv[train_csv["anatom_site_general_challenge"].isnull()]["benign_malignant"].value_counts())

#### 1.2.e Diagnosis
---
Type: Categorical

Description: Provide details about the diagnosis. Is only found in the train dataset.


In [None]:
print("\tBENIGN DIAGNOSIS VALUE COUNT:\n---\n")
print(train_csv[train_csv["benign_malignant"] == "benign"]["diagnosis"].value_counts())
print('------------------------')
print("\tMALIGNANT DIAGNOSIS VALUE COUNT:\n---\n")
print(train_csv[train_csv["benign_malignant"] != "benign"]["diagnosis"].value_counts())

This seems to indicate that benign diagnosis are left unknown, while at least being sure that what is observed is **NOT** a melanoma.

#### 1.2.f Benign/Malignant
---
Type: Categorical

Description: Target variable, describes if the spot is benign or not.

In [None]:
train_csv["benign_malignant"].value_counts()

### 2. Using ML algorithms to find correlations between variables.

We will now try to develop a predictive algorithm that will only use the table data to predict benign/malignant in order to figure out eventual correlations. The diagnosis column will be removed from this procedure and new features will be created.

In [None]:
new_train_csv = train_csv.copy()
to_drop = ["image_name", "diagnosis", "benign_malignant", "preffix_id", "patient_id"]

#### 2.a Sex
---

In [None]:
new_train_csv["is_female"] = new_train_csv["sex"].apply(lambda x: 0 if type(x) != float and x.lower() == 'male' else 1 if type(x) != float else -1)
to_drop.append("sex")
new_train_csv.head()

#### 2.b Anatom Site General Challenge
---

In [None]:
anatom_classes = {
    val: i for i, val in enumerate(new_train_csv["anatom_site_general_challenge"].unique())
}
print("String to class: "+str(anatom_classes))
new_train_csv["anatom_classes"] = new_train_csv["anatom_site_general_challenge"].apply(lambda x: anatom_classes[x])
to_drop.append("anatom_site_general_challenge")
new_train_csv.head()

#### 2.c Age Approx
---

In [None]:
new_train_csv["reduced_age"] = new_train_csv["age_approx"].apply(lambda x: int(x/5) if np.isnan(x) == False else -1)
to_drop.append("age_approx")
new_train_csv.head()

#### 2.d Patient id
---

In [None]:
suffix_classes = {

    val: i for i,val in enumerate(new_train_csv["suffix_id"].unique())

}

new_train_csv["suffix_classes"] = new_train_csv["suffix_id"].apply(lambda x: suffix_classes[x])
to_drop.append("suffix_id")
to_drop.append("suffix_classes")
new_train_csv.head()


#### 2.e Cleaning/splitting data and creating classifier

In [None]:
y = new_train_csv.drop(to_drop, axis=1).iloc[:, 0].to_numpy()
X = new_train_csv.drop(to_drop, axis=1).iloc[:, 1:].to_numpy()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.cluster import KMeans
N_CLUSTERS = 200

In [None]:
clf = KMeans(N_CLUSTERS)

In [None]:
kmeans = clf.fit(X_train)

In [None]:
class_per_cluster = {}
preds = kmeans.predict(X_train)

for i in range(N_CLUSTERS):
    classes = y_train[np.where(preds == i)]
    c, counts = np.unique(classes, return_counts=True)
    s = sum(counts)
    
    counts = counts/s
        
    class_per_cluster[i] = counts

In [None]:
y_preds = kmeans.predict(X_test)

In [None]:
y_preds_classes = []
for pred in y_preds:
    probs = class_per_cluster[pred]
    
    c = np.random.choice(list(range(len(probs))), p=probs)
    
    y_preds_classes.append(c)
    
print(roc_auc_score(y_preds_classes, y_test))