# VGG16 + XGBoost (images + tabular data)

In this notebook I've tried a very simple approach by using VGG16 and XGBoost classification. 
It consists of the following steps:
1. Training a Vgg16 model using the training images; (Used [this notebook](https://www.kaggle.com/ibtesama/siim-baseline-keras-vgg16) as reference)
2. Using the model to predict the diagnostic of each image of the training set;
3. Add the predictions as an extra column to the training data (that contains sex, age, etc);
4. Training an XGBoost classifier in the new dataframe.

It was inteded just as a learn-to-do-it exercise, and I think it is a good start point for newbies.

### Imports

In [None]:
import tensorflow

In [None]:
!pip install livelossplot

In [None]:
!pip install tornado==4.5.3

In [None]:
import os
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score

from sklearn.impute import KNNImputer
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc


import tensorflow
import cv2
import PIL
from IPython.display import Image, display
from keras.applications.vgg16 import VGG16,preprocess_input

import plotly.graph_objs as go
import plotly.graph_objects as go
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import train_test_split
from keras.models import Sequential, Model, load_model
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.applications.resnet50 import ResNet50
from keras.preprocessing.image import ImageDataGenerator,load_img, img_to_array
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Input, Flatten, BatchNormalization, Activation
from keras.layers import GlobalMaxPooling2D
from keras.models import Model
from keras.optimizers import Adam, SGD, RMSprop
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import gc
import skimage.io
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.python.keras import backend as K
from livelossplot import PlotLossesKeras


def clean_dataset(df):
    """
    Cleans data frame from NaNs, Infs and missing cells.
    """
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

# VGG-16

Let's start by using VGG16 to perform prediction using the images. Then we will add the results from the VGG prediction to the rest of the data.

In [None]:
TEST_CSV_PATH = '../input/siim-isic-melanoma-classification/test.csv'
TRAIN_CSV_PATH = '../input/siim-isic-melanoma-classification/train.csv'
TEST_JPEG_PATH = '../input/siim-isic-melanoma-classification/jpeg/test/'
TRAIN_JPEG_PATH = '../input/siim-isic-melanoma-classification/jpeg/train/'

train=pd.read_csv(TRAIN_CSV_PATH)
test=pd.read_csv(TEST_CSV_PATH)
train.head()


In [None]:
val1, val2 = train['target'].value_counts()
dist=train['target'].value_counts()
print(f"{(val1/(val1+val2))*100}% of benign data.")
print(f"{(val2/(val1+val2))*100}% of malign data.")

Only about 1.7% of the data consists of malign examples.
As we already know, that can be a problem. Check the Classifier notebook. In the Classifier notebook, we tried using SMOTE to deal with that problem. Well, that did not work that well. This time let's be more direct (and a lil bit drastic). I will just select a rather small sample of the data for training

In [None]:
df_0=train[train['target']==0].sample(2000)
df_1=train[train['target']==1]
train=pd.concat([df_0,df_1])
train=train.reset_index()

Preparing datasets...

In [None]:
labels=[]
data=[]
for i in range(train.shape[0]):
    data.append(TRAIN_JPEG_PATH + train['image_name'].iloc[i]+'.jpg')
    labels.append(train['target'].iloc[i])
df=pd.DataFrame(data)
df.columns=['images']
df['target']=labels

test_data=[]
for i in range(test.shape[0]):
    test_data.append(TEST_JPEG_PATH + test['image_name'].iloc[i]+'.jpg')
df_test=pd.DataFrame(test_data)
df_test.columns=['images']

I am using here .astype(np.float32) to avoid the following error:
``TypeError: Input 'y' of 'Mul' Op has type int64 that does not match type float32 of argument 'x'``

I **DID NOT** have this same problem running this notebook at Kaggle, where the tensorflow version is 2.1.0 (I am now executing the newest vertsion 2.3)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df['images'],df['target'], test_size=0.2, random_state=1234)

train=pd.DataFrame(X_train)
train.columns=['images']
train['target']=y_train.astype(np.float32)

validation=pd.DataFrame(X_val)
validation.columns=['images']
validation['target']=y_val.astype(np.float32)

I'll do some very basic preprocessing like
* normalizing
* reshaping
* augmentation(only for train data)

In [None]:
train_datagen = ImageDataGenerator(rescale=1./255,rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,horizontal_flip=True)
val_datagen=ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_dataframe(
    train,
    x_col='images',
    y_col='target',
    target_size=(224, 224),
    batch_size=8,
    shuffle=True,
    class_mode='raw')

validation_generator = val_datagen.flow_from_dataframe(
    validation,
    x_col='images',
    y_col='target',
    target_size=(224, 224),
    shuffle=False,
    batch_size=8,
    class_mode='raw')

### VGG modelling + training

I'm using pretrained VGG-16 and adding the last dense layer. The competition is evaluated on AUC scores, so we'll use that as a metric.

In [None]:
def vgg16_model(num_classes=None):

    model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    x=Flatten()(model.output)
    output=Dense(1,activation='sigmoid')(x) # because we have to predict the AUC
    model=Model(model.input,output)
    
    return model

vgg_conv=vgg16_model(1)

Also, because of class imbalance it's better to use **focal loss** rather than normal **binary_crossentropy**. You can read more about it [here](https://arxiv.org/abs/1708.02002).


In [None]:
def focal_loss(alpha=0.25, gamma=2.0):
    def focal_crossentropy(y_true, y_pred):
        bce = K.binary_crossentropy(y_true, y_pred)
        
        y_pred = K.clip(y_pred, K.epsilon(), 1.- K.epsilon())
        p_t = (y_true*y_pred) + ((1-y_true)*(1-y_pred))
        
        alpha_factor = 1
        modulating_factor = 1

        alpha_factor = y_true*alpha + ((1-alpha)*(1-y_true))
        modulating_factor = K.pow((1-p_t), gamma)

        # compute the final loss and return
        return K.mean(alpha_factor*modulating_factor*bce, axis=-1)
    return focal_crossentropy

In [None]:
opt = Adam(lr=1e-5)
vgg_conv.compile(loss=focal_loss(), metrics=[tf.keras.metrics.AUC()],optimizer=opt)

In [None]:
nb_epochs = 20
batch_size = 16
nb_train_steps = train.shape[0]//batch_size
nb_val_steps=validation.shape[0]//batch_size
print("Number of training and validation steps: {} and {}".format(nb_train_steps,nb_val_steps))

And lets create a checkout callback for the best the best model. I'll use the validation loss as monitor. That will save use some time in case something breaks during a large session of training. In such a case, we can just reload the checkout model and resume training on the weights.

In [None]:
checkpoint = ModelCheckpoint("best_model.hdf5", monitor='val_loss', verbose=1,
    save_best_only=True, mode='auto', period=1)

In [None]:
cb=[PlotLossesKeras(), checkpoint]
vgg_conv.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_steps,
    epochs=nb_epochs,
    validation_data=validation_generator,
    callbacks=cb,
    validation_steps=nb_val_steps)

In [None]:
# serialize model to JSON
model_json = vgg_conv.to_json()
with open("last_model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
vgg_conv.save_weights("last_model.h5")
print("Saved model to disk")

# XGBoost

### Prepare data 

* We drop all data that is not used for classification, namely the patient id, and malign/bening (that actually just gives a name for each label, but is already represented by the target column). 
* We also have to drop the diagnosis because it defines a cancer (obvsly it is not present at the evaluation data AND using it would provoke an accuracy of 1).
* Use one-hot encoding for sex and anatomy (the part of the body where the photo is from).


In [None]:
data_df = pd.read_csv(TRAIN_CSV_PATH) 
data_df = data_df.drop(columns=['benign_malignant', 'diagnosis', 'patient_id'])

# one-hot encoding
data_df = pd.concat([data_df,pd.get_dummies(data_df['sex'], prefix='sex', drop_first=True)],axis=1)
data_df = pd.concat([data_df,pd.get_dummies(data_df['anatom_site_general_challenge'], prefix='anatom', drop_first=True)],axis=1)
data_df.drop(['sex', 'anatom_site_general_challenge'],axis=1, inplace=True)

# normalization
column_names_to_normalize = ['age_approx']
x = data_df[column_names_to_normalize].values
scaler = MinMaxScaler() 
x_scaled = scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = data_df.index)
data_df[column_names_to_normalize] = df_temp

data_df.head()


### Predict training data (adding VGG predictions)

Now we use the newly trained VGG model to extract predicitons for the training data.

In [None]:
loaded_vgg = load_model('./best_model.hdf5', custom_objects={'focal_crossentropy': focal_loss})

def vgg_prediction(image_name):
    #print(str(TRAIN_JPEG_PATH + image_name + '.jpg'))
    img = cv2.imread(str(TRAIN_JPEG_PATH + image_name + '.jpg'))
    img = cv2.resize(img, (224,224))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img.astype(np.float32)/255.
    img = np.reshape(img,(1,224,224,3))
    prediction = loaded_vgg.predict(img)
    return prediction[0][0]

In [None]:
data_df['vgg_prediction'] = data_df['image_name'].apply(vgg_prediction)
data_df

It can take quite some time to get the predictions for all images, so lets save it in a csv file just to be sure we wont loose it.


In [None]:
data_df.to_csv("train_with_vgg.csv")

### Spliting train and test data.

In [None]:
y = data_df['target']
X_data = data_df.drop(columns=['target', 'image_name'])

# split the dataset into train and Test
seed = 7
test_size = 0.3
Xtrain, Xtest, ytrain, ytest = train_test_split(X_data, y, test_size=test_size, random_state=seed)

Xtrain.head()

### XGBoost model definition

In [None]:
data_dmmatrix= xgb.DMatrix(data=Xtrain,label=ytrain)

workers=4

param_bin = {
    'nthread':workers,
    'max_depth': 500,
    'eta': 0.01,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.8,
    'objective': 'binary:logistic'}
epochs = 1000

model_bin = xgb.train(param_bin, data_dmmatrix, epochs)


In [None]:
xgb_test = xgb.DMatrix(Xtest, label=ytest)
predictions_bin = model_bin.predict(xgb_test)

auc_bin = roc_auc_score(ytest, predictions_bin)
print('binary:logistic ROC AUC=%.3f' % (auc_bin))

# calculate roc curves
fpr_bin, tpr_bin, _ = roc_curve(ytest, predictions_bin)

# plotting the roc curves of each xgboost objective model
pyplot.plot(fpr_bin, tpr_bin, linestyle='--', label='binary:logistic')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
pyplot.show()

## Create submission file
Now we use the test.csv file, get the predictions and generate a submission file.

In [None]:
def vgg_prediction_test(image_name):
    img = cv2.imread(str(TEST_JPEG_PATH + image_name + '.jpg'))
    img = cv2.resize(img, (224,224))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img.astype(np.float32)/255.
    img = np.reshape(img,(1,224,224,3))
    prediction = loaded_vgg.predict(img)
    return prediction[0][0]
 
eval_data_df = pd.read_csv(TEST_CSV_PATH)
imageNames_lst = eval_data_df['image_name']

eval_data_df = pd.concat([eval_data_df,pd.get_dummies(eval_data_df['sex'], prefix='sex', drop_first=True)],axis=1)
eval_data_df = pd.concat([eval_data_df,pd.get_dummies(eval_data_df['anatom_site_general_challenge'], drop_first=True, prefix='anatom')],axis=1)

eval_data_df.drop(['sex', 'anatom_site_general_challenge', 'patient_id'],axis=1, inplace=True)

# normalize the age...
column_names_to_normalize = ['age_approx']
x = eval_data_df[column_names_to_normalize].values
scaler = MinMaxScaler() 
x_scaled = scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = eval_data_df.index)
eval_data_df[column_names_to_normalize] = df_temp

# take the vgg prediction and add it to the dataframe...
eval_data_df['vgg_prediction'] = eval_data_df['image_name'].apply(vgg_prediction_test)

# now we dont need the image name anymore...
eval_data_df = eval_data_df.drop(columns=['image_name'])
X = eval_data_df

xgb_X = xgb.DMatrix(X)
predictions_bin = model_bin.predict(xgb_X)

sub_data_bin = {'image_name': imageNames_lst, 'target':predictions_bin}
sub_df_bin = pd.DataFrame(sub_data_bin) 
sub_df_bin

In [None]:
sub_df_bin.to_csv('submission.csv', index=False)