![Cancer](https://media2.giphy.com/media/sCqnpiUFN228E/giphy.gif)

# Introduction

Among the most important areas in the world is human health. Exploring the methods for preventing and detecting health problems has sparked a lot of interest. Cancer is the most common illness that has a significant impact on human health. A malignant tumor is a cancerous tumor that develops as a result of the disease. Colon cancer, together with breast cancer and lung cancer, is the third most deadly disease in the United States, killing 49,190 people in 2016 [1]. This is a cancer that begins in the large intestine colon, which is the last component of the digestive system.

The machine learning technique should be used in this assignment to aid in the detection of malignant cells and the differentiation of cell types in colon cancer. Deep learning algorithms such as AlexNet, Resnet50, and VGG19 will all be developed and evaluated in this notebook, with XGBoost being the sole non-deep learning option to tackle the issue.

# Import necessary library

In [None]:
conda install -c conda-forge keras-preprocessing


In [None]:
pip install pydot

In [None]:
pip install opencv-python

In [None]:
pip install plotly

In [None]:
pip install tensorflow

In [None]:
pip install xgboost


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.utils import resample

import seaborn as sns
import warnings

from tensorflow.keras.layers.experimental import preprocessing



from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")

## Load dataset

In [None]:
df_label = pd.read_csv("data_labels_mainData.csv")
df_label_extra = pd.read_csv("data_labels_extraData.csv")

In [None]:
from zipfile import ZipFile
ZipFile("Image_classification_data.zip").extractall(".")

In [None]:
df_label

In [None]:
df_label['cellType']

In [None]:
df_label_extra

In [None]:
cancer_data = df_label.groupby('patientID').any()

sns.set_theme(style="whitegrid")
fig, ax1 = plt.subplots(figsize = (8 , 8))
graph = sns.countplot(ax=ax1,x='isCancerous', data=cancer_data, palette='tab10')
graph.set_title("Positive vs Negative cancerous patients", fontsize=20)
graph.set_xticklabels(graph.get_xticklabels(),rotation=0)
ax1.set_ylim([0, 50])
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 1, height, ha="center")

51 out of total of 77 persons who have cancer

In [None]:
import plotly.express as px 

pie=px.pie(data_frame=df_label,
           names='cellTypeName',
           color_discrete_sequence=px.colors.qualitative.Pastel,
           width=550,
           height=550)
pie.update_layout(title_text='Distribution of cell types', title_x=0.5)
pie.update_traces(textinfo='value+label+percent')
pie

In [None]:
sns.set_theme(style="whitegrid")
fig, ax1 = plt.subplots(figsize = (10 , 5))
cell_types_graph = sns.countplot(ax=ax1,x='isCancerous', hue='cellTypeName', data=df_label, palette='tab10')
cell_types_graph.set_title("Cell types of positive vs negative patients", fontsize=20)
cell_types_graph.set_xticklabels(cell_types_graph.get_xticklabels(),rotation=0)
ax1.set_ylim([0, 5000])
for p in cell_types_graph.patches:
    if p.get_height() > 0:
        height = p.get_height()  
        cell_types_graph.text(p.get_x()+p.get_width()/2., height + 100, int(height), ha="center")
    else:
        cell_types_graph.text(p.get_x()+p.get_width()/2, 100, '0', ha="center")

From the graph, we can conclude the epithelial is cancerous cell type as all the patients who have cancer all possess epithelia cell type.

In [None]:
sns.set_theme(style="whitegrid")
fig, ax1 = plt.subplots(figsize = (8 , 8))
graph = sns.countplot(ax=ax1,x='isCancerous', data=df_label, palette='tab10')
graph.set_title("Positive vs Negative cancerous patients", fontsize=20)
graph.set_xticklabels(graph.get_xticklabels(),rotation=0)
ax1.set_ylim([0, 6000])
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 1, height, ha="center")

In [None]:
sns.set_theme(style="whitegrid")
fig, ax1 = plt.subplots(figsize = (8 , 8))
graph = sns.countplot(ax=ax1,x='isCancerous', data=df_label_extra, palette='tab10')
graph.set_title("Positive vs Negative cancerous patients", fontsize=20)
graph.set_xticklabels(graph.get_xticklabels(),rotation=0)
ax1.set_ylim([0, 8000])
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 1, height, ha="center")

In [None]:
sns.set_theme(style="whitegrid")
fig, ax1 = plt.subplots(figsize = (7 , 5))
graph = sns.countplot(ax=ax1,x='cellTypeName', data=df_label, palette='tab10')
graph.set_title("Cell types number", fontsize=20)
graph.set_xticklabels(graph.get_xticklabels(),rotation=0)
ax1.set_ylim([0, 5000])
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 100, height, ha="center")

In [None]:
from tensorflow.keras.preprocessing import image

CELL_TYPE_SAMPLE_SIZE = 5

for cell_type_name in df_label['cellTypeName'].unique():
    df_sample = df_label[df_label['cellTypeName'] == cell_type_name].sample(CELL_TYPE_SAMPLE_SIZE)
    plt.figure(figsize=(CELL_TYPE_SAMPLE_SIZE ** 2, CELL_TYPE_SAMPLE_SIZE))
    for image_index, image_name in enumerate(df_sample['ImageName']):
        plt.subplot(1, CELL_TYPE_SAMPLE_SIZE + 1, image_index+1)
        plt.grid(None)
        img = image.load_img('./patch_images/' + image_name, target_size=(27, 27))
        plt.imshow(img)
        plt.title(cell_type_name)

## *Task1: Classify  images  according  to  whether  given  cell  image  represents  a cancerous cells or not (isCancerous)*

# Data Processing 


In [None]:
# document: https://keras.io/api/preprocessing/image/#imagedatagenerator-class
from keras_preprocessing.image import ImageDataGenerator

def get_dataframe_iterator(dataframe, 
                            image_shape = (27, 27), 
                            batch_size = 64,
                            x_col = "ImageName",
                            y_col = "cellTypeName",
                            classes = ["fibroblast", "inflammatory", "epithelial", "others"]):
    dataframe[y_col] = dataframe[y_col].apply(str)
    generator = ImageDataGenerator(
        rescale = 1./255, 
        rotation_range = 20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True
    ) 
    iterator = generator.flow_from_dataframe(
        dataframe = dataframe,
        directory = "./patch_images", 
        x_col = x_col,
        y_col = y_col,
        classes = classes, 
        class_mode = "categorical", 
        target_size = image_shape, 
        batch_size = batch_size,
    )
    return iterator

In [None]:
# Check duplicate

In [None]:
import os
file_list = os.listdir('./patch_images/')
print(len(file_list))

In [None]:
import hashlib, os
duplicates = []
hash_keys = dict()
for index, filename in  enumerate(os.listdir('./patch_images/')):  #listdir('.') = current directory

    if os.path.isfile('./patch_images/'+filename):
        with open('./patch_images/'+filename, 'rb') as f:
            filehash = hashlib.md5(f.read()).hexdigest()
        if filehash not in hash_keys: 
            hash_keys[filehash] = index
        else:
            duplicates.append((index,hash_keys[filehash]))
            print(filename)


In [None]:
from imageio import imread
for file_indexes in duplicates[:30]:
    try:
    
        plt.subplot(121),plt.imshow(imread('./patch_images/'+ file_list[file_indexes[1]]))
        plt.title(file_indexes[1]), plt.xticks([]), plt.yticks([])

        plt.subplot(122),plt.imshow(imread('./patch_images/'+ file_list[file_indexes[0]]))
        plt.title(str(file_indexes[0]) + ' duplicate'), plt.xticks([]), plt.yticks([])
        plt.show()
    
    except OSError as e:
        continue

In [None]:
# Remove duplicate
for index in duplicates:
    os.remove('./patch_images/' + file_list[index[0]] )

In [None]:
print(df_label)

In [None]:
is_cancer_class_count = df_label.isCancerous.value_counts()
amount_for_balance = abs(is_cancer_class_count[0] - is_cancer_class_count[1])
df_random_cancer_from_extra = df_label_extra[df_label_extra['isCancerous'] == 1].sample(amount_for_balance)
for index in duplicates:
    df_label = df_label[df_label.ImageName  != file_list[index[0]]]
    df_random_cancer_from_extra = df_random_cancer_from_extra[df_random_cancer_from_extra.ImageName  != file_list[index[0]]]

In [None]:
df_label_task2 = df_label

In [None]:

df_label = pd.concat([df_label, df_random_cancer_from_extra], ignore_index=True)
df_label.isCancerous.value_counts()

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df_label, test_size=0.2, random_state=9999)


# Drop celltype 

In [None]:
train_df=train_df.drop(['cellTypeName','cellType'],1)

In [None]:
test_df = test_df.drop(['cellTypeName','cellType'],1)

In [None]:
train_df, test_df

In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=9999)

print("Train data : {}, Val Data: {}, Test Data: {}".format(train_df.shape[0], val_df.shape[0], test_df.shape[0]))

In [None]:
# document: https://keras.io/api/preprocessing/image/#imagedatagenerator-class
from keras_preprocessing.image import ImageDataGenerator

def get_dataframe_iterator(dataframe, 
                            image_shape = (27, 27), 
                            batch_size = 64,
                            x_col = "ImageName",
                            y_col = "cellTypeName",
                            classes = ["fibroblast", "inflammatory", "epithelial", "others"]):
    dataframe[y_col] = dataframe[y_col].apply(str)
    generator = ImageDataGenerator(
        rescale = 1./255, 
        rotation_range = 20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True
    ) 
    iterator = generator.flow_from_dataframe(
        dataframe = dataframe,
        directory = "./patch_images", 
        x_col = x_col,
        y_col = y_col,
        classes = classes, 
        class_mode = "categorical", 
        target_size = image_shape, 
        batch_size = batch_size,
    )
    return iterator

In [None]:
train_iterator = get_dataframe_iterator(train_df, y_col='isCancerous', classes=['0','1'])
val_iterator = get_dataframe_iterator(val_df, y_col='isCancerous', classes=['0','1'])
test_iterator = get_dataframe_iterator(test_df, y_col='isCancerous', classes=['0','1'])

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score

# Tensorflow
#from sklearn.preprocessing import OneHotEncoder
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D,ZeroPadding2D, AveragePooling2D, BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation, Add
from keras import regularizers
from keras import Input
from tensorflow.keras import initializers
from keras.initializers import GlorotUniform

from keras import Model
from keras import backend as K

In [None]:
def fit_model(model, iterator, val_iterator, 
              epochs = 100, 
              export_dir = './export',
              name = 'default'):
    es = EarlyStopping(monitor='val_accuracy', 
                       mode='max', 
                       verbose=1, 
                       patience=10, 
                       restore_best_weights=True)
    mc = ModelCheckpoint('{}/model_{}.h5'.format(export_dir, name), 
                         monitor='val_accuracy', 
                         mode='max', 
                         save_best_only=True)

    history = model.fit_generator(
        iterator,
        validation_data = val_iterator,
        epochs = epochs,
        verbose = 1,
        callbacks=[mc,es]
    )
    return history

In [None]:
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:
METRICS = ['accuracy', precision_m, recall_m, f1_m]

In [None]:
import cv2
import os
list_of_images = []

for path in df_label['ImageName']:
  image_path = os.path.join("./patch_images", path)
  image = cv2.imread(image_path , cv2.IMREAD_GRAYSCALE)
  list_of_images.append(image)

list_of_images = np.asarray(list_of_images)
np.array(list_of_images).shape

In [None]:
list_of_images = np.reshape(list_of_images,  (-1 , 27 * 27))
list_of_images = pd.DataFrame(list_of_images)

In [None]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

train_x_cancer , validate_x_cancer, train_y_cancer , validate_y_cancer = train_test_split(list_of_images, df_label['isCancerous'], test_size=0.2 , random_state = 42, shuffle = True)

print("Train X shape: " , train_x_cancer.shape)
print("Train Y shape: " , train_y_cancer.shape)
print("Validate X shape: " , validate_x_cancer.shape)
print("Validate Y shape: " , validate_y_cancer.shape)

In [None]:
# https://medium.com/mlearning-ai/implementation-of-googlenet-on-keras-d9873aeed83c
from keras.models import Model
from keras.layers import Input, Flatten, Dense, Dropout, BatchNormalization
from keras.layers import Input, Conv2D, MaxPooling2D, AveragePooling2D, Flatten, GlobalAveragePooling2D, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from keras.layers import Concatenate
from keras.layers.merge import concatenate

## XGBoost

### 1. Default model (without any parameters)

In [None]:
import xgboost as xgb
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

# Binary:logistic is used for logistic classfication problem which is our problem in this task
xgbr = xgb.XGBClassifier(objective='binary:logistic')
xgbr.fit(train_x_cancer, train_y_cancer)

y_pred_validate = xgbr.predict(validate_x_cancer)
prediction_validate = [round(value) for value in y_pred_validate]
# evaluate predictions
accuracy = accuracy_score(validate_y_cancer, prediction_validate)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
mse = mean_squared_error(validate_y_cancer, y_pred_validate)
print("RMSE: %.2f" % (mse**(1/2.0)))
# accuracy = accuracy_score(validate_y_cancer, ypred)
# print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
y_pred_train = xgbr.predict(train_x_cancer)
prediction_train = [round(value) for value in y_pred_train]
print("Validate report")
print(classification_report(validate_y_cancer, prediction_validate))

### 🔬 Observation: 
The alogrithm produce a high accuracy and low RMSE. However, we should increase the efficency of this model by figure out the best parameters for this alogirthm. 


### 2. Using GridSearchCV to find best params and calculate RMSE

####  **Hyperparameter tuning chosen**


In [None]:

from sklearn.model_selection import GridSearchCV

params = { 'max_depth': [3,6,10],
           'learning_rate': [0.01, 0.05, 0.1],
           'n_estimators': [100, 500, 1000],
           'colsample_bytree': [0.3, 0.7]}
gs_xgbr = xgb.XGBClassifier(seed = 20)
grid_search_cv = GridSearchCV(estimator=gs_xgbr, 
                   param_grid=params,
                   scoring='neg_mean_squared_error', 
                   verbose=1)
grid_search_cv.fit(train_x_cancer, train_y_cancer)
print("Best parameters:", grid_search_cv.best_params_)
print("Lowest RMSE: ", (-grid_search_cv.best_score_)**(1/2.0))

In [None]:
## Predict the y train and y validate using our model with x train and x validate
gs_y_pred_validate = grid_search_cv.predict(validate_x_cancer)
gs_prediction_validate = [round(value) for value in gs_y_pred_validate]

print("Validation report")
print(classification_report(validate_y_cancer, gs_prediction_validate))

### 3. Using Randomized Search CV to find best params and calculate lowest RMSE

#### **Hyperparameter tuning chosen**


In [None]:

from sklearn.model_selection import RandomizedSearchCV

params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'subsample': np.arange(0.5, 1.0, 0.1),
           'colsample_bytree': np.arange(0.4, 1.0, 0.1),
           'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
           'n_estimators': [100, 500, 1000]}
rs_xgbr = xgb.XGBClassifier(seed = 20)
random_search = RandomizedSearchCV(estimator=rs_xgbr,
                         param_distributions=params,
                         scoring='neg_mean_squared_error',
                         n_iter=25,
                         verbose=1)
random_search.fit(train_x_cancer, train_y_cancer)
print("Best parameters:", random_search.best_params_)
print("Lowest RMSE: ", (-random_search.best_score_)**(1/2.0))

In [None]:
## Predict the y train and y validate using our model with x train and x validate
rd_y_pred_validate = random_search.predict(validate_x_cancer)
rd_prediction_validate = [round(value) for value in rd_y_pred_validate]

print("Validation report")
print(classification_report(validate_y_cancer, rd_prediction_validate))

### 🔬 Observation: 
- After using 2 different approaches : grid search and random search, it is straightforward that the random search produce the higher accuracy ***(83% > 82%)*** and lower RMSE ***(0.421% > 0.423%)*** than the grid search . Therefore, we design to use random search instead of gridsearch to build our final model.

In [None]:
from sklearn.metrics import plot_confusion_matrix
fig, ( ax1) = plt.subplots(1, 1, figsize=(20, 8))
plot_confusion_matrix(random_search, validate_x_cancer, validate_y_cancer, ax=ax1)


labels = np.asarray(labels).reshape(2,2)
ax1.title.set_text('Test Confusion Matrix')

In [None]:
## Draw a summary table for prediction sepsis and not sepsis
pd.DataFrame(confusion_matrix(validate_y_cancer,rd_y_pred_validate),\
            columns=["Predicted Not-Cancerous", "Predicted Cancerous"],\
            index=["Not-Cancerous","Cancerous"])

### 🔬 Observation: 
For this model, when diagnose 1108 not-cancerous cells, the machine predict corect 929 cells, which is 84%. On the other hand, when predict for cancerous cells, the machine predict correct 999 over 1219 cell, which 82%. As our a problem is diagnose whether the cell is cancerous or not, thus, it is more ***important*** to consider the ***false negative*** than ***false postive***. In other words, if the normal cell is diagnose as positive, we need to pay extra fee for medical and place for that patient whereas the positive cell is diagnose as negative, that patient may lost their life. Therefore, compare between the money and people life, obviously, we should pay more attention on the people life which is affected by false negative. Thus, this model is acceptable since the percentage of recall is 84%

In [None]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score


evaluation_xgboost_t1 = [None]
evaluation_xgboost_t1.append(accuracy_score(validate_y_cancer,rd_y_pred_validate))
evaluation_xgboost_t1.append(precision_score(validate_y_cancer, rd_y_pred_validate))
evaluation_xgboost_t1.append(recall_score(validate_y_cancer, rd_y_pred_validate))
evaluation_xgboost_t1.append(f1_score(validate_y_cancer, rd_y_pred_validate))

## InceptionV3

In [None]:
!pip install keras_vggface

In [None]:
import keras
!pip install keras_applications
from keras_applications.imagenet_utils import _obtain_input_shape

In [None]:
from tensorflow import keras
from tensorflow.python.util import tf_inspect
from keras.applications.inception_v3 import InceptionV3
from keras.models import Model
from keras.layers import Lambda, Input
from keras.layers import Dense, GlobalAveragePooling2D
from keras.applications.inception_v3 import InceptionV3

In [None]:
model = InceptionV3(include_top=False, input_shape=(75, 75, 3), weights='imagenet')

# Resize Input images to 75x75
newInput = Input(batch_shape=(None, 27, 27, 3))
resizedImg = Lambda(lambda image: tf.compat.v1.image.resize_images(image, (75, 75)))(newInput)
newOutputs = model(resizedImg)
model = Model(newInput, newOutputs)

# Freeze all the layers
for layer in model.layers[:]:
    layer.trainable = False

# Add Dense layer to classify on CIFAR10
output = model.output
output = GlobalAveragePooling2D()(output)
output = Dense(units=2, activation='softmax')(output)
model_inceptionv3 = Model(model.input, output)

model_inceptionv3.summary()

In [None]:

model_inceptionv3.summary()

In [None]:
opt = Adam(lr=0.0045 , amsgrad = True)
model_inceptionv3.compile(optimizer=Adam(lr=1e-4),loss='categorical_crossentropy', metrics=METRICS)

history_inception_v3 = fit_model(model_inceptionv3, train_iterator, val_iterator, 
                                export_dir='.',
                                name="Inception_Task1")

# GoogleNet

In [None]:
def Inception_block(input_layer, f1, f2_conv1, f2_conv3, f3_conv1, f3_conv5, f4): 
  # Input: 
  # - f1: number of filters of the 1x1 convolutional layer in the first path
  # - f2_conv1, f2_conv3 are number of filters corresponding to the 1x1 and 3x3 convolutional layers in the second path
  # - f3_conv1, f3_conv5 are the number of filters corresponding to the 1x1 and 5x5  convolutional layer in the third path
  # - f4: number of filters of the 1x1 convolutional layer in the fourth path

  # 1st path:
  path1 = Conv2D(filters=f1, kernel_size = (1,1), padding = 'same', activation = 'relu')(input_layer)

  # 2nd path
  path2 = Conv2D(filters = f2_conv1, kernel_size = (1,1), padding = 'same', activation = 'relu')(input_layer)
  path2 = Conv2D(filters = f2_conv3, kernel_size = (3,3), padding = 'same', activation = 'relu')(path2)

  # 3rd path
  path3 = Conv2D(filters = f3_conv1, kernel_size = (1,1), padding = 'same', activation = 'relu')(input_layer)
  path3 = Conv2D(filters = f3_conv5, kernel_size = (5,5), padding = 'same', activation = 'relu')(path3)

  # 4th path
  path4 = MaxPooling2D((3,3), strides= (1,1), padding = 'same')(input_layer)
  path4 = Conv2D(filters = f4, kernel_size = (1,1), padding = 'same', activation = 'relu')(path4)

  output_layer = concatenate([path1, path2, path3, path4], axis = -1)

  return output_layer

In [None]:
def GoogLeNet():
  # input layer 
  input_layer = Input(shape = (27, 27, 3))

  # convolutional layer: filters = 64, kernel_size = (7,7), strides = 2
  X = Conv2D(filters = 64, kernel_size = (7,7), strides = 2, padding = 'valid', activation = 'relu')(input_layer)

  X = ZeroPadding2D(padding=(10, 10))(X)

  # max-pooling layer: pool_size = (3,3), strides = 2
  X = MaxPooling2D(pool_size = (3,3), strides = 1)(X)

  # convolutional layer: filters = 64, strides = 1
  X = Conv2D(filters = 64, kernel_size = (1,1), strides = 1, padding = 'same', activation = 'relu')(X)

  # convolutional layer: filters = 192, kernel_size = (3,3)
  X = Conv2D(filters = 192, kernel_size = (3,3), padding = 'same', activation = 'relu')(X)

  # max-pooling layer: pool_size = (3,3), strides = 2
  X = MaxPooling2D(pool_size= (3,3), strides = 2)(X)

  # 1st Inception block
  X = Inception_block(X, f1 = 64, f2_conv1 = 96, f2_conv3 = 128, f3_conv1 = 16, f3_conv5 = 32, f4 = 32)

  # 2nd Inception block
  X = Inception_block(X, f1 = 128, f2_conv1 = 128, f2_conv3 = 192, f3_conv1 = 32, f3_conv5 = 96, f4 = 64)

  # max-pooling layer: pool_size = (3,3), strides = 2
  X = MaxPooling2D(pool_size= (3,3), strides = 2)(X)

  # 3rd Inception block
  X = Inception_block(X, f1 = 192, f2_conv1 = 96, f2_conv3 = 208, f3_conv1 = 16, f3_conv5 = 48, f4 = 64)

  # Extra network 1:
  X1 = AveragePooling2D(pool_size = (5,5), strides = 3)(X)
  X1 = Conv2D(filters = 128, kernel_size = (1,1), padding = 'same', activation = 'relu')(X1)
  X1 = Flatten()(X1)
  X1 = Dense(1024, activation = 'relu')(X1)
  X1 = Dropout(0.7)(X1)
  X1 = Dense(2, activation = 'softmax')(X1) # <----- changed 1000 to 2

  
  # 4th Inception block
  X = Inception_block(X, f1 = 160, f2_conv1 = 112, f2_conv3 = 224, f3_conv1 = 24, f3_conv5 = 64, f4 = 64)

  # 5th Inception block
  X = Inception_block(X, f1 = 128, f2_conv1 = 128, f2_conv3 = 256, f3_conv1 = 24, f3_conv5 = 64, f4 = 64)

  # 6th Inception block
  X = Inception_block(X, f1 = 112, f2_conv1 = 144, f2_conv3 = 288, f3_conv1 = 32, f3_conv5 = 64, f4 = 64)

  # Extra network 2:
  X2 = AveragePooling2D(pool_size = (5,5), strides = 3)(X)
  X2 = Conv2D(filters = 128, kernel_size = (1,1), padding = 'same', activation = 'relu')(X2)
  X2 = Flatten()(X2)
  X2 = Dense(1024, activation = 'relu')(X2)
  X2 = Dropout(0.7)(X2)
  X2 = Dense(2, activation = 'softmax')(X2) # <----- changed 1000 to 2
  
  
  # 7th Inception block
  X = Inception_block(X, f1 = 256, f2_conv1 = 160, f2_conv3 = 320, f3_conv1 = 32, 
                      f3_conv5 = 128, f4 = 128)

  # max-pooling layer: pool_size = (3,3), strides = 2
  X = MaxPooling2D(pool_size = (3,3), strides = 2)(X)

  # 8th Inception block
  X = Inception_block(X, f1 = 256, f2_conv1 = 160, f2_conv3 = 320, f3_conv1 = 32, f3_conv5 = 128, f4 = 128)

  # 9th Inception block
  X = Inception_block(X, f1 = 384, f2_conv1 = 192, f2_conv3 = 384, f3_conv1 = 48, f3_conv5 = 128, f4 = 128)

  # Global Average pooling layer 
  X = GlobalAveragePooling2D(name = 'GAPL')(X)

  # Dropoutlayer 
  X = Dropout(0.4)(X)

  # output layer 
  X = Dense(2, activation = 'softmax')(X) # <------ changed from 1000 to 2 
  
  # model
  model = Model(input_layer, [X, X1, X2], name = 'GoogLeNet')

  return model

In [None]:
model_googlenet_t1 = GoogLeNet()

In [None]:
from tensorflow.keras.utils import plot_model

model_googlenet_t1.summary()

In [None]:
plot_model(model_googlenet_t1, show_shapes=True)

In [None]:
opt = Adam(lr=0.00045, amsgrad = True)
model_googlenet_t1.compile(optimizer=opt, loss='binary_crossentropy',metrics=METRICS)

history_googlenet_t1 = fit_model(model_googlenet_t1, train_iterator, val_iterator,
                                export_dir='.',
                                name="GoogLeNet_Task1")

In [None]:
model_googlenet_t1.save('./');

In [None]:
evaluation_googlenet_t1 = model_googlenet_t1.evaluate(test_iterator)

In [None]:
evaluation_googlenet_t1 = [
    evaluation_googlenet_t1[1], # dense_4_loss
    evaluation_googlenet_t1[4], # dense_4_accuracy
    evaluation_googlenet_t1[5], # dense_4_precision
    evaluation_googlenet_t1[6], # dense_4_recall
    evaluation_googlenet_t1[7]  # dense_4_f1
]

# Resnet50
## Defining an identity block

In [None]:
# https://medium.com/analytics-vidhya/understanding-and-implementation-of-residual-networks-resnets-b80f9a507b9c
def identity_block(X, f, filters, block, activation='relu'):
    """
    Implementation of the identity block
    
    Arguments:
    X: input tensor
    f: shape for middle CONV kernel size param
    filters: list of number of filters in the CONV layers of the main path
    block: name of this block
    
    Returns:
    X: output, returns a tensor
    """
    
    
    conv_name = 'conv' + block
    bn_name = 'batchNorm' + block
    
    # get filters from parameter
    F1, F2, F3 = filters
    
    # copy the original shape to add it back to the main path
    X_copy = X
    
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name + 'a', kernel_initializer = GlorotUniform(seed = 0))(X)
    X = BatchNormalization(axis = 3, name = bn_name + 'a')(X)
    X = Activation(activation)(X)
    
    
    # Second component of main path
    X = Conv2D(filters = F2, kernel_size = (f, f), strides = (1,1), padding = 'same', name = conv_name + 'b', kernel_initializer = GlorotUniform(seed = 0))(X)
    X = BatchNormalization(axis = 3, name = bn_name + 'b')(X)
    X = Activation(activation)(X)

    # Third component of main path 
    X = Conv2D(filters = F3, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name + 'c', kernel_initializer = GlorotUniform(seed = 0))(X)
    X = BatchNormalization(axis = 3, name = bn_name + 'c')(X)

    # add shortcut back to main path, and use relu activation
    X = Add()([X, X_copy])
    X = Activation(activation)(X)
    
    return X

## Defining a convolutional block

In [None]:
def convolutional_block(X, f, filters, block, s = 2, activation='relu'):
    """
    Implementation of the convolutional block
    
    Arguments:
    X: input tensor
    f: shape for middle CONV kernel size param
    filters: list of number of filters in the CONV layers of the main path
    block: name of this block
    s: stride param to be used for shortcut component
    
    Returns:
    X: output, returns a tensor

    """

    conv_name = 'conv' + block
    bn_name = 'batchNorm' + block
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value
    X_copy = X

    # First component
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name + 'a', kernel_initializer  = GlorotUniform(seed= 0))(X)
    X = BatchNormalization(axis = 3, name = bn_name + 'a')(X)
    X = Activation(activation)(X)

    # Second component
    X = Conv2D(F2, (f,f), strides = (1,1), padding = 'same', name = conv_name + 'b', kernel_initializer = GlorotUniform(seed = 0))(X)
    X = BatchNormalization(axis = 3, name = bn_name + 'b')(X)
    X = Activation(activation)(X)

    # Third component
    X = Conv2D(F3, (1,1), strides = (1,1), padding = 'valid', name = conv_name + 'c', kernel_initializer = GlorotUniform(seed = 0))(X)
    X = BatchNormalization(axis = 3, name = bn_name + 'c')(X)

    # Shortcut
    X_copy = Conv2D(F3, (1,1), strides = (s,s), padding = 'valid', name = conv_name + 'd', kernel_initializer = GlorotUniform(seed = 0))(X_copy)
    X_copy = BatchNormalization(axis = 3, name = bn_name + 'd')(X_copy)

    # add shortcut back to main path, and use relu activation
    X = Add()([X, X_copy])
    X = Activation(activation)(X)
    
    return X

## Implementing a ResNet50 architecture

In [None]:
def ResNet50(input_shape = (27, 27, 3), classes = 2):
    """
    Implementation using the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    MAXPOOl -> TOPLAYER

    reducing the typical 5 stage to 3 stage to reduce time and memory expense

    Arguments:
    input_shape: shape of image, currently is 27x27
    classes: integer, number of classes to identify
    
    returns a Model() instance.
    """
    
    # set x_input as a tensor with shape input_shape
    X_input = Input(input_shape)

    
    # add padding for tensor
    X = ZeroPadding2D((3, 3))(X_input)

    #since resnet only works for images that is 30x30 pixels or higher, we need to add padding pixels for the algorithm to work
    
    # first stage
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = initializers.RandomNormal(stddev=0.01))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # second
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], block='2a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], block='2b')
    X = identity_block(X, 3, [64, 64, 256], block='2c')

    X = ZeroPadding2D((1, 1))(X_input)
    
    # third
    X = convolutional_block(X, f = 3, filters = [128, 128, 512], block='3a', s = 2)
    X = identity_block(X, 3, [128, 128, 512], block='3b')
    X = identity_block(X, 3, [128, 128, 512], block='3c')
    X = identity_block(X, 3, [128, 128, 512], block='3d')
    
    # avg pooling
    X = AveragePooling2D()(X)

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='sigmoid', name='fc' + str(classes), kernel_initializer = GlorotUniform(seed=0))(X)
    
    
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

In [None]:
model_resnet50_t1 = ResNet50()

In [None]:
model_resnet50_t1.summary()

In [None]:
opt = Adam(lr=0.0045 , amsgrad = True)
model_resnet50_t1.compile(optimizer=opt, loss='binary_crossentropy', metrics=METRICS)

history_resnet50_t1 = fit_model(model_resnet50_t1, train_iterator, val_iterator, 
                                export_dir="",
                                name="resnet50_t1")

In [None]:
evaluation_resnet50_t1 = model_resnet50_t1.evaluate(test_iterator)

# AlexNet

In [None]:
#Importing library
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import SGD, Adagrad
import numpy as np

np.random.seed(1000)

#Instantiation
AlexNet = Sequential()

#1st Convolutional Layer
AlexNet.add(Conv2D(filters=96, input_shape=(27, 27, 3), kernel_size=(11,11), strides=(4,4), padding='same'))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))
AlexNet.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='same'))

#2nd Convolutional Layer
AlexNet.add(Conv2D(filters=256, kernel_size=(5, 5), strides=(1,1), padding='same'))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))
AlexNet.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='same'))

#3rd Convolutional Layer
AlexNet.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same'))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))

#4th Convolutional Layer
AlexNet.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same'))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))

#5th Convolutional Layer
AlexNet.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same'))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))
AlexNet.add(MaxPooling2D(pool_size=(3,3), strides=(1,1), padding='same'))

#Passing it to a Fully Connected layer
AlexNet.add(Flatten())
# 1st Fully Connected Layer
AlexNet.add(Dense(4096, input_shape=(32,32,3,)))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))
# Add Dropout to prevent overfitting
AlexNet.add(Dropout(0.5))

#2nd Fully Connected Layer
AlexNet.add(Dense(4096))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))
#Add Dropout
AlexNet.add(Dropout(0.7))

#3rd Fully Connected Layer
AlexNet.add(Dense(4096))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('relu'))
#Add Dropout
AlexNet.add(Dropout(0.9))

#Output Layer
AlexNet.add(Dense(10))
AlexNet.add(BatchNormalization())
AlexNet.add(Activation('softmax'))

#Model Summary
AlexNet.summary()

In [None]:
# https://towardsdatascience.com/implementing-alexnet-cnn-architecture-using-tensorflow-2-0-and-keras-2113e090ad98

from tensorflow.keras.optimizers import SGD, Adagrad

def AlexNet(input_shape=(27, 27, 3), classes = 2):
    model = keras.models.Sequential([
        keras.layers.Conv2D(filters=96, kernel_size=(1,1), strides=(1,1), activation='relu', input_shape=input_shape),
        keras.layers.BatchNormalization(),
        
        keras.layers.Conv2D(filters=96, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
        keras.layers.BatchNormalization(),

        keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),
        
        keras.layers.Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), activation='relu', padding="same"),
        keras.layers.BatchNormalization(),
        
        keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),
        
        keras.layers.Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
        keras.layers.BatchNormalization(),
        
        keras.layers.Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
        keras.layers.BatchNormalization(),
        
        keras.layers.Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
        keras.layers.BatchNormalization(),
        
        keras.layers.MaxPool2D(pool_size=(3,3), strides=(1,1)),
        
        keras.layers.Flatten(),
        
        keras.layers.Dense(4096, activation='relu'),
        keras.layers.Dropout(0.5),
        
        keras.layers.Dense(4096, activation='relu'),
        keras.layers.Dropout(0.7), # change to 0.7 to hit 93% @ 48 epochs

        keras.layers.Dense(classes, activation='sigmoid')
    ])

    return model

In [None]:
opt = Adam(lr=0.0035)

alex_model = AlexNet()
alex_model.compile(optimizer="SGD", loss='binary_crossentropy', metrics=METRICS)

alex_model.summary()

In [None]:
history_alexnet_t1 = fit_model(model_alexnet_t1, train_iterator, val_iterator,name="AlexNet_Task1")

# *Task2: Classify  images  according  to  cell-type,  such  as:  fibroblast,  inflammatory, epithelial or others* 

In [None]:
df_label = df_label_task2

<a id ="IV"></a>
<h2 style = "text-align:center; color:white; font-weight:600; padding:0.5em; background-color: #B02B00; border-radius:25px ; box-shadow: 0 0 20px 0 #ACA7CB; margin-right:3em; margin-bottom:1em">Ⅰ. Cleaning dataset  </h2>

#### Before cleaning the data, let display the dataset to observe it in general

In [None]:
df_label

<a id ="IV.A1"></a>

### *1. Check Data type*

In [None]:
## read all info row by row
df_label.info()

### 📚 Reason: 
We need to check data type for overall understanding for our dataset and identifying which column we should keep or change data type for later encoding and better model's performance

<a id ="IV.A2"></a>

### *2. Checking missing values*

### 📚 Reason: 
Missing value can lead to error for machine, thus, we need to check if there is any missing value and fill it.

In [None]:
# Check if the dataset has any missing value, 
df_label.isnull().values.any()

### 🔬 Observation: 
Since the return value is false, we can conclude that the dataset has no missing values. However, we should double check for every columns

In [None]:
# Check total missing values for each columns 
df_label.isnull().sum()

### 🔬 Observation: 
There is 100% no missing values in any columns in the dataset, thus, we dont not need to fill any missing values for this dataset

<a id ="IV.A3"></a>

### *3. Check typography*

In [None]:
celltype_name_values = df_label['cellTypeName'].nunique(dropna=False)
print(celltype_name_values)

In [None]:
print(df_label['cellTypeName'].unique())

### 📚 Reason: 
Typo value can lead to time runing and storage problem for machine (fibroblast, fibreblast are 2 different values but have same meaning), thus, we check typo preventing same meaningful values

### 🔬 Observation: 
Since the cellTypeName column have 4 different values such as fibroblast, inflammatory, epithelial or others. => no typo

<a id ="IV.A4"></a>

### *4. Convert string column to uppercase*

### 🔬 Observation: 
Since there are only 4 values fibroblast, inflammatory, epithelial in cellTypeName columns. Therefore, we do not need to convert to lowercase or uppercase for this dataset. However, in the larger dataset with multiple values, we should convert to all uppercase or lowercase to avoid duplication 

<a id ="IV.A5"></a>

### 5. Eliminate extra white spaces 

### 🔬 Observation: 
Since the Sepsis column have 4 different values such as fibroblast, inflammatory, epithelial or others => no extra white spaces    

<a id ="IV.A6"></a>

### *6. Check duplication*

In [None]:
# Empty Datafrane -> no dupplication in the df_label dataframe
duplicate_values = df_label[df_label.duplicated()]
print(duplicate_values)

### 📚 Reason: 
Duplicate data can lead to time runing and storage problem for machine, thus, we need to check if there is any duplicated data and drop it.

### 🔬 Observation: 
There is 100% no duplicated values in the dataset, thus, we dont not need to drop any rows for this dataset

<a id ="IV.A7a"></a>

### 7. Check impossible values 

### 📚 Reason: 
Some time the dataset has some impossible values such as age is negative, thus, we need to check impossible value to find and drop or fix it to improve the accuracy of the machine learing

### 🔬 Observation: 
In this dataset, all of the data are reasonable => no impossible values

<a id ="IV.B8"></a>

### 8. Check outlier 

In [None]:
plt.rcParams['figure.figsize'] = [10, 7.5]
# plot the boxplot to see the outlier of each numerical column
sns.boxplot(data=df_label,orient="v")
plt.title("Bot-Plots Distribution", y = 1,fontsize = 20, pad = 40);

### 🔬 Observation:
According to the bot-plots, there are completely no outliner in this dataset

### 📚 Reason: 
After finish cleaning the data, we should display the data to double check the data and figure out the count, mean, min , 25%, 50%, 75%, max to prepare for EDA step in later section.

<a id ="VI"></a>

<h2 style = "text-align:center; color:white; font-weight:600; padding:0.5em; background-color: #B02B00; border-radius:25px ; box-shadow: 0 0 20px 0 #ACA7CB; margin-right:3em; margin-bottom:1em">II. Exploratory Data Analysis (EDA)
</h2>

<a id ="V.1"></a>

<h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">1.My hypothesises </h3>

1. The value for each type in the cell type name will be different.

2. The value of `others` class will be the least compare to other classes.


In [None]:
import seaborn as sns
sns.set_theme(style="darkgrid")
titanic = sns.load_dataset("titanic")
ax = sns.countplot(y="cellTypeName", data=df_label)
ax.set_title("Bar chart to display the total number of each type in cell type name", fontsize=15)
for bars in ax.containers:
    ax.bar_label(bars)
    

### 🔬 Observation: 
In this plot, the vertical axis is cellTypeName and the horizontal axis is count (total values). In overall, the class epithelial has the most value (4079) whereas the other class have the least value (1386). In additional, the different between each class are large, thus, we should consider to solve this problem by using imbalance in the feature engineering step. 

After observation, we can conclude that our hypotheses are correct.

<a id ="VI"></a>

<h2 style = "text-align:center; color:white; font-weight:600; padding:0.5em; background-color: #B02B00; border-radius:25px ; box-shadow: 0 0 20px 0 #ACA7CB; margin-right:3em; margin-bottom:1em">III. Feature Enginnering</h2>

<a id ="VI.1"></a>

<h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">1.Drop Unrealated columns to the target</h3>

### 🔬 Observation: 
Since all columns in the dataset are neccessary => we do not need to drop any columns

<a id ="VI.2"></a>

<h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">2.Class Imbalances</h3>

In [None]:
print(df_label['cellTypeName'].value_counts())
print(df_label['cellTypeName'].value_counts(normalize=True, dropna=False))

### 📚 Reason: 
The reason, we need to rebalance these classes value is the accuracy of the sepsis prediction might affected by the amount of values in one class. In other words , if one class has more values compare to the others , it is likely that we will receive the better prediction for this class instead of others, thus, the prediction for other clas might be worst. Thus,in this particular situation, since the difference between these classes are large, we can rebalance these classes using <strong>upsample method</strong>

### 🔬 Observation: 
After upsampling these classes, the value for all classes are the same, thus, we can move to the next step.

In [None]:
import cv2
import os
list_of_images = []

for path in df_label['ImageName']:
  image_path = os.path.join("./patch_images", path)
  image = cv2.imread(image_path , cv2.IMREAD_GRAYSCALE)
  list_of_images.append(image)

list_of_images = np.asarray(list_of_images)
np.array(list_of_images).shape

In [None]:
list_of_images = np.reshape(list_of_images,  (-1 , 27 * 27))
list_of_images = pd.DataFrame(list_of_images)

<a id ="VII"></a>
<h2 style = "text-align:center; color:white; font-weight:600; padding:0.5em; background-color: #B02B00; border-radius:25px ; box-shadow: 0 0 20px 0 #ACA7CB; margin-right:3em; margin-bottom:1em">IV. Model Building</h2>

<a id ="VII.1"></a>

<h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">1.Split dataframe </h3>

In [None]:
train_df_bl, test_df_bl = train_test_split(np.array(df_label), test_size=0.2, random_state=42)
train_df_bl, val_df_bl = train_test_split(np.array(train_df), test_size=0.2, random_state=42)
print(train_df_bl.shape)
print(test_df_bl.shape)
print(val_df_bl.shape)

### 🔬 Observation: 
After spliting the data, we can start training the models 

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df_label, test_size=0.2, random_state=9999)
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=9999)
print("Train data : {}, Val Data: {}, Test Data: {}".format(train_df.shape[0], val_df.shape[0], test_df.shape[0]))

In [None]:
train_iterator = get_dataframe_iterator(train_df)
val_iterator = get_dataframe_iterator(val_df)
test_iterator = get_dataframe_iterator(test_df)

<h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">2. XG Boost</h3>

In [None]:
from sklearn.model_selection import train_test_split

cancer_training_X , cancer_test_X, cancer_training_Y , cancer_test_Y = train_test_split(
    list_of_images, 
    df_label['cellType'], 
    train_size = 0.8, 
    random_state = 9999, 
    shuffle = True)

print("Training X shape: " , cancer_training_X.shape)
print("Training Y shape: " , cancer_training_Y.shape)
print("Testing X shape: " , cancer_test_X.shape)
print("Testing Y shape: " , cancer_test_Y.shape)

### 1. Default model (without any parameters)

In [None]:
import xgboost as xgb
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

# Binary:logistic is used for logistic classfication problem which is our problem in this task
xgbr_task2 = xgb.XGBClassifier(objective='binary:logistic')
xgbr_task2.fit(cancer_training_X, cancer_training_Y)

y_pred_validate_task2 = xgbr_task2.predict(cancer_test_X)
prediction_validate_task2 = [round(value) for value in y_pred_validate_task2]
# evaluate predictions
accuracy = accuracy_score(cancer_test_Y, prediction_validate_task2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
mse = mean_squared_error(cancer_test_Y, y_pred_validate_task2)
print("RMSE: %.2f" % (mse**(1/2.0)))
# accuracy = accuracy_score(validate_y_cancer, ypred)
# print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
print("Validate report")
print(classification_report(cancer_test_Y, prediction_validate_task2))

### 🔬 Observation: 
The alogrithm produce a high accuracy and low RMSE. However, we should increase the efficency of this model by figure out the best parameters for this alogirthm. 


### 2. Using GridSearchCV to find best params and calculate RMSE

####  **Hyperparameter tuning chosen**


In [None]:
from sklearn.model_selection import GridSearchCV

params = { 'max_depth': [3,6,10],
           'learning_rate': [0.01, 0.05, 0.1],
           'n_estimators': [100, 500, 1000],
           'colsample_bytree': [0.3, 0.7]}
gs_xgbr_task2 = xgb.XGBClassifier(seed = 20)
grid_search_cv_task2 = GridSearchCV(estimator=gs_xgbr_task2, 
                   param_grid=params,
                   scoring='neg_mean_squared_error', 
                   verbose=1)
grid_search_cv_task2.fit(train_x_cancer, train_y_cancer)
print("Best parameters:", grid_search_cv_task2.best_params_)
print("Lowest RMSE: ", (-grid_search_cv_task2.best_score_)**(1/2.0))

In [None]:
## Predict the y train and y validate using our model with x train and x validate
gs_y_pred_validate_task2 = grid_search_cv_task2.predict(validate_x_cancer)
gs_prediction_validate_task2 = [round(value) for value in gs_y_pred_validate_task2]

print("Validation report")
print(classification_report(validate_y_cancer, gs_prediction_validate_task2))

### 3. Using Randomized Search CV to find best params and calculate lowest RMSE

#### **Hyperparameter tuning chosen**


In [None]:
from sklearn.model_selection import RandomizedSearchCV

params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'subsample': np.arange(0.5, 1.0, 0.1),
           'colsample_bytree': np.arange(0.4, 1.0, 0.1),
           'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
           'n_estimators': [100, 500, 1000]}
rs_xgbr_task2 = xgb.XGBClassifier(seed = 20)
random_search_task2 = RandomizedSearchCV(estimator=rs_xgbr_task2,
                         param_distributions=params,
                         scoring='neg_mean_squared_error',
                         n_iter=25,
                         verbose=1)
random_search_task2.fit(train_x_cancer, train_y_cancer)
print("Best parameters:", random_search_task2.best_params_)
print("Lowest RMSE: ", (-random_search_task2.best_score_)**(1/2.0))

In [None]:
## Predict the y train and y validate using our model with x train and x validate
rd_y_pred_validate_task2 = random_search_task2.predict(validate_x_cancer)
rd_prediction_validate_task2 = [round(value) for value in rd_y_pred_validate_task2]

print("Validation report")
print(classification_report(validate_y_cancer, rd_prediction_validate_task2))

### 🔬 Observation: 


In [None]:
from sklearn.metrics import plot_confusion_matrix
fig, ( ax1) = plt.subplots(1, 1, figsize=(20, 8))
plot_confusion_matrix(random_search, validate_x_cancer, validate_y_cancer, ax=ax1)


labels = np.asarray(labels).reshape(2,2)
ax1.title.set_text('Test Confusion Matrix')

In [None]:
## Draw a summary table for prediction sepsis and not sepsis
pd.DataFrame(confusion_matrix(validate_y_cancer,rd_y_pred_validate),\
            columns=["Predicted Not-Cancerous", "Predicted Cancerous"],\
            index=["Not-Cancerous","Cancerous"])

### 🔬 Observation: 


In [None]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score


evaluation_xgboost_t1 = [None]
evaluation_xgboost_t1.append(accuracy_score(validate_y_cancer,rd_y_pred_validate))
evaluation_xgboost_t1.append(precision_score(validate_y_cancer, rd_y_pred_validate))
evaluation_xgboost_t1.append(recall_score(validate_y_cancer, rd_y_pred_validate))
evaluation_xgboost_t1.append(f1_score(validate_y_cancer, rd_y_pred_validate))

# <h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">3. Resnet</h3>

In [None]:
from keras.regularizers import l2

def ResNet50_t2(input_shape = (27, 27, 3), classes = 2):
    """
    Implementation using the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*3 -> CONVBLOCK -> IDBLOCK*3
    AVGPOOL -> TOPLAYER

    reducing the typical 5 stage to 3 stage to reduce time and memory expense

    Arguments:
    input_shape: shape of image, currently is 27x27
    classes: integer, number of classes to identify
    
    returns a Model() instance.
    """
    
    # set x_input as a tensor with shape input_shape
    X_input = Input(input_shape)

    
    # add padding for tensor
    X = ZeroPadding2D((3, 3))(X_input)

    #since resnet only works for images that is 30x30 pixels or higher, we need to add padding pixels for the algorithm to work
    
    # first stage
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = initializers.RandomNormal(stddev=0.01))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # second
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], block='2a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], block='2b')
    X = identity_block(X, 3, [64, 64, 256], block='2c')
    X = identity_block(X, 3, [64, 64, 256], block='2d')

    # third
    X = convolutional_block(X, f = 3, filters = [128, 128, 512], block='3a', s = 2)
    X = identity_block(X, 3, [128, 128, 512], block='3b')
    X = identity_block(X, 3, [128, 128, 512], block='3c')
    X = identity_block(X, 3, [128, 128, 512], block='3d')
    
    # avg pooling
    X = AveragePooling2D()(X)

    # output layer
    X = Flatten()(X)
    X = Dense(1024, activation='relu')(X)
    X = Dropout(0.2)(X)
    X = Dense(classes, activation='softmax', name='fc', kernel_initializer = GlorotUniform(seed=0) , kernel_regularizer=l2(0.01))(X)
    
    
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

In [None]:
model_resnet50_t2 = ResNet50_t2(classes=4)

In [None]:
opt = Adam(lr=0.00052) 
model_resnet50_t2.compile(optimizer=opt, loss='categorical_crossentropy', 
                          metrics=METRICS)

history_resnet50_t2 = fit_model(model_resnet50_t2, train_iterator, val_iterator,
                                export_dir="",
                                name='ResNet50_Task2')

In [None]:
evaluation_resnet50_t2 = model_resnet50_t2.evaluate(test_iterator)

# <h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">4. GoogleNet</h3>

# <h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">5. Alex Net</h3>

# <h3 style = "color : #F06200; font-style:italic; letter-spacing:0.075em;">5. Alex Net (Supervised)</h3>