![Cancer](https://media2.giphy.com/media/sCqnpiUFN228E/giphy.gif)

# Introduction

Among the most important areas in the world is human health. Exploring the methods for preventing and detecting health problems has sparked a lot of interest. Cancer is the most common illness that has a significant impact on human health. A malignant tumor is a cancerous tumor that develops as a result of the disease. Colon cancer, together with breast cancer and lung cancer, is the third most deadly disease in the United States, killing 49,190 people in 2016 [1]. This is a cancer that begins in the large intestine colon, which is the last component of the digestive system.

The machine learning technique should be used in this assignment to aid in the detection of malignant cells and the differentiation of cell types in colon cancer. Deep learning algorithms such as AlexNet, Resnet50, and VGG19 will all be developed and evaluated in this notebook, with XGBoost being the sole non-deep learning option to tackle the issue.

# Import necessary library

In [1]:
conda install -c conda-forge keras-preprocessing

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

## Load dataset

In [3]:
df_label = pd.read_csv("data_labels_mainData.csv")
df_label_extra = pd.read_csv("data_labels_extraData.csv")

# Data Processing 


In [4]:
is_cancer_class_count = df_label.isCancerous.value_counts()
amount_for_balance = abs(is_cancer_class_count[0] - is_cancer_class_count[1])
df_random_cancer_from_extra = df_label_extra[df_label_extra['isCancerous'] == 1].sample(amount_for_balance)
df_label = pd.concat([df_label, df_random_cancer_from_extra], ignore_index=True)
df_label.isCancerous.value_counts()

0    5817
1    5817
Name: isCancerous, dtype: int64

In [5]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df_label, test_size=0.2, random_state=9999)
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=9999)

print("Train data : {}, Val Data: {}, Test Data: {}".format(train_df.shape[0], val_df.shape[0], test_df.shape[0]))

Train data : 6980, Val Data: 2327, Test Data: 2327


In [6]:
# document: https://keras.io/api/preprocessing/image/#imagedatagenerator-class
from keras_preprocessing.image import ImageDataGenerator

def get_dataframe_iterator(dataframe, 
                            image_shape = (27, 27), 
                            batch_size = 64,
                            x_col = "ImageName",
                            y_col = "cellTypeName",
                            classes = ["fibroblast", "inflammatory", "epithelial", "others"]):
    dataframe[y_col] = dataframe[y_col].apply(str)
    generator = ImageDataGenerator(
        rescale = 1./255, 
        rotation_range = 20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True
    ) 
    iterator = generator.flow_from_dataframe(
        dataframe = dataframe,
        directory = "./patch_images", 
        x_col = x_col,
        y_col = y_col,
        classes = classes, 
        class_mode = "categorical", 
        target_size = image_shape, 
        batch_size = batch_size,
    )
    return iterator

In [7]:
train_iterator = get_dataframe_iterator(train_df, y_col='isCancerous', classes=['0','1'])
val_iterator = get_dataframe_iterator(val_df, y_col='isCancerous', classes=['0','1'])
test_iterator = get_dataframe_iterator(test_df, y_col='isCancerous', classes=['0','1'])

Found 6980 validated image filenames belonging to 2 classes.
Found 2327 validated image filenames belonging to 2 classes.
Found 2327 validated image filenames belonging to 2 classes.


In [8]:
# Check duplicate

In [9]:
import hashlib, os
duplicates = []
hash_keys = dict()
for index, filename in  enumerate(os.listdir('./patch_images/')):  #listdir('.') = current directory
    if os.path.isfile('./patch_images/'+filename):
        with open('./patch_images/'+filename, 'rb') as f:
            filehash = hashlib.md5(f.read()).hexdigest()
        if filehash not in hash_keys: 
            hash_keys[filehash] = index
        else:
            duplicates.append((index,hash_keys[filehash]))
            print(filename)


15848.png
18581.png
4971.png


In [10]:
duplicates

[(5843, 5794), (8473, 8472), (15164, 14897)]

In [11]:
hash_keys

{'00d5f90bf22d694a58f8cf7ad98cedf9': 0,
 '7e472ee2df0b286c4302d597bc74d1cd': 1,
 'a770fa49a7baa1d2c8f1f8cba56c1d72': 2,
 'f3dabfcea8c9807136cb493bf35c743b': 3,
 'ef117899afcc9b518239882943ad00ef': 4,
 '0526bb187c92b6ae69dee06d1e22b02c': 5,
 '776e48c5e6164b1d9fd5d3b2248382e9': 6,
 'aeac839641cde44842c5f5d4e5116e72': 7,
 '5d19878863cfe1dc9bda629c31c61711': 8,
 'a1064bef08f94f232ffb48925b048da4': 9,
 'e3157e31d56d86876536f7c85e9f5033': 10,
 'a8b297482923d18d93adfe38e9a16929': 11,
 'd2ecec4745dff3285ceb3619a92919f4': 12,
 '73956ddcf342dcc35d79923507f1b4c9': 13,
 '7c90c8c4632259a532dbb8377ce32f1c': 14,
 'd4d5e9967eec97e36e087a69b4cb716b': 15,
 '931896b226188040c6e09f68d95467cd': 16,
 '0d6c713ab7aa641bb9ecd9f5fe2d9916': 17,
 '2c3eee4d9a64c5807dc8ae1a8d12261d': 18,
 '4c60ab5202ed246de1ae8c675ac3dbe2': 19,
 '3b5cea0995c127c530004b4c6dd37f74': 20,
 '065fd04601b17661c1c7a9c01d4960b2': 21,
 'dcb644664623facddc6b2f3cb136ea56': 22,
 '3534f0e059509dbd3c3e652bffae48a0': 23,
 'f5b5bd5e84e77940a05537f2

In [12]:
file_list = os.listdir('./patch_images/')
print(file_list)

['1.png', '10.png', '100.png', '1000.png', '10000.png', '10001.png', '10002.png', '10003.png', '10004.png', '10005.png', '10006.png', '10007.png', '10008.png', '10009.png', '1001.png', '10011.png', '10012.png', '10013.png', '10014.png', '10015.png', '10016.png', '10017.png', '10018.png', '10019.png', '1002.png', '10020.png', '10021.png', '10022.png', '10024.png', '10026.png', '10027.png', '10029.png', '1003.png', '10030.png', '10031.png', '10032.png', '10033.png', '10034.png', '10035.png', '10036.png', '10037.png', '10038.png', '1004.png', '10040.png', '10041.png', '10042.png', '10043.png', '10044.png', '10045.png', '10046.png', '10047.png', '10048.png', '10049.png', '1005.png', '10050.png', '10051.png', '10052.png', '10053.png', '10054.png', '10055.png', '10056.png', '10057.png', '10058.png', '10059.png', '1006.png', '10060.png', '10061.png', '10062.png', '10063.png', '10064.png', '10065.png', '10066.png', '10067.png', '10068.png', '10069.png', '1007.png', '10070.png', '10071.png', '1

In [None]:
conda install -c anaconda scipy==1.0

In [None]:
from scipy.misc import imread, imresize, imshow
for file_indexes in duplicates[:30]:
    try:
    
        plt.subplot(121),plt.imshow(imread(file_list[file_indexes[1]]))
        plt.title(file_indexes[1]), plt.xticks([]), plt.yticks([])

        plt.subplot(122),plt.imshow(imread(file_list[file_indexes[0]]))
        plt.title(str(file_indexes[0]) + ' duplicate'), plt.xticks([]), plt.yticks([])
        plt.show()
    
    except OSError as e:
        continue