This kernel is my first try at making a NN using keras to apply it to the cancer cell competion. Any comments are more than welcome on any topic, as I am a very early beginner in data science :-)

**Things that I learned**
This is my first deep learning code, so obviously, it can only be a learning experience. Since I've read a few Kernels already, I figured if any other beginner like me stumbles upon my Kernel, maybe this might be helpful. If not, well at least I get to write feedback for myself ;-)

- I started by trying to make a [57k, 96, 96, 3] np.ndarray containing all the arrays of all the images we need to classify. While that did seem to work with smaller set (I tried with 25k, and it worked), at 57k, the Kernel just crashes. After some investigation (*puts Sherlock's hat on*) the issue seems to be memory overload. I mean I'm just turning 57,000+ images into 96x96x3 arrays, what could go wrong? Next step is to try inserting the prediction inside the for loop. Here's the idea: I'm still training my model (Well, not really mine, rather the one used in the week 2 of Course 4 of Andrew Ng's Coursera Deep Learning course) with a small amount of data, just to see if it's working. I'm taking baby steps, I'll gradually add more data as things work (eventually, I hope). I saw a tweet from Andrej Karapathy a fw weeks ago saying that you should try making a small model, with little data, until it overfits, and then move on. This allows you to check that the model is working, and helps find potential (as a beginner, I'd change the word 'potential' by 'numerous', but maybe it's just me) sources of error. Once I have my first model, then the prediciton is done image by image: I take an image, convert it to an array, and then predict it. Repeat ~57,000 times. This means no storing of the arrays, and (hopefully) no memory crash.

**TO DO**
Problem at the moment: memory overload.
- Train model with about 2000 train ex, 500 test ex, and then do the prediction inside the loop, so no array storing of the image on submission set. In submission loop: img_to_array, then predict, then put prediction into sample_submission['label']. I think this is the file that has to be submitted again, have to check that.

In [None]:
import numpy as np 
import math
import pandas as pd 
import cv2
import matplotlib.pyplot as plt
from PIL import Image

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model, Sequential
from keras.optimizers import Adam
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications import ResNet50
from keras.applications.imagenet_utils import preprocess_input
import pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model

import keras.backend as K
K.set_image_data_format('channels_last')
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

%matplotlib inline

import time

import os
print(os.listdir("../input"))

In [None]:
train_labels = pd.read_csv('../input/train_labels.csv', dtype=str)
test_labels = pd.read_csv('../input/sample_submission.csv', dtype=str)

#print('train : ','\n', train_sample.head(5))
#print('test : ','\n', test_labels.head(5))

Very early remarks:

1. We have 220,025 images in train_labels, of which 89,117 are labelled as having a cancer pixel in the 32x32 center zone.
2. There are 57,5k images in sample_sumbissions
3. Images are of size 96x96x3 (Meaning RBG, and of total size 27,648)
4. The only information we have is, if there is a cancer cell of not (in the 32x32 center zone, according to the data description). We do not know what it looks like, which pixel is identified as the one being the cancer, and what makes or doesn't make a cancer cell. This will make EDA relatively fast in my opinion, because there isn't much information we are going to be abel to look at; apart from looking at 1 labelled pictures, and visually trying to find what looks like the patterns compared to 0 labelled data.


In [None]:
#This image is labelled as having a cancer cell.
image = plt.imread('../input/train/c18f2d887b7ae4f6742ee445113fa1aef383ed77.tif')
plt.imshow(image)
plt.show()

In [None]:
num_classes = 2
my_model = Sequential()
my_model.add(ResNet50(include_top=False, weights='imagenet'))
my_model.add(Dense(num_classes, activation = 'softmax'))
my_model.layers[0].trainable = False

In [None]:
my_model.compile(optimizer = Adam(lr=0.0001), loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
#Ratio of images in train compared to test
sample_size = 2000
ratio = 0.9 
sample_train = train_labels[:sample_size]
sample_test = test_labels[:sample_size]

size_train = math.ceil(ratio*sample_train.shape[0])
train_df=sample_train[:size_train]
test_df=sample_test[size_train+1:]

print('sample size : ', sample_size, '\n',
      'ratio : ', ratio,'\n',
      'train size : ', train_df.shape[0],'\n',
      'test size : ', test_df.shape[0])

In [None]:
# used in the reference url: https://medium.com/@vijayabhaskar96/tutorial-on-keras-flow-from-dataframe-1fd4493d237c
def append_ext(fn):
    return fn + '.tif'

In [None]:
train_df['id']=train_df['id'].apply(append_ext)
test_df['id']=test_df['id'].apply(append_ext)

In [None]:
train_df.head()

In [None]:
train_batch_size = 10
val_batch_size = 10
valid_ratio = 0.25

train_steps = np.ceil(train_df.shape[0] / train_batch_size)
val_steps = np.ceil((train_df.shape[0]*valid_ratio) / val_batch_size)

In [None]:
data_generator = ImageDataGenerator(rescale = 1./255., validation_split=valid_ratio)

train_generator = data_generator.flow_from_dataframe(dataframe = train_df, 
                                                directory = '../input/train/',
                                               x_col = 'id',
                                               y_col = 'label',
                                               subset = 'training',
                                               batch_size = train_batch_size,
                                               shuffle = True,
                                               class_mode = 'categorical',
                                               target_size = (96, 96))
validation_generator = data_generator.flow_from_dataframe(dataframe = train_df,
                                                         directory = '../input/train/',
                                                        x_col = 'id',
                                                        y_col = 'label',
                                                        subset = 'validation',
                                                        batch_size=val_batch_size,
                                                        shuffle = True,
                                                        class_mode = 'categorical',
                                                        target_size = (96, 96))

test_datagen = ImageDataGenerator(rescale = 1./255.)

test_generator = test_datagen.flow_from_dataframe(dataframe = test_df,
                                                directory = '../input/test/',
                                               x_col = 'id',
                                               y_col = 'label',
                                               class_mode = None,
                                               shuffle = False,
                                               target_size = (96, 96))

In [None]:
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=validation_generator.n//validation_generator.batch_size
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size

print('step size for : ', '\n', 'train : ', STEP_SIZE_TRAIN,
     '\n', 'valid : ', STEP_SIZE_VALID,
     '\n', 'test : ', STEP_SIZE_TEST)

my_model.fit_generator(generator = train_generator,
                       steps_per_epoch=STEP_SIZE_TRAIN,
                       validation_data = validation_generator,
                       validation_steps=STEP_SIZE_VALID,
                       epochs = 3)

In [None]:
evaluation = my_model.evaluate(x= val_img_array, y=val_img_label)
print()
print ("Loss = " + str(evaluation[0]))
print ("Test Accuracy = " + str(evaluation[1]))

In [None]:
print('number of images labelled with cancer : ',test_sample[test_sample['label']==1].shape[0],
      ' out of ', test_sample.shape[0], ' examples')

In [None]:
test_sample.to_csv('test_predictions.csv', index=False)