#### First I'll show how to train with the smaller data set that is in one file, then I'll show the method for the larger.

#### Here I made a mistake when we did our preprocessing, I should have actually shuffled and split the data set so that everytime I load it, the same images in the same order are fed through the nural network, this is important for evaluating different models.

#### It is not a big deal since we are learning, but will matter in production env.

#### Also, the training below is an intro into deep learning and a simplification of what is needed, for our results we are not predecting wether or not they have pulmonary fibrosis, we are predicting their FVC value for a certain week based on a set of images.

In [1]:
#we will be using keras with tensorflow backend

import numpy as np
import os
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Activation, MaxPooling2D, Flatten
import pandas as pd

In [2]:
#load in the data
train = np.load('D:/train.npy', allow_pickle = True)
test = np.load('D:/test.npy', allow_pickle = True)

In [4]:
#quick look at the distribution of the training set

count = 0
for i in train:
    if i[1] == 1:
        count += 1
    else:
        pass
print("Training Data")
print(count)
print(25000-count)
print(count/25000)
print((25000-count)/25000)      

Training Data
17512
7488
0.70048
0.29952


In [5]:
#quick look at the distribution of the test set
count = 0
for i in test:
    if i[1] == 1:
        count += 1
    else:
        pass
print("Test Data")
print(count)
print(25000-count)
print(count/25000)
print((25000-count)/25000)  

Test Data
4937
20063
0.19748
0.80252


In [6]:
X = [] #our images list
y = [] #our labels list
#iterate over the dataset and seperate out the images from the lables
#they need to be sperated for KERAS https://keras.io/guides/sequential_model/
for features, label in train:
        X.append(features)
        y.append(label)

In [7]:
X = np.array(X)
y = np.array(y)

In [8]:
print(X.shape)
print(y.shape)

(25000, 256, 256)
(25000,)


In [9]:
#we need to reshape our data so that it can be fed to our nural network
#https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/
#https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/
X = X.reshape(-1,256,256,1) 

In [10]:
print(X.shape)

#we have 25,000 samples, each are 256x256 and with one slice
#if we had kept the images together by patient our shape would be something like
# (176, 256, 256, 100) assuming 172 patients, size of 256x256 and 100 slices per patient

(25000, 256, 256, 1)


In [None]:
# Some things to note about training in notebooks, jupyter notebooks work fine if you are
# training data once or twice, but the notebooks don't handle memory very well and
# you'll eventually have issues or run out of memory, I recomend just using a python file.

In [11]:
#refer to sentdex video reguarding this model

#sequential model
model = Sequential()

model.add(Conv2D(32, (3,3), input_shape = (256, 256, 1))) 
model.add(Activation("relu"))               # learn about
model.add(MaxPooling2D(pool_size = (2,2))) # learn about


model.add(Flatten()) 


model.add(Dense(32))
model.add(Activation("relu"))


#output layer
model.add(Dense(1)) 
model.add(Activation('sigmoid')) 

model.compile(loss = "binary_crossentropy",
                         optimizer = 'adam',
                         metrics = ['accuracy'])

model.fit(X, y, batch_size = 32, epochs =5, validation_split=.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x14be99ef808>

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 254, 254, 32)      320       
_________________________________________________________________
activation (Activation)      (None, 254, 254, 32)      0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 127, 127, 32)      0         
_________________________________________________________________
flatten (Flatten)            (None, 516128)            0         
_________________________________________________________________
dense (Dense)                (None, 32)                16516128  
_________________________________________________________________
activation_1 (Activation)    (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 3

In [13]:
#evaluate our results on our test set

In [14]:
Xtest = []
ytest = []
for features_test, label_test in test:
        Xtest.append(features_test)
        ytest.append(label_test)
Xtest = np.array(Xtest).reshape(-1,256,256,1) 
ytest = np.array(ytest)

In [15]:
# lets see how accurate our model is:
test_loss, test_acc =model.evaluate(Xtest, ytest, verbose=2)
print(f'Test Loss: {test_loss}. Test Accuracy: {test_acc}')

219/219 - 7s - loss: 0.6055 - accuracy: 0.7063
Test Loss: 0.6054663062095642. Test Accuracy: 0.7062947154045105


In [16]:
#we can also save our model for further usage or training later
model.save('model_1.keras')

In [17]:
#and load it back in
test_model = tf.keras.models.load_model('model_1.keras')


In [18]:
#test it again like 3 lines ago
test_loss, test_acc = test_model.evaluate(Xtest, ytest, verbose=2)
print(f'Test Loss: {test_loss}. Test Accuracy: {test_acc}')
#and we get the same results

219/219 - 7s - loss: 0.6055 - accuracy: 0.7063
Test Loss: 0.6054663062095642. Test Accuracy: 0.7062947154045105


In [None]:
#to find the best model, we want to train with different layer sizes, number of layers
#and anything alse we think may increase our accruacy
#so at this point it becomes trial and error

# https://stackoverflow.com/questions/51704808/what-is-the-difference-between-loss-accuracy-validation-loss-validation-accur
#check this out for understanding val loss and val acc from our keras results

#I ran a few different sixed models and outputed the rusults into csv files

In [21]:
model = Sequential()

model.add(Conv2D(32, (3,3), input_shape = (256, 256, 1)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))

model.add(Conv2D(32, (3,3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))

model.add(Flatten()) 

model.add(Dense(32))
model.add(Activation("relu"))

model.add(Dense(32))
model.add(Activation("relu"))


#output layer
model.add(Dense(1)) 
model.add(Activation('sigmoid')) 

model.compile(loss = "binary_crossentropy",
                         optimizer = 'adam',
                         metrics = ['accuracy'])

model.fit(X, y, batch_size = 32, epochs =10, validation_split=.1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x14c7b904288>

In [22]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 254, 254, 32)      320       
_________________________________________________________________
activation_8 (Activation)    (None, 254, 254, 32)      0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 127, 127, 32)      0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 125, 125, 32)      9248      
_________________________________________________________________
activation_9 (Activation)    (None, 125, 125, 32)      0         
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 62, 62, 32)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 123008)           

In [24]:
#ran out of memory for the notebook as I mentioned earlier, switching to an actual python file.
#I forced this error by increasing batch size...
#A reset kernal fixes this, or just use python file

model = Sequential()

model.add(Conv2D(64, (3,3), input_shape = (256, 256, 1)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))


model.add(Flatten()) 


model.add(Dense(64))
model.add(Activation("relu"))


#output layer
model.add(Dense(1)) 
model.add(Activation('sigmoid')) 

model.compile(loss = "binary_crossentropy",
                         optimizer = 'adam',
                         metrics = ['accuracy'])

model.fit(X, y, batch_size = 64, epochs =10, validation_split=.1)

Epoch 1/10


ResourceExhaustedError:  OOM when allocating tensor with shape[64,64,254,254] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node sequential_5/conv2d_6/Conv2D (defined at <ipython-input-24-831e0526ddc7>:25) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_31359]

Function call stack:
train_function


#### To further this, see the file named "train_all.py" This was run in AWS on a p3.2xlarge EC2 instance for a few hours. The results of the models are found in "model_log_train1603151129.csv" and can be used to evaluate the effectiveness of each model.