Ryan's notebook: https://drive.google.com/open?id=1MC9Swt97bI_KK9huLOrrFBj6XHci-v1o

# Importing, Downloading, and Unzipping

Using CheXpert dataset: https://stanfordmlgroup.github.io/competitions/chexpert/

In [1]:
#! pip install tensorflow==2.0.0
#!pip install tensorflow_io

%tensorflow_version 2.x

import gdown
import zipfile, os
import h5py
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
#keras = tf.keras
from tensorflow.keras import applications, Model, optimizers
from tensorflow.keras.utils import HDF5Matrix
from tensorflow.keras import backend as k 
from tensorflow.keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D
from tensorflow.keras.preprocessing import image
from tensorflow.keras.callbacks import *
#import tensorflow_io.hdf5
from tqdm import tqdm

TensorFlow 2.x selected.


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Setting up our Dataset:

Download and unzip the dataset files from google drive to the local machine




In [0]:
#Downloads the CheXpert Archived Dataset
file_id = "1MhlQ1D4TUdvYImYpjYMYAZNfnbriOvGZ"
url = f'https://drive.google.com/uc?id={file_id}'
output = 'dataset.zip'

gdown.download(url, output, quiet=False) 

#Unzips the CheXpert Archived Dataset

zip_ref = zipfile.ZipFile('/content/dataset.zip')
zip_ref.extractall('/content/data')
zip_ref.close()

Downloading...
From: https://drive.google.com/uc?id=1MhlQ1D4TUdvYImYpjYMYAZNfnbriOvGZ
To: /content/dataset.zip
11.6GB [01:12, 160MB/s]


Now, Lets move the files from their default location to the folder that our preprocessing functions are expecting them to be in.  

In [0]:
#create directory and move datatset files to it
!mkdir "/content/data/manual"
!mkdir "/content/data/manual/chexpert"
!mv "/content/data/CheXpert-v1.0-small/CheXpert-v1.0-small" "/content/data/manual/chexpert"
!ls "/content/data/manual/chexpert/CheXpert-v1.0-small"

train  train.csv  valid  valid.csv


# Training Data Setup

In this section we take our training data and combine it into one feature array and one label array, which can be used to fit our model.  These arrays are stored in an HDF5 file.  

(This section must only be run once, the dataset can be referenced from google drive after that)

In [0]:
#Read in the csv of training data to a Pandas dataframe
train = pd.read_csv('/content/data/manual/chexpert/CheXpert-v1.0-small/train.csv')

In [0]:
#Get the number of images in the training dataset
trange = train.shape[0]
print(trange)

223414


Now, we iterate through the training dataframe in 20 sections, fetching the images as (299, 299, 3) arrays and storing them to a larger array.  

We combine these 20 arrays in a memory mapped file which is written to the local disk, as a (223414, 299, 299, 3) array is too large (230GB!) to work with in our RAM.  

In [0]:
X_train = np.memmap('memmap.dat', mode='w+', shape=(223414, 299, 299, 3), dtype=np.float32)

In [0]:
f = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='w')

OSError: ignored

In [0]:
T=0
t=[]
for j in range(20):

  print('Starting part', j+1, 'of the training data')
  start = j*int(trange/20)
  end = (j+1)*int(trange/20)
  T=0
  if j == 19:
    end += 14
  for i in tqdm(range(start, end, 1)):
      img = image.load_img('/content/data/manual/chexpert/' + train['Path'][i], target_size=(299,299))
      img = image.img_to_array(img)
      img = img/255
      t.append(img)
  T = np.array(t)
  t = [] 
  print(T.shape)
  #X_train[start:end, :, :, :] = T
  dname = 'X_train_{j}'.format(j=j)
  dset = f.create_dataset(dname, data=T, chunks=True, dtype=np.float32)

In [0]:
'X_train {j}'.format(j=j)

In [0]:
!rm -r '/content/data/manual/chexpert'

In [0]:
X_train

In [0]:
#f.close()
f = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='r')
for key in f.keys():
  print(key)
f.close()

In [0]:
del t, img

In [0]:
f2

In [0]:
!ls '/content'
!mv '/content/chexpert.hdf5' '/content/drive/My Drive/Colab Notebooks/data/manual'

In [0]:
f = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='a')
y_train = f.create_dataset('y_train', (223414, 14))

In [0]:
#one hot encoding our class labels
y = train[['No Finding', 'Enlarged Cardiomediastinum', 'Cardiomegaly', 'Lung Opacity', 'Lung Lesion', 'Edema', 'Consolidation', 'Pneumonia', 'Atelectasis', 'Pneumothorax', 'Pleural Effusion', 'Pleural Other', 'Fracture', 'Support Devices']]
y = y.fillna(0)
y_train = y.replace(to_replace=-1, value=0)


In [0]:
print(y_train.shape)
for key in f.keys():
  print(key)

In [0]:
f.close()
del train, y, y_train

# Validation Data Setup

In this section, we store our validation data in HDF5 format, just like the training data.  We then move the file into google drive so it can be referrenced in the future without extracting the data again.  

In [0]:
valid = pd.read_csv('/content/data/manual/chexpert/CheXpert-v1.0-small/valid.csv')

In [0]:

f = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='a')
X_val = f.create_dataset('X_val', (234, 299, 299, 3))

In [0]:
t = []
T = 0
for i in tqdm(range(valid.shape[0])):
      img = image.load_img('/content/data/manual/chexpert/' + valid['Path'][i], target_size=(299,299))
      img = image.img_to_array(img)
      img = img/255
      t.append(img)
T = np.array(t)
print(T.shape)
X_val = T

In [0]:
for key in f.keys():
  print(key)
print(f['X_val'])

In [0]:
f.close()
del T, t, X_val

In [0]:
f = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='a')
y_val = f.create_dataset('y_val', (234, 14))

In [0]:
y = valid[['No Finding', 'Enlarged Cardiomediastinum', 'Cardiomegaly', 'Lung Opacity', 'Lung Lesion', 'Edema', 'Consolidation', 'Pneumonia', 'Atelectasis', 'Pneumothorax', 'Pleural Effusion', 'Pleural Other', 'Fracture', 'Support Devices']]
y = y.fillna(0.0)
y_val = y.replace(to_replace=-1, value=0.0)

In [0]:
for key in f.keys():
  print(key)
f.close()

In [0]:
f.close()
del y, valid, y_val

# Building the Model

In [0]:
# Let's set up our model, starting with the Convolutional layers of ResNet101 with their Imagenet weights
model = applications.resnet.ResNet50(include_top=False, weights='imagenet', classes=14, input_shape=(299, 299, 3))
x = model.output
x = Flatten()(x)
x = Dropout(rate=0.5)(x)
x = Dense(1024, activation="relu")(x)
predictions = Dense(14, activation="relu")(x)

In [0]:
#How many layers are there in Inception V3?
print(len(model.layers))

In [0]:

model.layers[175-25:]

In [0]:
for i in range(len(model.layers)):
  model.layers[i].trainable = False


model_final = Model(inputs = model.input, outputs = predictions)
model_final.compile(optimizer=optimizers.Adam(), loss="binary_crossentropy", metrics=["AUC"])

In [0]:
filepath = '/content/drive/My Drive/Colab Notebooks/chexpertCNN/epochs:{epoch:03d}-val_acc:{val_loss:.3f}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

# Training the Model

In [0]:
f = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='r+')


In [0]:
f.close()

In [0]:
#dataset = h5py.File('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', mode='r')
X_train = HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_0'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_1'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_2'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_3'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_4'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_5'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_6'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_7'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_8'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_9'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_10'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_11'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_12'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_13'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_14'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_15'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_16'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_17'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_18'), HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_train_19')
y_train = HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'y_train')
X_val = HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'X_val')
y_val = HDF5Matrix('/content/drive/My Drive/Colab Notebooks/data/manual/chexpert.hdf5', 'y_val')

In [0]:
X_train[0]


In [0]:
y_train

In [0]:
for i in range(10):
  for k in range(19):
    start = k*11170
    end = (k+1)*11170
    if k == 19:
      end += 14
    data = X_train[k]
    model_final.fit(data,y_train[start:end], epochs=1, validation_data=(X_val, y_val), callbacks=callbacks_list, shuffle='batch')

# FastAI Image Classification

See separate file here (anyone at CWRU with the link can access): https://colab.research.google.com/drive/1hOFSD2mtdvqcioLnrGsAmNEZQ7y_79kJ