<a href="https://colab.research.google.com/github/yeshwanth32/2048/blob/master/Final_project_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chest X-ray Analyser**


---



So, the title of my final project is "Chest X-ray Analyser". The goal of this project is to train a convolutional neural network on a dataset of chest x-ray images, which are labeled with the names of the diseases that can be identified using the image. The neural network should be able to predict the occurrence of a disease given a chest x-ray image. The dataset was provided by the National Institute of Health, the details of how it was obtained and the images can be found [here](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345) 
Before I explain my current approach for the project, let me explain some logistical problems I am facing with the dataset. The first issue is that each image in the dataset is very large. Each image is in fact 1024x1024 pixels big and there are nearly 100,000 images in the dataset. This large size makes it very costly to train the network on all the images. The second issue is a consequence of the first one, the entire dataset is too big. If I were to use all the images, then I would have to figure out a way to submit a 42gb folder to the professor in order to be able to run.

So, to solve these issues I am not going to be training a multi-label image classification model, like they did in the paper. But instead I will be training a model to identify the occurrence of only one of the 14 irregularities in the dataset. For example, Pleural Thickening only occurs 5172 times in the dataset so my understanding is that it would be less time consuming to accurately train the model to identify just this one disease. This could also be useful because we can see if it would be more effective to have different models specialize in identifying a disease rather than having one model identify all of them. To reduce the time it takes to process each image, I will be using the library tensor flow and google collabs to run my code. Since google colab and tensorflow make use of GPUs on the cloud it would significantly speed up the run time of my code. I am also planning to pre-process the data and standardize the images and save the arrays to reduce the processing time. 

To train the algorithm I will be using the inbuilt convolutional neural network model library in tensorflow to design a custom network for this task. In the paper the researchers used already existing models like ImageNet, AlexNet, GoogLeNet, VGGNet-16 and ResNet-50. I haven’t finalized the exact structure of the neural network I will be developing but I will definitely take inspiration from the networks used in the paper and apply my own ideas and changes to create a new model. 

My goal is to get as high a success rate as possible, preferably higher than 90 percent. I cannot give an accurate goal for the project yet because once I actually build and start training my model, I am going to get a better understanding of the project and if a 90 percent success rate is even possible. I will be using a graph to display the success rate of my program as the training process goes on. I will also be some form of CNN visualization tool (like the one described [here](https://medium.com/@falaktheoptimist/want-to-look-inside-your-cnn-we-have-just-the-right-tool-for-you-ad1e25b30d90 )):  to get a more in depth understanding of how my network works. 

Status report: 

Currently, I have finished downloading the massive 42GB of data to my laptop and uploaded it to my drive for easy access. I have set up a google colab project and finished testing out the various ways to read the data. I am also in the process of researching the best way to structure my CNN. 

Since the project is nowhere near completion, there is a lot left to be done. I need to finish pre-processing the data and divide it into training and testing blocks. I also need to build a CNN and finish training and testing on the dataset. I also need to finish setting up the visualization code, like drawing the graphs etc.


In [6]:
# all imports
import glob
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import numpy as np
import pandas as pd
import pickle

t0 = time.time()

In [7]:
#This code is needed to be able to run from google colab. This is to mount google drive so that we 
# can access the image files that will be stored in the drive
#from google.colab import drive
#drive.mount('/gdrive')

# loading all the files paths in different sub folders into a single array

files_names1 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images1/images/*.png")
files_names2 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images2/images/*.png")
files_names3 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images3/images/*.png")
files_names4 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images4/images/*.png")
files_names5 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images5/images/*.png")
files_names6 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images6/images/*.png")
files_names7 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images7/images/*.png")
files_names8 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images8/images/*.png")
files_names9 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images9/images/*.png")
files_names10 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images10/images/*.png")
files_names11= glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images11/images/*.png")
files_names12 = glob.glob("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/images/Images12/images/*.png")

all_files_paths = files_names1+files_names2+files_names3+files_names4+files_names5+files_names6+files_names7+files_names8+files_names9+files_names10+files_names11+files_names12
print(len(all_files_paths))

KeyboardInterrupt: ignored

In [5]:
#time test with tensor flow

# mirrored_strategy = tf.distribute.MirroredStrategy()

# t0 = time.time()
# files_names = glob.glob("/content/drive/My Drive/247_final_project_data/Temporary_folder/images/*.png")
# path = "/content/drive/My Drive/247_final_project_data/Temporary_folder/images/"
# normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)
# for i in range(0, len(files_names)):
#   image = tf.keras.preprocessing.image.load_img(files_names[i], color_mode = "grayscale") #
#   input_arr =tf.keras.preprocessing.image.img_to_array(image)
#   standardized_image_arr = normalization_layer(input_arr)
# t1 = time.time()
# print(t1-t0)
# print("Number of files:" + str(len(files_names)))


In [None]:
#creating another array that only contains the paths to the subset of images we want to train on

dataset = pd.read_csv("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/partial_dataset.csv")
file_paths = []
for i in range(0, len(dataset['File Name'])):
  temp_file_name = dataset['File Name'][i]
  found = None
  for j in range(0, len(all_files_paths)):
   if (all_files_paths[j].find(temp_file_name) != -1):
     found = True
     file_paths.append(all_files_paths[j])
  if (found == False):
    print("Error!")
    break;

print(len(file_paths))
print("Done")

with open("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/file_paths.txt", "wb") as fp:
  pickle.dump(file_paths, fp)

In [None]:
#reading the images from the path and standardizing them

dataset = pd.read_csv("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/partial_dataset.csv")

with open("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/file_paths.txt", "rb") as fp:  
  file_paths = pickle.load(fp)


normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)
standardized_images_all = []
for i in range(0, len(file_paths)):
  image = tf.keras.preprocessing.image.load_img(file_paths[i], color_mode = "grayscale")
  input_arr =tf.keras.preprocessing.image.img_to_array(image)
  standardized_image_arr = normalization_layer(input_arr)
  standardized_images_all.append(standardized_image_arr.numpy())

standardized_images_all = np.array(standardized_images_all)
print(standardized_images_all.shape)

labels_all = np.array([[i] for i in dataset['Label']])
print(labels_all.shape)

In [None]:
with open("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/standardized_images_all.txt", "wb") as fp:
  pickle.dump(standardized_images_all, fp)

with open("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/standardized_images_all.txt", "rb") as fp:  
  a = pickle.load(fp)

with open("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/labels_all.txt", "wb") as fp:
  pickle.dump(labels_all, fp)

with open("/content/drive/MyDrive/247_final_project_data/Temporary_folder_2/labels_all.txt", "rb") as fp:  
  b = pickle.load(fp)

print(a.shape)
print(b.shape)

In [None]:
t1 = time.time()
print(t1-t0)