## DS420 Final Project
## NIH Chest X-RAY Image Disease Classification

#### Team members: Yordanos Alemu, Kelsey Dinndorf, Jorania Ferreria Alves, & Rabab Mohamed Nafe

Data: https://www.kaggle.com/nickuzmenkov/nih-chest-xrays-tfrecords?select=preprocessed_data.csv

The dataset includes chest x-ray images to classify different diagnoses of diseases. There are 15 categories of diagnosis and 256 images (600 x 600). Additional patient information like age, sex, etc. are not included.

Disease Categories: None, Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural Thickening, Cardiomegaly, Nodule, Mass, Hernia

In [None]:
#import libraries

import numpy as np 
import pandas as pd
import seaborn as sns
import os
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Exploratory Data Analysis and Visualization

In [None]:
# Read the disease csv file (contains True/False)
df= pd.read_csv('/kaggle/input/nih-chest-xrays-tfrecords/preprocessed_data.csv')
df.head()

In [None]:
#Define disease categories as data
data= df.iloc[:,1:]

In [None]:
data

In [None]:
#Show data shape
df.shape

#There are 16 columns and 112,120 rows

In [None]:
# Show data info
df.info()

#There are no missing values
#all of the attributes are type boolean

In [None]:
# Countplot of No finding category
sns.countplot(x='No Finding', data=df)

In [None]:
#define data columns, number of columns, and character columns
cols = data.columns
num_cols = data._get_numeric_data().columns
char_cols=list(set(cols) - set(num_cols))
char_cols

In [None]:
# 1-hot encoding
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
def encode(df):
    for i in cols:
        df[i]= le.fit_transform(df[i])
    return df

In [None]:
#1-hot encode the disease categories
encode(data)

In [None]:
#concat the original image urls with the 1-hot encoded data
df=pd.concat([df.iloc[:,0],data], axis=1)

In [None]:
#Histogram of disease types
data.hist(bins=20, figsize=(15,10))

In [None]:
#import libraries
import IPython.display as display
import random
from functools import partial
import sys
from numpy import load
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
import time as timer

In [None]:
# Correlation matrix
sns.set(rc={'figure.figsize':(20,10)})
correlation_matrix = data.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True)

'''
There are large correlations between the No Finding category an most of the other categories.
There is a fairly large correlation (0.18) between Emphysema and Pneumothorax.
There is a fairly large correlation (0.17) between Pneumonia and Edema.
There is a fairly large correlation (0.17) between Atelectasis and Effusion.
'''

In [None]:
# Pie chart of percent of x-ray images classified as a Mass disease
df.groupby('Mass').size().plot(kind='pie', autopct='%.2f')

# 5.16% of the x-ray images are classified as Mass

In [None]:
# Define list of column headers
heads = list(df.columns)[2:]
heads

### Split Image

In [None]:
file_loc = '/kaggle/input/nih-chest-xrays-tfrecords/'

image_loc = file_loc + 'data/'

image = os.listdir(image_loc)

print('The total images in TFRecord is ' + str(len(image)) + ' x-ray images')

#There are 256 images

In [None]:
img = [image_loc + x for x in image]

file_name = tf.io.gfile.glob(img)

Randomly sample the entire list to a 80-20% split, then set aside 10% of the train sets randomly as a validation set.

In [None]:
#Define training and test sets (index)
ALL = list(range(len(file_name)))

train_valid = random.sample(ALL, int(len(ALL) * 0.8))
test_index = list(set(ALL) - set(train_valid))

train_index = random.sample(train_valid, int(len(train_valid) * 0.9))
valid_index = list(set(train_valid) - set(train_index))

In [None]:
#Define training and test image file names
TRAINING_FILENAMES, VALID_FILENAMES, TEST_FILENAMES = [file_name[index] for index in train_index], [file_name[index] for index in valid_index], [file_name[index] for index in test_index]
TRAINING_FILENAMES

In [None]:
print("Train TFRecord Files:", len(TRAINING_FILENAMES))
print("Validation TFRecord Files:", len(VALID_FILENAMES))
print("Test TFRecord Files:", len(TEST_FILENAMES))

### Reducing Image Dimensionality

In [None]:
feature_description = {}

for elem in list(df.columns)[2:]:
    feature_description[elem] = tf.io.FixedLenFeature([], tf.int64)
    
feature_description['image'] = tf.io.FixedLenFeature([], tf.string)

Here we are reducing the image size to 50 X 50 

In [None]:
BATCH_SIZE = 32
IMAGE_ONE_AXIS = 50
IMAGE_SIZE = [IMAGE_ONE_AXIS, IMAGE_ONE_AXIS]
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [None]:
# Functions to read the data
def read_tfrecord(example):
    example = tf.io.parse_single_example(example, feature_description)
    image = tf.io.decode_jpeg(example["image"], channels=3)
    image = tf.image.resize(image, IMAGE_SIZE)
    image = tf.cast(image, tf.float32) / 255.0
    
    label = []
    
    for val in heads:
        label.append(example[val])
    
    return image, label

In [None]:
def load_dataset(filenames):
    ignore_order = tf.data.Options()
    ignore_order.experimental_deterministic = False
    dataset = tf.data.TFRecordDataset(filenames)
    dataset = dataset.with_options(ignore_order)
    dataset = dataset.map(read_tfrecord)
    
    return dataset

In [None]:
def get_dataset(filenames):
    dataset = load_dataset(filenames)
    dataset = dataset.shuffle(2048)
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE)
    
    return dataset

In [None]:
#Define train, valid, and test datasets
train_dataset = get_dataset(TRAINING_FILENAMES)
valid_dataset = get_dataset(VALID_FILENAMES)
test_dataset = get_dataset(TEST_FILENAMES)

In [None]:
#Show the images for the training set to visualize
image_viz, label_viz = next(iter(train_dataset))

def show_batch(X, Y):
    plt.figure(figsize=(20, 20))
    for n in range(25):
        ax = plt.subplot(5, 5, n + 1)
        plt.imshow(X[n])
        
        result = [x for i, x in enumerate(heads) if Y[n][i]]
        title = "+".join(result)
        
        if result == []: title = "No Finding"
        
        plt.title(title)
        plt.axis("off")

show_batch(image_viz.numpy(), label_viz.numpy())

In [None]:
image_viz.numpy()

In [None]:
label_viz.numpy()

# PCA Model:

In [None]:
from PIL import Image
from IPython.display import display
display(image)

In [None]:
'''# Read and print data:
sess=tf.compat.v1.InteractiveSession()

# Read TFRecord file
reader = tf.compat.v1.TFRecordReader()
#tf.compat.v1.python_io
filename_queue = tf.train.string_input_producer(['180-438.tfrec'])
_, serialized_example = reader.read(filename_queue)

# Define features
read_features = {
    'image/height': tf.FixedLenFeature([], dtype=tf.int64),
    'image/width': tf.FixedLenFeature([], dtype=tf.int64),
    'image/colorspace': tf.FixedLenFeature([], dtype=tf.string),
    'image/class/label': tf.FixedLenFeature([], dtype=tf.int64),
    'image/class/raw': tf.FixedLenFeature([], dtype=tf.int64),
    'image/class/source': tf.FixedLenFeature([], dtype=tf.int64),
    'image/class/text': tf.FixedLenFeature([], dtype=tf.string),
    'image/format': tf.FixedLenFeature([], dtype=tf.string),
    'image/filename': tf.FixedLenFeature([], dtype=tf.string),
    'image/id': tf.FixedLenFeature([], dtype=tf.int64),
    'image/encoded': tf.FixedLenFeature([], dtype=tf.string)
}

# Extract features from serialized data
read_data = tf.parse_single_example(serialized=serialized_example,
                                features=read_features)

# Many tf.train functions use tf.train.QueueRunner,
# so we need to start it before we read
tf.train.start_queue_runners(sess)

# Print features
for name, tensor in read_data.items():
    print('{}: {}'.format(name, tensor.eval()))
    '''

## Tree Classification Model (Machine Learning)

In [None]:
# Import libraries
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics

In [None]:
#Define y as response (No Finding category)
y = data.iloc[:32, :1]
y

In [None]:
#define y as numpy array
y = y.values
y

In [None]:
# Define x as explanatory variables
x = image_viz

In [None]:
#Check shape of x
x.shape

In [None]:
#Check shape of y
y.shape

In [None]:
#convert x tensor to an array
proto_tensor = tf.make_tensor_proto(image_viz)  # convert `tensor a` to a proto tensor
x = tf.make_ndarray(proto_tensor) 

# output has shape (2,3)

In [None]:
x.shape

In [None]:
#reshape x from 4D to 2D
reshaped = x.reshape(32, 7500)
reshaped.shape

In [None]:
# define the explanatory data as newx
newx=reshaped

In [None]:
#Check shape
print(newx.shape)
print(y.shape)

In [None]:
# Split the train and test sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(newx,y, test_size = 0.2, random_state = 4)
print('Train set:', x_train.shape)
print('Test set:', x_test.shape)
print('Train set:', y_train.shape)
print('Test set:', y_test.shape)

In [None]:
# Decision Tree classifier
tree_clf = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_clf.fit(x_train, y_train)

In [None]:
#Accuracy evaluation
y_predict=tree_clf.predict(x_test)

print("Train set Accuracy: ", metrics.accuracy_score(y_train, tree_clf.predict(x_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, y_predict))

Using a Decision Tree Classifier to classify the images as No Disease Finding or Yes Disease found gave a training set accuracy of 100%. The test set accuracy is about 42%, so this model is not great at predicting the disease based on the x-ray images.

In [None]:
#Confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_predict)
print(cm)

# CNN Model:

In [None]:
initial_learning_rate = 0.01
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate, decay_steps=5, decay_rate=0.96, staircase=True
)


In [None]:
def define_model(in_shape=(IMAGE_SIZE[0], IMAGE_SIZE[1], 3), out_shape=len(heads)):
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=in_shape))
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(out_shape, activation='sigmoid'))

    model.compile(optimizer=tf.keras.optimizers.Adadelta(learning_rate=lr_schedule),
                  loss='binary_crossentropy',
                  metrics=[tf.keras.metrics.AUC(name="auc")])
    return model

In [None]:
train_size = sum(1 for _ in tf.data.TFRecordDataset(TRAINING_FILENAMES))
validation_size = sum(1 for _ in tf.data.TFRecordDataset(VALID_FILENAMES))

epoch_steps = int(np.ceil(train_size/BATCH_SIZE))
validation_steps = int(np.ceil(validation_size/BATCH_SIZE))

epochs = 5

print("steps_per_epoch: " + str(epoch_steps))
print("validation_steps: " + str(validation_steps))

In [None]:
model = define_model()

history = model.fit(
    train_dataset,
    epochs=epochs,
    validation_data=valid_dataset,
    validation_steps = validation_steps
)

In [None]:
_, test_auc = model.evaluate(test_dataset, verbose=0)

print('Test auc:', test_auc)

The CNN model gives an accuracy of 70% when predicting x-ray images.

## Conclusions

Conclusion:

The model used for Machine Learning is the Tree Classifier 

The final model for tree classifier we decided to use only what category which is “Finding”, the values in this column are stating whether or not there is a disease detected instead of including all 15 categories.

The accuracy from the tree classifier model gave us 100% from the training set accuracy, while for test set the accuracy was about 27%.This suggests that in the training set it appears to be some oversampling because it is not likely that the model can predict 100% accuracy


The model used for Deep Learning is the CNN model

The accuracy from the CNN model 74% to classify the disease images. This shows that the deep learning model (CNN) performed better than the decision tree.

While working on the project there are some issues that we ran through like PCA
We tried to apply PCA but we encountered some problems when converting the image files from tfrec to jpeg.

Further Improvement:

We could try to implement PCA model and using all the categories to see how the model accuracy changes for both Tree Classifier and CNN



Reference: 
https://www.kaggle.com/hemanthhari/cv-hemanth
