# Introduction

Thank you for Kaggle to provide this dataset.

The purpose of my work is to provide a machine learning model that can predict if a person has corona from their Xray. This can assist doctors when they diagnose corona.

The dataset includes Xray images of patients with Corona from multiple sources: Virus such as COVID 19 and SARS, bacteria such as Streptococcus, and stress smoking such as ARDS.

I get a high cross-validation result: 0.96 accuracy from the time I run the model. But applying the model with highest cross validation result to the test data only results in 0.75 accuracy

# Import Initial Library

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing
import os
import tensorflow as tf

# Load the data

In [None]:
df = pd.read_csv("/kaggle/input/coronahack-chest-xraydataset/Chest_xray_Corona_Metadata.csv")

## View the data
From Viewing the data, I find that there are 2 types of Dataset: TRAIN and TEST, so I split them into df_train and df_test

In [None]:
df

In [None]:
#Splitting train and test data
df_train = df[df["Dataset_type"] == "TRAIN"]
df_test = df[df["Dataset_type"] == "TEST"]

# Exploratory Data Analysis/Data cleaning
Its important to understand the data :) 


# Missing values
From looking at the missing value graph, we see that most of the virus name is missing, and some of the virus category is missing.

There are 1576 Missing values on Virus type (Label 1 Virus Category). That just means the person is normal. We can fill the data on Label 1 Virus Category with None. We can also set Label_2_Virus_category to None


In [None]:
import missingno
missingno.matrix(df, figsize = (30,10))


In [None]:
df.isnull().sum() #check for number of null values

In [None]:
df['Label_1_Virus_category']= df['Label_1_Virus_category'].fillna("None")
df_train['Label_1_Virus_category']= df_train['Label_1_Virus_category'].fillna("None")
df_test['Label_1_Virus_category']=df_test['Label_1_Virus_category'].fillna("None")

In [None]:
df[df['Label_1_Virus_category'] == "None"]["Label_2_Virus_category"] = "None"
df_train[df_train['Label_1_Virus_category'] == "None"]["Label_2_Virus_category"] = "None"
df_test[df_test['Label_1_Virus_category'] == "None"]["Label_2_Virus_category"] = "None"

## Check for values in each column
SO we have 1576 normal patients to compare with 4334 Corona Patients.

We dont know most of the source of Corona, only a few labels are provided in Label_2_Virus_category.

In [None]:
df.Label.value_counts()

In [None]:
df.Dataset_type.value_counts()

In [None]:
print(df_train.Label_2_Virus_category.value_counts())

In [None]:
print(df_test.Label_2_Virus_category.value_counts())

In [None]:
print(df_train.Label_1_Virus_category.value_counts())

In [None]:
print(df_test.Label_1_Virus_category.value_counts())

# Load images
In short, I load images and then put them into numpy array to put in the CNN model.
I rescale the image, which can create some potential problems when dealing with test data.


In [None]:
#get test and train dir
test_img_dir = '/kaggle/input/coronahack-chest-xraydataset/Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/test'
train_img_dir = '/kaggle/input/coronahack-chest-xraydataset/Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/train'

In [None]:
#Here I am loading all the names of different image
image_train = os.listdir(train_img_dir)
image_train = sorted(image_train)
image_train
df_train = df_train.sort_values("X_ray_image_name")

image_test = os.listdir(test_img_dir)
image_test = sorted(image_test)
image_test
df_test = df_test.sort_values("X_ray_image_name")

train_images_name = df_train["X_ray_image_name"]
test_images_name= df_test["X_ray_image_name"]

In [None]:
#Now I am building the numpy array for train images
import cv2
TrainImages = []
for i in image_train:
    if i in train_images_name.values:
        img = cv2.imread(train_img_dir+'/'+i)
        img = cv2.resize(img, (200,200)) #if I dont rescale to (200,200), the memory cannot take it. Also, its good to have all the images in the same size.
        TrainImages.append(img)
TrainImages= np.array(TrainImages)
TrainImages.shape

In [None]:
#I build the numpy array for test images
TestImages = []
for i in image_test:
    if i in test_images_name.values:
        img = cv2.imread(test_img_dir+'/'+i)
        img = cv2.resize(img, (200,200))    
        TestImages.append(img)
TestImages= np.array(TestImages)
TestImages.shape

In [None]:
#Lets view some of the images, it looks like the image is still fine (I dont know how X ray image works)
import matplotlib.pyplot as plt
plt.figure()
plt.imshow(TrainImages[0])
plt.colorbar()
plt.grid(False)
plt.show()

In [None]:
#Lets view some of the images, it looks like te image is still fine (I dont know how X ray image works)
plt.figure()
plt.imshow(TrainImages[1])
plt.colorbar()
plt.grid(False)
plt.show()

# Data Preprocessing
We create numpy arrays with 1 as Pnemonia and 0 as normal
We also scale the image to 0 and 1.

## Create dummy labels

In [None]:
#Create train and test labels for neural network models
train_labels = df_train["Label"] == "Pnemonia"
train_labels = np.array(train_labels).astype(int)
test_labels = df_test["Label"] == "Pnemonia"
test_labels = np.array(test_labels).astype(int)

## Data Augmentation
Actually, I received lower validation accuracy when doing data augmentation. My theory is because that the X-ray is pretty standard so augment it hurts the prediction

In [None]:
#from keras.preprocessing.image import ImageDataGenerator 
#aug = ImageDataGenerator(rotation_range=20, zoom_range=0.15, width_shift_range=0.2, 
#                         height_shift_range=0.2, shear_range=0.15, 
#                         horizontal_flip=True, fill_mode="nearest")

# Scaleimage

In [None]:
#Scaling the image
TrainImages = TrainImages/255
TestImages = TestImages/255

## Split train and test data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
TrainImages, train_labels, test_size=0.2, random_state=0)

# Delete the df to release some memory

Basically, Kaggle has a 13gb RAM limit, so I delete some dataframes to safe RAM.


In [None]:
list = [df, df_train, df_test, train_img_dir, test_img_dir, TrainImages, train_labels, img, train_images_name, test_images_name, image_test, image_train]
del list
import gc
gc.collect()
df = pd.DataFrame()
df_train= pd.DataFrame()
df_test=pd.DataFrame()
train_img_dir= []
test_img_dir = []
TrainImages= []
train_labels= []
img = []
test_images_name = []
test_images_name = []
img_train = []
img_test = []

In [None]:
df = pd.DataFrame()

# Build the model
For the model, similar to other image recognition project, I use CNN.

By changing different Dense nodes, I find that 30 dense nodes yields better accuracy than (50,100,150) nodes.

From some previous experience with image recognition, I only use one Dense hidden layer

I try to use 2 Convolution layers followed by Maxpooling and find that it works the best

I also find that 1 epoch works the best.



In [None]:
from tensorflow import keras
#Initialize the model
model = keras.Sequential()
#Convolutional layers
model.add(keras.layers.Conv2D(filters = 32, kernel_size = (3, 3),activation='relu', input_shape= (200,200,3)))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))

model.add(keras.layers.Conv2D(filters = 64, kernel_size = (3, 3),activation='relu', input_shape= (200,200,3)))
model.add(keras.layers.MaxPooling2D(pool_size = 2, strides=2))
model.add(keras.layers.Dropout(0.5))

#Dense layers
model.add(keras.layers.Flatten())


model.add(keras.layers.Dense(30, activation='relu'))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(2, activation='softmax'))

#Choose compiler
model.compile(optimizer = 'adam', 
              loss='sparse_categorical_crossentropy',
              metrics = ['accuracy'])

In [None]:
#What the model looks like
model.summary()

## Train and Evaluate the model
best CV score I can get is about 0.97

In [None]:
#Here I found out that my data augmentation wasnt helping.
#history = model.fit_generator(aug.flow(X_train, y_train, batch_size= 32),
#                    epochs = 6, validation_data= (X_test, y_test))
model.fit(X_train, y_train, epochs = 5,validation_data=(X_test, y_test))

# Use the model on the test data
I get about 0.74 accuracy on test model, which is a huge loss of accuracy from validation. I am not sure why this is the case.

In [None]:
model.evaluate(TestImages, test_labels)