# X-Ray Binary Classification

This project is used for generating x-ray binary classifications. 
The dataset is coming from X-Ray Chest Images by Tolga
More detail about dataset: https://www.kaggle.com/datasets/tolgadincer/labeled-chest-xray-images

The objective about this project is building based model from CNN about image classifications that classifying an image has pneumonia or not.

Talking about pneumonia, you can read more detail about this on this article:


In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plta
import tensorflow as tf
import requests
import os
import zipfile
import shutil

from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array, array_to_img
from tensorflow.keras import models, layers

# Importing Dataset

Download and unzip the datasets

In [4]:
# Alternatif Download from Github -> WIP
# data = requests.get("")

Extract data into data path

In [6]:
extracted_data_path = 'data/'
zip_path = 'data/archive.zip'

with zipfile.ZipFile(zip_path, 'r') as zipfile:
    zipfile.extractall(extracted_data_path)

print(f"Data was extracted into {extracted_data_path} path") 

Data was extracted into data/ path


Finding the directory informations

In [None]:
dataset_path = 'data/chest_xray'

for dirpath, _, filenames in os.walk(dataset_path):
    folder_name = os.path.basename(dirpath)
    file_count = len([f for f in filenames if not f.startswith(".")]) 
    print(f"Folder: {folder_name}, Banyaknya berkas: {file_count}")

Folder: chest_xray, Banyaknya berkas: 0
Folder: test, Banyaknya berkas: 0
Folder: PNEUMONIA, Banyaknya berkas: 390
Folder: NORMAL, Banyaknya berkas: 234
Folder: train, Banyaknya berkas: 0
Folder: PNEUMONIA, Banyaknya berkas: 3883
Folder: NORMAL, Banyaknya berkas: 1349


## Problem - Imbalenced datasets

In the directory informations, we found that data training with NORMAL labels has smaller than pneumonia. 
This means, the training has imbalanced datasets. 

This problems need to be solved because we are facing the health issues. 
The models need to classify the minority class, which means the models need to good for classifying the NORMAL cases.

**The impact is big if it's wrong**

We are going to implement simple data augmentations with focusing on data rotations around 10 degrees.

### Minority IDG

In [17]:
# Seperate the data based on labels -> only choose Normal (this is the minority)
minority_aug = ImageDataGenerator(
    rescale=1./255,
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    brightness_range=[0.8, 1.2]
)

In [28]:
normal_dir = './data/chest_xray/train/NORMAL'
aug_dir = './data/chest_xray/train/NORMAL_augmented'
os.makedirs(aug_dir,exist_ok=True)

In [31]:
images = os.listdir(normal_dir)
target_total = len(images) * 3 #Stopper bcs the objective is multiply the dataset.
generated = 0

for img_name in images:
    img_path = os.path.join(normal_dir, img_name)
    img = load_img(img_path, target_size=[150,150])
    img_array = img_to_array(img)
    img_array = img_array.reshape((1,) + img_array.shape) # dibutuhkan karena kebutuhan .flow nanti formatnya (1,150,150,3)
    
    for batch in minority_aug.flow(img_array, batch_size=1, save_to_dir=aug_dir, save_prefix='aug', save_format='jpeg'):
        generated += 1
        if generated >= target_total:
            break
    if generated >= target_total:
        break

### Gabungkan dataset yang sudah augmented dengan direktori asli.

In [None]:
for fname in os.listdir(aug_dir):
    shutil.move(os.path.join(aug_dir, fname), os.path.join(normal_dir, fname))

print(f"Sebanyak {len(os.listdir(normal_dir))} data ditemukan pada berkas {normal_dir}")

Sebanyak 4693 data ditemukan.


# Training

## Build ImageDataGenerator

In [44]:
# Tanpa augmentasi biar ga over.
train_datagen = ImageDataGenerator(
    rescale= 1/.255,
    validation_split=0.2
)

test_datagen = ImageDataGenerator(
    rescale=1/.255,
)

## Build Generator for training & Validation

In [46]:
train_dir = './data/chest_xray/train'
test_dir = './data/chest_xray/test'

train_gen = train_datagen.flow_from_directory(
    train_dir,
    subset='training',
    class_mode='binary',
    target_size=[150,150],
    batch_size=32,
    shuffle=True
)

val_gen = train_datagen.flow_from_directory(
    train_dir,
    subset='validation',
    target_size=[150,150],
    batch_size=32,
    class_mode='binary',
    shuffle=True
)

test_gen = test_datagen.flow_from_directory(
    test_dir,
    target_size=[150,150],
    batch_size=32,
    class_mode='binary',
    shuffle=False # because important to evaluate.
)

Found 6862 images belonging to 3 classes.
Found 1714 images belonging to 3 classes.
Found 624 images belonging to 2 classes.


## Membangun Arsitektur

In [None]:
model = models.Sequential([
    layers.Conv2D(32, (3,3), activation='relu', input_shape=[150,150,1]),
    layers.MaxPooling2D(2,2),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D(2,2),
    layers.Conv2D(128, (3,3), activation='relu'),
    layers.MaxPooling2D(2,2),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [49]:
history = model.fit(
    train_gen,
    validation_data=val_gen,
    epochs=20
)

Epoch 1/50


  self._warn_if_super_not_called()


[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 237ms/step - accuracy: 0.1323 - loss: -35787576.0000 - val_accuracy: 0.1179 - val_loss: -1252522112.0000
Epoch 2/50
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 223ms/step - accuracy: 0.1818 - loss: -12693776384.0000 - val_accuracy: 0.0496 - val_loss: -92014493696.0000
Epoch 3/50
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 241ms/step - accuracy: 0.2283 - loss: -275902791680.0000 - val_accuracy: 0.0158 - val_loss: -820350812160.0000
Epoch 4/50
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 209ms/step - accuracy: 0.2078 - loss: -2133038071808.0000 - val_accuracy: 0.0123 - val_loss: -3631139258368.0000
Epoch 5/50
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 212ms/step - accuracy: 0.2326 - loss: -8343480434688.0000 - val_accuracy: 0.0088 - val_loss: -10711380000768.0000
Epoch 6/50
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

KeyboardInterrupt: 

In [None]:
loss, acc = model.evaluate(test_gen)
print(f"Test accuracy: {acc: .4f}")

In [None]:
plt.plot(history.history['accuracy'], label='train_acc')
plt.plot(history.history['val_accuracy'], label='val_acc')
plt.legend()
plt.title("Accuracy per epoch")
plt.show()