<a href="https://colab.research.google.com/github/thesis17/Afaan-Oromoo-chatGPT/blob/main/Tuberculosis_%F0%9F%A6%A0_X_Ray_Diagnosis_CNN_%F0%9F%A7%A0_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
tawsifurrahman_tuberculosis_tb_chest_xray_dataset_path = kagglehub.dataset_download('tawsifurrahman/tuberculosis-tb-chest-xray-dataset')

print('Data source import complete.')


# Background 🤢
Tuberculosis, or TB, is a highly contagious disease caused by the bacteria 'Mycobacterium tuberculosis'. A deadly disease that affects the lungs and spreads to every other organ in the body, it can be spread through the air when someone speaks, coughs, or sneezes, and results in severe coughing, fever, pain, weight loss, night sweats, coughing up blood, and potentially death when left untreated.

As someone who knows people who have died from TB, I have created a convolutional neural network to accurately diagnose patients using chest X-rays, with user [Tawsifur Rahman](https://www.kaggle.com/tawsifurrahman)'s "Tuberculosis (TB) Chest X-ray Database" dataset, in an attempt to potentially save lives by preventing patients with TB from going untreated and gaining the symptoms mentioned above.

This dataset contains 3500 samples of chest X-rays of patients without tuberculosis, and 700 of patients with tuberculosis, that all look similar to the image below, which I will be using to train this CNN.

![](https://storage.googleapis.com/kagglesdsdata/datasets/891819/2332307/TB_Chest_Radiography_Database/Normal/Normal-1019.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20240713%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240713T160233Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=81e3c3b9b9fa4dfc6654ee819f45aa2a24fd64e87130250359a86f23723efcc10185744e2eb48faf4f82833472a62c4bb2488ed8452ad9bc8ae1e35d46b12e35e845e4ee5be597e3d326fae4b3a5b9e6c0afa958c0d405ae07f03ee0ed70b144d9f2859a8f37529cf2190f5d79df12587c1ad049075555cce2e0561289cacb6c4a9f1a20905092913a2879c439119184186d2f0dc115dd057413f36ec40fbdedaafdac108c0cc62ec31102803194787ac4897ba10359e3422f919932ff72e9e68a936b2c85add5b102f2490b7c4fb1991a4b2957943a0ac2e761c32ce32fef9c0eba91399b394fa799c73017847b9205231fa9f083875579a6da2cd890c31355)

# Image Processing 🖼️
The main problem with the data I wanted to fix was the class imbalance (3500 Normal images vs. 700 TB images).
In the code below I first read all the image files with OpenCV in and transformed them into a format suitable for SMOTE upsampling to even out the class counts.

In [None]:
#Importing the necessary libraries:
import cv2 as cv
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import os

In [None]:
#Initializing the values needed for all the image files
normaldir = '/kaggle/input/tuberculosis-tb-chest-xray-dataset/TB_Chest_Radiography_Database/Normal'
tbdir = '/kaggle/input/tuberculosis-tb-chest-xray-dataset/TB_Chest_Radiography_Database/Tuberculosis'
images = []
labels = []
imagesize = 256

In [None]:
#Storing all the image directories in the 'images' array and corresponding them to either 1 for TB images or 0 for normal images.
for x in os.listdir(normaldir):
    imagedir = os.path.join(normaldir, x)
    image = cv.imread(imagedir, cv.IMREAD_GRAYSCALE)
    image = cv.resize(image, (imagesize, imagesize))
    images.append(image)
    labels.append(0)

for y in os.listdir(tbdir):
    imagedir = os.path.join(tbdir, y)
    image = cv.imread(imagedir, cv.IMREAD_GRAYSCALE)
    image = cv.resize(image, (imagesize, imagesize))
    images.append(image)
    labels.append(1)

In [None]:
#Converting to NumPy arrays since they have more features than regular lists
images = np.array(images)
labels = np.array(labels)

#Splitting the images and labels into training and testing sets, then normalizing the values within them for computational efficiency (from 0-255 scale to 0-1 scale)
imagetrain, imagetest, labeltrain, labeltest = train_test_split(images, labels, test_size=0.3, random_state=42)
imagetrain = (imagetrain.astype('float32'))/255
imagetest = (imagetest.astype('float32'))/255

In [None]:
#Flattening the image array into 2D (making it [2940 images] x [all the pixels of the image in just one 1D array]) to be suitable for SMOTE oversampling
imagetrain = imagetrain.reshape(2940, (imagesize*imagesize))

#Performing oversampling
smote = SMOTE(random_state=42)
imagetrain, labeltrain = smote.fit_resample(imagetrain, labeltrain)

#Unflattening the images now to use them for convolutional neural network (4914 images of 256x256 size, with 1 color channel (grayscale, as compared to RGB with 3 color channels))
imagetrain = imagetrain.reshape(-1, imagesize, imagesize, 1)
print(imagetrain.shape)

In [None]:
#Classes balanced - equal counts of each label
print(np.unique(labeltrain, return_counts=True))

# CNN Time 🧠
Using Tensorflow's Sequential API for CNN modeling to diagnose all the patients in the testing set with a high accuracy.

In [None]:
#Importing the necessary libraries
import tensorflow as tf
import keras
from keras import layers
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

In [None]:
#The CNN model has 3 convolutional layers, each followed by pooling to summarize the features found by the layer, starting with 16 and multiplying by 2 each time for computational efficiency, as bits are structured in powers of 2. 3x3 filters and ReLU activation used.
cnn = keras.Sequential(
    [
    #Input layer, same shape as all the images (256x256x1):
    keras.Input(shape=(imagesize, imagesize, 1)),

    #1st convolutional layer:
    Conv2D(16, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    #2nd convolutional layer:
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    #3rd convolutional layer:
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    #Flattening layer for the dense layers:
    Flatten(),

    #1st dense layer following the convolutional layers:
    Dense(64, activation='relu'),

    #Dropout layer with heavy dropout rate to avoid overfitting in the large-ish dataset
    Dropout(0.5),

    #Output layer that squeezes each image to either 0 or 1 with sigmoid activation
    Dense(1, activation='sigmoid')
    ]
)

In [None]:
#Compiling the model with parameters best suited for the task at hand:
cnn.compile(
    loss='binary_crossentropy', #Best for binary classification
    optimizer = keras.optimizers.Adam(learning_rate=0.001), #Good starting LR for dataset of this size
    metrics=['accuracy'], #Looking for accuracy
)

In [None]:
#Fitting the model, with the ReduceLROnPlateau callback added to it to reduce the learning rate to take smaller steps in increasing the accuracy whenever the learning rate plateaus (goes in the wrong direction)
#Doing this with patience=1, meaning it will perform this if it even plateaus for one epoch, since only 10 epochs are used
#factor=0.1 means that for every time the learning rate is reduced, it is reduced by a factor of 0.1 - it also won't go lower than 0.00001
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='accuracy', factor=0.1, patience=1, min_lr=0.00001, verbose=1)

#Fitting the model w/ the callback. ON VS CODE, batch size of 16 makes each epoch take around a minute in this case w/ good accuracy, making the whole training process 10 min, but on Kaggle it should take longer due to less computational resources:
cnn.fit(imagetrain, labeltrain, batch_size=16, epochs=10, verbose=2, callbacks = [reduce_lr])

In [None]:
#Evaluating the data w/ multiple types of metrics
print('TESTING DATA:')
cnn.evaluate(imagetest, labeltest, batch_size=32, verbose=2)

print('ADVANCED TESTING METRICS:')
from sklearn.metrics import classification_report, confusion_matrix
predictions = cnn.predict(imagetest, batch_size=32)
predicted_labels = (predictions > 0.5).astype('int32')
print(classification_report(labeltest, predicted_labels))
print(confusion_matrix(labeltest, predicted_labels))

Very good accuracy and loss values, and great balance of precision between the 0 and 1 classes.

# Conclusion 🤓
Though there are obviously better models out there for diagnosing TB from x-ray scans, this model can hopefully work as a starting point to show some of you guys how using a CNN for medical purposes works and how it can benefit our society. Hopefully one day, a 100% accurate TB diagnosis model can be made and properly implemented in the healthcare system so that deaths from TB can be avoided altogether.

**But overall, thanks for reading this notebook - it means a lot. If you liked it please make sure to leave an upvote, and check out my other work as well!**