<a href="https://colab.research.google.com/github/sakshamhooda/PneumoniaDetection/blob/main/main_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://aws.amazon.com/sagemaker/" target="_blank">
  <img src="https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png" alt="Powered by AWS SageMaker" style="height: 40px;">
</a>


*work was shifted to aws sagemaker due to computation and kernel stability limitations*

# Setting up kaggle for AWS Sagemaker

In [6]:
!pip install --upgrade pip setuptools wheel


Collecting wheel
  Using cached wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
Using cached wheel-0.44.0-py3-none-any.whl (67 kB)
Installing collected packages: wheel
  Attempting uninstall: wheel
    Found existing installation: wheel 0.43.0
    Uninstalling wheel-0.43.0:
      Successfully uninstalled wheel-0.43.0
Successfully installed wheel-0.44.0


In [7]:
# Step 1: Install Kaggle API
!pip install kaggle

# Step 3: Move kaggle.json to the correct folder
import os
os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)

# Move the kaggle.json file to the ~/.kaggle/ folder (replace with the correct path if not in the root)
!mv kaggle.json ~/.kaggle/

# Step 4: Set permissions for kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

# Step 5: Verify Kaggle API setup
!kaggle competitions list

# Step 6: Accept competition rules manually on the Kaggle website

# Step 7: Download the competition data
!kaggle competitions download -c 123-of-ai-presents-pneumonia-detection-from-xray

# Step 8: Unzip the data to a specific directory
!unzip 123-of-ai-presents-pneumonia-detection-from-xray.zip -d ~/PneumoniaDetection/data/

# Step 9: Verify the files
!ls ~/PneumoniaDetection/data/


Collecting kaggle
  Using cached kaggle-1.6.17.tar.gz (82 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[56 lines of output][0m
  [31m   [0m !!
  [31m   [0m 
  [31m   [0m         ********************************************************************************
  [31m   [0m         Usage of dash-separated 'description-file' will not be supported in future
  [31m   [0m         versions. Please use the underscore name 'description_file' instead.
  [31m   [0m 
  [31m   [0m         See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
  [31m   [0m         ********************************************************************************
  [31m   [0m 
  [31m   [0m !!
  [31m   [0m   opt = self.warn_dash_deprecation(opt, section)
  [31m   [0m run

In [1]:
import zipfile
import os

# Paths
zip_file_path = "123-of-ai-presents-pneumonia-detection-from-xray.zip"
extract_dir = "data/"

# Create the extract directory if it doesn't exist
if not os.path.exists(extract_dir):
    os.makedirs(extract_dir)

# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print(f"Files successfully extracted to {extract_dir}")


Files successfully extracted to data/


# Approach

1. Understanding the Dataset

    Directories and Files:

    processed_train_set/: Contains the training X-ray images.
    processed_test_set/: Contains the test X-ray images.
    train_metadata.csv: Contains mappings of image names to their classes (healthy or pneumonia).
    test_files.csv: Contains the list of test image names.
    sample_submission.csv: A sample submission file.
    Data Columns:

    path: The image file name.
    class: The ground truth label (healthy or pneumonia).

2. Data Preparation

    Load train_metadata.csv and test_files.csv.
    Ensure that the image paths and labels are correctly mapped.
    Use appropriate data generators that match the dataset structure.

3. Model Selection

    Use Inception V3 as the base model with ImageNet weights.
    If desired, ensemble with EfficientNetB0 for improved performance.

4. Training Strategy

    Freeze the base model layers initially and train the top layers.
    Unfreeze some layers for fine-tuning.
    Use data augmentation to prevent overfitting.
    Monitor the F1 score, as per the competition metric.
5. Evaluation and Submission

    Evaluate the model on a validation set.
    Generate predictions on the test set.
    Prepare the submission file matching the required format.


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from sklearn.metrics import f1_score

In [None]:
# Load the dataframes
train_df = pd.read_csv('data/1. train_metadata.csv')
test_df = pd.read_csv('data/2. test_files.csv')

# Display class distribution
train_df['class'].value_counts().plot(kind='bar', title='Class Distribution')
plt.show()


In [None]:
# Add full path to image files
train_df['path'] = 'data/processed_test_set/' + train_df['path']
test_df['path'] = 'data/processed_test_set/' + test_df['path']


In [None]:
print(train_df.head())


In [None]:
missing_files = train_df[~train_df['path'].apply(os.path.exists)]
print(f"Missing files:\n{missing_files}")

In [None]:
# Define the image size and batch size
IMAGE_SIZE = (299, 299)  # InceptionV3 default size
BATCH_SIZE = 32

# Training data generator with augmentation
train_datagen = ImageDataGenerator(
    #preprocessing_function=preprocess_input,
    rescale=1./255,
    rotation_range=20,
    shear_range=0.1,
    zoom_range=0.1,
    horizontal_flip=True,
    vertical_flip=False,
    validation_split=0.2  # 20% for validation
)

# Test data generator
test_datagen = ImageDataGenerator(
    #preprocessing_function=preprocess_input,
    rescale=1./255
)

# Create training and validation generators
train_generator = train_datagen.flow_from_dataframe(
    dataframe=train_df,
    x_col='path',
    y_col='class',
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='binary',
    subset='training',
    shuffle=True
)

validation_generator = train_datagen.flow_from_dataframe(
    dataframe=train_df,
    x_col='path',
    y_col='class',
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='binary',
    subset='validation',
    shuffle=False
)

# Create test generator
test_generator = test_datagen.flow_from_dataframe(
    dataframe=test_df,
    x_col='path',
    y_col=None,
    target_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    class_mode=None,
    shuffle=False
)


In [None]:
#Bias check

from sklearn.utils import class_weight
import numpy as np

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(train_generator.classes),
    y=train_generator.classes
)

class_weights = dict(enumerate(class_weights))
class_weights

In [None]:
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input

# Load InceptionV3 with pre-trained ImageNet weights
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(*IMAGE_SIZE, 3))


In [None]:
# Add global average pooling and output layers
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.5)(x)
predictions = Dense(1, activation='sigmoid')(x)

# Define the full model
model = Model(inputs=base_model.input, outputs=predictions)


In [None]:
# Freeze all layers in the base model
for layer in base_model.layers:
    layer.trainable = False


In [None]:
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
# Early stopping to prevent overfitting
earlystop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Save the best model
checkpoint = ModelCheckpoint('inception_v3_best_model.keras', monitor='val_loss', save_best_only=True)

# Reduce learning rate when a metric has stopped improving
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=3, factor=0.2, min_lr=1e-7)

callbacks = [earlystop, checkpoint, reduce_lr]


In [None]:
#custom callback class for monitoring F1 score during training

from tensorflow.keras.callbacks import Callback

class F1ScoreCallback(Callback):
    def __init__(self, validation_generator):
        super().__init__()
        self.validation_generator = validation_generator

    def on_epoch_end(self, epoch, logs=None):
        self.validation_generator.reset()
        val_preds = self.model.predict(self.validation_generator)
        val_preds = (val_preds > 0.5).astype(int).reshape(-1)
        val_f1 = f1_score(self.validation_generator.classes, val_preds)
        print(f' - val_f1: {val_f1:.4f}')
 
callbacks.append(F1ScoreCallback(validation_generator))


In [None]:
history = model.fit(
    train_generator,
    class_weight=class_weights,
    steps_per_epoch=train_generator.samples // BATCH_SIZE,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // BATCH_SIZE,
    epochs=20,
    callbacks=callbacks
)

**Fine-Tuning**

In [None]:
# Unfreeze the top 50 layers of the model
for layer in base_model.layers[-50:]:
    layer.trainable = True


In [None]:
# Re-compile the model with a lower learning rate
model.compile(optimizer=Adam(learning_rate=1e-5), loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
history_fine = model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // BATCH_SIZE,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // BATCH_SIZE,
    epochs=10,
    callbacks=callbacks
)


**Evaluation**

In [None]:
# Reset validation generator and get predictions
validation_generator.reset()
val_preds = model.predict(validation_generator, steps=validation_generator.samples // BATCH_SIZE + 1)
val_preds = (val_preds > 0.5).astype(int).reshape(-1)


In [None]:
# True labels
val_labels = validation_generator.classes


In [None]:
from sklearn.metrics import f1_score

# Calculate F1 score
val_f1 = f1_score(val_labels, val_preds)
print('Validation F1 Score:', val_f1)


**Generate Predictions on Test Set**

In [None]:
# Predict on the test set
test_generator.reset()
test_preds = model.predict(test_generator, steps=test_generator.samples // BATCH_SIZE + 1)
test_preds = (test_preds > 0.5).astype(int).reshape(-1)


In [None]:
# Get the mapping from class indices to labels
class_indices = train_generator.class_indices
reverse_class_indices = {v: k for k, v in class_indices.items()}

# Map predictions to class names
test_labels = [reverse_class_indices[pred] for pred in test_preds]


In [None]:
# Prepare submission DataFrame
submission = pd.DataFrame({
    'ID': test_df.index,
    'class': test_labels
})

# Ensure it matches the sample submission format
submission = submission[['ID', 'class']]
#submission.columns = ['path', 'class']

# Save to CSV
submission.to_csv('submission.csv', index=False)


### Ensemble with EfficientNetBo

In [None]:
from tensorflow.keras.applications import EfficientNetB0

# Load EfficientNetB0
effnet_base = EfficientNetB0(weights='imagenet', include_top=False, input_shape=(*IMAGE_SIZE, 3))

# Add custom layers
x = effnet_base.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.5)(x)
predictions = Dense(1, activation='sigmoid')(x)

# Define the model
effnet_model = Model(inputs=effnet_base.input, outputs=predictions)

# Freeze base model layers
for layer in effnet_base.layers:
    layer.trainable = False  

    setting rest part as it is  

In [None]:
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
# Early stopping to prevent overfitting
earlystop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Save the best model
checkpoint = ModelCheckpoint('inception_v3_best_model.keras', monitor='val_loss', save_best_only=True)

# Reduce learning rate when a metric has stopped improving
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=3, factor=0.2, min_lr=1e-7)

callbacks = [earlystop, checkpoint, reduce_lr]


In [None]:
#custom callback class for monitoring F1 score during training

from tensorflow.keras.callbacks import Callback

class F1ScoreCallback(Callback):
    def __init__(self, validation_generator):
        super().__init__()
        self.validation_generator = validation_generator

    def on_epoch_end(self, epoch, logs=None):
        self.validation_generator.reset()
        val_preds = self.model.predict(self.validation_generator)
        val_preds = (val_preds > 0.5).astype(int).reshape(-1)
        val_f1 = f1_score(self.validation_generator.classes, val_preds)
        print(f' - val_f1: {val_f1:.4f}')
 
callbacks.append(F1ScoreCallback(validation_generator))


In [None]:
history = model.fit(
    train_generator,
    class_weight=class_weights,
    steps_per_epoch=train_generator.samples // BATCH_SIZE,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // BATCH_SIZE,
    epochs=20,
    callbacks=callbacks
)

In [None]:
# Unfreeze the top 50 layers of the model
for layer in base_model.layers[-50:]:
    layer.trainable = True

# Re-compile the model with a lower learning rate
model.compile(optimizer=Adam(learning_rate=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
history_fine = model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // BATCH_SIZE,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // BATCH_SIZE,
    epochs=10,
    callbacks=callbacks
)



In [None]:
# Reset validation generator and get predictions
validation_generator.reset()
val_preds = model.predict(validation_generator, steps=validation_generator.samples // BATCH_SIZE + 1)
val_preds = (val_preds > 0.5).astype(int).reshape(-1)

# True labels
val_labels = validation_generator.classes

from sklearn.metrics import f1_score

# Calculate F1 score
val_f1 = f1_score(val_labels, val_preds)
print('Validation F1 Score:', val_f1)


In [None]:
# Predictions from Inception V3
test_generator.reset()
inception_preds = model.predict(test_generator, steps=test_generator.samples // BATCH_SIZE + 1)

# Predictions from EfficientNetB0
test_generator.reset()
effnet_preds = effnet_model.predict(test_generator, steps=test_generator.samples // BATCH_SIZE + 1)


In [None]:
# Average the predictions
ensemble_preds = (inception_preds + effnet_preds) / 2
ensemble_preds = (ensemble_preds > 0.5).astype(int).reshape(-1)


In [None]:
# Map predictions to class names
ensemble_labels = [reverse_class_indices[pred] for pred in ensemble_preds]


In [None]:
# Prepare submission DataFrame
submission = pd.DataFrame({
    'ID': test_df.index,
    'class': test_labels
})

# Ensure it matches the sample submission format
submission = submission[['ID', 'class']]
#submission.columns = ['path', 'class']

# Save to CSV
submission.to_csv('submission.csv', index=False)
