# **IBM Z Datathon - Datasmiths**

---

## Problem Statement:

The primary objective of this challenge is to develop an automated diagnostic system for detecting chronic diseases like Chronic Obstructive Pulmonary Disease (COPD) and Cancer using deep learning techniques.

Automated Diagnosis of COPD(Chronic Obstructive Pulmonary Disease) and Cancer
using Deep Learning Techniques.
Manual diagnosis of chronic diseases like COPD and cancer is time-consuming,
resource-intensive, and prone to human error. Current diagnostic processes lack scalability and
often lead to delayed treatment, reducing patient outcomes.


---

## Team Members:
- Chandravel Saravanan (Team Leader)
- Chanakya R
- Nithish Kumar S
- Vijay Srinivas K


---

## Dataset:

The dataset consists of images of lung tissue from patients with either COPD or cancer. The images are labeled with their corresponding disease status (benign, malignant, or normal). The dataset is divided into training and testing sets, with 80% of the data used for training and 20% for testing. Find the link to the public dataset used https://www.kaggle.com/datasets/waseemnagahhenes/lung-cancer-dataset-iq-othnccd


---

In [1]:
!pip install tensorflow

Defaulting to user installation because normal site-packages is not writeable
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting protobuf<3.20,>=3.9.2 (from tensorflow)
  Downloading protobuf-3.19.6-py2.py3-none-any.whl.metadata (828 bytes)
Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Downloading protobuf-3.19.6-py2.py3-none-any.whl (162 kB)
Installing collected packages: protobuf, gast
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.2
    Uninstalling protobuf-3.20.2:
      Successfully uninstalled protobuf-3.20.2
  Attempting uninstall: gast
    Found existing installation: gast 0.5.3
    Uninstalling gast-0.5.3:
      Successfully uninstalled gast-0.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
onnx 1.16.2 requires protobuf>=3.20.2, but you hav

---

## Lung Cancer Dataset Splitting Script

This script automates the process of splitting a dataset into training and testing subsets. It can be used to split images (or any files) from different classes into two directories: one for training and another for testing. The script uses an 80/20 split by default but can be adjusted as needed.

### Steps:

#### Directory Setup:
- Checks if the train and test directories exist, and creates them if not.

#### Class Subfolder Processing:
- Loops through each class (e.g., "benign", "malignant") in the source directory.
- Creates corresponding subfolders in the train and test directories.

#### Random Split:
- Shuffles the images in each class.
- Splits them into training and testing sets according to the specified ratio.

#### File Transfer:
- Moves the files to the appropriate train/test folders for each class.

This ensures a clean and randomized split for building and evaluating models.

In [3]:
import os
import shutil
import random

def split_dataset(source_dir, train_dir, test_dir, split_ratio=0.8):
    # Create train and test directories if they don't exist
    if not os.path.exists(train_dir):
        os.makedirs(train_dir)
    
    if not os.path.exists(test_dir):
        os.makedirs(test_dir)

    # Loop through each class folder (e.g., benign, malignant, normal)
    for class_name in os.listdir(source_dir):
        class_path = os.path.join(source_dir, class_name)
        
        # Ensure the path is a directory (skip non-folder items)
        if os.path.isdir(class_path):
            # Create corresponding train/test subfolders for the class
            train_class_dir = os.path.join(train_dir, class_name)
            test_class_dir = os.path.join(test_dir, class_name)

            os.makedirs(train_class_dir, exist_ok=True)
            os.makedirs(test_class_dir, exist_ok=True)

            # Get all files in the class folder
            all_files = os.listdir(class_path)

            # Shuffle the files to randomize the splitting
            random.shuffle(all_files)

            # Calculate the split index
            split_index = int(len(all_files) * split_ratio)

            # Split files into train and test
            train_files = all_files[:split_index]
            test_files = all_files[split_index:]

            # Move files to the corresponding train and test folders
            for file in train_files:
                shutil.move(os.path.join(class_path, file), os.path.join(train_class_dir, file))

            for file in test_files:
                shutil.move(os.path.join(class_path, file), os.path.join(test_class_dir, file))

            print(f"Class '{class_name}' split into {len(train_files)} training and {len(test_files)} testing images.")

# Define the source folder and the destination train/test folders
source_dir = 'Dataset/Data/Lung_cancer_dataset'    # Original dataset folder containing class subfolders
train_dir = 'Dataset/Data/train'
test_dir = 'Dataset/Data/test'

# Call the function to split the dataset with an 80/20 train-test split
split_dataset(source_dir, train_dir, test_dir, split_ratio=0.8)


Class 'Normal cases' split into 491 training and 123 testing images.
Class 'Benign cases' split into 96 training and 25 testing images.
Class '.ipynb_checkpoints' split into 0 training and 1 testing images.
Class 'Malignant cases' split into 1071 training and 268 testing images.


---

## Model Creation, Training and Evaluation

This script implements a Convolutional Neural Network (CNN) for multi-class image classification using TensorFlow and Keras. It includes:

Data Preprocessing: Uses ImageDataGenerator to rescale image pixel values and set up generators for the training and validation datasets.
Model Architecture: A sequential CNN model with three convolutional layers, max-pooling layers, a flattening layer, and a fully connected dense layer, ending with a softmax output for 4 classes.
Compilation and Training: The model is compiled using the Adam optimizer and categorical cross-entropy loss. It is trained for 30 epochs with specified steps per epoch and evaluated on the validation data.
Model Evaluation: The final validation accuracy is printed after model evaluation.

In [5]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers, models

# Split dataset with 80% for training and 20% for testing
split_dataset(source_dir, train_dir, test_dir, split_ratio=0.8)

# Data augmentation and preprocessing
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical')  # For multiclass classification

validation_generator = test_datagen.flow_from_directory(
    test_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical')  # Ensure it's categorical for multi-class

# Define CNN model
model = models.Sequential()

model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))

# Output layer for 4 classes (update the number based on actual classes)
model.add(layers.Dense(4, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Train the model
history = model.fit(
    train_generator,
    steps_per_epoch=100,  # Adjust based on your dataset size
    epochs=30,
    validation_data=validation_generator,
    validation_steps=50)

# Evaluate the model
loss, accuracy = model.evaluate(validation_generator)
print(f'Validation Accuracy: {accuracy * 100:.2f}%')


Class 'Normal cases' split into 0 training and 0 testing images.
Class 'Benign cases' split into 0 training and 0 testing images.
Class '.ipynb_checkpoints' split into 0 training and 0 testing images.
Class 'Malignant cases' split into 0 training and 0 testing images.
Found 1658 images belonging to 4 classes.
Found 416 images belonging to 4 classes.
Epoch 1/30
Validation Accuracy: 82.69%


---
The output provides insights into the dataset splitting, model training, and validation performance:

Dataset Splitting:
- The classes Normal cases, Benign cases, and Malignant cases were expected to contain images but ended up with 0 images split into training or testing sets. This might be due to an incorrect source directory, empty folders, or a file-reading issue.
- The .ipynb_checkpoints folder, a hidden folder created by Jupyter, was mistakenly included. It should be ignored in future runs.

Image Counts:
- Training Set: 1658 images were successfully loaded across 4 classes (likely including .ipynb_checkpoints).
- Validation Set: 416 images were found across 4 classes.

Training Output (Epoch 1):
- Accuracy: The model achieved 73.58% accuracy during training after the first epoch.
- Loss: The training loss was 0.7117, which measures how far predictions are from the actual labels. Lower loss is better.

Validation Output:
- Val Accuracy: The model achieved 82.69% accuracy on the validation set, meaning it correctly predicted 82.69% of validation images.
- Val Loss: The validation loss was 0.4526, indicating that the model performs well on unseen data.

Overall, the model is performing well after the first epoch with a validation accuracy of 82.69%. However, there are potential issues with the dataset split (no images found for certain classes), which need to be addressed.

---

### Improved Model:

The model is improved and more robust with the fix of the dataset split algorithm.

In [9]:
from tensorflow.keras import layers, models
from tensorflow.keras.layers import BatchNormalization, Dropout

# Define an improved CNN model
model = models.Sequential()

train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=10,          # Reduced rotation
    width_shift_range=0.1,      # Reduced horizontal shifts
    height_shift_range=0.1,     # Reduced vertical shifts
    shear_range=0.1,            # Reduced shear
    zoom_range=0.1,             # Reduced zoom
    horizontal_flip=True,
    fill_mode='nearest'
)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical')  # For multiclass classification

validation_generator = test_datagen.flow_from_directory(
    test_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical')  # Ensure it's categorical for multi-class

# Block 1
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
model.add(BatchNormalization())
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(Dropout(0.25))  # Adding dropout to prevent overfitting

# Block 2
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(Dropout(0.25))

# Block 3
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(Dropout(0.25))

# Fully connected layers
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))  # Adding dropout before the final layer

# Output layer for 4 classes (softmax for multiclass classification)
model.add(layers.Dense(4, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Summary of the model
model.summary()
# Assuming you already defined train_generator and validation_generator

# Define the total number of images
total_train_images = 1660  # Total number of training images
total_validation_images = 500  # Update this with your actual validation set size

# Define batch size
batch_size = 32

# Calculate steps per epoch
train_steps = total_train_images // batch_size
validation_steps = total_validation_images // batch_size

# Train the model with correct steps
history = model.fit(
    train_generator,
    steps_per_epoch=train_steps,           # Set based on the number of training samples
    epochs=30,
    validation_data=validation_generator,
    validation_steps=validation_steps      # Set based on the number of validation samples
)

# Evaluate the model
loss, accuracy = model.evaluate(validation_generator)
print(f'Validation Accuracy: {accuracy * 100:.2f}%')

Found 1660 images belonging to 4 classes.
Found 416 images belonging to 4 classes.
Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_25 (Conv2D)          (None, 148, 148, 32)      896       
                                                                 
 batch_normalization_21 (Bat  (None, 148, 148, 32)     128       
 chNormalization)                                                
                                                                 
 conv2d_26 (Conv2D)          (None, 146, 146, 32)      9248      
                                                                 
 batch_normalization_22 (Bat  (None, 146, 146, 32)     128       
 chNormalization)                                                
                                                                 
 max_pooling2d_16 (MaxPoolin  (None, 73, 73, 32)       0         
 g2D)                                

This output provides detailed information about the model architecture and performance after training:

### 1. **Dataset Information**:
   - **Training Set**: 1660 images are correctly loaded across 4 classes.
   - **Validation Set**: 416 images are available for validation, also across 4 classes.

### 2. **Model Architecture**:
   - The model is sequential with multiple layers, including convolutional, batch normalization, max-pooling, and dropout layers.
   - **Conv2D Layers**: The network has multiple convolutional layers (Conv2D), each followed by batch normalization, to ensure stable training and faster convergence.
   - **Dropout Layers**: These help reduce overfitting by randomly deactivating some neurons during training, adding regularization.
   - **Total Parameters**: The model summary lists the parameters in each layer (weights, biases, etc.). These values indicate how complex the model is.

### 3. **Training Output**:
   - **Final Epoch (30/30)**: 
     - **Training Loss**: The loss decreased to **0.2236**, indicating good convergence during training.
     - **Training Accuracy**: The model achieved **91.58% accuracy** on the training set, showing strong performance on the data it was trained on.

### 4. **Validation Output**:
   - **Validation Loss**: After the last epoch, the validation loss is **0.5488**, showing a moderate degree of error on unseen data.
   - **Validation Accuracy**: The model achieved **84.62% accuracy** on the validation set, indicating it generalizes fairly well to new data.

Overall, the model has been successfully trained, achieving high accuracy on both the training and validation sets, with slight room for improvement in validation performance.


---

## **Image Prediction using a Pre-trained Model**

This script allows you to load an image, preprocess it, and make predictions using a pre-trained model (such as a CNN). The predicted class and its confidence level are then displayed, along with the image.

#### Key Components:
1. **Libraries Used:**
   - `numpy`: Used for numerical operations and array manipulations.
   - `tensorflow.keras.preprocessing`: For loading and preprocessing images.
   - `matplotlib.pyplot`: For displaying the image and the prediction.

2. **Class Labels:**
   - The model predicts one of three classes: `'benign'`, `'malignant'`, or `'normal'`. These labels can be adjusted based on the actual classes of your dataset.

3. **Image Preprocessing:**
   - The image is resized to `(150, 150)` pixels to match the input shape of the model.
   - The image is converted to a NumPy array and normalized by dividing the pixel values by 255 to ensure they are between 0 and 1.

4. **Prediction:**
   - The `predict()` function of the pre-trained model is used to make a prediction.
   - The class with the highest probability is chosen as the predicted label using `np.argmax()`.

5. **Display:**
   - The predicted class and its confidence level are printed.
   - The original image is displayed with the prediction as the title using `matplotlib`.

---

In [None]:
import numpy as np
from tensorflow.keras.preprocessing import image
import matplotlib.pyplot as plt

# Load the trained model
# Assuming your model is already loaded in 'model'

# Class labels (you can replace these with your actual class names)
class_labels = ['benign', 'malignant', 'normal']

def predict_image(img_path):
    # Load and preprocess the image
    img = image.load_img(img_path, target_size=(150, 150))  # Resize to match model input size
    img_array = image.img_to_array(img)  # Convert to numpy array
    img_array = np.expand_dims(img_array, axis=0)  # Add batch dimension
    img_array /= 255.0  # Normalize the image
    
    # Make the prediction
    prediction = model.predict(img_array)
    
    # Get the class with the highest probability
    predicted_class = np.argmax(prediction, axis=1)[0]
    predicted_label = class_labels[predicted_class]
    
    # Print and show the prediction
    print(f"Predicted: {predicted_label} (Confidence: {np.max(prediction)*100:.2f}%)")
    
    # Display the image
    plt.imshow(img)
    plt.title(f"Prediction: {predicted_label}")
    plt.axis('off')
    plt.show()

# Example usage:
img_path = 'path_to_your_test_image.jpg'  # Replace with the actual path to your image
predict_image(img_path)
