**Image Augmentation (No Resize) for Balancing Lung Cancer Classification Training Data**

This step focuses solely on augmenting the original Train dataset to address class imbalance without modifying the input image resolution. Ten different geometric and color transformations (including horizontal flip, vertical flip, rotation, shear, and zoom) are applied, ensuring each class is boosted to 1200 samples. The output zip file contains the augmented images in their original size, which will be used for Fine-Tuning the EfficientNet model in subsequent steps.

In [1]:
import tensorflow as tf
import numpy as np
import os
import shutil
import cv2
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array

# --- 1. Custom Preprocessing Function (CLAHE + Gamma) ---
def apply_clahe_and_gamma(img):
    """
    Applies CLAHE and Gamma Correction (gamma=0.5) for contrast enhancement.
    This function is designed to be passed to the ImageDataGenerator.
    
    IMPORTANT: The ImageDataGenerator expects the output to be in the 
    range [0, 255] (float or int) for subsequent rescaling.
    """
    
    # Ensure image is in the correct format (CV2 expects uint8 for processing)
    img = img.astype(np.uint8)

    # 1. CLAHE (Contrast Limited Adaptive Histogram Equalization)
    if len(img.shape) == 3 and img.shape[2] == 3:
        # --- COLOR IMAGE PROCESSING (L*a*b* space) ---
        img_lab = cv2.cvtColor(img, cv2.COLOR_RGB2LAB)
        l, a, b = cv2.split(img_lab)
        
        # Apply CLAHE to L channel
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        cl = clahe.apply(l)
        
        # Merge back and convert to RGB
        limg = cv2.merge((cl, a, b))
        img_clahe = cv2.cvtColor(limg, cv2.COLOR_LAB2RGB)
    else:
        # --- GRAYSCALE IMAGE PROCESSING ---
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        img_clahe = clahe.apply(img)


    # 2. Gamma Correction (gamma=0.5)
    gamma = 0.5
    inv_gamma = 1.0 / gamma
    # Create lookup table
    table = np.array([((i / 255.0) ** inv_gamma) * 255
                      for i in np.arange(256)]).astype("uint8")
    
    # Apply gamma correction
    # Note: cv2.LUT takes the input image and the table, using the 'uint8' image
    img_corrected = cv2.LUT(img_clahe, table)
    
    # Return the corrected image as float32 in [0, 255] range
    return img_corrected.astype(np.float32)

# --- Configuration & Path Setup ---
BASE_INPUT_DIR = '/kaggle/input/without-augmentation-lung-cancer-dataset/Lung Cancer Dataset'
TRAIN_DIR = os.path.join(BASE_INPUT_DIR, 'Train')
TARGET_OUTPUT_DIR = '/kaggle/working/Augmented_Dataset_Enhanced/Train'
TARGET_COUNT_PER_CLASS = 1200
CLASS_NAMES = ['Bengin cases', 'Malignant cases', 'Normal cases']

print("Configuration and Path Setup complete.")

# --- 2. Create the Augmentation Generator (WITH Preprocessing and Normalization) ---

# The steps for ImageDataGenerator are applied in this order:
# 1. Random Augmentations (Rotation, Shift, Zoom, etc.)
# 2. Custom Preprocessing Function (apply_clahe_and_gamma runs here)
# 3. Rescaling/Normalization (rescale=1./255 runs last)

datagen = ImageDataGenerator(
    # --- Normalization (Rescaling) ---
    rescale=1./255,                 # Standard normalization: converts [0, 255] to [0.0, 1.0]
    
    # --- Custom Preprocessing ---
    preprocessing_function=apply_clahe_and_gamma, # Apply CLAHE and Gamma Correction
    
    # --- Augmentation Parameters ---
    rotation_range=25,
    width_shift_range=0.15,
    height_shift_range=0.15,
    shear_range=0.15,
    zoom_range=[0.8, 1.2],
    horizontal_flip=True,
    vertical_flip=True,
    brightness_range=[0.7, 1.3],
    channel_shift_range=30.0,
    fill_mode='nearest',
)
print("Augmentation Generator with CLAHE/Gamma and Normalization created.")

# --- Data Augmentation Logic and Saving ---
print("üöÄ Starting Data Augmentation...")

# Create the target output directory structure
os.makedirs(TARGET_OUTPUT_DIR, exist_ok=True)

# Note on Saving: When using the generator's .flow() method with a preprocessing_function
# and rescale, the saved images will reflect the full transformation (CLAHE/Gamma AND Normalization). 
# This is usually NOT desirable for saving, as you want to train your model on 
# normalized data, but save un-normalized images to view them easily.

# Since the goal here is to save the FINAL (normalized) output ready for model consumption:

for class_name in CLASS_NAMES:
    input_class_path = os.path.join(TRAIN_DIR, class_name)
    output_class_path = os.path.join(TARGET_OUTPUT_DIR, class_name)
    
    os.makedirs(output_class_path, exist_ok=True)
    
    # Collect existing data files
    image_files = [f for f in os.listdir(input_class_path) if f.lower().endswith(('.png', '.jpg', '.jpeg', '.webp'))]
    current_count = len(image_files)
    
    print(f"\nüìÅ Class: {class_name} (Current Count: {current_count})")

    # 1. Copy existing data (We will skip this to only save the augmented/processed data)
    # The generator will create the augmented/processed versions, so we only need to 
    # generate the NEEDED augmented data.

    if current_count >= TARGET_COUNT_PER_CLASS:
        print(f"    Data count is sufficient. Skipping augmentation.")
        continue

    # 2. Calculate needed augmentation
    needed_augmentation = TARGET_COUNT_PER_CLASS - current_count
    total_generated = 0
    
    # 3. Augment and Save (Looping through images until target is met)
    for filename in image_files:
        if total_generated >= needed_augmentation:
            break
            
        img_path = os.path.join(input_class_path, filename)
        
        try:
            # Load image WITHOUT specifying target_size to preserve original size
            img = load_img(img_path)
            x = img_to_array(img)
            # Reshape: (1, height, width, channels)
            x = x.reshape((1,) + x.shape) 
            
            # Generate and save images (these saved images are NOW preprocessed and normalized)
            for batch in datagen.flow(x, batch_size=1, 
                                      save_to_dir=output_class_path, 
                                      save_prefix=f'{class_name}_aug_{total_generated}', 
                                      save_format='png'):
                total_generated += 1
                
                if total_generated >= needed_augmentation:
                    break
                    
        except Exception as e:
            print(f"    ‚ö†Ô∏è Error processing file {filename}: {e}")
            
    # IMPORTANT: Since we did not copy the original images, we must now calculate 
    # how many images we must generate to reach the target COUNT.
    
    # To simplify, we will just generate enough to cover the shortfall + originals.
    
    # Re-run the augmentation loop until the target is met (this ensures the target is hit)
    while total_generated < needed_augmentation:
        for filename in image_files:
            if total_generated >= needed_augmentation:
                break
            
            img_path = os.path.join(input_class_path, filename)
            
            try:
                img = load_img(img_path)
                x = img_to_array(img)
                x = x.reshape((1,) + x.shape) 
                
                for batch in datagen.flow(x, batch_size=1, 
                                          save_to_dir=output_class_path, 
                                          save_prefix=f'{class_name}_aug_{total_generated}', 
                                          save_format='png'):
                    total_generated += 1
                    if total_generated >= needed_augmentation:
                        break
                        
            except Exception as e:
                pass # Skip error handling for simplicity in the second loop

    # Now we copy the originals and ensure we have enough total files
    
    # Copying originals is essential since the generator only creates *new* augmented images.
    for img_file in image_files:
        shutil.copy(os.path.join(input_class_path, img_file), output_class_path)
    
    total_final_count = len(os.listdir(output_class_path))
    print(f"    ‚úÖ New augmented data created: {total_generated} files. Total Data (Originals + Augmented): {total_final_count} files.")


# --- Zip the folder for download ---
print("\nüì¶ Zipping the dataset...")
# Compress the entire Augmented_Dataset folder into a ZIP file
shutil.make_archive('Augmented_Dataset_Enhanced', 'zip', '/kaggle/working/Augmented_Dataset_Enhanced')
print("    ‚≠ê Augmented dataset successfully created as 'Augmented_Dataset_Enhanced.zip'.")

print("\n--- Pipeline Complete ---")

2025-12-02 15:52:31.966578: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764690752.201968      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764690752.273076      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Configuration and Path Setup complete.
Augmentation Generator with CLAHE/Gamma and Normalization created.
üöÄ Starting Data Augmentation...

üìÅ Class: Bengin cases (Current Count: 100)
    ‚úÖ New augmented data created: 1100 files. Total Data (Originals + Augmented): 1140 files.

üìÅ Class: Malignant cases (Current Count: 1000)
    ‚úÖ New augmented data created: 200 files. Total Data (Originals + Augmented): 1195 files.

üìÅ Class: Normal cases (Current Count: 500)
    ‚úÖ New augmented data created: 700 files. Total Data (Originals + Augmented): 1173 files.

üì¶ Zipping the dataset...
    ‚≠ê Augmented dataset successfully created as 'Augmented_Dataset_Enhanced.zip'.

--- Pipeline Complete ---
