We have chosen the **ISIC 2024 - Skin Cancer Detection task with 3D-TBP**, and this is our first milestone. Here, we load the necessary data, convert it to the appropriate format and scale, and set up a data generator, among other steps. Please find detailed explanations and code below.

First we import the necessary libraries

In [None]:
%tensorflow_version 2.x  # this line is not required unless you are in a notebook
import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import os
import numpy as np
import tensorflow as tf
keras = tf.keras

First, we uploaded the provided .zip file from Kaggle to Google Drive and downloaded it here to expedite the workflow. We unzip the file in the Google Colab environment because it is significantly faster than on our own setup. This code block is simply a representation of our workflow; **there’s no need to run it**. We process the data and organize it into different folders, then zip it within the environment. Later, you’ll find a code block that allows you to download the processed and organized data.
These zipping and parsing codeblock are very timconsuming to run, so we advise you to only download the organized zip later.

In [None]:
# Install gdown if you haven't already
!pip install gdown

# File ID
file_id = '17ezz6Onjpe48XvPrBcYHhZaJ8kAvRSG3'
download_link = f'https://drive.google.com/uc?id={file_id}'

# Download the ZIP file
!gdown --id {file_id} --output dataset.zip

# Unzip the downloaded file
!unzip dataset.zip -d /content/dataset

In this section, we organize the data based on the isic_id provided in the hdf5 file. It is needed to later load in the images using the ImageDataGenerator function from keras. You don't need to run this code, we saved the organized data as a zip file (see the section under this one), and you can get that file running the marked code.

In [None]:
import h5py
import shutil
import os
from PIL import Image
import io

# Define paths and parameters
hdf5_file_path = 'dataset/train-image.hdf5'  # Path to your HDF5 file
metadata_file_path = 'dataset/train-metadata.csv'  # Path to your metadata CSV
image_folder = 'images'  # Base folder for images
folder1_name = 'benign'  # Name of folder for images with target value 0
folder2_name = 'malignant'  # Name of folder for images with target value 1
target_column = 'target'  # Column in metadata containing the target value

# Create folders if they don't exist
os.makedirs(os.path.join(image_folder, folder1_name), exist_ok=True)
os.makedirs(os.path.join(image_folder, folder2_name), exist_ok=True)

# Open HDF5 file
with h5py.File(hdf5_file_path, 'r') as hdf5_file:
    # Read metadata
    import pandas as pd
    metadata_df = pd.read_csv(metadata_file_path)

    # Iterate through metadata
    for index, row in metadata_df.iterrows():
        isic_id = row['isic_id']
        target_value = row[target_column]

        # Get image data from HDF5
        image_data = hdf5_file[isic_id][()]

        # Convert to PIL Image
        image = Image.open(io.BytesIO(image_data))

        # Determine destination folder
        destination_folder = folder1_name if target_value == 0 else folder2_name

        # Save image to destination folder
        image_path = os.path.join(image_folder, destination_folder, f"{isic_id}.jpg")
        image.save(image_path)

        print(f"Saved image {isic_id} to {destination_folder}")

print("Image separation complete!")

In [None]:
import shutil
import os

# Define paths
image_folder = 'images'  # Path to the images folder
zip_file_name = 'images.zip'  # Name of the zip file

# Create zip file
shutil.make_archive(zip_file_name[:-4], 'zip', image_folder)

print(f"Zip file '{zip_file_name}' created successfully.")

### YOU NEED TO RUN THIS TO GET THE DATA###

The code block below will assist you in downloading the zip file that contains the organized data. This code will unzip the file, which may take some time, and will create two folders: one named "benign," which contains images of non-cancerous moles, and the other named "malignant," which contains images of cancerous moles.

In [None]:
# Install gdown if you haven't already
!pip install gdown

# File ID from the provided link
file_id = '1UlGjhWlDpL9nIT0CdyJGnMxCJc5VZuCx'
download_link = f'https://drive.google.com/uc?id={file_id}'

# Download the ZIP file
!gdown --id {file_id} --output dataset.zip

# Create a folder for the organized dataset
!mkdir Organized_Dataset

# Unzip the downloaded file into the Organized_Dataset folder
!unzip dataset.zip -d Organized_Dataset

[1;30;43mA streamkimeneten csak az utolsó 5000 sor látható.[0m
  inflating: Organized_Dataset/benign/ISIC_1399219.jpg  
  inflating: Organized_Dataset/benign/ISIC_0632500.jpg  
  inflating: Organized_Dataset/benign/ISIC_2725785.jpg  
  inflating: Organized_Dataset/benign/ISIC_7632164.jpg  
  inflating: Organized_Dataset/benign/ISIC_6491430.jpg  
  inflating: Organized_Dataset/benign/ISIC_5811800.jpg  
  inflating: Organized_Dataset/benign/ISIC_7078164.jpg  
  inflating: Organized_Dataset/benign/ISIC_5460061.jpg  
  inflating: Organized_Dataset/benign/ISIC_8336137.jpg  
  inflating: Organized_Dataset/benign/ISIC_6265352.jpg  
  inflating: Organized_Dataset/benign/ISIC_7430767.jpg  
  inflating: Organized_Dataset/benign/ISIC_7746968.jpg  
  inflating: Organized_Dataset/benign/ISIC_7739668.jpg  
  inflating: Organized_Dataset/benign/ISIC_2178954.jpg  
  inflating: Organized_Dataset/benign/ISIC_0172797.jpg  
  inflating: Organized_Dataset/benign/ISIC_6941272.jpg  
  inflating: Organized_

Here we load in the data to the model using the ImageDataGenerator function from tensorflow-keras. We also rescaled the data, and made an 80% split to train the model, and a 20% split for validation.

In [None]:
# Import libraries
import pandas as pd
import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create an ImageDataGenerator instance
datagen = ImageDataGenerator(
    rescale=1./255,
    validation_split=0.2)

data_dir = '/content/images'

train_generator = datagen.flow_from_directory(
    data_dir,
    target_size=(256, 256),
    batch_size=32,
    class_mode='binary',
    subset='training')

validation_generator = datagen.flow_from_directory(
    data_dir,
    target_size=(256, 256),
    batch_size=32,
    class_mode='binary',
    subset='validation')
