# Exploring the ISIC Dataset


Here is someone's guide to the competition: [Link](https://www.kaggle.com/competitions/isic-2024-challenge/discussion/543310)
The authors main takeaways: 
 - Use state of the art foundation models 
 - Clean, augment, and batchify the data 


## Metabasics of the Data 
- Data split into 2 sets: train, test 
- Both sets contain an image paired with metadata. The metadata contains basics such as age, lots of valuable information that describes the image in greater detail. There is information about shape, color, perimeter, location on body, etc 
- The images come from a device that captures total-body-photography
- The images are all of uniform size (15x15)mm
> Vectra WB360, a 3D TBP product from Canfield Scientific, captures the complete visible cutaneous surface area in one macro-quality resolution tomographic image. An AI-based software then identifies individual lesions on a given 3D capture. This allows for the image capture and identification of all lesions on a patient, which are exported as individual 15x15 mm field-of-view cropped photos. The dataset contains every lesion from a subset of thousands of patients seen between the years 2015 and 2024 across nine institutions and three continents.

The images come in a hdf5 file format. The hdf5 file format allows for easy storage of large amounts of data with easy retrieval. To access it we need to use a few python libraries such as hdf5. However within this file structure, the images are stored as jpegs. 

In [43]:
#Imports 
import os
import h5py
import numpy as np
from PIL import Image

In [25]:
file_dict = {
    "sample_sub": "sample_submission.csv",
    "test_images": "test-image.hdf5",
    "test_metadata": "test-metadata.csv",
    "train_images": "train-image.hdf5",
    "train_metadata": "train_metadata.csv"
}
for key in file_dict.keys():
    path = os.path.join(os.getcwd(), '..', 'data/isic-2024-challenge', file_dict[key])
    path = os.path.normpath(path)
    file_dict[key] = path
file_dict

{'sample_sub': '/Users/rakin/Desktop/Agent-O/data/isic-2024-challenge/sample_submission.csv',
 'test_images': '/Users/rakin/Desktop/Agent-O/data/isic-2024-challenge/test-image.hdf5',
 'test_metadata': '/Users/rakin/Desktop/Agent-O/data/isic-2024-challenge/test-metadata.csv',
 'train_images': '/Users/rakin/Desktop/Agent-O/data/isic-2024-challenge/train-image.hdf5',
 'train_metadata': '/Users/rakin/Desktop/Agent-O/data/isic-2024-challenge/train_metadata.csv'}

In [44]:
# Code Attribution: https://www.geeksforgeeks.org/hdf5-files-in-python/


path = file_dict["train_images"]
f = h5py.File(path, 'r')
print("Count of data: ", len(f.keys()))
list(f.keys())[:5]



Count of data:  401059


['ISIC_0015670',
 'ISIC_0015845',
 'ISIC_0015864',
 'ISIC_0015902',
 'ISIC_0024200']

In the cell above we grab keys from the h5 file and print the first 5 datsets. We see our counts of keys is about 400k. Each of these corresponds to a single image of a lesion. The key also corresponds to the unique key of the metadataset. 


In [45]:
keys = list(f.keys())
imageo = f[keys[0]]


In [46]:
print(imageo.dtype)
print(imageo.shape)
print(imageo.size)
print(imageo.ndim)

|S3325
()
1
0


In [52]:
imageo[()]

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x08\x06\x06\x07\x06\x05\x08\x07\x07\x07\t\t\x08\n\x0c\x14\r\x0c\x0b\x0b\x0c\x19\x12\x13\x0f\x14\x1d\x1a\x1f\x1e\x1d\x1a\x1c\x1c $.\' ",#\x1c\x1c(7),01444\x1f\'9=82<.342\xff\xdb\x00C\x01\t\t\t\x0c\x0b\x0c\x18\r\r\x182!\x1c!22222222222222222222222222222222222222222222222222\xff\xc0\x00\x11\x08\x00\x8b\x00\x8b\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1f\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\xff\xc4\x00\xb5\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01}\x01\x02\x03\x00\x04\x11\x05\x12!1A\x06\x13Qa\x07"q\x142\x81\x91\xa1\x08#B\xb1\xc1\x15R\xd1\xf0$3br\x82\t\n\x16\x17\x18\x19\x1a%&\'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz\x83\x84\x85\x86\x87\x88\x89\x8a\x92\x93\x94\x95\x96\x97\x98\x99\x9a\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xd2\xd3\

In [53]:

data = list(f[keys[0]])
# preferred methods to get dataset values:
ds_obj = f[keys[0]]     # returns as a h5py dataset object
ds_arr = f[keys[0]][()]  # returns as a numpy array
retrieved_image = Image.fromarray(imageo[()])
retrieved_image.show()

TypeError: Cannot handle this data type: (1, 1), |S3325