In this file we obtain the data from [Kaggle](https://www.kaggle.com/paultimothymooney/kermany2018/) and do some wrangling to create a Pandas DataFrame and then store it as CSV.

This file was ran locally (not in Colab) so it assumes the files have been downloaded and extracted in the same directory as this notebook.

The direct [download link](https://www.kaggle.com/paultimothymooney/kermany2018/download) if registered in Kaggle.

# Import stuff

In [None]:
import hashlib
import os

import numpy as np
import pandas as pd

# List folders in the downloaded dataset

In [None]:
os.listdir("kermany2018_downloaded_from_kaggle/OCT2017 "), os.listdir(
    "kermany2018_downloaded_from_kaggle/oct2017/OCT2017 "
)

(['.DS_Store', 'test', 'train', 'val'], ['.DS_Store', 'test', 'train', 'val'])

# Create a Pandas DataFrame

## Create the actual DataFrame, empty for now

In [None]:
df = pd.DataFrame(
    columns=[
        "file_name",
        "dataset",
        "condition",
        "file_location",
        "patient_id",
        "md5",
        "dimensions",
    ]
)
df

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions


## Fill the Pandas DataFrame

Navigating through the folders from the downloaded dataset the DataFrame gets filled.

Te structure of the downloaded dataset is structured like so:
- `main_dirs`, ['kermany2018_downloaded_from_kaggle/OCT2017 ', 'kermany2018_downloaded_from_kaggle/oct2017/OCT2017 '] are the containing folders
 - `data_set_dir` represents the folders inside: 'test', 'train' and 'val'
   - `data_type_dir` represent the condition: 'CNV', 'DME', 'DRUSEN' and 'normal'
     - `file_name` is the name of the file

All files will be kept together inside the DataFrame.

In [None]:
main_dirs = [
    "kermany2018_downloaded_from_kaggle/OCT2017 ",
    "kermany2018_downloaded_from_kaggle/oct2017/OCT2017 ",
]
for main_dir in main_dirs:
    for data_set_dir in os.listdir(main_dir):
        if not os.path.isdir(f"{main_dir}/{data_set_dir}"):
            continue

        for data_type_dir in os.listdir(f"{main_dir}/{data_set_dir}"):
            if not os.path.isdir(f"{main_dir}/{data_set_dir}/{data_type_dir}"):
                continue

            for file_name in os.listdir(f"{main_dir}/{data_set_dir}/{data_type_dir}"):
                if file_name[0] == ".":
                    continue
                # md5 stuff: https://stackoverflow.com/a/16876405/1071459
                with open(
                    f"{main_dir}/{data_set_dir}/{data_type_dir}/{file_name}", "rb"
                ) as file_to_check:
                    # read contents of the file
                    data = file_to_check.read()
                    # pipe contents of the file through
                    md5 = hashlib.md5(data).hexdigest()
                    img = PIL.Image.open(file_to_check)
                df = df.append(
                    {
                        "file_name": file_name,
                        "dataset": data_set_dir,
                        "condition": data_type_dir,
                        "file_location": f"{main_dir}/{data_set_dir}/{data_type_dir}/{file_name}",
                        "patient_id": os.path.splitext(file_name)[0].split("-")[1],
                        "md5": md5,
                        "dimensions": img.size,
                    },
                    ignore_index=True,
                )

df

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions
0,CNV-4283050-2.jpeg,test,CNV,kermany2018_downloaded_from_kaggle/OCT2017 /te...,4283050,194c039768e730812cf77c2072821f83,"(512, 496)"
1,CNV-909994-1.jpeg,test,CNV,kermany2018_downloaded_from_kaggle/OCT2017 /te...,909994,5b35e52a54e99ef5195e4a715054ac09,"(512, 496)"
2,CNV-5861916-2.jpeg,test,CNV,kermany2018_downloaded_from_kaggle/OCT2017 /te...,5861916,4266f7daa216b0d41db0c72330d4ced0,"(768, 496)"
3,CNV-2959614-4.jpeg,test,CNV,kermany2018_downloaded_from_kaggle/OCT2017 /te...,2959614,3b98f769746a5b1940e01d78e17cc432,"(768, 496)"
4,CNV-4974377-1.jpeg,test,CNV,kermany2018_downloaded_from_kaggle/OCT2017 /te...,4974377,174645709e1ac2f93849162c39ff729d,"(512, 496)"
...,...,...,...,...,...,...,...
168963,NORMAL-5193994-1.jpeg,val,NORMAL,kermany2018_downloaded_from_kaggle/oct2017/OCT...,5193994,c452deb7fe847610d4aa1ee41c4af55f,"(512, 496)"
168964,NORMAL-5324912-1.jpeg,val,NORMAL,kermany2018_downloaded_from_kaggle/oct2017/OCT...,5324912,2ee72e2c1e0458646b2b011a4c2a2ae4,"(512, 496)"
168965,NORMAL-9053621-1.jpeg,val,NORMAL,kermany2018_downloaded_from_kaggle/oct2017/OCT...,9053621,8781d05a185082abe914bb42807e05b9,"(512, 496)"
168966,NORMAL-5156112-1.jpeg,val,NORMAL,kermany2018_downloaded_from_kaggle/oct2017/OCT...,5156112,0a37613255e44f8a4f985964fd8fe438,"(512, 496)"


## Save to CSV

In [None]:
df.to_csv("kaggle_dataset_filelist.csv")