In this file we obtain the data from [Mendeley](https://data.mendeley.com/datasets/rscbjbr9sj/2) and do some wrangling to create a Pandas DataFrame and then store it as CSV.

# Import stuff

In [1]:
import hashlib
import os

import numpy as np
import pandas as pd

# Download and extract dataset

The next commands are useful to be run on Google Colab. The whole dataset is downloaded and extracted, in Google Colab this means the dataset is stored on the virtual machine containing the notebook.

In [2]:
_URL = "https://data.mendeley.com/public-files/datasets/rscbjbr9sj/files/5699a1d8-d1b6-45db-bb92-b61051445347/file_downloaded"
!wget -nc {_URL} -O OCT2017.tar.gz
!tar -xf OCT2017.tar.gz

--2022-03-09 22:33:04--  https://data.mendeley.com/public-files/datasets/rscbjbr9sj/files/5699a1d8-d1b6-45db-bb92-b61051445347/file_downloaded
Resolving data.mendeley.com (data.mendeley.com)... 162.159.133.86, 162.159.130.86
Connecting to data.mendeley.com (data.mendeley.com)|162.159.133.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com/9cfd5550-a37d-4404-9441-860ee091bc67 [following]
--2022-03-09 22:33:04--  https://md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com/9cfd5550-a37d-4404-9441-860ee091bc67
Resolving md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com (md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com)... 52.218.52.219
Connecting to md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com (md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com)|52.218.52.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5793183169 (5.4G) [applicat

## List downloaded dir contents

In [3]:
!ls OCT2017

test  train


## Mount Google Drive or use locally

**Note**: This was done inside a Google Drive directory but it has been removed from below, use your own if you plan to do this in Google Drive. Replace the part `[REPLACE THIS WITH THE LOCATION IN YOUR OWN GOOGLE DRIVE]` with your own directory inside your Google Drive.

In [9]:
BASE_PATH = !pwd
BASE_PATH = BASE_PATH[0]

# Note
#   If using DATA_CSV_DIR with ! or % , quote it so the spaces are respected
#   e.g. `!head "{DATA_CSV_DIR}"`

try:  # Mount Google Drive
    import os

    from google.colab import drive

    drive.mount("/content/gdrive")
    NOTEBOOK_DIR = "/content/gdrive/My Drive/[REPLACE THIS WITH THE LOCATION IN YOUR OWN GOOGLE DRIVE]"
    DATA_CSV_DIR = NOTEBOOK_DIR
    # !ln -s "{NOTEBOOK_DIR}" NOTEBOOK_DIR
    !if [ -e NOTEBOOK_DIR ]; then echo 'NOTEBOOK_DIR link already exists'; else ln -s "{NOTEBOOK_DIR}" NOTEBOOK_DIR; fi
except:  # Locally run Jupyter
    NOTEBOOK_DIR = f"{BASE_PATH}"
    DATA_CSV_DIR = NOTEBOOK_DIR

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
NOTEBOOK_DIR link already exists


# Create a Pandas DataFrame

## Create the actual DataFrame, empty for now

In [8]:
df = pd.DataFrame(
    columns=[
        "file_name",
        "dataset",
        "condition",
        "file_location",
        "patient_id",
        "md5",
        "dimensions",
    ]
)
df

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions


## Fill the Pandas DataFrame

Navigating through the folders from the downloaded dataset the DataFrame gets filled.

Te structure of the downloaded dataset is structured like so:
- `main_dir`, 'OCT2017' is the main folder
 - `data_set_dir` represents the folders inside: 'test' and 'train'
   - `data_type_dir` represent the condition: 'CNV', 'DME', 'DRUSEN' and 'normal'
     - `file_name` is the name of the file

All files will be kept together inside the DataFrame.

In [None]:
main_dir = "OCT2017"

for data_set_dir in os.listdir(main_dir):
    if not os.path.isdir(f"{main_dir}/{data_set_dir}"):
        continue

    for data_type_dir in os.listdir(f"{main_dir}/{data_set_dir}"):
        if not os.path.isdir(f"{main_dir}/{data_set_dir}/{data_type_dir}"):
            continue

        for file_name in os.listdir(f"{main_dir}/{data_set_dir}/{data_type_dir}"):
            if file_name[0] == ".":
                continue

            # md5 stuff: https://stackoverflow.com/a/16876405/1071459
            with open(
                f"{main_dir}/{data_set_dir}/{data_type_dir}/{file_name}", "rb"
            ) as file_to_check:
                data = file_to_check.read()
                md5 = hashlib.md5(data).hexdigest()  # md5sum
                img = PIL.Image.open(file_to_check)  # img object

            df = df.append(
                {
                    "file_name": file_name,
                    "dataset": data_set_dir,
                    "condition": data_type_dir,
                    "file_location": f"{main_dir}/{data_set_dir}/{data_type_dir}/{file_name}",
                    "patient_id": os.path.splitext(file_name)[0].split("-")[1],
                    "md5": md5,
                    "dimensions": img.size,
                },
                ignore_index=True,
            )

In [26]:
df = df.sort_values(["dataset", "file_location"]).reset_index(
    drop=True
)  # Keep order test first, train second
df

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions
0,CNV-1016042-1.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-1.jpeg,1016042,8878b3c48d6252464d388feeddf07259,"(512, 496)"
1,CNV-1016042-2.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-2.jpeg,1016042,2fe168b795c02e7a675f835f0930abd2,"(512, 496)"
2,CNV-1016042-3.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-3.jpeg,1016042,6bcd80b40786b6760724d082098f513f,"(768, 496)"
3,CNV-1016042-4.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-4.jpeg,1016042,4693ad1edc383053e72563f8212a94ce,"(512, 496)"
4,CNV-103044-1.jpeg,test,CNV,OCT2017/test/CNV/CNV-103044-1.jpeg,103044,bcd67009e1a0f7d540840a057f6334b2,"(512, 496)"
...,...,...,...,...,...,...,...
84479,NORMAL-9997680-2.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-2.jpeg,9997680,31f918fd7fe2f0d02d6a6b9f6f44bcf5,"(512, 512)"
84480,NORMAL-9997680-3.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-3.jpeg,9997680,ac491500b3d2616aaa6976d87505269a,"(512, 512)"
84481,NORMAL-9997680-4.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-4.jpeg,9997680,9d961b691ce6f2484642f5d8118748c7,"(512, 512)"
84482,NORMAL-9997680-5.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-5.jpeg,9997680,dd3be99ae7e602565aa91aa89ea06daa,"(512, 512)"


## Save all to CSV

In [27]:
df.to_csv(f"{DATA_CSV_DIR}/mendeley_filelist.csv")  # Save df to csv

# Do further analysis on the DataFrame

In [18]:
df.value_counts(["dataset", "condition"])

dataset  condition
train    CNV          37205
         NORMAL       26315
         DME          11348
         DRUSEN        8616
test     CNV            250
         DME            250
         DRUSEN         250
         NORMAL         250
dtype: int64

In [19]:
# Unique md5 (file) count
len(df["md5"].unique())

76902

In [20]:
# Duplicate rows
df.shape[0] - len(df["md5"].unique())

7582

In [21]:
# df['md5'].value_counts() # Get md5 count by md5
# (df['md5'].value_counts()>1) # Get boolean Series by md5 where the count is bigget than 1, non unique

# Get non unique md5
df["md5"].value_counts()[df["md5"].value_counts() > 1].index

Index(['2a2b0254a706719e6ef5899576c11334', '66b674e8e6e9c0c8e08689dcbaa08d19',
       '57eaf5cca279fa131814709e7d423c0d', '00e9c9a68ccdbdc311d4dd500a773f1e',
       '312562577d196d0ed8b66a36ff8de512', '87d81332ed4e6f753685983c3d1e61e1',
       '941b16320bc77e2957813e00583db823', '11f82331814898bff76e8be28d0a5db2',
       'b29d63b72317c8dc94a102fa54277895', 'e557d526182ad807ca00d852b19a713d',
       ...
       '564eb4d52afc749ecbabb18a036ecb10', '8431a06cbf9c28a2ef05d12d6a143f0d',
       '6e26c3e2ff923f51812a600b977ea6a6', 'cd0e102aaf9bff75d4359d4042a2b7e6',
       '006486dd2bc5516ab65c1198fab5ff31', '278f65557267078b1b55577adba8e773',
       '739a9d0fa830e5afaf22f2112a762d0f', '09f6ba7a01ad66c4b4ba37ad1ad6994c',
       'f14f60e19bbf1f2cb45bb75789bdcd9b', 'ac32b240cb8d66d080271f79dd9da25c'],
      dtype='object', length=6779)

In [22]:
# df[df['md5'] == '2a2b0254a706719e6ef5899576c11334'] # Get df for a specific md5

# Get df for all md5 that are duplicated
# df[df['md5'].isin(df['md5'].value_counts()[df['md5'].value_counts()>1].index)].sort_values('md5')
# ↑↓ same
df[df.duplicated("md5", keep=False)].sort_values("md5")

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions
21184,CNV-6493580-272.jpeg,train,CNV,OCT2017/train/CNV/CNV-6493580-272.jpeg,6493580,000aadc4aa4917fe9a46ea82e3f3b52a,"(512, 496)"
21183,CNV-6493580-271.jpeg,train,CNV,OCT2017/train/CNV/CNV-6493580-271.jpeg,6493580,000aadc4aa4917fe9a46ea82e3f3b52a,"(512, 496)"
39476,DME-30521-87.jpeg,train,DME,OCT2017/train/DME/DME-30521-87.jpeg,30521,001337313eacd1dccc0906a4349c2118,"(512, 512)"
39477,DME-30521-88.jpeg,train,DME,OCT2017/train/DME/DME-30521-88.jpeg,30521,001337313eacd1dccc0906a4349c2118,"(512, 512)"
21128,CNV-6493580-221.jpeg,train,CNV,OCT2017/train/CNV/CNV-6493580-221.jpeg,6493580,0014c28e50263bc18551cd50dcd003f0,"(768, 496)"
...,...,...,...,...,...,...,...
21424,CNV-6566667-144.jpeg,train,CNV,OCT2017/train/CNV/CNV-6566667-144.jpeg,6566667,fffe343c38819c5c2a74fc32cc1a7745,"(768, 496)"
21425,CNV-6566667-145.jpeg,train,CNV,OCT2017/train/CNV/CNV-6566667-145.jpeg,6566667,fffe343c38819c5c2a74fc32cc1a7745,"(768, 496)"
1120,CNV-1016042-207.jpeg,train,CNV,OCT2017/train/CNV/CNV-1016042-207.jpeg,1016042,ffff871ecd494bc2bbc551427f2e225b,"(512, 496)"
1121,CNV-1016042-208.jpeg,train,CNV,OCT2017/train/CNV/CNV-1016042-208.jpeg,1016042,ffff871ecd494bc2bbc551427f2e225b,"(512, 496)"


In [23]:
# Get df without duplicates on ['dataset', 'condition', 'md5']
df.drop_duplicates(subset=["dataset", "condition", "md5"], ignore_index=True)

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions
0,CNV-1016042-1.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-1.jpeg,1016042,8878b3c48d6252464d388feeddf07259,"(512, 496)"
1,CNV-1016042-2.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-2.jpeg,1016042,2fe168b795c02e7a675f835f0930abd2,"(512, 496)"
2,CNV-1016042-3.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-3.jpeg,1016042,6bcd80b40786b6760724d082098f513f,"(768, 496)"
3,CNV-1016042-4.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-4.jpeg,1016042,4693ad1edc383053e72563f8212a94ce,"(512, 496)"
4,CNV-103044-1.jpeg,test,CNV,OCT2017/test/CNV/CNV-103044-1.jpeg,103044,bcd67009e1a0f7d540840a057f6334b2,"(512, 496)"
...,...,...,...,...,...,...,...
77717,NORMAL-9997680-2.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-2.jpeg,9997680,31f918fd7fe2f0d02d6a6b9f6f44bcf5,"(512, 512)"
77718,NORMAL-9997680-3.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-3.jpeg,9997680,ac491500b3d2616aaa6976d87505269a,"(512, 512)"
77719,NORMAL-9997680-4.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-4.jpeg,9997680,9d961b691ce6f2484642f5d8118748c7,"(512, 512)"
77720,NORMAL-9997680-5.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-5.jpeg,9997680,dd3be99ae7e602565aa91aa89ea06daa,"(512, 512)"


# Create a new DataFrame without duplicates

After analysis, a duplicate is considered when two (or more) files have the same condition and md5 (the binary file is the same).

In [28]:
df_nodupes_combo_cond_md5 = df.drop_duplicates(
    subset=["condition", "md5"], ignore_index=True
)
df_nodupes_combo_cond_md5 = df_nodupes_combo_cond_md5.sort_values(
    ["dataset", "file_location"]
).reset_index(
    drop=True
)  # Keep order test first, train second
df_nodupes_combo_cond_md5

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions
0,CNV-1016042-1.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-1.jpeg,1016042,8878b3c48d6252464d388feeddf07259,"(512, 496)"
1,CNV-1016042-2.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-2.jpeg,1016042,2fe168b795c02e7a675f835f0930abd2,"(512, 496)"
2,CNV-1016042-3.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-3.jpeg,1016042,6bcd80b40786b6760724d082098f513f,"(768, 496)"
3,CNV-1016042-4.jpeg,test,CNV,OCT2017/test/CNV/CNV-1016042-4.jpeg,1016042,4693ad1edc383053e72563f8212a94ce,"(512, 496)"
4,CNV-103044-1.jpeg,test,CNV,OCT2017/test/CNV/CNV-103044-1.jpeg,103044,bcd67009e1a0f7d540840a057f6334b2,"(512, 496)"
...,...,...,...,...,...,...,...
77122,NORMAL-9997680-2.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-2.jpeg,9997680,31f918fd7fe2f0d02d6a6b9f6f44bcf5,"(512, 512)"
77123,NORMAL-9997680-3.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-3.jpeg,9997680,ac491500b3d2616aaa6976d87505269a,"(512, 512)"
77124,NORMAL-9997680-4.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-4.jpeg,9997680,9d961b691ce6f2484642f5d8118748c7,"(512, 512)"
77125,NORMAL-9997680-5.jpeg,train,NORMAL,OCT2017/train/NORMAL/NORMAL-9997680-5.jpeg,9997680,dd3be99ae7e602565aa91aa89ea06daa,"(512, 512)"


## Save the not duplicated DataFrame to CSV

In [29]:
df_nodupes_combo_cond_md5.to_csv(f"{DATA_CSV_DIR}/mendeley_filelist_combo_cond_md5.csv")

## Final revision

In [30]:
# Get duplicate md5 (file) but with different condition
df_nodupes_combo_cond_md5[
    df_nodupes_combo_cond_md5.duplicated(subset="md5", keep=False)
].sort_values("md5")

Unnamed: 0,file_name,dataset,condition,file_location,patient_id,md5,dimensions
48233,DRUSEN-7563760-10.jpeg,train,DRUSEN,OCT2017/train/DRUSEN/DRUSEN-7563760-10.jpeg,7563760,004c8ea4d1f342d163074ac12e5ea48e,"(512, 496)"
22868,CNV-7563760-17.jpeg,train,CNV,OCT2017/train/CNV/CNV-7563760-17.jpeg,7563760,004c8ea4d1f342d163074ac12e5ea48e,"(512, 496)"
14663,CNV-4973751-20.jpeg,train,CNV,OCT2017/train/CNV/CNV-4973751-20.jpeg,4973751,024f82a3f61ec8610e36ef60d14e4a54,"(512, 496)"
37442,DME-4973751-24.jpeg,train,DME,OCT2017/train/DME/DME-4973751-24.jpeg,4973751,024f82a3f61ec8610e36ef60d14e4a54,"(512, 496)"
49930,DRUSEN-8986660-102.jpeg,train,DRUSEN,OCT2017/train/DRUSEN/DRUSEN-8986660-102.jpeg,8986660,0380b37c745e613c0f0a2a0b45dc75fa,"(1536, 496)"
...,...,...,...,...,...,...,...
48131,DRUSEN-7513011-9.jpeg,train,DRUSEN,OCT2017/train/DRUSEN/DRUSEN-7513011-9.jpeg,7513011,fc82ffac5e76c4238183a7429dfb572d,"(768, 496)"
48142,DRUSEN-7531689-1.jpeg,train,DRUSEN,OCT2017/train/DRUSEN/DRUSEN-7531689-1.jpeg,7531689,fc94cf10d8f029242463a4eed3087d98,"(512, 496)"
22685,CNV-7531689-94.jpeg,train,CNV,OCT2017/train/CNV/CNV-7531689-94.jpeg,7531689,fc94cf10d8f029242463a4eed3087d98,"(512, 496)"
22747,CNV-7555604-151.jpeg,train,CNV,OCT2017/train/CNV/CNV-7555604-151.jpeg,7555604,fd9f28dad00632f644525582e8e57573,"(768, 496)"


In [31]:
df_nodupes_combo_cond_md5.groupby(["dataset", "condition"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,file_name,file_location,patient_id,md5,dimensions
dataset,condition,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
test,CNV,250,250,250,250,250
test,DME,250,250,250,250,250
test,DRUSEN,250,250,250,250,250
test,NORMAL,250,250,250,250,250
train,CNV,31399,31399,31399,31399,31399
train,DME,10898,10898,10898,10898,10898
train,DRUSEN,7776,7776,7776,7776,7776
train,NORMAL,26054,26054,26054,26054,26054
