# About the dataset
Ocular Disease Intelligent Recognition (ODIR) is a structured ophthalmic database of 5,000 patients with age, color fundus photographs from left and right eyes and doctors' diagnostic keywords from doctors.<br>

This dataset is meant to represent ‘‘real-life’’ set of patient information collected by Shanggong Medical Technology Co., Ltd. from different hospitals/medical centers in China. In these institutions, fundus images are captured by various cameras in the market, such as Canon, Zeiss and Kowa, resulting into varied image resolutions.<br>
Annotations were labeled by trained human readers with quality control management. They classify patient into eight labels including:<br>

* Normal (N),
* Diabetes (D),
* Glaucoma (G),
* Cataract (C),
* Age related Macular Degeneration (A),
* Hypertension (H),
* Pathological Myopia (M),
* Other diseases/abnormalities (O)

In [None]:
%%capture
!pip install openpyxl

In [None]:
import numpy as np
import pandas as pd 
import os
import cv2
from tqdm import tqdm

# File structure

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    print(dirname)

In [None]:
BASE_DIR = '/kaggle/input/ocular-disease-recognition-odir5k/'
TRAIN_DIR = '/kaggle/input/ocular-disease-recognition-odir5k/ODIR-5K/ODIR-5K/Training Images/'
TEST_DIR = '/kaggle/input/ocular-disease-recognition-odir5k/ODIR-5K/ODIR-5K/Testing Images/'

In [None]:
os.listdir(BASE_DIR)

In [None]:
os.listdir('/kaggle/input/ocular-disease-recognition-odir5k/ODIR-5K/ODIR-5K')

## CSV vs Excel
Looks like there are 2 files describing the data: one is in CSV format and the other is an Excel sheet. Let's see if there are any differences between the two.

In [None]:
df_csv = pd.read_csv(os.path.join(BASE_DIR, "full_df.csv"))
print(df_csv.shape)
df_csv.head()

In [None]:
df_csv.iloc[10]

In [None]:
df_csv['filepath'][0]

Each row of this table represents a patient with both their eyes checked. For some reason the last 4 columns only take into account the right eye, but since they don't really add any new information, we can just drop them if we decide to use the csv. <br><br>
Let's now have a look at the Excel sheet.

In [None]:
df_excel = pd.read_excel('/kaggle/input/ocular-disease-recognition-odir5k/ODIR-5K/ODIR-5K/data.xlsx')
print(df_excel.shape)
df_excel.head()

This looks like the same table but with questionable columns already dropped. Also the number of entries differs from the first table.

In [None]:
print(f'# of entries in the csv: {df_csv.shape}')
print(f'# of entries in the excel: {df_excel.shape}')

In [None]:
print(f'# of unique IDs in the csv: {len(df_csv.ID.unique())}')
print(f'# of unique IDs in the excel: {len(df_excel.ID.unique())}')

In [None]:
print(df_csv.ID.unique())
print(df_excel.ID.unique())

In [None]:
df_csv = df_csv.sort_values(by='ID')
df_excel = df_excel.sort_values(by='ID')

print(df_csv.ID.unique())
print(df_excel.ID.unique())

In [None]:
df_csv.ID.value_counts()

In [None]:
df_csv.loc[(df_csv.ID == 2895)]

In [None]:
df_csv.loc[(df_csv.ID == 2400)]

The CSV is structured with intention to have a separate entry for each image. This is not a good way to organize such data: 
* a lot of information is being repeated for both eyes
* it's easy to forget to create entry for the other eye, which is exactly what's going on in this table
* from medical point of view, both eyes of a patient should be considered together
<br><br>
Right now it seems like the Excel sheet is may be the refined version of the CSV:
it's structured better, it's sorted, the number of the IDs equals to the number of entries which is a round number. Also, the CSV has less unique IDs than the Excel.

## Image files

In [None]:
train_paths = sorted(os.listdir(TRAIN_DIR))
test_paths = sorted(os.listdir(TEST_DIR))
preprocessed_paths = sorted(os.listdir(os.path.join(BASE_DIR, 'preprocessed_images')))

print(f'train images: {len(train_paths)}')
print(f'test images: {len(test_paths)}')
print(f'preprocessed images: {len(preprocessed_paths)}')

Note that the number of preprocessed images matches exactly the number of entries in the CSV,<br>
while number of files in the "Training Images" folder corresponds to 3500 patients, which is exactly the number of entries in the Excel sheet

In [None]:
preprocessed_paths[:10]

In [None]:
train_paths[:10]

Note how the 1005_left.jpg is missing from the preprocessed images

In [None]:
test_paths[:14]

In [None]:
df_csv.loc[(df_csv.ID == 1000)]

In [None]:
df_excel.loc[(df_csv.ID == 1000)]

Seems like we don't have labels for the test images. We can either try to contact the owner of the data, or just use a fraction of the training folder as test images.

In [None]:
heights = []
widths = []
for file_name in tqdm(train_paths):
    image = cv2.imread(os.path.join(TRAIN_DIR, file_name))
    heights.append(image.shape[0])
    widths.append(image.shape[1])

In [None]:
heights_pd = pd.Series(heights)
widths_pd = pd.Series(widths)
print(f'min height: {heights_pd.min()}')
print(f'max height: {heights_pd.max()}')
print(f'min width: {widths_pd.min()}')
print(f'max width: {widths_pd.max()}')

In [None]:
pd.set_option('display.max_rows', None)
heights_pd.value_counts()

In [None]:
vertical_images = 0
horizontal_images = 0
square_images = 0
for i in range(len(heights)):
    if heights[i] > widths[i]:
        vertical_images+=1
    elif heights[i] < widths[i]:
        horizontal_images+=1
    else:
        square_images+=1
print(f'vertical images: {vertical_images}')
print(f'horizontal images: {horizontal_images}')
print(f'square images: {square_images}')