# Introduction

This kernel is just a quick look at the training dataset image sizes, and a look at some of the images at the lowest and highest diagnosis levels. To see if there is something easily visible to understand what the doctor might be looking at in a classification.

There is also a [previous competition](https://www.kaggle.com/c/diabetic-retinopathy-detection) on the same topic, with the exact same training labels. It seems to have a much larger training dataset. This set was mentioned multiple times in the [external data thread](). I had trouble adding that competition as a data source (error about loading the data). So I downloaded the data and set it up as a [separate dataset](https://www.kaggle.com/donkeys/retinopathy-train-2015). Had to downscale it quite a bit to max 896x896 pixel sizes, to fit it into the 20GB dataset size limit. But it seems potentially useful.

I am not quite sure how to check exact date of some old competition here on Kaggle, so I just picket the number of years it displays in the past, and went with 2015. So I will call the older set the *2015* set here. Or the *past* set vs the actual current set for the *present* time.


In [None]:
import os
import cv2
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import math
import PIL
from PIL import ImageOps
from keras.models import Sequential, Model
from keras.layers import Dense, Flatten, Activation, Dropout, GlobalAveragePooling2D
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers, applications
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping
from keras import backend as K 
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import keras

from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
!ls -l ../input/

## Number of files in train vs test vs the 2015 training set

In [None]:
!ls -l ../input/aptos2019-blindness-detection/train_images | wc -l

In [None]:
!ls -l ../input/aptos2019-blindness-detection/test_images | wc -l

In [None]:
!ls -l ../input/retinopathy-train-2015/rescaled_train_896/rescaled_train_896 | wc -l

## Basic metadata

In [None]:
train_path_2015 = "../input/retinopathy-train-2015/rescaled_train_896/rescaled_train_896/"
train_path = "../input/aptos2019-blindness-detection/train_images/"
test_path = "../input/aptos2019-blindness-detection/test_imges/"


In [None]:
df_train = pd.read_csv("../input/aptos2019-blindness-detection/train.csv")
df_train.head()

In [None]:
df_test = pd.read_csv("../input/aptos2019-blindness-detection/test.csv")
df_test.head()

In [None]:
df_train_2015 = pd.read_csv("../input/retinopathy-train-2015/rescaled_train_896/trainLabels.csv")
df_train_2015.head()

First 10 un-ordered files in past and present training sets to see the filenames match the csv columns ("id_code" and "image"):

In [None]:
!ls -lU ../input/aptos2019-blindness-detection/train_images/ | head -10

In [None]:
!ls -lU ../input/retinopathy-train-2015/rescaled_train_896/rescaled_train_896 | head -10

It's a match.

## Collect all metadata to single dataframe(s)

In [None]:
n_rows = df_train.shape[0]
n_rows

In [None]:
df_train["filename"] = df_train["id_code"]+".png"
df_train["path"] = [train_path]*n_rows
#the year is just to be able to easily separate the past and present datasets later
df_train["year"] = [2019]*n_rows
df_train.head()

In [None]:
n_rows_2015 = df_train_2015.shape[0]
n_rows_2015

In [None]:
df_train_2015["filename"] = df_train_2015["image"]+".png"
df_train_2015["path"] = [train_path_2015]*n_rows_2015
df_train_2015["year"] = [2015]*n_rows_2015
df_train_2015.head()

In [None]:
df_train_2015.columns = ["id_code", "diagnosis", "filename", "path", "year"]
df_train_2015.head()

In [None]:
df_train_all = pd.concat([df_train,df_train_2015], axis=0, sort=False).reset_index()
df_train_all.head()

In [None]:
df_train_all.tail()

In [None]:
#replacing df_train with the full set to calculate features and do visualizations all at once, keeping the original (present) just in case
df_train_orig = df_train
df_train = df_train_all

## Calculate Aspect Ratios etc.

In [None]:
%%time
img_sizes = []
widths = []
heights = []
aspect_ratios = []

for index, row in tqdm(df_train.iterrows(), total=df_train.shape[0]):
    filename = row["filename"]
    path = row["path"]
    img_path = os.path.join(path, filename)
    with open(img_path, 'rb') as f:
        img = PIL.Image.open(f)
        img_size = img.size
        img_sizes.append(img_size)
        widths.append(img_size[0])
        heights.append(img_size[1])
        aspect_ratios.append(img_size[0]/img_size[1])

df_train["width"] = widths
df_train["height"] = heights
df_train["aspect_ratio"] = aspect_ratios
df_train["size"] = img_sizes

In [None]:
df_train.head()

## Aspect Ratios

See that there are no images that are hugely different in size to others:

In [None]:
df_sorted = df_train.sort_values(by="aspect_ratio")

In [None]:
df_sorted.head()

### Past

In [None]:
df_sorted[df_sorted["year"] == 2015].head()

### Present

In [None]:
df_sorted[df_sorted["year"] == 2019].head()

The aspect ratios in the past and present seem very close to each other.

In [None]:
df_sorted.tail()

In [None]:
df_sorted[df_sorted["year"] == 2015].tail()

In [None]:
df_sorted[df_sorted["year"] == 2019].tail()

# Look at the Images / Eyes

In [None]:
#This just shows a single image in the notebook
def show_img(filename, path):
        img = PIL.Image.open(f"{path}/{filename}")
        npa = np.array(img)
        print(npa.shape)
        #https://stackoverflow.com/questions/35902302/discarding-alpha-channel-from-images-stored-as-numpy-arrays
#        npa3 = npa[ :, :, :3]
        print(filename)
        plt.imshow(npa)


In [None]:
import matplotlib

font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : 22}

matplotlib.rc('font', **font)

## A Random Eye

Visualize the first image in past and present sets to see if they are at all alike:


### Present

In [None]:
row = df_sorted[df_sorted["year"] == 2019].iloc[0]
show_img(row.filename, row.path)

### Past

In [None]:
row = df_sorted[df_sorted["year"] == 2015].iloc[0]
show_img(row.filename, row.path)

## 9-Eyes

Visualize 9 images from a set at a time, to learn a bit more about the set at once.

In [None]:
def plot_first_9(df_to_plot):
    plt.figure(figsize=[30,30])
    for x in range(9):
        path = df_to_plot.iloc[x].path
        filename = df_to_plot.iloc[x].filename
        img = PIL.Image.open(f"{path}/{filename}")
        print(filename)
        plt.subplot(3, 3, x+1)
        plt.imshow(img)
        title_str = filename+", diagnosis: "+str(df_to_plot.iloc[x].diagnosis)
        plt.title(title_str)

## Smallest Aspect Ratio

There seem to be no images with aspect ratio < 1, so plotting the smallest aspect ratios (practically the ratio is then 1) should show the most "square" images:

In [None]:
del df_sorted
df_sorted = df_train.sort_values(by="aspect_ratio", ascending=True)

### Present

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2019])

### Past

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2015])

Generally, the past vs present images seem very similar. Some color differences, although some of the later pics will show both have these more "orange" and "greenish" ones as well. But a deeper investigation of how the color spaces are distributed in different sets could be interesting.

## Highest Aspect Ratios

This should be the ones least "square":

In [None]:
del df_sorted
df_sorted = df_train.sort_values(by="aspect_ratio", ascending=False)

### Present

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2019])

### Past

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2015])

## Diagnosis Values

A look at the highest vs lowest diagnosis values /levels given in the training set. Can we spot some differences? 

### Highest / Most Severe Diagnosis:

In [None]:
del df_sorted
df_sorted = df_train.sort_values(by="diagnosis", ascending=False)
df_sorted.head()

### Present

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2019])

### Past

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2015])

### Lowest / Healthiest Diagnosis:

In [None]:
del df_sorted
df_sorted = df_train.sort_values(by="diagnosis", ascending=True)
df_sorted.head()

### Present

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2019])

### Past

In [None]:
plot_first_9(df_sorted[df_sorted["year"] == 2015])

I guess the healthier ones look more "clean".

# Final Size Statistics

On average, are the files about the same size? Actually might make sense to look at the past and present sets separately since I had to downsize the past significantly. But the idea is there, and it does already show if there are some really small ones..

In [None]:
df_train.describe()

Are the smallest files still valid files?

In [None]:
df_sorted = df_train.sort_values(by="width", ascending=True)

plot_first_9(df_sorted[df_sorted["year"] == 2019])

In [None]:

plot_first_9(df_sorted[df_sorted["year"] == 2015])

# Conclusions

The images from both sets seem to be quite similar. Possibly some color differences and other minor differences?