## This assignment is designed for automated pathology detection for Medical Images in a relalistic setup, i.e. each image may have multiple pathologies/disorders. 
### The goal, for you as an MLE, is to design models and methods to predictively detect pathological images and explain the pathology sites in the image data.

## Data for this assignment is taken from a Kaggle contest: https://www.kaggle.com/c/vietai-advance-course-retinal-disease-detection/overview
Explanation of the data set:
The training data set contains 3435 retinal images that represent multiple pathological disorders. The patholgy classes and corresponding labels are: included in 'train.csv' file and each image can have more than one class category (multiple pathologies).
The labels for each image are

```
-opacity (0), 
-diabetic retinopathy (1), 
-glaucoma (2),
-macular edema (3),
-macular degeneration (4),
-retinal vascular occlusion (5)
-normal (6)
```
The test data set contains 350 unlabelled images.

# For this assignment, you are working with specialists for Diabetic Retinopathy and Glaucoma only, and your client is interested in a predictive learning model along with feature explanability and self-learning for Diabetic Retinopathy and Glaucoma vs. Normal images.
# Design models and methods for the following tasks. Each task should be accompanied by code, plots/images (if applicable), tables (if applicable) and text:
## Task 1: Build a classification model for Diabetic Retinopathy and Glaucoma vs normal images. You may consider multi-class classification vs. all-vs-one classification. Clearly state your choice and share details of your model, paremeters and hyper-paramaterization pprocess. (60 points)
```
a. Perform 70/30 data split and report performance scores on the test data set.
b. You can choose to apply any data augmentation strategy. 
Explain your methods and rationale behind parameter selection.
c. Show Training-validation curves to ensure overfitting and underfitting is avoided.
```
## Task 2: Visualize the heatmap/saliency/features using any method of your choice to demonstrate what regions of interest contribute to Diabetic Retinopathy and Glaucoma, respectively. (25 points)
```
Submit images/folder of images with heatmaps/features aligned on top of the images, or corresponding bounding boxes, and report what regions of interest in your opinion represent the pathological sites.
```

## Task 3: Using the unlabelled data set in the 'test' folder augment the training data (semi-supervised learning) and report the variation in classification performance on test data set.(15 points)
[You may use any method of your choice, one possible way is mentioned below.] 

```
Hint: 
a. Train a model using the 'train' split.
b. Pass the unlabelled images through the trained model and retrieve the dense layer feature prior to classification layer. Using this dense layer as representative of the image, apply label propagation to retrieve labels correspndng to the unbalelled data.
c. Next, concatenate the train data with the unlabelled data (that has now been self labelled) and retrain the network.
d. Report classification performance on test data
Use the unlabelled test data  to improve classification performance by using a semi-supervised label-propagation/self-labelling approach. (20 points)
```

In [None]:
# importing the required modules
import pandas as pd
import numpy as np
import cv2
import PIL.Image
import os

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
# find path for the data
data_path = '/content/drive/MyDrive/4BAI/Week 8/Mid Term/Data'

### Process the Training Data

In [None]:
# read the training labels
train_labels = pd.read_csv(os.path.join(data_path, 'train/train.csv'))
train_labels.head()

Unnamed: 0,filename,opacity,diabetic retinopathy,glaucoma,macular edema,macular degeneration,retinal vascular occlusion,normal
0,c24a1b14d253.jpg,0,0,0,0,0,1,0
1,9ee905a41651.jpg,0,0,0,0,0,1,0
2,3f58d128caf6.jpg,0,0,1,0,0,0,0
3,4ce6599e7b20.jpg,1,0,0,0,1,0,0
4,0def470360e4.jpg,1,0,0,0,1,0,0


In [None]:
train_labels.head()

Unnamed: 0,filename,opacity,diabetic retinopathy,glaucoma,macular edema,macular degeneration,retinal vascular occlusion,normal
0,c24a1b14d253.jpg,0,0,0,0,0,1,0
1,9ee905a41651.jpg,0,0,0,0,0,1,0
2,3f58d128caf6.jpg,0,0,1,0,0,0,0
3,4ce6599e7b20.jpg,1,0,0,0,1,0,0
4,0def470360e4.jpg,1,0,0,0,1,0,0


### Create Data Pipelines

In [None]:
# function to load images to numpy array and normalize
def img_to_np(img_path):
  img_temp = cv2.imread(img_path, 1)
  img_temp = (img_temp/255.0).astype(np.float32)
  return img_temp

In [None]:
# define image size and channels
IMAGE_SIZE = 512
CHANNELS = 3

In [None]:
# training image dir
train_img_dir = '/content/drive/MyDrive/4BAI/Week 8/Mid Term/Data/train/train'

# testing image dir
test_img_dir = '/content/drive/MyDrive/4BAI/Week 8/Mid Term/Data/test/test'

In [None]:
# temp code to check image loading and also normalizing image
def get_img_iterator(img_dir):
  images_list = []
  temp_file_list = os.listdir(img_dir)
  for item in temp_file_list:
    images_list.append(img_to_np(os.path.join(img_dir, item)))
  
  return np.array(images_list)

In [None]:
train_np = get_img_iterator(train_img_dir)

In [None]:
train_np.shape

In [None]:
# using tf data for getting images and lables
train_datagen = tf.data.Dataset.as_numpy_iterator