# Images DQA  
In this notebook we will evaluate the quality of the images in the dataset. For statistical analysis we will just have a look at the image count and compare it with data on the [kaggle page](https://www.kaggle.com/c/cassava-leaf-disease-classification/data):
- Traing images: 55.5k
- Validation images: 13.9k  


Additionally we will have a look at the quality of the images themselves. For this, we will use the [cleanvision Imagelab library](https://github.com/cleanlab/cleanvision).  
With this library we can easily check for the following issues:
- Duplicates
- Near duplicates
- Blurry images
- Low Information
- Dark images
- Light images
- Grayscale images
- Odd aspect ratio
- Odd image size

## 1 - Setup

In [None]:
from os import listdir
from os.path import isfile, join

from cleanvision import Imagelab

from matplotlib.pyplot import subplots
from matplotlib.image import imread

import pandas as pd

In [None]:
BASE_PATH = "../data/"
TRAIN_IMAGES = BASE_PATH + "train_images/"
VAL_IMAGES = BASE_PATH + "test_images/"

### 1.1 - Helper Functions

In [None]:
def parse_id_from_path(image_path):
    return image_path.split("/")[-1].split(".")[0]

def pretty_print(ids):
    return "\n".join(ids)

## 2 - Data Overview

For the training and validation images, walk through the folder and check the image count

In [None]:
for folder in [TRAIN_IMAGES, VAL_IMAGES]:
    files = [f for f in listdir(folder) if isfile(join(folder, f))]
    print(f"Number of files in {folder}: {len(files)}")

This are exactly the numbers we found on kaggle.

## 3 - Image quality: train dataset

### 3.1 - Load images and analyze

In [None]:
imagelab = Imagelab(TRAIN_IMAGES)
imagelab.find_issues()

### 3.2 - Image sizes
First, lets have a look at the distribution of image sizes.

In [None]:
imagelab.info["statistics"]["size"]

Each and every image has a size of 512x512 pixels, therefore no cleaning has to be done based on the image size.

### 3.3 - Report
Imagelabe tells us that there are 84 issues in our training data. Therefore, we will check the imagelab report:

In [None]:
imagelab.report()

### 3.4 - Issue analysis
#### 3.4.1 - Duplicates  
The report states that there are 50 duplicates in the training data. We will have a look at the duplicates.

In [None]:
for image1, image2 in imagelab.info["exact_duplicates"]["sets"]:
	fig, ax = subplots(1, 2)
	ax[0].imshow(imread(image1))
	ax[0].set_title(image1.split("/")[-1])
	ax[1].imshow(imread(image2))
	ax[1].set_title(image2.split("/")[-1])
	fig.show()

As we can see in the plots above, there are really only 25 duplicates. Considering that the training set has 55.5k entries, we'll just remove them from the tabular data.

In [None]:
ids = [parse_id_from_path(image2) for _, image2 in imagelab.info["exact_duplicates"]["sets"]]

print(f"Number of duplicates to remove: {len(ids)}:\n{pretty_print(sorted(ids, reverse=True))}")

### 3.4.2 - Other issues
The report states, that there are 34 other issues. Considering that the near duplicates are counted twice, this leaves us with 26 issues. In the scope of 55.5k training images, we're not going to analyze them, but rather just remove them alltogether.

In [None]:
ids += [parse_id_from_path(image2) for _, image2 in imagelab.info["near_duplicates"]["sets"]]

ids += [parse_id_from_path(image) for image in imagelab.issues[imagelab.issues["is_blurry_issue"] == True].index.tolist()]

ids += [parse_id_from_path(image) for image in imagelab.issues[imagelab.issues["is_dark_issue"] == True].index.tolist()]

ids += [parse_id_from_path(image) for image in imagelab.issues[imagelab.issues["is_low_information_issue"] == True].index.tolist()]

ids += [parse_id_from_path(image) for image in imagelab.issues[imagelab.issues["is_light_issue"] == True].index.tolist()]

For duplicate id removal, we're casting the list to a set and afterwards saving them to a csv file to be imported in the tabular data dqa

In [None]:
print(len(ids))
pd.DataFrame({"id": sorted(list(set(ids)), reverse=True)}).to_csv(BASE_PATH + "train_ids_to_remove.csv")

## 4 - Image quality: validation dataset

### 4.1 - Load images and analyze

In [None]:
imagelab = Imagelab(VAL_IMAGES)
imagelab.find_issues()

### 4.2 - Image sizes
Again, let's have a look at the distribution of image sizes.

In [None]:
imagelab.info["statistics"]["size"]

As with the training data, all the images are the same 512x512 pixels in the validation dataset

### 4.3 - Issue analysis
Compared to the training dataset, there are even less issues. We're not even going to analyze them further and just remove them from the dataset.

In [None]:
imagelab.issue_summary

In [None]:
ids = [parse_id_from_path(image2) for _, image2 in imagelab.info["exact_duplicates"]["sets"]]

ids += [parse_id_from_path(image2) for _, image2 in imagelab.info["near_duplicates"]["sets"]]

ids += [parse_id_from_path(image) for image in imagelab.issues[imagelab.issues["is_blurry_issue"] == True].index.tolist()]

ids += [parse_id_from_path(image) for image in imagelab.issues[imagelab.issues["is_dark_issue"] == True].index.tolist()]

Same as before with the training dataset, were casting the list to a set to remove duplicates and afterwards save them to a csv file.

In [None]:
print(len(ids))
pd.DataFrame({"id": sorted(list(set(ids)), reverse=True)}).to_csv(BASE_PATH + "val_ids_to_remove.csv")