# 2. Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from pathlib import Path
import seaborn as sns
import plotly.express as px
import os
%matplotlib inline

#### Detect corrupt image

It is crucial to detect corrupted images, as they can negatively impact the performance of a machine learning model or computer vision system. Images that have been corrupted may contain noise, artifacts, or other anomalies that can lead to misclassifications or output errors. Improve the accuracy and dependability of a model or system by detecting and removing these images from the dataset. In addition, corrupted images can cause biases in the model or system, which can lead to erroneous results or unjust decisions.

In [2]:
from pathlib import Path
import imageio.v2 as imageio

corrupted_image = list()
dataset_path = "Flowers/Flowers"
accu = 0

for root, dirs, files in os.walk(dataset_path):
    for name in dirs:
        print(os.path.join(root, name))
        for image_file in Path(os.path.join(root, name)).glob('*.jpg'):
          accu = accu + 1
          try :
              image = imageio.imread(image_file)
            #   print(f'read {image_file}')
          except :
              print(f'Cannot read image {image_file}')
              corrupted_image.add(image_file)
print("Total number of images : ", accu)

Flowers/Flowers\Babi
Flowers/Flowers\Calimerio
Flowers/Flowers\Chrysanthemum
Flowers/Flowers\Hydrangeas
Flowers/Flowers\Lisianthus
Flowers/Flowers\Pingpong
Flowers/Flowers\Rosy
Flowers/Flowers\Tana
Total number of images :  0


In [3]:
len(corrupted_image)

0

#### Image duplication detection

Duplicate image detection is essential for multiple reasons:

* <b>Reducing storage</b>: Storing duplicate images wastes storage space, and detecting and removing them can help reduce storage costs.

* <b>Improving efficiency</b>: Processing or analyzing duplicate images is inefficient and time-consuming. Removing duplicates can improve processing and analysis efficiency.

* <b>Enhancing accuracy</b>: Duplicate images can bias the results of image-based analysis, such as object detection or image classification. Removing duplicates can improve the accuracy of these analyses.

* <b>Maintaining data integrity</b>: Duplicates can lead to confusion and inconsistency in data, especially when dealing with large image datasets. Removing duplicates helps to maintain data integrity and consistency.

In [4]:
from PIL import Image
import imagehash
import glob

# Define a function to compute the hash of an image file
def compute_hash(filepath):
    with Image.open(filepath) as img:
        return str(imagehash.phash(img))

# Define a function to find and remove duplicated images
def remove_duplicates(rootdir):
    hashes = {}
    duplicated = []
    rootdir = glob.glob(rootdir)
    for folder in rootdir:
        print(folder)
        for image_dir in glob.glob(folder+'/*.jpg'):
            print(image_dir)
            # Compute the hash of the image file
            file_hash = compute_hash(image_dir)
            # Check if this hash has already been seen
            file=os.path.basename(path).split('/')[-1]
            print(file)
            if file_hash in hashes:
                # This file is a duplicate, so remove it
                os.remove(image_dir)
                print(f'Removed duplicate file: {file}')
                duplicated=duplicated.append(file)
            else:
                # This file is not a duplicate, so remember its hash
                hashes[file_hash] = file

# Usage: specify the root directory to search for duplicates
remove_duplicates('Flowers/Flowers/*')

<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
<generator object Path.glob at 0x000001F9AE398430>
