<a href="https://colab.research.google.com/github/tselane2110/SSCLNet-Implementation/blob/main/dataset-preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Link to the Dataset](https://drive.google.com/drive/folders/17iNx6mt5FTt3cxwrsUvVoEhyODNX0gyi?usp=drive_link)

# Dataset Deduplication Process

## 🎯 Objective
Remove exact duplicate images from the dataset to prevent model bias and ensure data quality.

## 🔍 Method Used: MD5 Hashing

### How It Works
1. **Read Binary Content**: Each image file is read as raw binary data
2. **Generate Hash**: MD5 algorithm creates a unique fingerprint from the file content
3. **Compare Hashes**: Files with identical hashes are exact duplicates

### Technical Process
```python
# Step 1: Generate unique hash for each image
with open(image_path, 'rb') as file:
    file_hash = hashlib.md5(file.read()).hexdigest()

# Step 2: Store hashes in dictionary
hashes[file_hash].append(image_path)

# Step 3: Identify duplicates
for hash_value, paths in hashes.items():
    if len(paths) > 1:
        keep_first = paths[0]      # Preserve first occurrence
        remove_rest = paths[1:]    # Mark others for deletion
```
## Folder Structure processed:

```
Dataset-Brain-MRI/
├── 2-class/
│ ├── yes/
│ └── no/
└── 5-class/
├── Glioblastoma/
├── glioma_tumor/
├── meningioma_tumor/
├── no_tumor/
└── pituitary_tumor/
```

## Rules Applied
* Keep: First occurrence of each unique image
* Delete: All subsequent exact copies
* Cross-check: Compare across all folders and subfolders

### 1. Loading the dataset

In [1]:
!gdown --fuzzy "https://drive.google.com/file/d/1S2tya8E_Sn_HmBM4iKx0wCXjWe6B1mGr/view?usp=drive_link"
!unzip -q /content/Dataset-Brain-MRI.zip

Downloading...
From (original): https://drive.google.com/uc?id=1S2tya8E_Sn_HmBM4iKx0wCXjWe6B1mGr
From (redirected): https://drive.google.com/uc?id=1S2tya8E_Sn_HmBM4iKx0wCXjWe6B1mGr&confirm=t&uuid=3f3beaa4-7f4e-4435-b782-f456241c7e1b
To: /content/Dataset-Brain-MRI.zip
100% 209M/209M [00:03<00:00, 57.6MB/s]
replace Dataset-Brain-MRI/2-class/no/1 no.jpeg? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace Dataset-Brain-MRI/2-class/no/10 no.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace Dataset-Brain-MRI/2-class/no/11 no.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace Dataset-Brain-MRI/2-class/no/12 no.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace Dataset-Brain-MRI/2-class/no/13 no.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

### 2. Cloning the git repo for this project

In [20]:
#!rm -rf /content/SSCLNet-Implementation

In [1]:
!git clone https://github.com/tselane2110/SSCLNet-Implementation

Cloning into 'SSCLNet-Implementation'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 68 (delta 35), reused 11 (delta 4), pack-reused 0 (from 0)[K
Receiving objects: 100% (68/68), 84.67 KiB | 4.98 MiB/s, done.
Resolving deltas: 100% (35/35), done.


In [2]:
%cd SSCLNet-Implementation

/content/SSCLNet-Implementation


In [3]:
!git pull

Already up to date.


In [4]:
import sys
sys.path.append('/content/SSCLNet-Implementation/')

In [5]:
# importing the script to deduplicate the dataset
import deduplicate_img_dataset as script

In [6]:
main_folder = "/content/Dataset-Brain-MRI/"

# First run the debug to confirm everything is correct
script.debug_folder_structure(main_folder)

# If the debug shows your folders and images correctly, then run:
print("\n" + "="*50)
deduplicator = script.FolderDeduplicator(main_folder)
duplicates = deduplicator.run_deduplication(backup=True)

=== DEBUGGING FOLDER STRUCTURE ===
Main folder exists: True
2-class folder exists: True
5-class folder exists: True
2-class subfolders: ['no', 'yes']
  no: 98 images
  yes: 155 images
5-class subfolders: ['pituitary_tumor', 'no_tumor', 'meningioma_tumor', 'glioma_tumor', 'Glioblastoma']
  pituitary_tumor: 1512 images
  no_tumor: 951 images
  meningioma_tumor: 1674 images
  glioma_tumor: 1437 images
  Glioblastoma: 936 images

Starting deduplication in: /content/Dataset-Brain-MRI/
Folder structure: i/{2-class,5-class}/*/[images]
Scanning for all images...
Found 6763 total images across all subfolders
Checking for duplicate images...
Duplicate found (3 copies):
  KEEP: /content/Dataset-Brain-MRI/2-class/no/No17.jpg
  DELETE: /content/Dataset-Brain-MRI/2-class/no/No15.jpg
  DELETE: /content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_1019.jpg
Duplicate found (2 copies):
  KEEP: /content/Dataset-Brain-MRI/2-class/no/No19.jpg
  DELETE: /content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_1022.jpg

## 3. zipping the folders again

In [10]:
!zip -r /content/deduplicated_dataset.zip /content/Dataset-Brain-MRI

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_0443.jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/2.jpg (deflated 0%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/image(313).jpg (deflated 5%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/image(43).jpg (deflated 3%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_1475.jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/image(322).jpg (deflated 9%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_1256.jpg (deflated 4%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_0690.jpg (stored 0%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/Tr-no_0959.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/image(285).jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/5-class/no_tumor/image(421).jpg (deflated 17%)
  adding: content/Dataset-Brain-MRI/5-class/

## 4. uploading it on the google drive location

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
path = ""
!mv /content/deduplicated_dataset.zip path