<a href="https://colab.research.google.com/github/tselane2110/Brain-Tumor-Detection-And-Classification/blob/main/dataset-preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Link to the Dataset](https://drive.google.com/drive/folders/17iNx6mt5FTt3cxwrsUvVoEhyODNX0gyi?usp=drive_link)

# Dataset Deduplication Process

## Objective
Remove exact duplicate images from the dataset to prevent model bias and ensure data quality.

## Method Used: MD5 Hashing

### How It Works
1. **Read Binary Content**: Each image file is read as raw binary data
2. **Generate Hash**: MD5 algorithm creates a unique fingerprint from the file content
3. **Compare Hashes**: Files with identical hashes are exact duplicates

### Technical Process
```python
# Step 1: Generate unique hash for each image
with open(image_path, 'rb') as file:
    file_hash = hashlib.md5(file.read()).hexdigest()

# Step 2: Store hashes in dictionary
hashes[file_hash].append(image_path)

# Step 3: Identify duplicates
for hash_value, paths in hashes.items():
    if len(paths) > 1:
        keep_first = paths[0]      # Preserve first occurrence
        remove_rest = paths[1:]    # Mark others for deletion
```
## Folder Structure processed:

```
Dataset-Brain-MRI/
├── 2-class/
│   ├── yes/
│   └── no/
└── 4-class/
    ├── glioma_tumor/
    ├── meningioma_tumor/
    ├── no_tumor/
    └── pituitary_tumor/
```

## Rules Applied
* Keep: First occurrence of each unique image
* Delete: All subsequent exact copies
* Cross-check: Compare across all folders and subfolders

### 1. Loading the dataset

In [1]:
!gdown --fuzzy "https://drive.google.com/file/d/1Li4wCGDheUZ41dHr_9D5-wDemHw6BvYt/view?usp=drive_link"
!unzip -q Dataset-Brain-MRI.zip

Downloading...
From (original): https://drive.google.com/uc?id=1Li4wCGDheUZ41dHr_9D5-wDemHw6BvYt
From (redirected): https://drive.google.com/uc?id=1Li4wCGDheUZ41dHr_9D5-wDemHw6BvYt&confirm=t&uuid=0326b7cb-459c-4f92-ab56-1c6137aa78fb
To: /content/Dataset-Brain-MRI.zip
100% 429M/429M [00:03<00:00, 138MB/s]


In [None]:
# list down the number of files in each folder

!for d in /content/Dataset-Brain-MRI/*/*/ ; do \
    echo "$d" $(find "$d" -type f | wc -l) "files"; \
done

/content/content/Dataset-Brain-MRI/2-class/no/ 2993 files
/content/content/Dataset-Brain-MRI/2-class/yes/ 3039 files
/content/content/Dataset-Brain-MRI/4-class/glioma_tumor/ 2373 files
/content/content/Dataset-Brain-MRI/4-class/meningioma_tumor/ 1674 files
/content/content/Dataset-Brain-MRI/4-class/no_tumor/ 951 files
/content/content/Dataset-Brain-MRI/4-class/pituitary_tumor/ 1512 files


In [2]:
# moving dataset folder
source_path = "/content/content/Dataset-Brain-MRI"
destination_path = "/content/"
!mv source_path destination_path

mv: missing file operand
Try 'mv --help' for more information.


In [11]:
# deleting the folder we dont need (coz empty now)
!rm -r /content/content

### 2. Cloning the git repo for this project

In [None]:
#!rm -rf /content/Brain-Tumor-Detection-And-Classification

In [3]:
!git clone https://github.com/tselane2110/Brain-Tumor-Detection-And-Classification

Cloning into 'Brain-Tumor-Detection-And-Classification'...
remote: Enumerating objects: 301, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (127/127), done.[K
remote: Total 301 (delta 91), reused 71 (delta 31), pack-reused 143 (from 1)[K
Receiving objects: 100% (301/301), 551.30 KiB | 6.89 MiB/s, done.
Resolving deltas: 100% (166/166), done.


In [4]:
%cd Brain-Tumor-Detection-And-Classification

/content/Brain-Tumor-Detection-And-Classification


In [5]:
!git pull

Already up to date.


In [6]:
import sys
sys.path.append('/content/Brain-Tumor-Detection-And-Classification/')

In [7]:
# importing the script to deduplicate the dataset
import deduplicate_img_dataset as script

In [8]:
main_folder = "/content/Dataset-Brain-MRI/"

# First run the debug to confirm everything is correct
script.debug_folder_structure(main_folder)

# If the debug shows your folders and images correctly, then run:
print("\n" + "="*50)
deduplicator = script.FolderDeduplicator(main_folder)
duplicates = deduplicator.run_deduplication(backup=True)

=== DEBUGGING FOLDER STRUCTURE ===
Main folder exists: True
2-class folder exists: True
4-class folder exists: True
2-class subfolders: ['no', 'yes']
  no: 2993 images
  yes: 3039 images
4-class subfolders: ['pituitary_tumor', '.ipynb_checkpoints', 'meningioma_tumor', 'glioma_tumor', 'no_tumor']
  pituitary_tumor: 1512 images
  .ipynb_checkpoints: 0 images
  meningioma_tumor: 1674 images
  glioma_tumor: 2373 images
  no_tumor: 951 images

Starting deduplication in: /content/Dataset-Brain-MRI/
Folder structure: i/{2-class,4-class}/*/[images]
Scanning for all images...
Found 12542 total images across all subfolders
Checking for duplicate images...
Duplicate found (2 copies):
  KEEP: /content/Dataset-Brain-MRI/2-class/no/no 96.jpg
  DELETE: /content/Dataset-Brain-MRI/4-class/no_tumor/Tr-no_1011.jpg
Duplicate found (3 copies):
  KEEP: /content/Dataset-Brain-MRI/2-class/no/N3.jpg
  DELETE: /content/Dataset-Brain-MRI/2-class/no/no 4.jpg
  DELETE: /content/Dataset-Brain-MRI/4-class/no_tumor/T

## 3. zipping the folders again

In [16]:
!zip -r /content/deduplicated_dataset.zip "/content/Dataset-Brain-MRI"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: content/Dataset-Brain-MRI/2-class/no/no@363.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@197.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@4567.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@2714.jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@130.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@778.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@4081.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@1474.jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@1614.jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@3618.jpg (deflated 2%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@2638.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@2038.jpg (deflated 1%)
  adding: content/Dataset-Brain-MRI/2-class/no/no@1013.jpg (def

## 4. uploading it on the google drive location

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
path = "/content/drive/MyDrive/MS-AI/Project-Datasets/SSCLNet-Implementation/"
!mv /content/deduplicated_dataset.zip path