<a href="https://colab.research.google.com/github/tselane2110/Brain-Tumor-Classification-Using-SSL/blob/main/dataset-preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Link to the Dataset](https://drive.google.com/drive/folders/17iNx6mt5FTt3cxwrsUvVoEhyODNX0gyi?usp=drive_link)

# Dataset Deduplication Process

## Objective
Remove exact duplicate images from the dataset to prevent model bias and ensure data quality.

## Method Used: MD5 Hashing

### How It Works
1. **Read Binary Content**: Each image file is read as raw binary data
2. **Generate Hash**: MD5 algorithm creates a unique fingerprint from the file content
3. **Compare Hashes**: Files with identical hashes are exact duplicates

### Technical Process
```python
# Step 1: Generate unique hash for each image
with open(image_path, 'rb') as file:
    file_hash = hashlib.md5(file.read()).hexdigest()

# Step 2: Store hashes in dictionary
hashes[file_hash].append(image_path)

# Step 3: Identify duplicates
for hash_value, paths in hashes.items():
    if len(paths) > 1:
        keep_first = paths[0]      # Preserve first occurrence
        remove_rest = paths[1:]    # Mark others for deletion
```
## Folder Structure processed:

```
Dataset-Brain-MRI/
├── 2-class/
│   ├── yes/
│   └── no/
└── 4-class/
    ├── glioma_tumor/
    ├── meningioma_tumor/
    ├── no_tumor/
    └── pituitary_tumor/
```

## Rules Applied
* Keep: First occurrence of each unique image
* Delete: All subsequent exact copies
* Cross-check: Compare across all folders and subfolders

### 1. Loading the dataset

In [None]:
!gdown --fuzzy "https://drive.google.com/file/d/1Li4wCGDheUZ41dHr_9D5-wDemHw6BvYt/view?usp=drive_link"
!unzip -q Dataset-Brain-MRI.zip

Downloading...
From (original): https://drive.google.com/uc?id=1Li4wCGDheUZ41dHr_9D5-wDemHw6BvYt
From (redirected): https://drive.google.com/uc?id=1Li4wCGDheUZ41dHr_9D5-wDemHw6BvYt&confirm=t&uuid=0326b7cb-459c-4f92-ab56-1c6137aa78fb
To: /content/Dataset-Brain-MRI.zip
100% 429M/429M [00:03<00:00, 138MB/s]


In [None]:
# list down the number of files in each folder

!for d in /content/Dataset-Brain-MRI/*/*/ ; do \
    echo "$d" $(find "$d" -type f | wc -l) "files"; \
done

In [None]:
# moving dataset folder
source_path = "/content/content/Dataset-Brain-MRI"
destination_path = "/content/"
!mv source_path destination_path

In [None]:
# deleting the folder we dont need (coz empty now)
!rm -r /content/content

### 2. Cloning the git repo for this project

In [None]:
#!rm -rf /content/Brain-Tumor-Detection-And-Classification

In [None]:
!git clone https://github.com/tselane2110/Brain-Tumor-Detection-And-Classification

In [None]:
%cd Brain-Tumor-Detection-And-Classification

/content/Brain-Tumor-Detection-And-Classification


In [None]:
!git pull

Already up to date.


In [None]:
import sys
sys.path.append('/content/Brain-Tumor-Detection-And-Classification/')

In [None]:
# importing the script to deduplicate the dataset
import deduplicate_img_dataset as script

In [None]:
main_folder = "/content/Dataset-Brain-MRI/"

# First run the debug to confirm everything is correct
script.debug_folder_structure(main_folder)

# If the debug shows your folders and images correctly, then run:
print("\n" + "="*50)
deduplicator = script.FolderDeduplicator(main_folder)
duplicates = deduplicator.run_deduplication(backup=True)

## 3. zipping the folders again

In [None]:
!zip -r /content/deduplicated_dataset.zip "/content/Dataset-Brain-MRI"

## 4. uploading it on the google drive location

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = "/content/drive/MyDrive/MS-AI/Project-Datasets/SSCLNet-Implementation/"
!mv /content/deduplicated_dataset.zip path