# AI Health Hackathon: Step 1 ‚Äî Project Foundation + Deep Data Understanding

## üéØ Goal for this Notebook
Before training, we must fully understand our dataset. Real ML teams spend significant time here because **bad data understanding = bad model**.

### Why this matters (Industry Style):
- **Confirm dataset quality**
- **Understand labels**
- **Detect class imbalance early**
- **Decide preprocessing strategy**
- **Avoid wasting compute in Colab**

## 1Ô∏è‚É£ Mount Google Drive
Used if your dataset is large and stored in Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 2Ô∏è‚É£ Install Required Libraries

In [None]:
!pip install nibabel opencv-python timm albumentations scikit-learn matplotlib seaborn

# Reasons:
# nibabel -> MRI files (.nii)
# timm -> Swin Transformer
# albumentations -> medical image augmentations

## 3Ô∏è‚É£ Load Dataset & Check Classes
Locate the data directory and identify the classification categories.

In [None]:
import os

# Update this path to your extracted dataset location
data_dir = "datasets/Training"

if os.path.exists(data_dir):
    classes = os.listdir(data_dir)
    print("Classes found:", classes)
else:
    print("Error: Dataset directory not found. Please check the path.")

## 4Ô∏è‚É£ Count Samples per Class (Detect Imbalance)
Hospital data is often imbalanced. We must know the distribution before training.

In [None]:
import pandas as pd

results = []
for c in classes:
    path = os.path.join(data_dir, c)
    count = len(os.listdir(path))
    print(f"{c}: {count} samples")
    results.append({'Class': c, 'Count': count})

df = pd.DataFrame(results)
print("\nSummary Table:")
print(df)

## 5Ô∏è‚É£ Visual Inspection
Doctors inspect every scan ‚Äî so must we. Look for noise, artifacts, and resolution consistency.

In [None]:
import cv2
import matplotlib.pyplot as plt

def show_sample(target_class):
    folder = os.path.join(data_dir, target_class)
    sample_file = os.listdir(folder)[0]
    img_path = os.path.join(folder, sample_file)
    
    img = cv2.imread(img_path)
    if img is not None:
        plt.figure(figsize=(6, 6))
        plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        plt.title(f"Sample MRI ‚Äî {target_class}")
        plt.axis("off")
        plt.show()
    else:
        print(f"Failed to load image: {img_path}")

# Show a sample from the first class
show_sample(classes[0])

## 6Ô∏è‚É£ Metadata Check (For .nii files)
If using raw MRI scan files in NIfTI format.

In [None]:
import nibabel as nib

# scan = nib.load("sample.nii")
# print(scan.header)

print("Metadata check cell ready. Uncomment lines above if using NIfTI files.")

---
## üî¨ Data Audit Report Summary

| Question | Answer |
| :--- | :--- |
| **Is dataset balanced?** | [Check Table Above] |
| **Are labels reliable?** | Yes |
| **Consistent resolutions?** | Yes/No (Manual Check Required) |
| **Missing files?** | None detected |

**Next Step:**  
üëâ **Step 2: Deep Cleaning + Standardization**