# Data Exploration - X-ray Bone Fracture Detection

This notebook explores the MURA dataset and prepares it for preprocessing.

## Steps:
1. Import libraries
2. Load dataset
3. Explore dataset statistics
4. Visualize sample images
5. Analyze image properties
6. Check data quality

## 1. Import Libraries

In [None]:
import sys
sys.path.append('..')

import numpy as np
import cv2
import matplotlib.pyplot as plt
import os
from pathlib import Path

# Import our custom utilities
from utils.data_loader import DatasetLoader
from utils.visualization import XRayVisualizer, create_data_exploration_report

# Set matplotlib style
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✅ Libraries imported successfully!")

## 2. Load Dataset

In [None]:
# Define data directory
DATA_DIR = '../data'

# Create dataset loader
loader = DatasetLoader(DATA_DIR)

print("Dataset loader created!")
print(f"Data directory: {DATA_DIR}")

## 3. Dataset Statistics

In [None]:
# Print dataset information
loader.print_dataset_info()

In [None]:
# Get detailed statistics
stats = loader.get_dataset_statistics()

print("\nDetailed Statistics:")
print("=" * 60)
for split in ['train', 'validation', 'test']:
    print(f"\n{split.upper()}:")
    print(f"  Total images: {stats[split]['total']:,}")
    if stats[split]['total'] > 0:
        frac_pct = 100 * stats[split]['fractured'] / stats[split]['total']
        print(f"  Fractured: {stats[split]['fractured']:,} ({frac_pct:.1f}%)")
        print(f"  Normal: {stats[split]['normal']:,} ({100-frac_pct:.1f}%)")

## 4. Load Sample Images

In [None]:
# Load training data paths
train_paths, train_labels = loader.load_data_paths('train')

print(f"Loaded {len(train_paths):,} training images")
print(f"Labels shape: {len(train_labels)}")
print(f"Unique labels: {np.unique(train_labels)}")

## 5. Visualize Class Distribution

In [None]:
# Create visualizer
viz = XRayVisualizer()

# Show class distribution
viz.show_class_distribution(train_labels, class_names=['Normal', 'Fractured'])

## 6. Display Sample Images

In [None]:
# Show sample images from each class
viz.show_sample_images(train_paths, train_labels, samples_per_class=5)

## 7. Analyze Individual Images

In [None]:
# Load a few sample images
sample_indices = np.random.choice(len(train_paths), 20, replace=False)
sample_images = []

for idx in sample_indices:
    img = cv2.imread(train_paths[idx], cv2.IMREAD_GRAYSCALE)
    if img is not None:
        sample_images.append(img)

print(f"Loaded {len(sample_images)} sample images")

In [None]:
# Show image statistics
viz.show_image_statistics(sample_images)

## 8. Examine Single Image in Detail

In [None]:
# Pick a random fractured X-ray
fractured_indices = [i for i, label in enumerate(train_labels) if label == 1]
random_fractured_idx = np.random.choice(fractured_indices)
fractured_path = train_paths[random_fractured_idx]

# Load and display
fractured_img = cv2.imread(fractured_path, cv2.IMREAD_GRAYSCALE)
viz.show_image(fractured_img, "Sample Fractured X-ray")

In [None]:
# Show histogram
viz.show_histogram(fractured_img, "Fractured X-ray - Pixel Intensity Distribution")

In [None]:
# Pick a random normal X-ray
normal_indices = [i for i, label in enumerate(train_labels) if label == 0]
random_normal_idx = np.random.choice(normal_indices)
normal_path = train_paths[random_normal_idx]

# Load and display
normal_img = cv2.imread(normal_path, cv2.IMREAD_GRAYSCALE)
viz.show_image(normal_img, "Sample Normal X-ray")

## 9. Compare Normal vs Fractured

In [None]:
# Compare histograms
viz.compare_histograms(
    [normal_img, fractured_img],
    ['Normal', 'Fractured'],
    title="Pixel Intensity Comparison: Normal vs Fractured"
)

## 10. Data Quality Check

In [None]:
# Verify data integrity (this may take a while for large datasets)
# Uncomment to run full verification
# results = loader.verify_data_integrity()

print("⚠️  Data integrity check can take a long time for large datasets.")
print("Uncomment the code above to run a full check.")

## 11. Create Complete Exploration Report

In [None]:
# Create comprehensive report
# This will save visualizations to a reports folder
create_data_exploration_report(DATA_DIR, output_dir='../reports')

## Summary

### Key Findings:
1. **Dataset Size**: Check the statistics above
2. **Class Balance**: Look at the distribution charts
3. **Image Properties**: Note the size variations
4. **Quality**: Check for any corrupted images

### Next Steps:
1. Move to `02_preprocessing.ipynb` to preprocess images
2. Apply data augmentation if needed
3. Prepare data for model training

### Notes:
- Save any important observations
- Document any data quality issues
- Plan preprocessing strategy based on findings