# 01 - Data Exploration

**Purpose:** Load and explore sample data from Valheim WorldGenerator

**Scope:**
- Load sample JSON file
- Display basic statistics
- Visualize raw biome distribution
- Analyze height distribution
- Create spatial overview maps

**Prerequisites:**
- Sample data file: `../output/samples/*-samples-1024.json`

**Outputs:**
- Summary statistics
- Raw biome distribution pie chart
- Height histogram
- Spatial heatmap

**Estimated Time:** 5 minutes

## Setup

In [None]:
# Standard imports
import sys
sys.path.append('.')  # Ensure local imports work

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Local utilities
from biome_utils import *
from config import *

# Jupyter display settings
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("✓ Setup complete")

## Load Data

In [None]:
# Update this path to your sample file
SAMPLE_PATH = '../output/samples/hkLycKKCMI-samples-1024.json'

# Alternative: Use file picker
# sample_files = list(Path('../output/samples/').glob('*-samples-1024.json'))
# if sample_files:
#     SAMPLE_PATH = str(sample_files[0])
#     print(f"Auto-selected: {SAMPLE_PATH}")

# Load samples
df = load_samples(SAMPLE_PATH)

# Display first few rows
df.head()

## Summary Statistics

In [None]:
# Print comprehensive summary
print_summary_stats(df)

In [None]:
# DataFrame info
print("DataFrame Info:")
print(df.info())

print("\nDescriptive Statistics:")
print(df.describe())

## Biome Distribution

In [None]:
# Raw biome distribution (pie chart)
fig = plot_biome_distribution(df, "Raw API Data - Biome Distribution")
plt.show()

In [None]:
# Detailed statistics table
stats = calculate_biome_distribution(df)

stats_df = pd.DataFrame(stats).T
stats_df = stats_df.sort_values('percentage', ascending=False)
print("\nBiome Distribution Table:")
print(stats_df[['count', 'percentage']].to_string())

## Height Distribution

In [None]:
# Height histogram
fig = plot_height_histogram(df, bins=100)
plt.show()

In [None]:
# Height statistics by zone
print("Height Statistics:")
print(f"  Min: {df['Height'].min():.1f}m")
print(f"  Max: {df['Height'].max():.1f}m")
print(f"  Mean: {df['Height'].mean():.1f}m")
print(f"  Median: {df['Height'].median():.1f}m")

print(f"\nSamples below sea level ({SEA_LEVEL_METERS}m): {(df['Height'] < SEA_LEVEL_METERS).sum():,} ({(df['Height'] < SEA_LEVEL_METERS).sum() / len(df) * 100:.1f}%)")
print(f"Samples above sea level: {(df['Height'] >= SEA_LEVEL_METERS).sum():,} ({(df['Height'] >= SEA_LEVEL_METERS).sum() / len(df) * 100:.1f}%)")

## Spatial Overview

In [None]:
# All biomes spatial heatmap
fig = plot_spatial_heatmap(df, title="All Biomes - Spatial Distribution")
plt.show()

## Distance Ring Analysis

In [None]:
# Analyze biome distribution by distance rings
fig = plot_distance_rings(df)
plt.show()

In [None]:
# Detailed ring statistics
ring_stats = analyze_by_distance_ring(df)

for label, stats in ring_stats.items():
    print(f"\n{label}: {stats['total']:,} samples")
    print("-" * 50)
    for biome, data in sorted(stats['biomes'].items(), key=lambda x: x[1]['percentage'], reverse=True):
        print(f"  {biome:<15} {data['count']:>8,} ({data['percentage']:>5.1f}%)")

## Key Findings

**Based on raw API data analysis:**

1. **Polar Biome Over-representation:**
   - DeepNorth: ~31% of world
   - Ashlands: ~15% of world
   - Combined: ~46% of world (expected: ~15-20%)

2. **Mistlands Starvation:**
   - Only ~5.5% of world
   - Expected: 25-30% (outer ring biome)
   - Problem: Checked AFTER polar biomes in GetBiome()

3. **Outer Ring Analysis (6-10km):**
   - Ashlands dominates (43.9%)
   - DeepNorth second (28.9%)
   - Mistlands nearly absent (0.0%)
   - This is where filters will have most impact

**Next Steps:**
- Notebook 02: Tune sea level threshold
- Notebook 03: Apply polar biome filters
- Notebook 05: Compare filter strategies