# Latent Cartography: Pipeline Walkthrough

This notebook walks through the core analysis pipeline for our cross-mapper beatmap study. The goal: by comparing how **different mappers** annotate the **same songs**, we can identify which aspects of rhythm representation are **perceptually universal** (all mappers agree) vs. **stylistic** (mapper-specific).

The pipeline has two tracks:
- **Track A** — Compare mappers in a 9-dimensional interpretable space (onset, combo, slider, etc.)
- **Track B** — Compare mappers in a 32-dimensional VAE latent space (learned representation)
- **Synthesis** — Correlate the two spaces to understand what the VAE learns

---

## Background: What are we measuring?

In osu!, a **beatmap** is a human-authored annotation of a song — placing hit objects in time and space to match perceived rhythm. When multiple mappers independently create beatmaps for the same song, their maps reflect a mix of:

1. **Perceptual agreement** — aspects of rhythm that all humans hear the same way (e.g., a loud drum hit → everyone places an onset there)
2. **Stylistic choice** — artistic decisions that vary by mapper (e.g., cursor movement patterns, difficulty preferences)

Our metric is **Pearson correlation (r)** between pairs of mappers on the same song:

| Range | Interpretation |
|-------|---------------|
| r > 0.5 | **Strong agreement** — likely perceptual |
| r = 0.3–0.5 | **Moderate agreement** |
| r = 0.1–0.3 | **Weak agreement** — mixed |
| r < 0.1 | **No agreement** — stylistic / mapper-specific |

## The 9 Interpretable Dimensions

Each beatmap is encoded as a 9-channel temporal signal (one value per audio frame):

| # | Dimension | What it captures |
|---|-----------|------------------|
| 0 | **ONSET** | Is there a hit object at this moment? (binary) |
| 1 | **COMBO** | New combo flag — mapper's phrasing of the rhythm |
| 2 | **SLIDE** | Is a slider active? (sustained hits) |
| 3 | **SUSTAIN** | Slider body / hold duration |
| 4 | **WHISTLE** | Whistle hitsound (high-pitched accent) |
| 5 | **FINISH** | Finish hitsound (cymbal crash) |
| 6 | **CLAP** | Clap hitsound (snare-like accent) |
| 7 | **X** | Horizontal cursor position (0–512) |
| 8 | **Y** | Vertical cursor position (0–384) |

---
## Step 1: Find Multi-Mapper Songs

We scan a folder of `.osz` beatmap archives and group them by song (matching on title + artist). We keep only songs that have been mapped by **2 or more different mappers** — these are our natural experiments.

In [None]:
import zipfile, re, json
from collections import defaultdict
from pathlib import Path

def scan_osz_folder(dataset_dir):
    """Read metadata from each .osz file and group by song."""
    songs = defaultdict(list)
    
    for osz_path in Path(dataset_dir).glob("*.osz"):
        with zipfile.ZipFile(osz_path) as z:
            # Each .osz contains one or more .osu files (difficulty versions)
            osu_files = [n for n in z.namelist() if n.endswith(".osu")]
            content = z.read(osu_files[0]).decode("utf-8", errors="replace")
            
            # Extract key fields from the .osu text format
            title = re.search(r"^Title:(.+)$", content, re.MULTILINE).group(1).strip()
            artist = re.search(r"^Artist:(.+)$", content, re.MULTILINE).group(1).strip()
            creator = re.search(r"^Creator:(.+)$", content, re.MULTILINE).group(1).strip()
            
            songs[(title.lower(), artist.lower())].append({
                "title": title, "artist": artist, "creator": creator,
                "filename": osz_path.name,
            })
    
    # Keep only songs with 2+ different mappers
    multi_mapper = []
    for (title, artist), entries in songs.items():
        mappers = set(e["creator"] for e in entries)
        if len(mappers) >= 2:
            multi_mapper.append({
                "title": entries[0]["title"],
                "artist": entries[0]["artist"],
                "num_mappers": len(mappers),
                "beatmapsets": entries,
            })
    
    return multi_mapper

# Example: multi_mapper = scan_osz_folder("/path/to/osz/files/")

## Step 2: Build the Beatmap Registry

For each multi-mapper song, we extract the `.osu` files and parse their metadata. We filter to **Mode 0** (osu!standard — the classic click-circle mode) and select one **representative** beatmap per mapper per song (the highest difficulty, measured by Overall Difficulty).

This representative selection ensures we're comparing each mapper's "best effort" on the same song.

In [None]:
def parse_osu_metadata(osu_path):
    """Parse an .osu file for difficulty settings and metadata."""
    content = Path(osu_path).read_text(encoding="utf-8", errors="replace")
    
    def get(name):
        m = re.search(rf"^{name}:\s*(.+)$", content, re.MULTILINE)
        return m.group(1).strip() if m else None
    
    # Only keep Mode 0 (osu!standard)
    if get("Mode") != "0":
        return None
    
    return {
        "title": get("Title"),
        "artist": get("Artist"),
        "creator": get("Creator"),
        "version": get("Version"),        # difficulty name, e.g. "Hard", "Insane"
        "od": float(get("OverallDifficulty") or 0),  # timing precision required
        "ar": float(get("ApproachRate") or 0),        # how fast objects appear
        "cs": float(get("CircleSize") or 0),          # click target size
        "audio_path": get("AudioFilename"),
    }

def select_representatives(registry):
    """One beatmap per mapper per song — pick the highest OD."""
    groups = defaultdict(list)
    for r in registry:
        groups[(r["song_idx"], r["creator"])].append(r)
    
    return [max(beatmaps, key=lambda b: b["od"]) for beatmaps in groups.values()]

## Step 3: Encode to 9-dim Signals

Each representative beatmap is converted to a `[9, T]` numpy array — 9 channels over `T` audio frames (~172 fps). This uses the [osu-dreamer](https://github.com/jaswon/osu-dreamer) library's encoding:

```
Audio file ──→ Spectrogram (T frames at ~172 fps)
                    │
.osu file  ──→ encode_beatmap(beatmap, frame_times) ──→ [9, T] array
```

For songs with multiple mappers, we ensure all encodings share the same `T` (trimming to the shortest) so they're directly comparable frame-by-frame.

In [None]:
import numpy as np
# from osu_dreamer.osu.beatmap import Beatmap
# from osu_dreamer.data.load_audio import load_audio, get_frame_times
# from osu_dreamer.data.beatmap.encode import encode_beatmap

def encode_representatives(representatives):
    """Encode each representative beatmap to a 9-dim temporal signal."""
    encodings = []
    
    for rep in representatives:
        # Load audio → spectrogram (determines frame count T)
        spec = load_audio(Path(rep["audio_path"]))
        frame_times = get_frame_times(spec.shape[1])
        
        # Parse .osu file and encode hit objects to 9-dim signal
        beatmap = Beatmap(Path(rep["osu_path"]))
        encoded = encode_beatmap(beatmap, frame_times)  # shape: [9, T]
        
        # Each row is one dimension over time:
        # encoded[0] = ONSET signal  (1 where a hit object starts, 0 elsewhere)
        # encoded[7] = X position    (normalized 0-1 over the playfield)
        # ... etc.
        
        encodings.append({"data": encoded, **rep})
    
    return encodings

## Step 4: Track A — Cross-Mapper Comparison (9-dim)

The core analysis. For every pair of mappers who mapped the same song, we compute Pearson correlation **per dimension**.

If ONSET has r=0.58 across all mapper pairs, that means mappers consistently agree on **where** to place hit objects — this is perceptual. If X position has r=0.02, mappers place objects in completely different spatial patterns — this is stylistic.

In [None]:
DIM_NAMES = ["ONSET", "COMBO", "SLIDE", "SUSTAIN", "WHISTLE", "FINISH", "CLAP", "X", "Y"]

def pearson_corr(a, b):
    """Pearson correlation between two 1-D time series."""
    if len(a) < 2 or np.std(a) == 0 or np.std(b) == 0:
        return 0.0
    return float(np.corrcoef(a, b)[0, 1])

def cross_mapper_analysis(encodings_by_song):
    """
    Compare every cross-mapper pair on the same song.
    Returns per-dimension Pearson correlations averaged across all pairs.
    """
    results = {dim: [] for dim in DIM_NAMES}
    
    for song_idx, song_encodings in encodings_by_song.items():
        # Get minimum length across all encodings for this song
        min_T = min(enc["data"].shape[1] for enc in song_encodings)
        
        # Compare all cross-mapper pairs
        for i in range(len(song_encodings)):
            for j in range(i + 1, len(song_encodings)):
                a = song_encodings[i]
                b = song_encodings[j]
                
                if a["creator"] == b["creator"]:
                    continue  # Skip same-mapper pairs
                
                # Compare each dimension independently
                for d, dim_name in enumerate(DIM_NAMES):
                    r = pearson_corr(
                        a["data"][d, :min_T],
                        b["data"][d, :min_T]
                    )
                    results[dim_name].append(r)
    
    # Average across all pairs
    summary = {}
    for dim_name in DIM_NAMES:
        if results[dim_name]:
            summary[dim_name] = {
                "pearson_mean": np.mean(results[dim_name]),
                "pearson_std": np.std(results[dim_name]),
                "n_pairs": len(results[dim_name]),
            }
    
    return summary

# Example output (from our 368-song dataset):
# ONSET    r=0.58  [PERCEPTUAL]  ← mappers agree on timing
# SLIDE    r=0.33  [PERCEPTUAL]  ← sliders placed similarly
# CLAP     r=0.12  [MIXED]       ← some agreement on hitsounds
# X        r=0.02  [STYLISTIC]   ← cursor patterns are personal
# Y        r=0.01  [STYLISTIC]   ← same

## Step 5: Encode Through VAE (Track B Setup)

We pass each 9-dim encoding through a pre-trained **WaveNet VAE** (from the [osu-dreamer](https://github.com/jaswon/osu-dreamer) project). This compresses the 9-dim signal into a 32-dim latent representation:

```
[9, T] ──→ VAE encoder ──→ [32, T/18]
```

The temporal resolution drops by ~18x (the model's downsampling factor). The question Track B asks: **do these learned dimensions also show cross-mapper agreement?** If a latent dimension has high Pearson r, the VAE has learned something perceptually meaningful — even though it was never explicitly trained on cross-mapper data.

In [None]:
# import torch
# from osu_dreamer.latent_model.model import Model

def encode_latent(encodings_9dim, checkpoint_path):
    """Pass 9-dim encodings through the VAE encoder → 32-dim latent."""
    model = Model.load_from_checkpoint(checkpoint_path)
    model.eval()
    
    # Use Apple Silicon (MPS), CUDA, or CPU
    device = "mps" if torch.backends.mps.is_available() else \
             "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    
    latent_encodings = []
    for enc in encodings_9dim:
        x = torch.tensor(enc["data"]).float().unsqueeze(0).to(device)  # [1, 9, T]
        with torch.no_grad():
            z = model.encode(x)  # [1, 32, T/18]
        latent_encodings.append({"data": z.squeeze(0).cpu().numpy(), **enc})
    
    return latent_encodings

## Step 6: Track B — Cross-Mapper Comparison (32-dim latent)

Same methodology as Track A, but on the 32-dim latent space. We compute Pearson r for each of the 32 latent dimensions across all cross-mapper pairs.

The code is identical to `cross_mapper_analysis()` above, just with 32 dimensions instead of 9 and the latent encodings as input.

In [None]:
N_LATENT = 32

# Track B uses the exact same cross-mapper comparison as Track A,
# just operating on the 32-dim latent encodings instead of 9-dim.
#
# The output classifies each latent dimension:
#   z_12  r=0.41  [PERCEPTUAL]  ← this latent dim captures something universal
#   z_05  r=0.22  [MIXED]
#   z_28  r=0.01  [STYLISTIC]   ← this dim captures mapper-specific style
#
# Key finding: some VAE dimensions DO show cross-mapper agreement,
# meaning the model learned perceptually meaningful features from data alone.

## Step 7: Synthesis — Connecting the Two Spaces

Finally, we build a **32 x 9 correlation matrix** to understand what each latent dimension encodes in terms of the interpretable dimensions.

For each beatmap that has both encodings, we:
1. Downsample the 9-dim signal to match the latent temporal resolution (average 18-frame blocks)
2. Compute Pearson r between each latent dim and each interpretable dim
3. Average across all beatmaps

This tells us: if latent dimension z_12 is perceptual (high cross-mapper r), **what** does it encode? Maybe it correlates strongly with ONSET — meaning the VAE independently learned to represent hit timing.

In [None]:
def build_correlation_matrix(encodings_9dim, encodings_latent):
    """
    Build a [32 x 9] matrix: correlation between each latent dim
    and each interpretable dim, averaged across all beatmaps.
    """
    corr_matrix = np.zeros((N_LATENT, len(DIM_NAMES)))
    count = 0
    
    for enc_9, enc_z in zip(encodings_9dim, encodings_latent):
        T_latent = enc_z["data"].shape[1]
        
        # Downsample 9-dim to match latent temporal resolution
        # The VAE's encoder downsamples by ~18x
        enc_9_downsampled = np.zeros((9, T_latent))
        for dim in range(9):
            for t in range(T_latent):
                start = t * 18
                end = min(start + 18, enc_9["data"].shape[1])
                enc_9_downsampled[dim, t] = enc_9["data"][dim, start:end].mean()
        
        # Correlate each latent dim with each interpretable dim
        for z_dim in range(N_LATENT):
            for i_dim in range(9):
                z_signal = enc_z["data"][z_dim]
                i_signal = enc_9_downsampled[i_dim]
                if np.std(z_signal) > 0 and np.std(i_signal) > 0:
                    r = np.corrcoef(z_signal, i_signal)[0, 1]
                    if not np.isnan(r):
                        corr_matrix[z_dim, i_dim] += r
        count += 1
    
    if count > 0:
        corr_matrix /= count
    
    return corr_matrix  # [32, 9] — rows are latent dims, columns are interpretable dims

## Putting It All Together

The full pipeline runs these steps in sequence:

```
┌─────────────────────────────────────────────────────┐
│  .osz files (beatmap archives)                      │
└──────────────────────┬──────────────────────────────┘
                       │
            Step 1: Find multi-mapper songs
            Step 2: Parse metadata, select representatives
                       │
            Step 3: Encode to 9-dim signals
                       │
              ┌────────┴────────┐
              │                 │
     Step 4: Track A      Step 5: VAE encode
     (9-dim Pearson r)    (9-dim → 32-dim latent)
              │                 │
              │            Step 6: Track B
              │            (32-dim Pearson r)
              │                 │
              └────────┬────────┘
                       │
            Step 7: Synthesis
            (32 x 9 correlation matrix)
└─────────────────────────────────────────────────────┘
```

Track B is optional — if no VAE checkpoint is provided, only Track A runs.

## Running It

**GUI** (recommended — interactive results viewer):
```bash
git clone -b core-pipeline https://github.com/recxa/osu-dreamer-cartography.git
cd osu-dreamer-cartography
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
python run_gui.py
```

**CLI** (headless, for scripting):
```bash
python experiment/run_analysis.py /path/to/your/osz/files/
```

Results are saved to `experiment/output/results/` as JSON + PNG plots.

---

*Pipeline code: [recxa/osu-dreamer-cartography](https://github.com/recxa/osu-dreamer-cartography) (core-pipeline branch)*