# **360x Dataset EDA**

Quick exploratory checks on the 360x panoramic dataset: verify structure, inspect sample entries, and capture basic summary stats.

## 1. Resolve dataset root

Uses the same resolution logic as the quickstart notebook: prefer `CAPTIONQA_DATASETS`, otherwise search upward for the repository root and fall back to `./datasets`.

In [1]:
from pathlib import Path
from itertools import islice
import json
import os

def resolve_dataset_root() -> Path:
    env_root = os.environ.get('CAPTIONQA_DATASETS')
    if env_root:
        return Path(env_root).expanduser().resolve()
    cwd = Path.cwd()
    for candidate in [cwd, *cwd.parents]:
        if (candidate / 'pyproject.toml').exists() or (candidate / '.git').exists():
            return (candidate / 'datasets').resolve()
    return (cwd / 'datasets').resolve()

DATASET_ROOT = resolve_dataset_root()
print(f'Dataset root: {DATASET_ROOT}')
HR_ROOT = DATASET_ROOT / '360x' / '360x_dataset_HR'
LR_ROOT = DATASET_ROOT / '360x' / '360x_dataset_LR'
print(f'HR root: {HR_ROOT}')
print(f'LR root: {LR_ROOT}')

Dataset root: D:\CaptionQA\data
HR root: D:\CaptionQA\data\360x\360x_dataset_HR
LR root: D:\CaptionQA\data\360x\360x_dataset_LR


## 2. Validate presence & summarize

Run the next cell to verify the expected folder layout and capture a few file listings. The notebook handles missing data gracefully and prints next steps if the dataset is absent.

In [4]:
def describe_directory(root: Path, label: str, limit: int = 10):
    print(f'\n{label} -> {root}')
    if not root.exists():
        print('[missing] Not found. Download with: python -m captionqa.data.download 360x --output', DATASET_ROOT)
        return
    entries = sorted(root.iterdir())
    print(f'Total entries: {len(entries)}')
    for path in islice(entries, limit):
        print(' -', path.name)
    if len(entries) > limit:
        print(' ...')

describe_directory(HR_ROOT, 'High-resolution split')
describe_directory(LR_ROOT, 'Low-resolution split')


High-resolution split -> D:\CaptionQA\data\360x\360x_dataset_HR
Total entries: 5
 - .cache
 - .gitattributes
 - binocular
 - README.md
 - TAL_annotations

Low-resolution split -> D:\CaptionQA\data\360x\360x_dataset_LR
[missing] Not found. Download with: python -m captionqa.data.download 360x --output D:\CaptionQA\data


## 3. Sample metadata (optional)

If JSON metadata files are available, the next cell attempts to load one sample to inspect structure. Adjust the glob pattern if the dataset uses a different naming scheme.

In [5]:
def load_sample_metadata(root: Path, suffix: str = '.json'):
    if not root.exists():
        print('[skip] Root missing; nothing to load.')
        return
    for path in sorted(root.rglob(f'*{suffix}')):
        print(f'Previewing {path}')
        try:
            with path.open('r', encoding='utf-8') as handle:
                snippet = json.load(handle)
        except Exception as exc:
            print('Failed to parse JSON:', exc)
            return
        print(json.dumps(snippet, indent=2)[:2000])
        return
    print('[skip] No files matching suffix found.')

load_sample_metadata(HR_ROOT)
load_sample_metadata(LR_ROOT)

Previewing D:\CaptionQA\data\360x\360x_dataset_HR\TAL_annotations\019cc67f-512f-4b8a-96ef-81f806c86ce1.json
{
  "file": {
    "1": {
      "fid": "1",
      "fname": "360_panoramic.mp4",
      "type": 4,
      "loc": 1,
      "src": ""
    }
  },
  "metadata": {
    "1_1_1": {
      "duration": [
        5.516,
        7.3905
      ],
      "action": {
        "1": "sitting"
      }
    },
    "1_2_2": {
      "duration": [
        12.312,
        17.47384
      ],
      "action": {
        "1": "drinking"
      }
    },
    "1_2_3": {
      "duration": [
        51.417,
        56.46075
      ],
      "action": {
        "1": "drinking"
      }
    },
    "1_2_4": {
      "duration": [
        235.334,
        237.98142
      ],
      "action": {
        "1": "drinking"
      }
    },
    "1_3_5": {
      "duration": [
        9.558,
        116.33559
      ],
      "action": {
        "1": "speaking"
      }
    },
    "1_3_6": {
      "duration": [
        135.211,
        167.50226

## 4. Next steps

- Drill down into a representative panorama to inspect frame counts and available modalities (video, audio, annotations).
- Compute dataset-level aggregates (duration, resolution, annotation coverage).
- Integrate results into the main README or reporting pipeline once satisfied.

> Tip: Run this notebook from the repo root with the `captionqa` uv environment activated to ensure imports resolve (`uv venv captionqa` / `./captionqa/Scripts/Activate.ps1`).