CERN Open Data NanoAOD Test Workspace

Lightweight local workspace for downloading, inspecting, extracting, plotting, and converting a few CMS NanoAOD ROOT files from CERN Open Data. This is intentionally scoped for research prototyping, not CERN-scale production processing.

Primary starter dataset:

CMS Open Data record: https://opendata.cern.ch/record/30563
Dataset: /SingleMuon/Run2016H-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD
DOI: 10.7483/OPENDATA.CMS.4BUS.64MV
Default sample file uses the CMS Open Data workshop Run2016H/SingleMuon/NANOAOD example.

Structure

cern_workspace/
|-- data/
|   |-- adoc/
|   |-- raw/
|   |-- processed/
|   `-- parquet/
|-- docs/
|-- notebooks/
|-- src/
|   |-- ingestion/
|   |-- physics/
|   |-- features/
|   |-- utils/
|   `-- visualization/
|-- scripts/
|   |-- download_sample.py
|   |-- inspect_root.py
|   |-- extract_muons.py
|   `-- convert_to_parquet.py
|-- outputs/
|-- requirements.txt
|-- README.md
`-- .gitignore

Setup

Python 3.11 or newer is recommended.

Linux/macOS:

cd cern_workspace
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Windows PowerShell with Python 3.11 installed:

cd cern_workspace
py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

If py -3.11 is not available but python points to Python 3.11 or newer:

cd cern_workspace
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Optional future packages are listed but commented out in requirements.txt: coffea, dask, and polars.

Example Workflow

Inspect a remote NanoAOD file through the HTTPS EOS endpoint:

python scripts/inspect_root.py --limit 60

If your Windows network has a custom/self-signed certificate chain and HTTPS reads fail with CERTIFICATE_VERIFY_FAILED, allow insecure SSL for this test session:

$env:CERN_ALLOW_INSECURE_SSL = "1"
python scripts/inspect_root.py --limit 60

Download the default sample file to data/raw/:

python scripts/download_sample.py

Inspect the local file:

python scripts/inspect_root.py --input data/raw/61FC1E38-F75C-6B44-AD19-A9894155874E.root

Extract a few chunked batches and save a muon pT histogram:

python scripts/extract_muons.py \
  --input data/raw/61FC1E38-F75C-6B44-AD19-A9894155874E.root \
  --step-size "25 MB" \
  --max-batches 2 \
  --output outputs/muon_pt.png

Convert selected branches to chunked Parquet:

python scripts/convert_to_parquet.py \
  --input data/raw/61FC1E38-F75C-6B44-AD19-A9894155874E.root \
  --branches Muon_pt Muon_eta Muon_phi Muon_mass MET_pt Jet_pt \
  --step-size "25 MB" \
  --max-batches 2 \
  --output-dir data/parquet

Adoc Manual File Workflow

Use data/adoc/ for ROOT files you downloaded manually for testing:

data/adoc/my_test_nanoaod.root

The same scripts work with Adoc files:

python scripts/inspect_root.py --input data/adoc/my_test_nanoaod.root --limit 80
python scripts/extract_muons.py --input data/adoc/my_test_nanoaod.root --max-batches 2 --output outputs/adoc_muon_pt.png
python scripts/convert_to_parquet.py --input data/adoc/my_test_nanoaod.root --max-batches 2 --output-dir data/parquet/adoc_test

Notebook:

notebooks/adoc_basic_etl.ipynb

Detailed doc:

docs/ADOC_WORKFLOW.md

Memory Model

The scripts use uproot tree iteration and selected branch reads. Defaults are deliberately small:

only selected branches are read,
event data is processed in batches,
--max-batches limits early exploration,
Parquet output is written as separate part files.

Increase --step-size and --max-batches only after validating local memory and disk usage.

Modules

src/ingestion/root_io.py: open ROOT files, list branches, iterate selected event batches.
src/ingestion/samples.py: dataset metadata and default sample URLs.
src/physics/muons.py: small muon summary helpers.
src/visualization/histograms.py: CMS-style histogram plotting with hist and mplhep.
src/features/parquet.py: chunked ROOT-to-Parquet conversion.
src/lakehouse/: PySpark ETL for the muon_db bronze, silver, and gold Parquet lakehouse.
src/utils/paths.py: portable workspace-relative paths.
src/utils/logging_config.py: consistent logging setup.

Future Extension Points

The package layout leaves room for:

Delphes simulation output ingestion,
MadGraph/Pythia generated event workflows,
feature extraction for ML and GNN models,
event graph construction,
Parquet lakehouse partitioning,
Airflow or other orchestration layers,
distributed execution with Dask or Coffea.

muon_db PySpark Lakehouse ETL

After converting ROOT files to Parquet, build the data-engineering lakehouse with:

python scripts/run_lakehouse_etl.py --source-parquet data/parquet/adoc_test --output-root data/muon_db

This writes partitioned Parquet tables under data/muon_db/bronze, data/muon_db/silver, and data/muon_db/gold. Local PySpark requires Java and a configured JAVA_HOME. See docs/MUON_DB_LAKEHOUSE.md for table details and scope.

Query or visualize the generated tables with:

python scripts/query_muon_db.py --list
python scripts/query_muon_db.py --table event_summary --limit 10
python scripts/plot_muon_db.py --plot dimuon_mass
python sandbox/muon_db_playground.py

See docs/MUON_DB_ACCESS.md for the full access and sandbox structure.

Notebook sandboxes are also available:

notebooks/muon_db_00_catalog_and_layers.ipynb
notebooks/muon_db_01_sql_sandbox.ipynb
notebooks/muon_db_02_visualization_sandbox.ipynb

Keep future additions modular and batch-oriented so exploratory scripts do not turn into full-scale processing jobs by accident.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CERN Open Data NanoAOD Test Workspace

Structure

Setup

Example Workflow

Adoc Manual File Workflow

Memory Model

Modules

Future Extension Points

muon_db PySpark Lakehouse ETL

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
notebooks		notebooks
outputs		outputs
sandbox		sandbox
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CERN Open Data NanoAOD Test Workspace

Structure

Setup

Example Workflow

Adoc Manual File Workflow

Memory Model

Modules

Future Extension Points

muon_db PySpark Lakehouse ETL

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages