# Exploration (Colab-ready)

This notebook auto-sets up the repository in Google Colab and runs the project.

**Assumption:** the entry script lives at `scripts/cyclotron_run.py`.

## What it does
1. Clone or pull the GitHub repo
2. Install dependencies + editable install
3. Add `src/` and `scripts/` to `sys.path`
4. Run the pipeline via `from cyclotron_run import run` (or CLI)


In [None]:
# =========================
# 1) Clone / pull + cd
# =========================
import os, subprocess, pathlib

# --- EDIT THESE IF NEEDED ---
REPO_URL = "https://github.com/tommasocarzaniga/CNM_CycNucMed.git"
REPO_DIR = "CNM_CycNucMed"

def sh(cmd, cwd=None):
    print(">>", cmd)
    subprocess.check_call(cmd, shell=True, cwd=cwd)

# If you get an auth error, your repo is private.
# Clone using a GitHub Personal Access Token (PAT):
#   from getpass import getpass
#   token = getpass('GitHub PAT (will not be shown): ').strip()
#   sh(f"git clone https://{token}@github.com/tommasocarzaniga/CNM_CycNucMed.git {REPO_DIR}")

if not os.path.exists(REPO_DIR):
    sh(f"git clone {REPO_URL}")
else:
    sh("git pull", cwd=REPO_DIR)

%cd {REPO_DIR}
print("Now in:", pathlib.Path().resolve())
sh("ls -la")


In [None]:
# =========================
# 2) Install dependencies + package (editable)
# =========================
import sys, subprocess, pathlib

def pip(cmd):
    print(">> pip", cmd)
    subprocess.check_call([sys.executable, "-m", "pip"] + cmd.split())

pip("install -U pip setuptools wheel")

root = pathlib.Path('.').resolve()
if (root / 'requirements.txt').exists():
    pip(f"install -r {root/'requirements.txt'}")

# Editable install (uses pyproject.toml if present)
pip("install -e .")


In [None]:
# =========================
# 3) Ensure imports work (src/ + scripts/)
# =========================
import sys, pathlib

ROOT = pathlib.Path('.').resolve()
SRC = ROOT / 'src'
SCRIPTS = ROOT / 'scripts'

for p in [ROOT, SRC, SCRIPTS]:
    p = str(p)
    if p not in sys.path:
        sys.path.insert(0, p)

print('PYTHONPATH set to include:')
print(' -', ROOT)
print(' -', SRC)
print(' -', SCRIPTS)

# Quick sanity checks
import iaea_project
print('Imported iaea_project from:', iaea_project.__file__)
print('scripts/cyclotron_run.py exists:', (SCRIPTS / 'cyclotron_run.py').exists())


In [None]:
# =========================
# 4) Run the pipeline
# =========================

# Option A (recommended in notebooks): import and call
from cyclotron_run import run

# Run all countries
# pdf_path = run()

# Run selected countries
pdf_path = run(["Switzerland"])  # edit as needed

print('Generated PDF at:', pdf_path)

# Option B (CLI style):
# !python scripts/cyclotron_run.py --countries Switzerland Germany


## 0. Automatic Setup (Local or Google Colab)

Run the next cell first. It detects whether you are running locally or in Google Colab, installs dependencies if needed, and exposes the single entry point `run()` for reproducing the pipeline.


In [None]:
# --- Automatic environment setup (Local or Colab) ---

import sys
from pathlib import Path

# Detect if running in Google Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('Running in Google Colab')

    # If you provided the project as a zip, upload it to Colab and unzip it, then %cd into the folder.
    # If you provided a GitHub repo instead, uncomment and adapt the next two lines:
    !git clone https://github.com/tommasocarzaniga/CNM_CycNucMed.git
    %cd CNM_CycNucMed

    # Install dependencies
    !pip -q install -r requirements.txt
    !pip -q install -e .
else:
    print('Running locally')

# Ensure project root is importable (works for local Jupyter and Colab after %cd into repo)
PROJECT_ROOT = Path().resolve().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from cyclotron_run import run

print('Setup complete. You can now call run()')


# IAEA Cyclotron Project – Professor Guide (Jupyter Notebook)

This notebook walks through a **reproducible, modular** Python project that:

1. **Scrapes** the IAEA cyclotron table (Playwright)
2. **Cleans & canonicalizes** key fields (countries, manufacturers)
3. **Analyzes** global and per‑country patterns
4. **Optionally maps** locations
5. **Builds a PDF report** with global overview + per‑country executive summaries

Project layout (important for grading):

- `src/iaea_project/` – reusable modules (scrape / clean / analyze / plot / PDF)
- `scripts/` – legacy entry points (kept for CLI), but the recommended single entry point is `run.py`
- `data/` – raw + processed datasets
- `outputs/` – figures + reports
- `tests/` – unit tests


## 1) Setup

Run the next cell **once** to install dependencies and install the package in editable mode.

Notes:
- Playwright requires a one-time browser install.
- Map generation uses optional geospatial libraries; if they fail to install on your environment, you can still run everything with `--disable-maps`.


In [None]:
!pip -q install -r ../requirements.txt
!pip -q install -e ..

# Playwright browser install (one-time)
!playwright install chromium

## 2) Quick smoke test (no scraping)

If the raw CSV exists already, we can quickly test the cleaning pipeline without scraping.


In [None]:
# Make sure the project root is on PYTHONPATH so we can import run.py
import sys
from pathlib import Path

PROJECT_ROOT = Path().resolve().parent  # notebooks/ -> project root
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from cyclotron_run import run
print('Project root:', PROJECT_ROOT)


In [None]:
from pathlib import Path
import pandas as pd

raw_csv = Path('..') / 'data' / 'raw' / 'iaea_cyclotrons_raw.csv'
raw_csv.exists(), raw_csv

In [None]:
from iaea_project.cleaning import clean_cyclotron_df

if raw_csv.exists():
    df_raw = pd.read_csv(raw_csv)
    df_clean = clean_cyclotron_df(df_raw)
    df_clean.head()
else:
    print('Raw CSV not found yet. Run the full pipeline (section 4) to scrape & create it.')

## 3) Run the included test script

This script reads the raw CSV (if present), cleans it, and prints a small preview.


In [None]:
!python ../scripts/run_test.py

## 4) Run the full pipeline (scrape → clean → analyze → PDF)

This will:
- scrape the IAEA cyclotron page
- save raw CSV to `data/raw/`
- clean & save processed CSV to `data/processed/`
- create figures in `outputs/figures/`
- build a PDF report in `outputs/reports/`

By default, it includes executive summaries for the **top 10 countries** by row count.


In [None]:
# Run the full pipeline
run()

### 4.1) Customize which countries appear in the PDF

Examples (Python API):

- **All countries** (default):
  - `run()`

- **Choose specific countries**:
  - `run(['Switzerland', 'Germany', 'Italy'])`

- **Top N countries** (by row count in the dataset):
  - `run(top_n_countries=20)`

- **Disable maps** (if geospatial dependencies are missing):
  - `run(disable_maps=True)`

- **Skip scraping** and reuse existing raw CSV:
  - `run(skip_scrape=True)`


In [None]:
# Example: only these countries, no maps
run(['Switzerland','Germany','Italy'], disable_maps=True, out_pdf='outputs/reports/IAEA_Cyclotron_Report_selected.pdf')

## 5) Optional: Enable LLM assistance (demonstration)

The project is designed to work without any LLM.

If you want to demonstrate controlled, *pluggable* LLM usage:
- Install OpenAI SDK: `pip install openai`
- Set `OPENAI_API_KEY` in your environment
- Run `cyclotron_run.py --enable-llm` (or use `run(enable_llm=True, ...)` in Python)

LLM is used only as a **fallback** for:
1) rare country strings that do not map deterministically
2) manufacturer canonicalization by choosing from a canon list

All results are cached in `outputs/cache/` to ensure reproducibility across runs.


In [None]:
# Optional LLM demonstration (requires OPENAI_API_KEY)
# Uncomment to run:
# run(enable_llm=True, llm_model='gpt-4.1-mini', disable_maps=True, out_pdf='outputs/reports/IAEA_Cyclotron_Report_llm_demo.pdf')

## 6) Inspect outputs

After running the pipeline you should see:
- `data/raw/iaea_cyclotrons_raw.csv`
- `data/processed/iaea_cyclotrons_clean.csv`
- `outputs/reports/IAEA_Cyclotron_Report.pdf`


In [None]:
from pathlib import Path

base = Path('..')
paths = [
    base/'data/raw/iaea_cyclotrons_raw.csv',
    base/'data/processed/iaea_cyclotrons_clean.csv',
    base/'outputs/reports/IAEA_Cyclotron_Report.pdf',
]
[(p, p.exists(), p.stat().st_size if p.exists() else None) for p in paths]

## 7) Minimal analysis inside the notebook

Here we load the processed data and reproduce a few summary tables.


In [None]:
import pandas as pd
from iaea_project.analysis import global_comparison_tables, country_summary

clean_csv = Path('..') / 'data' / 'processed' / 'iaea_cyclotrons_clean.csv'
df = pd.read_csv(clean_csv)

top_countries, top_manu, energy_country = global_comparison_tables(df)

top_countries.head(10), top_manu.head(10)

In [None]:
# Example country summary
cs = country_summary(df, 'Switzerland', top_n=10)
cs['total_cyclotrons'], cs['cities_top'].head(10)

## 8) Run unit tests

This verifies a few cleaning functions and guards against regressions.


In [None]:
!pytest -q ../tests