# 00 — Setup & Sanity Checks

**What/Why**: Establish a *known‑good* environment (pinned versions), print system info, seed all RNGs, and validate dataset access.

You can run this on Colab (GPU recommended). Each code cell includes **line‑by‑line comments** and a **How to read results** section.

## Install pinned dependencies (Colab)
Skip if you already created a virtualenv locally and installed `requirements.txt`. On Colab, uncomment and run.

In [0]:
# %% If running on Google Colab, uncomment the next lines
# !pip install -q transformers==4.41.2 datasets==2.20.0 evaluate==0.4.2 accelerate==0.31.0
# !pip install -q torch>=2.1 numpy==1.26.4 pandas==2.1.4 scipy==1.11.4 tqdm==4.66.4
# !pip install -q nltk==3.8.1 typer==0.12.3 rich==13.7.1 bert-score==0.3.13 rouge-score==0.1.2
# !pip install -q sentencepiece==0.2.0 sacrebleu==2.4.0 huggingface-hub==0.24.5 orjson==3.10.1

## Imports & environment summary

In [0]:
# Standard imports
import os, sys, json, random, time, platform, textwrap, pathlib

# Core libs
import numpy as np

# Hugging Face ecosystem
import datasets, transformers, evaluate

# Other tools
import nltk, torch

# Print versions for reproducibility
print('Python  :', sys.version.split()[0])
print('Platform:', platform.platform())
print('PyTorch :', torch.__version__)
print('Transformers:', transformers.__version__)
print('Datasets:', datasets.__version__)
print('Evaluate:', evaluate.__version__)
print('NLTK    :', nltk.__version__)

# How to read results:
# This shows the exact versions used in this session. Copy these to your report's appendix and logs.


## Create folders & global config

In [0]:
# Create standard folders (no error if they already exist)
for d in ['configs', 'runs', 'results']:
    os.makedirs(d, exist_ok=True)

# Fixed seeds for reproducibility (HF Trainer will also receive this seed)
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Create a tiny run config to track settings consistently
run_cfg = {
    'seed': SEED,
    'dataset_id': 'cnn_dailymail',
    'dataset_repo': 'ccdv/cnn_dailymail',
    'dataset_config': '3.0.0',
    'baseline_model': 'facebook/bart-base',
    'created_at': time.strftime('%Y-%m-%d %H:%M:%S'),
}
with open('configs/run.json', 'w') as f:
    json.dump(run_cfg, f, indent=2)

print('Wrote configs/run.json')

# How to read results:
# 'configs/run.json' records critical identifiers (dataset, model, seed). We will append to it later.


## Download NLTK data (tokenizers, etc.)

In [0]:
# We use NLTK for tokenization in some metrics. This downloads small helper data.
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)  # fallback for newer nltk tokenizers
print('Downloaded NLTK data.')

# How to read results:
# If downloads fail (offline), you can run metrics that don't require NLTK, or re-run with internet.


## Load CNN/DailyMail (config 3.0.0) and show 3 redacted examples

In [0]:
from datasets import load_dataset

# Load the dataset (default split names: train/validation/test)
ds = load_dataset('cnn_dailymail', '3.0.0')

# Redact named entities or digits in previews to prevent accidental PIIs (simple demo)
import re
def redact(text):
    text = re.sub(r'\b[A-Z][a-z]+\b', '[NAME]', text)
    text = re.sub(r'\d', '0', text)
    return text

for i in range(3):
    art = ds['train'][i]['article'][:400].replace('\n', ' ')
    summ = ds['train'][i]['highlights'][:200].replace('\n', ' ')
    print(f'Example {i+1}\nArticle  :', redact(art))
    print('Summary :', redact(summ))
    print('-'*80)

# Save small samples so later notebooks can run quick dry-runs
os.makedirs('results', exist_ok=True)
ds['validation'].select(range(64)).to_json('results/val.sample.jsonl', orient='records', lines=True)
print('Wrote results/val.sample.jsonl (64 items).')

# How to read results:
# You should see 3 preview examples with simple redactions. The small validation sample enables quick experiments.


## (Optional) Record dataset license

In [0]:
# Record the dataset license from the Hugging Face dataset card.
cfg = json.load(open('configs/run.json'))
cfg['dataset_license'] = 'Apache-2.0'
json.dump(cfg, open('configs/run.json','w'), indent=2)
print("Updated configs/run.json with dataset license 'Apache-2.0'.")

# How to read results:
# The generated configs/run.json now includes the dataset's Apache-2.0 license for audit trails.
