# 00 · Colab Setup

Set up dependencies, authentication, and guardrails before running downstream notebooks.

> **Data Protection:** Access to MIMIC datasets requires PhysioNet credentials and human oversight. Never sync PHI to this repo.

## Guardrails
- Use this Colab only within approved clinical research scopes.
- Mount Google Drive with restricted sharing.
- Update configuration flags to disable exporting raw text artifacts.
- Double-check that only anonymized metrics leave the runtime.

In [4]:
!git clone https://github.com/sehajbath/secure-llm-mia.git
%cd secure-llm-mia


Cloning into 'secure-llm-mia'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (84/84), done.[K
remote: Compressing objects: 100% (72/72), done.[K
remote: Total 84 (delta 7), reused 84 (delta 7), pack-reused 0 (from 0)[K
Receiving objects: 100% (84/84), 36.74 KiB | 1.67 MiB/s, done.
Resolving deltas: 100% (7/7), done.
/content/secure-llm-mia


In [7]:
import os
import sys
from pathlib import Path

PROJECT_ROOT = Path(os.getcwd()).resolve()  # -> /content/secure-llm-mia

if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

print(f"Project root: {PROJECT_ROOT}")


Project root: /content/secure-llm-mia


In [8]:
# Install dependencies from the repo environment spec
import subprocess
import sys

requirements_path = PROJECT_ROOT / 'env' / 'requirements.txt'
if requirements_path.exists():
    print('Installing dependencies...')
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-U', '-r', str(requirements_path)], check=False)
else:
    print('requirements.txt not found. Verify repo checkout or adjust path.')

Installing dependencies...


In [10]:
# Authenticate with Hugging Face Hub (token input is not stored)
from getpass import getpass
try:
    from huggingface_hub import login
except Exception as exc:  # pragma: no cover
    print(f'huggingface_hub not available: {exc}')
else:
    hf_token = getpass('Enter Hugging Face token (press ENTER to skip): ')
    if hf_token:
        login(token=hf_token, add_to_git_credential=True)
    else:
        print('Token not provided. Access to meta-llama/Llama-3.1-8B will fail if gated.')

Enter Hugging Face token (press ENTER to skip): ··········


In [11]:
# Optional: initialize Weights & Biases logging
try:
    import wandb
    enable_wandb = False  # TODO: flip to True after configuring wandb settings.
    if enable_wandb:
        wandb.login()
        wandb.init(project='secure-llm-mia', config={'notebook': '00_colab_setup'})
    else:
        print('W&B disabled by default. Set enable_wandb=True to log runs.')
except Exception as exc:  # pragma: no cover
    print(f'wandb not available: {exc}')

W&B disabled by default. Set enable_wandb=True to log runs.


In [12]:
# Mount Google Drive if executing inside Colab
try:
    from google.colab import drive  # type: ignore
    DRIVE_ROOT = Path('/content/drive')
    if not DRIVE_ROOT.exists():
        drive.mount('/content/drive', force_remount=False)
        print('Drive mounted at /content/drive.')
    else:
        print('Drive already mounted.')
except ImportError:
    print('Not running inside Google Colab; skipping Drive mount.')

Mounted at /content/drive
Drive mounted at /content/drive.


In [13]:
# Determine runtime device capabilities
import torch


def describe_device() -> None:
    if not torch.cuda.is_available():
        print('CUDA not available. Colab may be on CPU-only instance.')
        return
    device = torch.cuda.get_device_name(0)
    capability = torch.cuda.get_device_capability(0)
    print(f'CUDA device: {device} (cc {capability})')
    print(f'BF16 support: {torch.cuda.is_bf16_supported()}')
    print(f'FP16 support: {torch.cuda.is_available()}')


describe_device()

CUDA not available. Colab may be on CPU-only instance.


In [14]:
# Configure deterministic seeds and 4-bit defaults
from src.utils.seed import set_global_seed
from src.constants import ensure_directories, DATA_CACHE_DIR, ARTIFACT_ROOT

PROJECT_TEMP = PROJECT_ROOT / 'colab_temp'
PROJECT_TEMP.mkdir(exist_ok=True)
set_global_seed(17)
ensure_directories()
print(f'Data cache directory: {DATA_CACHE_DIR}')
print(f'Artifact root: {ARTIFACT_ROOT}')
os.environ['BITSANDBYTES_NOWELCOME'] = '1'  # ensures bnb does not print ASCII art

Data cache directory: /content/secure-llm-mia/data_cache
Artifact root: /content/secure-llm-mia/artifacts


✅ **Next steps:** Proceed to `01_data_intake_and_clean.ipynb`, update dataset paths, and run the synthetic smoke test before working with credentialed data.