# tDS Quickstart Notebook (Standalone)

This notebook is a **standalone**, local-Jupyter-friendly quickstart for common Data Strategist workflows on **Google Cloud**:
- Auth (ADC / service account)
- BigQuery query + (optional) write
- (Optional) Cloud Storage smoke actions
- Reproducible configuration + safe-by-default write guards

**Design goals**
- Runs outside Colab (no `google.colab` dependencies)
- Safe defaults (no accidental writes)
- Clear “what/why” for each step

_Last updated: 2026-01-10_


## 0) Prerequisites

### Local environment
- Python 3.10+
- Jupyter installed
- Google Cloud SDK set up (recommended)

### Authentication (choose one)
- **Application Default Credentials (recommended)**:
  - `gcloud auth application-default login`
- **Service account JSON**:
  - Set `GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json`

> If you don't have GCP credentials configured, read cells will fail with an auth error.


In [None]:
# Optional: install dependencies (uncomment if running in a fresh environment)
# !pip install -U google-cloud-bigquery google-cloud-storage pandas pyarrow

import os
from dataclasses import dataclass

@dataclass(frozen=True)
class Config:
    # ---- Project / location ----
    project_id: str = os.environ.get("GOOGLE_PROJECT_ID", "")  # set this
    location: str = os.environ.get("GOOGLE_LOCATION", "US")

    # ---- BigQuery (optional write) ----
    allow_writes: bool = os.environ.get("ALLOW_WRITES", "0") == "1"
    dataset_id: str = os.environ.get("BQ_DATASET_ID", "tds_sandbox")
    table_id: str = os.environ.get("BQ_TABLE_ID", "quickstart_table")

    # ---- GCS (optional) ----
    bucket: str = os.environ.get("GCS_BUCKET", "")

cfg = Config()
cfg


## 1) Authentication + Clients

We use **google-auth default credentials**. In production, you'd typically run this in:
- Cloud Run / GKE (Workload Identity)
- Airflow/Composer (service account)
- CI (short-lived identity)

This cell verifies auth and prints the detected project.


In [None]:
from google.auth import default as google_auth_default
from google.cloud import bigquery

creds, detected_project = google_auth_default()
project = cfg.project_id or detected_project

if not project:
    raise ValueError(
        "No project detected. Set GOOGLE_PROJECT_ID env var or configure gcloud ADC."
    )

bq = bigquery.Client(project=project, credentials=creds, location=cfg.location)

print("Authenticated ✅")
print("Project:", project)
print("Location:", cfg.location)


## 2) BigQuery: Read Example (Public Dataset)

This is a safe, read-only query against a public dataset.
It demonstrates:
- query execution
- returning a pandas DataFrame


In [None]:
import pandas as pd

sql = """
SELECT
  name,
  SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'CA'
GROUP BY name
ORDER BY total DESC
LIMIT 10
"""

df = bq.query(sql).to_dataframe()
df


## 3) BigQuery: Optional Write (Guarded)

Writes are **disabled by default**. To enable, set:
- `ALLOW_WRITES=1`

This pattern avoids accidental writes during review or CI.


In [None]:
from google.api_core.exceptions import NotFound

dataset_ref = bigquery.Dataset(f"{project}.{cfg.dataset_id}")
dataset_ref.location = cfg.location

table_ref = f"{project}.{cfg.dataset_id}.{cfg.table_id}"

if not cfg.allow_writes:
    print("Writes are disabled (ALLOW_WRITES=0). Skipping dataset/table creation.")
else:
    # Create dataset if missing
    try:
        bq.get_dataset(dataset_ref)
        print("Dataset exists:", dataset_ref.dataset_id)
    except NotFound:
        bq.create_dataset(dataset_ref)
        print("Created dataset:", dataset_ref.dataset_id)

    # Create table (a tiny example) if missing
    schema = [
        bigquery.SchemaField("run_ts", "TIMESTAMP", mode="REQUIRED"),
        bigquery.SchemaField("message", "STRING", mode="REQUIRED"),
    ]
    try:
        bq.get_table(table_ref)
        print("Table exists:", table_ref)
    except NotFound:
        table = bigquery.Table(table_ref, schema=schema)
        bq.create_table(table)
        print("Created table:", table_ref)

    # Insert one row
    rows = [{"run_ts": datetime.datetime.utcnow().isoformat(), "message": "hello from tDS quickstart"}]
    errors = bq.insert_rows_json(table_ref, rows)
    if errors:
        raise RuntimeError(errors)
    print("Inserted 1 row into:", table_ref)


## 4) GCS: Optional Smoke Actions (List / Upload)

This section is optional and permission-dependent.

To enable upload/list:
- Set `GCS_BUCKET` to a bucket you have access to
- Ensure your credentials include Storage permissions

Writes are also guarded by `ALLOW_WRITES=1`.


In [None]:
from google.cloud import storage

if not cfg.bucket:
    print("GCS_BUCKET not set. Skipping GCS section.")
else:
    gcs = storage.Client(project=project, credentials=creds)
    bucket = gcs.bucket(cfg.bucket)

    # List first 20 blobs (read-only)
    print(f"Listing blobs in gs://{cfg.bucket} (first 20):")
    for i, blob in enumerate(gcs.list_blobs(cfg.bucket, max_results=20), start=1):
        print(f"{i:02d}. {blob.name}")

    # Optional upload (guarded)
    if not cfg.allow_writes:
        print("Writes are disabled (ALLOW_WRITES=0). Skipping upload.")
    else:
        tmp_path = "tds_quickstart_example.txt"
        with open(tmp_path, "w", encoding="utf-8") as f:
            f.write("hello from tDS quickstart\n")

        blob = bucket.blob("quickstart/tds_quickstart_example.txt")
        blob.upload_from_filename(tmp_path)
        print("Uploaded:", f"gs://{cfg.bucket}/{blob.name}")


## 5) Production Notes

If this were running as a production job, you'd typically add:

- **Orchestration**
  - Airflow/Composer or Prefect schedules
  - Backfills + retries + idempotency

- **Observability**
  - Structured logs
  - Metrics (rows processed, runtime, error rate)
  - Data quality checks (nulls, duplicates, range checks)

- **Artifact/version management**
  - Timestamped outputs
  - Metadata tables (run_id, inputs, outputs, git SHA)

- **Security**
  - Workload Identity (avoid long-lived keys)
  - Least-privilege IAM
