# Applied Psychological Data Science
## Course Notebook - Your Workspace for the Semester

**Welcome!** This notebook is your workspace for the entire course.

### Quick Start
1. **Save a copy**: File → Save a copy in Drive
2. **Run the setup cell below** (just once)
3. **Follow your lab instructions** and paste code here

---

## Step 1: Run This Setup Cell (Once Per Session)

This loads the course dataset system. Run this cell every time you open the notebook.

In [None]:
# ============================================================
# COURSE SETUP - Run this cell first!
# ============================================================

import pandas as pd
import numpy as np
import json
import urllib.request

# Course dataset configuration
GCS_BUCKET = "variable-resolution-applied-computational-psychology-course"
GCS_BASE_URL = f"https://storage.googleapis.com/{GCS_BUCKET}"
CATALOG_URL = f"{GCS_BASE_URL}/manifest.json"

# Load the dataset catalog
print("Loading course dataset catalog...")
with urllib.request.urlopen(CATALOG_URL) as response:
    CATALOG = json.loads(response.read().decode())

print(f"✓ Connected! {len(CATALOG['datasets'])} datasets available")

# ============================================================
# MAIN FUNCTION: load_dataset()
# ============================================================

def load_dataset(name, nrows=None):
    """
    Load a course dataset by name.
    
    Parameters:
    -----------
    name : str
        The dataset name (e.g., "andrea_reddit_results_andrea_2025_03_13")
    nrows : int, optional
        Number of rows to load (useful for large datasets)
    
    Returns:
    --------
    pandas.DataFrame
    
    Example:
    --------
    >>> df = load_dataset("andrea_reddit_results_andrea_2025_03_13", nrows=5000)
    >>> df.head()
    """
    for ds in CATALOG['datasets']:
        if ds['canonical_name'] == name:
            url = ds['access']['public_url']
            df = pd.read_csv(url, nrows=nrows)
            print(f"✓ Loaded {len(df):,} rows from '{name}'")
            return df
    
    # Dataset not found - show similar names
    similar = [ds['canonical_name'] for ds in CATALOG['datasets'] 
               if name.lower() in ds['canonical_name'].lower()][:5]
    raise ValueError(f"Dataset '{name}' not found. Similar: {similar}")


def list_datasets(contributor=None, search=None):
    """
    List available datasets, optionally filtered.
    
    Parameters:
    -----------
    contributor : str, optional
        Filter by contributor name (e.g., "Andrea", "Peter")
    search : str, optional
        Search in dataset names
    
    Example:
    --------
    >>> list_datasets(contributor="Andrea")
    >>> list_datasets(search="reddit")
    """
    results = []
    for ds in CATALOG['datasets']:
        name = ds['canonical_name']
        if contributor and contributor.lower() not in name.lower():
            continue
        if search and search.lower() not in name.lower():
            continue
        rows = ds.get('stats', {}).get('rows', 'unknown')
        results.append(f"{name} ({rows:,} rows)" if isinstance(rows, int) else f"{name}")
    
    print(f"Found {len(results)} datasets:")
    for r in results[:20]:
        print(f"  - {r}")
    if len(results) > 20:
        print(f"  ... and {len(results) - 20} more")


# ============================================================
# READY!
# ============================================================
print("")
print("Ready! You can now use:")
print("  - load_dataset('dataset_name')     → Load a dataset")
print("  - load_dataset('name', nrows=1000) → Load first 1000 rows")
print("  - list_datasets()                  → See all datasets")
print("  - list_datasets(contributor='Andrea') → Filter by contributor")
print("")
print("Recommended datasets for each module:")
print("  M1: andrea_reddit_results_andrea_2025_03_13")
print("  M2: agatha_ballet_dancemoms_agatha")
print("  M3: yashita_yashita_data, kaitlyn_merged_data_overview_kaitlyn_master")
print("  M5: clara_bert_embeddings")
print("  M6: raymond_umap_dbscan_results_20250223_154628")

---

## Step 2: Test It Works

Run this cell to make sure everything is connected:

In [None]:
# Test: Load a small sample from Andrea's Reddit data
df_test = load_dataset("andrea_reddit_results_andrea_2025_03_13", nrows=5)
df_test.head()

---

## Your Workspace

Add new cells below as you work through each module's lab instructions.

### Module 1: Data Foundations
Follow the lab instructions and add your code here:

In [None]:
# Module 1 - Your code here
# Follow lab_instructions.md in the module_1_data_foundations folder


### Module 2: Linear Regression

In [None]:
# Module 2 - Your code here
# Follow lab_instructions.md in the module_2_linear_regression folder


### Module 3: LLMs as Raters

In [None]:
# Module 3 - Your code here
# You'll need your own OpenAI API key for this module
# API_KEY = "your-api-key-here"  # Replace with your key


### Module 4: APIs & Data Collection

In [None]:
# Module 4 - Your code here


### Module 5: Embeddings

In [None]:
# Module 5 - Your code here
# Load Clara's pre-computed embeddings:
# df_embed = load_dataset("clara_bert_embeddings")


### Module 6: PCA & UMAP Visualization

In [None]:
# Module 6 - Your code here
# Load Raymond's pre-computed UMAP:
# df_umap = load_dataset("raymond_umap_dbscan_results_20250223_154628")
