# PEEC AI + Awin Connector

This notebook connects **PEEC AI citation data** with **Awin affiliate transaction data** to help you understand where AI models cite your publishers and identify recruitment opportunities.

**What you can do with this tool:**
- Pull AI citation data from PEEC (which domains/URLs do AI models cite?)
- Pull Awin transaction data for your advertiser account
- Match PEEC-cited domains to Awin publisher domains
- Build enriched reports combining citation + transaction metrics
- Run gap analysis to find high-citation domains with no Awin relationship

---

| Step | What you'll do |
|------|----------------|
| 1 | Set up workspace, download scripts, configure session |
| 2 | Load styles & PEEC client |
| 3 | Pull PEEC citation data |
| 4 | Generate domain & URL reports |
| 5 | Pull Awin transaction data |
| 6 | Match domains & build enriched report |
| 7 | Gap analysis — find recruitment targets |

---

**Run each cell in order.** Step 1 handles all setup automatically.

# Step 1: Setup & Configure Session

This cell handles everything you need to get started:
1. Installs dependencies
2. Detects your environment (Colab or local)
3. Sets up a workspace folder
4. Downloads the latest scripts from GitHub
5. Configures your API keys, project, date range, and advertiser ID

---

<details>
<summary><b>API Keys — Colab setup (click to expand)</b></summary>

1. Click the key icon in the left sidebar (Secrets)
2. Add two secrets:
   - Name: `PEEC_API_KEY` — Value: your PEEC AI API key
   - Name: `AWAPI` — Value: your Awin API token
3. Toggle "Notebook access" ON for both
4. Run this cell

</details>

<details>
<summary><b>API Keys — Local setup (click to expand)</b></summary>

1. Create a `.env` file in your project folder
2. Add:
   ```
   PEEC_API_KEY=your-peec-key-here
   AWAPI=your-awin-token-here
   ```
3. Run this cell

</details>

---

**Run the cell below**, then click **"Confirm Settings"** once you've set your project, dates, and advertiser ID.

In [None]:
# =============================================================================
# BOOTSTRAP: Pip installs + Workspace + Script Download + Session Config
# =============================================================================

import json
import os
import subprocess
import sys
import urllib.request
import urllib.error
from datetime import datetime
from pathlib import Path

# ── Pip installs (silent) ──────────────────────────────────────────
subprocess.check_call(
    [sys.executable, "-m", "pip", "install", "--quiet",
     "requests", "pandas", "python-dotenv", "ipywidgets"],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)
print("\u2705 Dependencies installed.")

import ipywidgets as widgets
from IPython.display import display, clear_output

# ── Environment detection ─────────────────────────────────────────
try:
    from google.colab import drive  # type: ignore
except ImportError:
    drive = None

IN_COLAB = "google.colab" in sys.modules

# ── GitHub settings ───────────────────────────────────────────────
GITHUB_REPO = "smartaces/peec-awin-connector"
GITHUB_BRANCH = "main"
GITHUB_RAW_BASE = f"https://raw.githubusercontent.com/{GITHUB_REPO}/{GITHUB_BRANCH}/scripts"

SCRIPT_FILES = [
    "cell_00_pip_installs.py",
    "cell_01_session_config.py",
    "cell_02_css_styling.py",
    "cell_03_peec_client.py",
    "cell_04_peec_data_pull.py",
    "cell_05_domain_report.py",
    "cell_06_url_report.py",
    "cell_07_awin_transactions.py",
    "cell_08_domain_match.py",
    "cell_09_enriched_report.py",
    "cell_10_gap_analysis.py",
]


# ── Detect local scripts in scripts/ subfolder next to notebook ───
def _local_scripts_available():
    """Check if cell scripts exist in a scripts/ folder next to the notebook."""
    return (Path.cwd() / "scripts" / "cell_01_session_config.py").is_file()


# ── Location selection ────────────────────────────────────────────
def _get_location_options():
    if IN_COLAB:
        return [
            ("Google Drive (/content/drive/MyDrive)", "drive"),
            ("Colab Temporary (/content)", "colab"),
            ("Local Folder (current directory)", "local"),
        ]
    return [("Local Folder (current directory)", "local")]


def _resolve_base_path(selection):
    if selection == "drive":
        mount_point = Path("/content/drive")
        if not mount_point.exists() or not os.path.ismount(mount_point):
            print("\U0001f50c Mounting Google Drive...")
            drive.mount(str(mount_point))
        return mount_point / "MyDrive"
    elif selection == "colab":
        return Path("/content")
    return Path.cwd()


def _download_scripts(target_dir):
    """Download all scripts from GitHub into target_dir."""
    target_dir.mkdir(parents=True, exist_ok=True)
    print(f"\U0001f4e5 Downloading latest scripts from GitHub into {target_dir} ...")
    success = 0
    for filename in SCRIPT_FILES:
        url = f"{GITHUB_RAW_BASE}/{filename}"
        dest = target_dir / filename
        try:
            req = urllib.request.Request(url, headers={"User-Agent": "PEEC-Awin-Connector"})
            with urllib.request.urlopen(req, timeout=30) as resp:
                content = resp.read()
            with open(dest, "wb") as fp:
                fp.write(content)
            print(f"   \u2022 {filename} \u2713")
            success += 1
        except (urllib.error.URLError, urllib.error.HTTPError, IOError) as e:
            print(f"   \u2022 {filename} \u2717 ({e})")
    print(f"\u2705 Downloaded {success}/{len(SCRIPT_FILES)} scripts.")
    return success == len(SCRIPT_FILES)


# ── Main setup UI ────────────────────────────────────────────────
import __main__

_loc_dd = widgets.Dropdown(
    options=_get_location_options(),
    value=_get_location_options()[0][1],
    description="Location:",
    style={"description_width": "70px"},
    layout=widgets.Layout(width="420px"),
)

_setup_btn = widgets.Button(
    description="  Set Up Workspace", button_style="info",
    icon="folder-open", layout=widgets.Layout(width="200px", height="36px"),
)
_update_btn = widgets.Button(
    description="  Download Latest from GitHub", button_style="warning",
    icon="download", layout=widgets.Layout(width="260px", height="36px"),
)
_setup_output = widgets.Output()
_config_output = widgets.Output()


def _on_setup(b):
    with _setup_output:
        clear_output()
        try:
            base = _resolve_base_path(_loc_dd.value)
            workspace = (base / "peec_awin_workspace").resolve()
            workspace.mkdir(parents=True, exist_ok=True)

            # Create output and logs folders in workspace
            for name in ["output", "logs"]:
                (workspace / name).mkdir(parents=True, exist_ok=True)

            # Decide where scripts live
            if _local_scripts_available():
                # Local dev: use scripts/ subfolder from the repo
                scripts_dir = Path.cwd() / "scripts"
                print(f"\U0001f4c2 Using local scripts from: {scripts_dir}")
            else:
                # Colab / standalone: download into workspace
                scripts_dir = workspace / "scripts"
                scripts_dir.mkdir(parents=True, exist_ok=True)
                existing = sum(1 for f in SCRIPT_FILES if (scripts_dir / f).is_file())
                if existing < len(SCRIPT_FILES):
                    _download_scripts(scripts_dir)
                else:
                    print(f"\u2705 {existing}/{len(SCRIPT_FILES)} scripts already in workspace.")

            paths = {
                "scripts": scripts_dir,
                "output": workspace / "output",
                "logs": workspace / "logs",
            }

            print(f"\U0001f4c2 Workspace: {workspace}")
            print(f"   Scripts: {scripts_dir}")
            print(f"   Output:  {paths['output']}")

            # Set globals
            __main__.IN_COLAB = IN_COLAB
            __main__.WORKSPACE_ROOT = workspace
            __main__.PATHS = paths
            os.environ["WORKSPACE_ROOT"] = str(workspace)

            print("\n\u23f3 Loading session configuration...\n")

            # Load session config
            config_script = scripts_dir / "cell_01_session_config.py"
            if config_script.exists():
                with _config_output:
                    clear_output()
                    exec(compile(open(config_script).read(), str(config_script), "exec"))
            else:
                print(f"\u26a0\ufe0f Session config script not found: {config_script}")

        except Exception as e:
            print(f"\u274c Error: {e}")


def _on_update(b):
    with _setup_output:
        clear_output()
        # Download into wherever scripts currently point
        if hasattr(__main__, "PATHS") and __main__.PATHS is not None:
            target = Path(__main__.PATHS["scripts"])
        elif _local_scripts_available():
            target = Path.cwd() / "scripts"
        else:
            print("\u26a0\ufe0f Set up workspace first.")
            return
        _download_scripts(target)


_setup_btn.on_click(_on_setup)
_update_btn.on_click(_on_update)

display(
    widgets.HTML("<h3>\U0001f4c1 Workspace & Session Setup</h3>"),
    widgets.HTML("<p>Choose where to store project files, then click <b>Set Up Workspace</b>.</p>"),
    widgets.HBox([_loc_dd]),
    widgets.HBox(
        [_setup_btn, _update_btn],
        layout=widgets.Layout(margin="8px 0 10px 0"),
    ),
    _setup_output,
    _config_output,
)

# Step 2: Load Styles & PEEC Client

This cell loads the visual styling and initialises the PEEC API client with your project's lookup tables (prompts, tags, topics, models).

**Run the cell below.** You should see confirmation that the client is ready.

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])

# Load CSS styling
exec(compile(open(scripts_dir / "cell_02_css_styling.py").read(),
     str(scripts_dir / "cell_02_css_styling.py"), "exec"))

# Load PEEC client, helpers, and lookups
exec(compile(open(scripts_dir / "cell_03_peec_client.py").read(),
     str(scripts_dir / "cell_03_peec_client.py"), "exec"))

# Step 3: Pull PEEC Citation Data

This cell fetches citation data from PEEC AI for your configured date range and project. It pulls:
- Domain classifications
- URL-level citation report broken down by prompt and model

**Click "Pull Data"** to fetch the data. Once complete, you can run the reports below.

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_04_peec_data_pull.py").read(),
     str(scripts_dir / "cell_04_peec_data_pull.py"), "exec"))

# Step 4: PEEC Reports

Two report views from your citation data:

| Report | What it shows |
|--------|---------------|
| **Domain Report** | One row per domain — total citations, avg position, unique pages, models |
| **URL Report** | One row per URL — with clickable links and prompt counts |

Use the filters to drill down, then download as CSV.

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_05_domain_report.py").read(),
     str(scripts_dir / "cell_05_domain_report.py"), "exec"))

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_06_url_report.py").read(),
     str(scripts_dir / "cell_06_url_report.py"), "exec"))

# Step 5: Pull Awin Transaction Data

This cell fetches transaction-level data from the Awin API for your configured advertiser and date range.

The Awin API has a 31-day maximum window per request, so the tool automatically chunks larger date ranges.

**Click "Pull Transactions"** to fetch. You can optionally filter by transaction status.

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_07_awin_transactions.py").read(),
     str(scripts_dir / "cell_07_awin_transactions.py"), "exec"))

# Step 6: Domain Match & Enriched Report

Two steps here:

1. **Domain Match** — Matches PEEC-cited domains to Awin publisher domains using normalised hostnames (strips protocol, `www.`, paths)
2. **Enriched Report** — Combines the matched domains with Awin transaction metrics and AI model citation data

Run both cells in order. Use the exclude filter to remove noise domains (e.g. google, facebook).

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_08_domain_match.py").read(),
     str(scripts_dir / "cell_08_domain_match.py"), "exec"))

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_09_enriched_report.py").read(),
     str(scripts_dir / "cell_09_enriched_report.py"), "exec"))

# Step 7: Gap Analysis

This identifies domains and pages that AI models cite frequently but where you have **no Awin publisher relationship** — these are your potential recruitment targets.

Use the filters to:
- Focus on specific domain types (e.g. editorial, blog)
- Search for domains containing specific keywords
- Exclude noise domains (google, wikipedia, youtube, etc.)

In [None]:
import __main__
from pathlib import Path

scripts_dir = Path(__main__.PATHS["scripts"])
exec(compile(open(scripts_dir / "cell_10_gap_analysis.py").read(),
     str(scripts_dir / "cell_10_gap_analysis.py"), "exec"))