# Screener Pipeline Demo: Data Science Researcher Role

This notebook walks through the **screener pipeline** step-by-step:
1. **Extract** — Fetch GitHub repo artifacts (commits, PRs, issues) and normalize them.
2. **Evaluate** — Score the candidate's artifacts against a job description using AI (Gemini).

**Use case:** Score a GitHub profile for a **Data Science Researcher** role.

---

## Prerequisites

- **Next.js app running:** In a terminal run `npm run dev` so the API is at `http://localhost:3000`.
- **GitHub token:** A Personal Access Token (PAT) with `repo` scope, or complete OAuth in the browser and use the token from the redirect URL.
- **Environment:** The app must have `GEMINI_API_KEY` set for the evaluate step.


## Setup: imports and base URL


In [1]:
import os
import json
import requests

# Base URL of your Next.js app (default: local dev server)
BASE_URL = os.environ.get("SCREENER_API_URL", "http://localhost:3000")

def api(path):
    return f"{BASE_URL}{path}"

print(f"API base: {BASE_URL}")


API base: http://localhost:3000


## Step 1: GitHub token

You need a GitHub token to call the Extract API.

- **Option A (recommended for notebooks):** Create a [Personal Access Token](https://github.com/settings/tokens) with `repo` scope and set it in your environment as `GITHUB_TOKEN`, or paste it below (do not commit).
- **Option B:** Open `http://localhost:3000/api/auth/github` in a browser, authorize, then copy the `github_token` from the redirect URL and set it below.


In [2]:
# Set your GitHub token (from env or paste here for testing — do not commit)
GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN", "")

if not GITHUB_TOKEN:
    print("Set GITHUB_TOKEN in your environment or assign it in this cell.")
else:
    print("GitHub token is set (length =", len(GITHUB_TOKEN), ")")


GitHub token is set (length = 40 )


## Step 2: Extract — fetch and normalize repo artifacts

We call `POST /api/screener/extract` with a **repo URL** and the **token**. The API returns an **ArtifactBundle** (repo metadata, commits, PRs, issues, activity signals).

Use any public repo; for a Data Science researcher you might use a repo with notebooks, Python, or ML code.


In [6]:
# Example: a candidate's GitHub repo (replace with a real public repo for testing)
REPO_URL = "https://github.com/eshaanmathakari/blockdawg"  # or e.g. facebook/react, owner/repo

response = requests.post(
    api("/api/screener/extract"),
    json={
        "repoUrl": REPO_URL,
        "token": GITHUB_TOKEN,
        "options": {
            "maxCommits": 50,
            "sinceDays": 90,
            "includeIssues": True,
            "includePRs": True,
        },
    },
    headers={"Content-Type": "application/json"},
    timeout=120,
)

if not response.ok:
    print("Extract failed:", response.status_code)
    print(response.json())
else:
    out = response.json()
    artifact_bundle = out["data"]
    print("Extract success. ArtifactBundle id:", artifact_bundle["id"])
    print("Repo:", artifact_bundle["repoMeta"]["fullName"])
    print("Commits:", len(artifact_bundle["commits"]))
    print("PRs:", len(artifact_bundle["pullRequests"]))
    print("Issues:", len(artifact_bundle["issues"]))
    print("Activity — commit freq (per week):", artifact_bundle["activitySignals"]["commitFrequency"])
    print("Languages:", list(artifact_bundle["repoMeta"]["languages"].keys())[:10])


Extract success. ArtifactBundle id: artifact_gz6ovh_mlcsf1ta
Repo: eshaanmathakari/blockdawg
Commits: 3
PRs: 0
Issues: 0
Activity — commit freq (per week): 0.3
Languages: ['Python', 'TypeScript', 'JavaScript', 'Solidity', 'CSS']


### Inspect artifact bundle (optional)

Pretty-print a subset of the bundle to understand what the evaluator will see.


In [7]:
# Show repo meta and first few commits
if response.ok:
    bundle = response.json()["data"]
    print("=== Repo meta ===")
    print(json.dumps(bundle["repoMeta"], indent=2))
    print("\n=== Sample commits (first 3) ===")
    for c in bundle["commits"][:3]:
        print(f"  {c['sha'][:7]} | {c['message'][:60]} | +{c['additions']}/-{c['deletions']}")


=== Repo meta ===
{
  "name": "blockdawg",
  "fullName": "eshaanmathakari/blockdawg",
  "description": "a sniffer that helps us retrace scammed txns and scammer's wallets",
  "languages": {
    "Python": 81.5,
    "TypeScript": 16.2,
    "JavaScript": 1.7,
    "Solidity": 0.4,
    "CSS": 0.2
  },
  "stars": 0,
  "forks": 0,
  "defaultBranch": "main",
  "createdAt": "2025-11-18T22:10:26Z",
  "updatedAt": "2026-01-28T15:51:24Z"
}

=== Sample commits (first 3) ===
  4cc66a0 | added minor changes to week 1 planned architecture | +251/-50
  60358b8 | Initial project setup with backend, frontend, and contracts | +20906/-0
  3072df2 | Initial commit | +209/-0


## Step 3: Define the job description (Data Science Researcher)

We define a **JobDescription** that matches the types expected by the Evaluate API. This one is tailored to a Data Science researcher role.


In [8]:
JOB_DESCRIPTION = {
    "id": "job_ds_researcher_001",
    "title": "Data Science Researcher",
    "company": "Research Lab",
    "description": "Conduct research in machine learning and statistical modeling. Publish and present findings; collaborate with product and engineering teams to deploy models.",
    "requirements": "PhD or equivalent in a quantitative field. Strong background in ML (supervised/unsupervised, deep learning). Proficiency in Python, PyTorch or TensorFlow, and SQL. Experience with large-scale data and reproducible research (version control, notebooks).",
    "dailyTasks": "Design experiments, implement and evaluate models, write papers and technical reports, review code and mentor junior researchers.",
    "expectedOutcomes": "Novel methods or improvements documented and published; production-ready models where applicable; clear documentation and reproducibility.",
    "techStack": ["Python", "PyTorch", "TensorFlow", "scikit-learn", "SQL", "Jupyter", "Git"],
    "experienceLevel": "mid to senior",
}

print("Job:", JOB_DESCRIPTION["title"], "at", JOB_DESCRIPTION["company"])


Job: Data Science Researcher at Research Lab


## Step 4: Evaluate — score candidate against the job

We call `POST /api/screener/evaluate` with the **ArtifactBundle** from Step 2 and the **JobDescription**. The API uses Gemini to analyze the artifacts and produce component scores plus an overall score (0–100) and explanation.


In [9]:
eval_response = requests.post(
    api("/api/screener/evaluate"),
    json={
        "artifactBundle": artifact_bundle,
        "jobDescription": JOB_DESCRIPTION,
    },
    headers={"Content-Type": "application/json"},
    timeout=120,
)

if not eval_response.ok:
    print("Evaluate failed:", eval_response.status_code)
    print(eval_response.json())
else:
    eval_data = eval_response.json()["data"]
    print("Evaluation id:", eval_data["id"])
    print("Overall score (0–100):", eval_data["overallScore"])
    print("Confidence:", eval_data["confidence"])
    print("\nExplanation:", eval_data["explanation"])
    print("\nComponent scores:")
    for cs in eval_data["componentScores"]:
        print(f"  {cs['category']}: {cs['score']} — {cs['reasoning']}")
    if eval_data.get("flaggedConcerns"):
        print("\nFlagged concerns:", eval_data["flaggedConcerns"])


Evaluation id: eval_artifact_job_ds_r_mlcsg1dt
Overall score (0–100): 10
Confidence: medium

Explanation: The candidate demonstrates foundational Python skills and the ability to set up multi-technology projects, but their limited experience (0.5 years) and focus on initial scaffolding of a blockchain application are a poor fit for a mid-to-senior Data Science Researcher role. There is a significant lack of evidence for core ML expertise, research experience, and collaborative development practices required by the job description.

Component scores:
  skills_alignment: 10 — The candidate demonstrates intermediate Python skills but lacks evidence of proficiency in core ML frameworks (PyTorch, TensorFlow, scikit-learn), SQL, or experience with large-scale data, which are critical for this role.
  code_quality: 15 — While commit messages are clear, the candidate shows poor or no evidence of PR documentation, testing practices, or code review participation, which are crucial for reproducib

## Step 5: Full pipeline in one go (optional)

You can also run **extract** and **evaluate** back-to-back: first extract, then evaluate with the same bundle and job. Below we re-run both steps and print a summary.


In [None]:
def run_full_pipeline(repo_url: str, token: str, job: dict):
    """Extract artifacts from repo, then evaluate against job. Returns (artifact_bundle, evaluation)."""
    r1 = requests.post(
        api("/api/screener/extract"),
        json={"repoUrl": repo_url, "token": token, "options": {"maxCommits": 50, "sinceDays": 90}},
        headers={"Content-Type": "application/json"},
        timeout=120,
    )
    r1.raise_for_status()
    bundle = r1.json()["data"]

    r2 = requests.post(
        api("/api/screener/evaluate"),
        json={"artifactBundle": bundle, "jobDescription": job},
        headers={"Content-Type": "application/json"},
        timeout=120,
    )
    r2.raise_for_status()
    evaluation = r2.json()["data"]
    return bundle, evaluation


# Uncomment and run to test full pipeline with your repo + token:
# bundle, evaluation = run_full_pipeline(REPO_URL, GITHUB_TOKEN, JOB_DESCRIPTION)
# print("Score:", evaluation["overallScore"], "|", evaluation["explanation"])


## Summary

| Step | API | Input | Output |
|------|-----|--------|--------|
| 1 | — | GitHub token (env or OAuth) | Token for API calls |
| 2 | `POST /api/screener/extract` | `repoUrl`, `token`, `options` | `ArtifactBundle` |
| 3 | — | Role description | `JobDescription` dict |
| 4 | `POST /api/screener/evaluate` | `artifactBundle`, `jobDescription` | `EvaluationResult` (score, components, explanation) |

To test with a **different GitHub profile**, change `REPO_URL` and re-run from Step 2. To test a **different role**, edit `JOB_DESCRIPTION` and re-run from Step 4 (or run the full pipeline).
