# Grok Recruiter - Taste-Graph Talent Discovery

This notebook walks through each phase of the discovery pipeline:

1. **Setup** - Load config and initialize clients
2. **Phase 1** - Resolve seed accounts (xAI employees)
3. **Phase 2** - Expand graph (following > likes > retweets > replies)
4. **Phase 3** - Hydrate user profiles & filter candidates
5. **Phase 4a** - Fast LLM screening (bio + pinned tweet)
6. **Phase 4b** - PageRank ranking (on fast-screened candidates)
7. **Phase 5** - Deep Evaluation with xAI Search Tools
8. **Phase 6** - Export & UI

## Setup - Load Config & Initialize Clients

In [5]:
import sys
from pathlib import Path

import yaml
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Import our modules
from src.x_client import XClient
from src.grok_client import GrokClient, load_criteria
from src.graph_builder import GraphBuilder, Node, Edge
from src.evaluator import CandidateEvaluator, PreFilterResult, print_evaluation_summary
from src.ranking import compute_rankings, export_ranked_nodes

print("Modules loaded successfully!")

Modules loaded successfully!


In [6]:
# Load configuration
CONFIG_PATH = "seeds.yaml"
CRITERIA_PATH = "criteria.yaml"
CACHE_DIR = "data/raw"
EVAL_CACHE_DIR = "data/evaluations"
OUTPUT_DIR = "data/processed"

with open(CONFIG_PATH, "r") as f:
    config = yaml.safe_load(f)

roots = config.get("roots", [])
settings = config.get("settings", {})

print(f"Loaded {len(roots)} seed accounts")
print(f"Settings: {settings}")

Loaded 90 seed accounts
Settings: {'max_depth': 1, 'max_followers_candidate': 50000, 'min_followers_candidate': 50, 'max_following_per_root': 500, 'max_liked_tweets': 100, 'max_root_tweets': 50, 'max_likers_per_tweet': 100, 'max_retweeters_per_tweet': 100, 'max_replies_per_conversation': 50}


In [7]:
# Initialize X API client
x_client = XClient(cache_dir=CACHE_DIR)
print("X API client initialized")

# Initialize Grok client
grok_client = GrokClient(cache_dir=EVAL_CACHE_DIR)
print("Grok client initialized")
print(f"  Fast model: {grok_client.fast_model}")
print(f"  Full model: {grok_client.model}")

X API client initialized
Grok client initialized
  Fast model: grok-4-1-fast-non-reasoning
  Full model: grok-4-1-fast-reasoning


## Phase 1 - Resolve Seed Accounts

Convert seed handles to user objects with IDs and metadata.

**No LLM calls** - just X API

In [8]:
# Initialize graph builder
builder = GraphBuilder(
    x_client=x_client,
    roots=roots,
    settings=settings,
)

print(f"Graph builder initialized with {len(roots)} roots")

Graph builder initialized with 90 roots


In [9]:
# Phase 1: Resolve root accounts
print("=" * 60)
print("PHASE 1: Resolving root accounts")
print("=" * 60)

root_users = {}

for handle in roots:
    user = x_client.get_user_by_username(handle)
    if not user:
        print(f"  [warn] Could not resolve @{handle}")
        continue
    
    root_users[handle] = user
    builder._add_or_update_node(user, is_root=True)
    followers = user.get('public_metrics', {}).get('followers_count', 0)
    print(f"  [root] @{handle} (ID: {user['id']}, followers: {followers:,})")

print(f"\nResolved {len(root_users)}/{len(roots)} root accounts")

PHASE 1: Resolving root accounts
  Fetching user: @elonmusk
  [cache] Loading from user_elonmusk.json
  [root] @elonmusk (ID: 44196397, followers: 229,420,382)
  Fetching user: @A5_Wagyu
  [cache] Loading from user_A5_Wagyu.json
  [root] @A5_Wagyu (ID: 265152280, followers: 409)
  Fetching user: @APrerepa
  [cache] Loading from user_APrerepa.json
  [root] @APrerepa (ID: 1146180676091170817, followers: 2,254)
  Fetching user: @AdamSliwakowski
  [cache] Loading from user_AdamSliwakowski.json
  [root] @AdamSliwakowski (ID: 1513813650, followers: 371)
  Fetching user: @a2xai
  [cache] Loading from user_a2xai.json
  [root] @a2xai (ID: 1173974870, followers: 2,081)
  Fetching user: @adityagupta
  [cache] Loading from user_adityagupta.json
  [root] @adityagupta (ID: 2269061347, followers: 7,935)
  Fetching user: @aknowntheory
  [cache] Loading from user_aknowntheory.json
  [root] @aknowntheory (ID: 630619183, followers: 516)
  Fetching user: @Alexjpeng
  [cache] Loading from user_Alexjpeng.js

## Phase 2 - Expand Graph

Expand from seeds using priority order:
1. **Following** (weight 5.0) - Who seeds chose to follow
2. **Retweets** (weight 3.0) - Who retweeted seed tweets
3. **Replies** (weight 2.5) - Who replied to seed tweets

**Note:** Likes endpoint not available on Free/Basic X API tier (skipped)

**No LLM calls** - just X API

In [10]:
# Phase 2a: Expand via FOLLOWING (strongest signal)
# Processing ALL seeds (not just first 5)
print("=" * 60)
print("PHASE 2a: Expanding via FOLLOWING (strongest signal)")
print("=" * 60)

for handle, root_user in root_users.items():  # ALL seeds
    root_id = root_user["id"]
    print(f"\n[root] @{handle} - fetching following...")
    builder._expand_following(root_id)

print(f"\nNodes so far: {len(builder.nodes)}")
print(f"Edges so far: {len(builder.edges)}")

PHASE 2a: Expanding via FOLLOWING (strongest signal)

[root] @elonmusk - fetching following...
  Fetching following for user 44196397
  [cache] Loading from following_44196397_page0.json
    Found 500 accounts followed
  [following] Processing 500 accounts followed by root

[root] @A5_Wagyu - fetching following...
  Fetching following for user 265152280
  [cache] Loading from following_265152280_page0.json
    Found 480 accounts followed
  [following] Processing 480 accounts followed by root

[root] @APrerepa - fetching following...
  Fetching following for user 1146180676091170817
  [cache] Loading from following_1146180676091170817_page0.json
    Found 378 accounts followed
  [following] Processing 378 accounts followed by root

[root] @AdamSliwakowski - fetching following...
  Fetching following for user 1513813650
  [cache] Loading from following_1513813650_page0.json
    Found 491 accounts followed
  [following] Processing 491 accounts followed by root

[root] @a2xai - fetching fo

In [11]:
# # Phase 2c: Expand via ROOT TWEETS (retweets, replies)
# # Processing ALL seeds
# print("=" * 60)
# print("PHASE 2c: Expanding via ROOT TWEETS (retweets, replies)")
# print("=" * 60)

# for handle, root_user in root_users.items():  # ALL seeds
#     root_id = root_user["id"]
#     print(f"\n[root] @{handle} - fetching tweet engagements...")
#     builder._expand_root_tweets(root_id)

# print(f"\nNodes so far: {len(builder.nodes)}")
# print(f"Edges so far: {len(builder.edges)}")

# Phase 2c: Expand via ROOT TWEETS (retweets, replies)
# This phase is OPTIONAL - following data (Phase 2a) is the strongest signal
# Skip this if hitting rate limits

ENABLE_PHASE_2C = False  # Set to True to enable (will hit rate limits with 90 seeds)

if ENABLE_PHASE_2C:
    print("=" * 60)
    print("PHASE 2c: Expanding via ROOT TWEETS (retweets, replies)")
    print("=" * 60)
    
    for handle, root_user in root_users.items():
        root_id = root_user["id"]
        print(f"\n[root] @{handle} - fetching tweet engagements...")
        builder._expand_root_tweets(root_id)
    
    print(f"\nNodes so far: {len(builder.nodes)}")
    print(f"Edges so far: {len(builder.edges)}")
else:
    print("=" * 60)
    print("PHASE 2c: SKIPPED (rate limit protection)")
    print("=" * 60)
    print("  [info] Following data from Phase 2a is the strongest signal")
    print("  [info] Set ENABLE_PHASE_2C = True to enable retweet/reply expansion")
    print(f"\nNodes so far: {len(builder.nodes)}")
    print(f"Edges so far: {len(builder.edges)}")

In [12]:
# Phase 3: Hydrate pending users
print("=" * 60)
print("PHASE 3: Hydrating discovered users")
print("=" * 60)

builder._hydrate_pending_users()

# Count results
total_nodes = len(builder.nodes)
total_edges = len(builder.edges)
root_count = sum(1 for n in builder.nodes.values() if n.is_root)
candidate_count = sum(1 for n in builder.nodes.values() if n.is_candidate)

# Edge breakdown
edge_counts = {}
for edge in builder.edges:
    edge_counts[edge.interaction_type] = edge_counts.get(edge.interaction_type, 0) + 1

print(f"\n" + "=" * 60)
print("DISCOVERY COMPLETE (Phase 1-3)")
print("=" * 60)
print(f"  Total nodes: {total_nodes:,}")
print(f"  Total edges: {total_edges:,}")
print(f"  Root accounts: {root_count}")
print(f"  Candidate accounts: {candidate_count:,}")
print(f"  Edge breakdown: {edge_counts}")

# Show filtered xAI/X employees
if hasattr(builder, 'filtered_employees') and builder.filtered_employees:
    print(f"\n  [FILTERED] {len(builder.filtered_employees)} xAI/X employees excluded:")
    for handle in builder.filtered_employees[:20]:
        print(f"    - @{handle}")
    if len(builder.filtered_employees) > 20:
        print(f"    ... and {len(builder.filtered_employees) - 20} more")

PHASE 3: Hydrating discovered users

DISCOVERY COMPLETE (Phase 1-3)
  Total nodes: 15,748
  Total edges: 27,266
  Root accounts: 89
  Candidate accounts: 10,694
  Edge breakdown: {'follow': 27266}


## Phase 3b: Multi-depth expansion (SKIP - use cell below instead)

The cell below (cell-14) has rate-limit aware settings. Run that one instead.

In [13]:
# Phase 3b: Multi-depth expansion (up to depth 5)
# RATE LIMIT AWARE: Reduced parameters to stay under API limits

MAX_DEPTH = 5           # Go up to 5 hops from seeds
TOP_K_PER_DEPTH = 15    # Expand from top 15 candidates at each depth
MAX_FOLLOWING = 50      # Fetch up to 50 accounts each follows

print("=" * 60)
print(f"MULTI-DEPTH EXPANSION (up to depth {MAX_DEPTH})")
print("=" * 60)
print(f"  Config: top_k={TOP_K_PER_DEPTH}, max_following={MAX_FOLLOWING}")

for depth in range(2, MAX_DEPTH + 1):
    print(f"\n{'='*40}")
    print(f"DEPTH {depth}")
    print(f"{'='*40}")
    
    # Find candidates from previous depth with highest in-degree
    candidate_scores = {}
    
    for edge in builder.edges:
        if edge.interaction_type == "follow" and edge.depth == depth - 1:
            dst = edge.dst_user_id
            if dst in builder.nodes and builder.nodes[dst].is_candidate:
                candidate_scores[dst] = candidate_scores.get(dst, 0) + 1
    
    if not candidate_scores:
        print(f"  No candidates with incoming edges at depth {depth-1}, stopping expansion")
        break
    
    # Sort by score and take top k
    top_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)[:TOP_K_PER_DEPTH]
    
    print(f"  Expanding from top {len(top_candidates)} depth-{depth-1} candidates")
    
    new_discoveries = 0
    
    for user_id, score in top_candidates:
        node = builder.nodes.get(user_id)
        if not node:
            continue
        
        print(f"    @{node.handle} (score: {score})...", end=" ")
        
        # Get who this candidate follows
        following = builder.x.get_user_following(user_id, max_results=MAX_FOLLOWING)
        
        added = 0
        for user in following:
            fol_id = user.get("id")
            if not fol_id or fol_id == user_id:
                continue
            
            # Skip if already in graph
            if fol_id in builder.nodes:
                continue
            
            # Add new node
            new_node = builder._add_or_update_node(user, discovered_via=f"depth{depth}_following")
            if new_node:
                new_discoveries += 1
                added += 1
                
                # Add edge with current depth
                builder._add_edge(
                    src_user_id=user_id,
                    dst_user_id=fol_id,
                    interaction_type="follow",
                    tweet_id="",
                    created_at="",
                    depth=depth,
                )
        
        print(f"+{added} new")
    
    print(f"\n  Total discovered at depth {depth}: {new_discoveries}")
    
    # Hydrate new users
    if new_discoveries > 0:
        builder._hydrate_pending_users()
    else:
        print(f"  No new discoveries at depth {depth}, stopping expansion")
        break

# Final summary
print(f"\n{'='*60}")
print("MULTI-DEPTH EXPANSION COMPLETE")
print(f"{'='*60}")
print(f"  Total nodes: {len(builder.nodes):,}")
print(f"  Total edges: {len(builder.edges):,}")

# Count by depth
depth_counts = {}
for node in builder.nodes.values():
    via = node.discovered_via
    if via.startswith("depth") and via.endswith("_following"):
        d = via.replace("depth", "").replace("_following", "")
        depth_counts[f"depth_{d}"] = depth_counts.get(f"depth_{d}", 0) + 1
    elif via == "followed_by_root":
        depth_counts["depth_1"] = depth_counts.get("depth_1", 0) + 1
    elif via in ["retweeted_root", "replied_to_root"]:
        depth_counts["depth_1_engagement"] = depth_counts.get("depth_1_engagement", 0) + 1

print("\nCandidate breakdown by depth:")
for d, count in sorted(depth_counts.items()):
    print(f"  {d}: {count:,}")

MULTI-DEPTH EXPANSION (up to depth 5)
  Config: top_k=15, max_following=50

DEPTH 2
  Expanding from top 15 depth-1 candidates
    @jimmybajimmyba (score: 48)...   Fetching following for user 1588323862110162944
  [cache] Loading from following_1588323862110162944_page0.json
    Found 100 accounts followed
+11 new
    @ericzelikman (score: 43)...   Fetching following for user 137463715
  [cache] Loading from following_137463715_page0.json
    Found 100 accounts followed
+32 new
    @Yuhu_ai_ (score: 42)...   Fetching following for user 881959726958862337
  [cache] Loading from following_881959726958862337_page0.json
    Found 100 accounts followed
+3 new
    @HeinrichKuttler (score: 38)...   Fetching following for user 1260917258970451969
  [cache] Loading from following_1260917258970451969_page0.json
    Found 100 accounts followed
+19 new
    @veggie_eric (score: 37)...   Fetching following for user 1219282049070063617
  [cache] Loading from following_1219282049070063617_page0.json
 

In [14]:
# Prepare candidates for evaluation
# Priority: followed_by_root > liked_by_root > retweeted_root > replied_to_root
discovery_priority = {
    "followed_by_root": 0,
    "liked_by_root": 1,
    "retweeted_root": 2,
    "replied_to_root": 3,
    "liked_root_tweet": 4,
}

candidates = [
    {
        "user_id": node.user_id,
        "handle": node.handle,
        "bio": node.bio,
        "followers_count": node.followers_count,
        "tweet_count": node.tweet_count,
        "discovered_via": node.discovered_via,
    }
    for node in builder.nodes.values()
    if node.is_candidate
]

# Sort by discovery priority
candidates.sort(key=lambda c: discovery_priority.get(c.get("discovered_via", ""), 99))

print(f"Prepared {len(candidates):,} candidates for evaluation")
print(f"\nDiscovery method breakdown:")
discovery_counts = {}
for c in candidates:
    via = c.get("discovered_via", "unknown")
    discovery_counts[via] = discovery_counts.get(via, 0) + 1
for via, count in sorted(discovery_counts.items(), key=lambda x: discovery_priority.get(x[0], 99)):
    print(f"  {via}: {count:,}")

Prepared 12,050 candidates for evaluation

Discovery method breakdown:
  followed_by_root: 10,694
  depth2_following: 204
  depth3_following: 314
  depth4_following: 403
  depth5_following: 435


In [15]:
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter

def visualize_graph(builder: GraphBuilder, max_nodes: int = 500, interaction_filter: list = None, 
                    layout: str = "spring", show_labels: bool = False):
    '''
    Visualize the graph showing different connection types.
    
    Args:
        builder: GraphBuilder instance with nodes and edges
        max_nodes: Maximum number of nodes to display (for performance)
        interaction_filter: List of interaction types to show (e.g., ['retweet', 'reply'])
                          If None, shows all types
        layout: Layout algorithm ('spring', 'kamada_kawai', 'circular')
        show_labels: Whether to show node labels (can be slow for large graphs)
    '''
    G = nx.DiGraph()
    
    edge_type_colors = {
        'follow': '#1DA1F2',    # Twitter blue
        'retweet': '#17BF63',   # Green 
        'reply': '#FFAD1F',     # Orange
        'like': '#E0245E',      # Red/pink
        'quote': '#794BC4'      # Purple
    }
    
    edge_type_widths = {
        'follow': 2.0,
        'retweet': 1.5,
        'reply': 1.2,
        'like': 0.8,
        'quote': 1.0
    }
    
    root_ids = {n.user_id for n in builder.nodes.values() if n.is_root}
    
    # Build full graph first
    for edge in builder.edges:
        if interaction_filter and edge.interaction_type not in interaction_filter:
            continue
        G.add_edge(edge.src_user_id, edge.dst_user_id, 
                   interaction_type=edge.interaction_type, weight=edge.weight)
    
    # Add node attributes
    for node_id in G.nodes():
        if node_id in builder.nodes:
            node = builder.nodes[node_id]
            G.nodes[node_id]['handle'] = node.handle
            G.nodes[node_id]['is_root'] = node.is_root
            G.nodes[node_id]['is_candidate'] = node.is_candidate
            G.nodes[node_id]['followers'] = node.followers_count
    
    print(f"Full graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
    
    # If too large, sample while preserving connectivity
    if G.number_of_nodes() > max_nodes:
        print(f"Sampling {max_nodes} most connected nodes (keeping all roots)...")
        
        # Always keep roots
        nodes_to_keep = set(root_ids & set(G.nodes()))
        
        # Get nodes by degree (most connected first)
        node_degrees = dict(G.degree())
        sorted_nodes = sorted(node_degrees.items(), key=lambda x: x[1], reverse=True)
        
        # Add highest-degree nodes until we hit max
        for node_id, degree in sorted_nodes:
            if len(nodes_to_keep) >= max_nodes:
                break
            nodes_to_keep.add(node_id)
        
        # Create subgraph
        G = G.subgraph(nodes_to_keep).copy()
        print(f"Sampled graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
    
    # Count edges by type
    edge_counts_by_type = Counter()
    for _, _, d in G.edges(data=True):
        edge_counts_by_type[d.get('interaction_type', 'unknown')] += 1
    
    print(f"Edge breakdown: {dict(edge_counts_by_type)}")
    
    # Choose layout
    print(f"Computing {layout} layout...")
    if layout == "kamada_kawai":
        pos = nx.kamada_kawai_layout(G)
    elif layout == "circular":
        pos = nx.circular_layout(G)
    else:  # spring
        pos = nx.spring_layout(G, k=1.5/len(G.nodes())**0.5, iterations=50, seed=42)
    
    # Create figure
    fig, ax = plt.subplots(figsize=(16, 12))
    
    # Categorize nodes
    root_nodes = [n for n in G.nodes() if G.nodes[n].get('is_root', False)]
    candidate_nodes = [n for n in G.nodes() if G.nodes[n].get('is_candidate', False) and n not in root_nodes]
    other_nodes = [n for n in G.nodes() if n not in root_nodes and n not in candidate_nodes]
    
    # Draw edges by type (bottom layer)
    for interaction_type in ['like', 'reply', 'retweet', 'follow', 'quote']:
        edges_of_type = [(u, v) for u, v, d in G.edges(data=True) 
                         if d.get('interaction_type') == interaction_type]
        if edges_of_type:
            nx.draw_networkx_edges(
                G, pos, edgelist=edges_of_type, ax=ax,
                edge_color=edge_type_colors.get(interaction_type, '#CCCCCC'),
                width=edge_type_widths.get(interaction_type, 1.0),
                alpha=0.3,
                arrowsize=8,
                arrowstyle='-|>',
                connectionstyle='arc3,rad=0.1'
            )
    
    # Draw nodes (top layer)
    nx.draw_networkx_nodes(G, pos, nodelist=other_nodes, node_color='#B8D4E3', 
                          node_size=30, alpha=0.5, ax=ax)
    nx.draw_networkx_nodes(G, pos, nodelist=candidate_nodes, node_color='#4ECDC4', 
                          node_size=80, alpha=0.7, ax=ax)
    nx.draw_networkx_nodes(G, pos, nodelist=root_nodes, node_color='#FF6B6B', 
                          node_size=200, alpha=0.9, ax=ax)
    
    # Draw labels for roots
    if show_labels or len(root_nodes) <= 20:
        root_labels = {n: G.nodes[n].get('handle', '')[:12] for n in root_nodes}
        nx.draw_networkx_labels(G, pos, root_labels, font_size=7, font_weight='bold', ax=ax)
    
    # Create legend
    from matplotlib.lines import Line2D
    from matplotlib.patches import Patch
    
    legend_elements = [
        Patch(facecolor='#FF6B6B', label=f'Seeds ({len(root_nodes)})'),
        Patch(facecolor='#4ECDC4', label=f'Candidates ({len(candidate_nodes)})'),
        Patch(facecolor='#B8D4E3', label=f'Other ({len(other_nodes)})'),
        Line2D([0], [0], color='white', label=''),  # Spacer
    ]
    
    # Add edge type legends
    for itype in ['follow', 'retweet', 'reply', 'like', 'quote']:
        if edge_counts_by_type.get(itype, 0) > 0:
            legend_elements.append(
                Line2D([0], [0], color=edge_type_colors[itype], 
                       linewidth=edge_type_widths[itype], 
                       label=f'{itype.capitalize()} ({edge_counts_by_type[itype]:,})')
            )
    
    ax.legend(handles=legend_elements, loc='upper left', fontsize=9, framealpha=0.9)
    
    ax.set_title(f'Taste Graph: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges', 
                fontsize=14, fontweight='bold')
    ax.axis('off')
    plt.tight_layout()
    plt.savefig('data/processed/graph_visualization.png', dpi=150, bbox_inches='tight')
    print("Saved to data/processed/graph_visualization.png")
    plt.show()
    
    return G

print("Graph visualization function ready!")

Graph visualization function ready!


In [16]:
# Multi-depth visualization function (supports depth 1-5)
def visualize_by_depth(builder: GraphBuilder, max_nodes: int = 1000, max_depth: int = 5):
    """Visualize graph colored by discovery depth (supports up to depth 5)."""
    import networkx as nx
    import matplotlib.pyplot as plt
    from collections import Counter
    
    G = nx.DiGraph()
    
    # Build graph
    for edge in builder.edges:
        G.add_edge(edge.src_user_id, edge.dst_user_id, 
                   depth=edge.depth, interaction_type=edge.interaction_type)
    
    # Add node attributes
    for node_id in G.nodes():
        if node_id in builder.nodes:
            node = builder.nodes[node_id]
            G.nodes[node_id]['handle'] = node.handle
            G.nodes[node_id]['is_root'] = node.is_root
            G.nodes[node_id]['is_candidate'] = node.is_candidate
            G.nodes[node_id]['discovered_via'] = node.discovered_via
            
            # Parse depth from discovered_via
            via = node.discovered_via
            if node.is_root:
                G.nodes[node_id]['depth'] = 0
            elif via == "followed_by_root" or via in ["retweeted_root", "replied_to_root"]:
                G.nodes[node_id]['depth'] = 1
            elif via.startswith("depth") and via.endswith("_following"):
                try:
                    G.nodes[node_id]['depth'] = int(via.replace("depth", "").replace("_following", ""))
                except:
                    G.nodes[node_id]['depth'] = 99
            else:
                G.nodes[node_id]['depth'] = 99
    
    # Sample if needed
    if G.number_of_nodes() > max_nodes:
        root_ids = {n.user_id for n in builder.nodes.values() if n.is_root}
        nodes_to_keep = set(root_ids & set(G.nodes()))
        node_degrees = dict(G.degree())
        sorted_nodes = sorted(node_degrees.items(), key=lambda x: x[1], reverse=True)
        for node_id, _ in sorted_nodes:
            if len(nodes_to_keep) >= max_nodes:
                break
            nodes_to_keep.add(node_id)
        G = G.subgraph(nodes_to_keep).copy()
    
    print(f"Visualizing {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
    
    # Layout
    pos = nx.spring_layout(G, k=2.0/len(G.nodes())**0.5, iterations=50, seed=42)
    
    fig, ax = plt.subplots(figsize=(18, 14))
    
    # Color palette for depths
    depth_colors = {
        0: '#E74C3C',   # Red - Seeds
        1: '#3498DB',   # Blue - Depth 1
        2: '#9B59B6',   # Purple - Depth 2
        3: '#27AE60',   # Green - Depth 3
        4: '#F39C12',   # Orange - Depth 4
        5: '#1ABC9C',   # Teal - Depth 5
        99: '#BDC3C7',  # Gray - Other
    }
    
    depth_sizes = {0: 250, 1: 100, 2: 70, 3: 50, 4: 40, 5: 35, 99: 20}
    
    # Categorize by depth
    nodes_by_depth = {d: [] for d in range(max_depth + 1)}
    nodes_by_depth[99] = []  # Other
    
    for n in G.nodes():
        d = G.nodes[n].get('depth', 99)
        if d in nodes_by_depth:
            nodes_by_depth[d].append(n)
        else:
            nodes_by_depth[99].append(n)
    
    # Draw edges by depth
    for d in range(1, max_depth + 1):
        edges_at_depth = [(u, v) for u, v, data in G.edges(data=True) if data.get('depth', 1) == d]
        if edges_at_depth:
            nx.draw_networkx_edges(G, pos, edgelist=edges_at_depth, ax=ax,
                                  edge_color=depth_colors.get(d, '#CCCCCC'), 
                                  width=0.8, alpha=0.25, arrowsize=5)
    
    # Draw nodes by depth (from deepest to seeds so seeds are on top)
    for d in sorted(nodes_by_depth.keys(), reverse=True):
        if nodes_by_depth[d]:
            nx.draw_networkx_nodes(G, pos, nodelist=nodes_by_depth[d], 
                                  node_color=depth_colors.get(d, '#BDC3C7'),
                                  node_size=depth_sizes.get(d, 30), 
                                  alpha=0.7 if d > 0 else 0.95, ax=ax)
    
    # Labels for seeds
    seed_labels = {n: G.nodes[n].get('handle', '')[:10] for n in nodes_by_depth[0]}
    nx.draw_networkx_labels(G, pos, seed_labels, font_size=6, font_weight='bold', ax=ax)
    
    # Build legend
    from matplotlib.patches import Patch
    legend_elements = []
    for d in range(max_depth + 1):
        if nodes_by_depth[d]:
            label = "Seeds" if d == 0 else f"Depth {d}"
            legend_elements.append(Patch(facecolor=depth_colors[d], label=f'{label} ({len(nodes_by_depth[d])})'))
    if nodes_by_depth[99]:
        legend_elements.append(Patch(facecolor=depth_colors[99], label=f'Other ({len(nodes_by_depth[99])})'))
    
    ax.legend(handles=legend_elements, loc='upper left', fontsize=10)
    
    # Title showing the cascade
    title_parts = [f"{len(nodes_by_depth[0])} seeds"]
    for d in range(1, max_depth + 1):
        if nodes_by_depth[d]:
            title_parts.append(f"{len(nodes_by_depth[d])} depth-{d}")
    
    ax.set_title(f'Taste Graph: {" → ".join(title_parts)}', fontsize=14, fontweight='bold')
    ax.axis('off')
    plt.tight_layout()
    plt.savefig('data/processed/graph_by_depth.png', dpi=150, bbox_inches='tight')
    print("Saved to data/processed/graph_by_depth.png")
    plt.show()
    
    return nodes_by_depth

print("Multi-depth visualization function ready!")

Multi-depth visualization function ready!


## Phase 4a - Fast LLM Screening

Use `grok-4-1-fast-non-reasoning` to quickly filter candidates based on:
- Bio
- Pinned tweet (if available)

**First LLM calls happen here!**

In [17]:
# Initialize evaluator
evaluator = CandidateEvaluator(
    x_client=x_client,
    grok_client=grok_client,
    criteria_path=CRITERIA_PATH,
    min_followers=settings.get("min_followers_candidate", 50),
    max_followers=settings.get("max_followers_candidate", 50000),
    min_tweets=50,
    seed_handles=set(roots),  # Exclude known xAI employees
    use_fast_screen=True,
)

print(f"Evaluator initialized")
print(f"  Min followers: {evaluator.min_followers}")
print(f"  Max followers: {evaluator.max_followers}")
print(f"  Min tweets: {evaluator.min_tweets}")
print(f"  Seed handles excluded: {len(evaluator.seed_handles)}")

Evaluator initialized
  Min followers: 50
  Max followers: 50000
  Min tweets: 50
  Seed handles excluded: 89


In [18]:
# Fast screening - PRIORITIZE UNDERRATED CANDIDATES
# 1. Depth 2+ (not direct follows - hidden gems)
# 2. Under 10k followers (truly underrated)

print("=" * 60)
print("PHASE 4a: Fast LLM Screening (underrated candidates: depth 2+, <10k followers)")
print("=" * 60)

# Filter for underrated candidates: depth 2+ AND under 10k followers
UNDERRATED_MAX_FOLLOWERS = 10000

underrated_candidates = [
    c for c in candidates 
    if c.get("discovered_via", "").startswith("depth") 
    and c.get("followers_count", 0) < UNDERRATED_MAX_FOLLOWERS
    and c.get("followers_count", 0) >= 50  # Min threshold
]

# Also include some depth-1 candidates with low followers for comparison
depth1_underrated = [
    c for c in candidates
    if c.get("discovered_via") in ["followed_by_root", "retweeted_root", "replied_to_root"]
    and c.get("followers_count", 0) < UNDERRATED_MAX_FOLLOWERS
    and c.get("followers_count", 0) >= 50
]

print(f"Underrated candidate pool:")
print(f"  Depth 2+ with <{UNDERRATED_MAX_FOLLOWERS:,} followers: {len(underrated_candidates):,}")
print(f"  Depth 1 with <{UNDERRATED_MAX_FOLLOWERS:,} followers: {len(depth1_underrated):,}")

# Prioritize depth 2+ first, then depth 1
candidates_to_screen = underrated_candidates + depth1_underrated

# Sort by depth (deeper = more underrated)
depth_priority = {
    "depth5_following": 0,
    "depth4_following": 1,
    "depth3_following": 2,
    "depth2_following": 3,
    "followed_by_root": 4,
    "retweeted_root": 5,
    "replied_to_root": 6,
}
candidates_to_screen.sort(key=lambda c: (
    depth_priority.get(c.get("discovered_via", ""), 99),
    c.get("followers_count", 0)  # Within same depth, lower followers first
))
print("candidates_to_screen", len(candidates_to_screen))

# How many to screen
MAX_FAST_SCREEN = 1000

print(f"\nScreening {min(MAX_FAST_SCREEN, len(candidates_to_screen))} underrated candidates...")
print(f"\nBreakdown of candidates to screen:")
for via in ["depth2_following", "depth3_following", "depth4_following", "depth5_following", "followed_by_root"]:
    count = sum(1 for c in candidates_to_screen[:MAX_FAST_SCREEN] if c.get("discovered_via") == via)
    if count > 0:
        print(f"  {via}: {count}")

fast_screen_results = []
depth_pass_counts = {}
depth_fail_counts = {}

for i, candidate in enumerate(candidates_to_screen[:MAX_FAST_SCREEN]):
    handle = candidate["handle"]
    bio = candidate["bio"]
    via = candidate.get("discovered_via", "unknown")
    followers = candidate.get("followers_count", 0)
    
    # Run fast screen
    result = grok_client.fast_screen(
        handle=handle,
        bio=bio,
        pinned_tweet=None,
        location=None,
    )
    
    fast_screen_results.append((candidate, result))
    
    # Track by depth
    if result.pass_filter:
        depth_pass_counts[via] = depth_pass_counts.get(via, 0) + 1
    else:
        depth_fail_counts[via] = depth_fail_counts.get(via, 0) + 1
    
    # Progress update every 50
    if (i + 1) % 50 == 0:
        passed = sum(1 for _, r in fast_screen_results if r.pass_filter)
        print(f"  Processed {i+1}/{min(MAX_FAST_SCREEN, len(candidates_to_screen))} - {passed} passed so far...")

# Summary
passed = sum(1 for _, r in fast_screen_results if r.pass_filter)
print(f"\n{'='*60}")
print(f"FAST SCREEN COMPLETE: {passed}/{len(fast_screen_results)} passed ({100*passed/len(fast_screen_results):.1f}%)")
print(f"{'='*60}")

print(f"\nPass rate by depth:")
for via in ["depth5_following", "depth4_following", "depth3_following", "depth2_following", "followed_by_root"]:
    p = depth_pass_counts.get(via, 0)
    f = depth_fail_counts.get(via, 0)
    total = p + f
    if total > 0:
        print(f"  {via}: {p}/{total} passed ({100*p/total:.1f}%)")

# Show passed candidates with follower counts
passed_candidates = [(c, r) for c, r in fast_screen_results if r.pass_filter]
print(f"\nPassed underrated candidates ({len(passed_candidates)}):")
for c, r in sorted(passed_candidates, key=lambda x: x[0].get("followers_count", 0))[:25]:
    followers = c.get("followers_count", 0)
    print(f"  ✓ @{c['handle']} ({followers:,} followers, {c['discovered_via']}) - {r.potential_role}")
if len(passed_candidates) > 25:
    print(f"  ... and {len(passed_candidates) - 25} more")

PHASE 4a: Fast LLM Screening (underrated candidates: depth 2+, <10k followers)
Underrated candidate pool:
  Depth 2+ with <10,000 followers: 1,108
  Depth 1 with <10,000 followers: 7,377
candidates_to_screen 8485

Screening 1000 underrated candidates...

Breakdown of candidates to screen:
  depth2_following: 59
  depth3_following: 258
  depth4_following: 341
  depth5_following: 342
  Processed 50/1000 - 22 passed so far...
  Processed 100/1000 - 34 passed so far...
  Processed 150/1000 - 46 passed so far...
  Processed 200/1000 - 65 passed so far...
  Processed 250/1000 - 77 passed so far...
  Processed 300/1000 - 93 passed so far...
  Processed 350/1000 - 110 passed so far...
  Processed 400/1000 - 139 passed so far...
  Processed 450/1000 - 172 passed so far...
  Processed 500/1000 - 202 passed so far...
  Processed 550/1000 - 228 passed so far...
  Processed 600/1000 - 247 passed so far...
  Processed 650/1000 - 271 passed so far...
  Processed 700/1000 - 294 passed so far...
  Proc

In [None]:
# # Visualize graph by depth (after fast LLM screening)
# print("=" * 60)
# print("GRAPH VISUALIZATION (Post Fast-Screen)")
# print("=" * 60)

# # Show multi-depth graph
# nodes_by_depth = visualize_by_depth(builder, max_nodes=10000, max_depth=5)

# # Summary of LLM screening results
# passed_handles = {c["handle"] for c, r in fast_screen_results if r.pass_filter}
# failed_handles = {c["handle"] for c, r in fast_screen_results if not r.pass_filter}

# print(f"\nFast LLM Screening Summary:")
# print(f"  Total screened: {len(fast_screen_results)}")
# print(f"  Passed (engineering candidates): {len(passed_handles)}")
# print(f"  Failed (non-technical/filtered): {len(failed_handles)}")

# # Show depth breakdown of passed candidates
# print(f"\nPassed candidates by depth:")
# passed_by_depth = {}
# for c, r in fast_screen_results:
#     if r.pass_filter:
#         via = c.get("discovered_via", "unknown")
#         passed_by_depth[via] = passed_by_depth.get(via, 0) + 1
# for via, count in sorted(passed_by_depth.items()):
#     print(f"  {via}: {count}")

GRAPH VISUALIZATION (Post Fast-Screen)
Visualizing 10000 nodes, 21039 edges


KeyboardInterrupt: 

## Phase 4b - Full Grok Evaluation (SKIP)

**SKIP THIS SECTION** - We use Deep Evaluation with xAI Search Tools instead (Phase 5).

The cells below (24-26) are kept for reference but should NOT be run.

In [21]:
# SKIP - Using Deep Evaluation with xAI Search Tools instead
# # Full evaluation on candidates that passed fast screen
# print("=" * 60)
# print("PHASE 4b: Full Grok Evaluation (on fast-screen passed candidates)")
# print("=" * 60)
# 
# # Get candidates that passed fast screening
# passed_candidate_handles = {c["handle"] for c, r in fast_screen_results if r.pass_filter}
# candidates_to_eval = [c for c in candidates if c["handle"] in passed_candidate_handles]
# 
# print(f"Candidates that passed fast screen: {len(candidates_to_eval)}")
# 
# # How many to fully evaluate
# MAX_EVAL = 500  # Full eval is more expensive, so limit
# 
# results = evaluator.evaluate_batch(candidates_to_eval, max_candidates=MAX_EVAL)
# 
# # Update nodes with evaluations
# evaluated_count = 0
# relevant_count = 0
# 
# for candidate, pre_filter, evaluation in results:
#     user_id = candidate.get("user_id")
#     if user_id in builder.nodes:
#         node = builder.nodes[user_id]
#         node.grok_evaluated = True
#         evaluated_count += 1
#         
#         if evaluation:
#             node.grok_relevant = evaluation.relevant
#             node.grok_score = evaluation.score
#             node.grok_reasoning = evaluation.reasoning
#             node.grok_skills = ",".join(evaluation.detected_skills)
#             node.grok_role = evaluation.recommended_role
#             node.grok_exceptional_work = evaluation.exceptional_work
#             node.grok_red_flags = ",".join(evaluation.red_flags)
#             
#             if evaluation.relevant:
#                 relevant_count += 1
#                 via = candidate.get("discovered_via", "unknown")
#                 print(f"  ✓ @{node.handle} ({via}) - score: {evaluation.score:.2f} - {evaluation.recommended_role}")
# 
# print(f"\n{'='*60}")
# print(f"Evaluation complete: {evaluated_count} evaluated, {relevant_count} marked relevant")

In [22]:
# SKIP - Part of old Phase 4b (Full Grok Evaluation)
# # Debug: Check evaluation results
# print("=" * 60)
# print("EVALUATION RESULTS BREAKDOWN")
# print("=" * 60)
# 
# passed_prefilter = 0
# failed_prefilter = 0
# got_full_eval = 0
# marked_relevant = 0
# 
# prefilter_reasons = {}
# 
# for candidate, pre_filter, evaluation in results:
#     if pre_filter.passed:
#         passed_prefilter += 1
#         if evaluation:
#             got_full_eval += 1
#             if evaluation.relevant:
#                 marked_relevant += 1
#     else:
#         failed_prefilter += 1
#         reason = pre_filter.reason.split(':')[0] if ':' in pre_filter.reason else pre_filter.reason
#         prefilter_reasons[reason] = prefilter_reasons.get(reason, 0) + 1
# 
# print(f"Pre-filter results:")
# print(f"  Passed: {passed_prefilter}")
# print(f"  Failed: {failed_prefilter}")
# print(f"\nPre-filter failure reasons:")
# for reason, count in sorted(prefilter_reasons.items(), key=lambda x: -x[1])[:10]:
#     print(f"  {reason}: {count}")
# 
# print(f"\nFull evaluation results:")
# print(f"  Got full eval: {got_full_eval}")
# print(f"  Marked relevant: {marked_relevant}")
# 
# # Show some examples of candidates that passed pre-filter but weren't marked relevant
# print(f"\nExamples of passed pre-filter but not relevant:")
# shown = 0
# for candidate, pre_filter, evaluation in results:
#     if pre_filter.passed and evaluation and not evaluation.relevant and shown < 5:
#         print(f"  @{candidate['handle']}: {evaluation.reasoning[:80]}...")
#         shown += 1

In [23]:
# SKIP - Part of old Phase 4b (Full Grok Evaluation)
# # Print evaluation summary
# print_evaluation_summary(results)

## Phase 5 - Rank & Export

Compute underratedness scores and export results.

In [24]:
# Compute rankings
print("=" * 60)
print("PHASE 5: Computing Rankings")
print("=" * 60)

nodes = compute_rankings(builder.nodes, builder.edges)

PHASE 5: Computing Rankings

COMPUTING RANKINGS
  Root nodes (seeds): 89
  Method: Personalized PageRank

  Top 20 candidates by UNDERRATEDNESS score:
  (High PageRank + Low Followers = Hidden Gem)
  ----------------------------------------------------------------------
   1. @theVincentStark      | followers:   2,206 | ppr: 0.002927 | underrated: 0.000380
   2. @512x512              | followers:  47,933 | ppr: 0.003198 | underrated: 0.000297
   3. @LiLiDuc22            | followers:   3,864 | ppr: 0.001268 | underrated: 0.000154
   4. @JiachengHong         | followers:     531 | ppr: 0.000915 | underrated: 0.000146
   5. @MichelleShieh        | followers:   2,916 | ppr: 0.000949 | underrated: 0.000119
   6. @ericzelikman         | followers:  22,370 | ppr: 0.000981 | underrated: 0.000098
   7. @arnogau              | followers:   9,326 | ppr: 0.000860 | underrated: 0.000094
   8. @dboedijono           | followers:   1,178 | ppr: 0.000654 | underrated: 0.000092
   9. @xu63006           

In [25]:
# Export results
print("=" * 60)
print("PHASE 6: Exporting Results")
print("=" * 60)

output_dir = Path(OUTPUT_DIR)
output_dir.mkdir(parents=True, exist_ok=True)

nodes_path = output_dir / "nodes.csv"
edges_path = output_dir / "edges.csv"

export_ranked_nodes(nodes, str(nodes_path))
builder.export_to_csv(str(nodes_path), str(edges_path))

print(f"\nExported to:")
print(f"  {nodes_path}")
print(f"  {edges_path}")

PHASE 6: Exporting Results
  Exported 17323 ranked nodes to data/processed/nodes.csv
  Exported 17323 nodes to data/processed/nodes.csv
  Exported 28841 edges to data/processed/edges.csv

Exported to:
  data/processed/nodes.csv
  data/processed/edges.csv


In [26]:
# Export graph in multiple formats for PageRank analysis
import networkx as nx
import pickle
import json

print("=" * 60)
print("EXPORTING GRAPH FOR PAGERANK ANALYSIS")
print("=" * 60)

# Build NetworkX graph with all attributes
G = nx.DiGraph()

# Add edges with weights
for edge in builder.edges:
    G.add_edge(
        edge.src_user_id,
        edge.dst_user_id,
        weight=edge.weight,
        interaction_type=edge.interaction_type,
        depth=edge.depth,
        tweet_id=edge.tweet_id,
    )

# Add node attributes
for node_id, node in builder.nodes.items():
    if node_id in G:
        G.nodes[node_id].update({
            'handle': node.handle,
            'name': node.name,
            'bio': node.bio[:200] if node.bio else '',  # Truncate long bios
            'followers_count': node.followers_count,
            'following_count': node.following_count,
            'tweet_count': node.tweet_count,
            'is_root': node.is_root,
            'is_candidate': node.is_candidate,
            'discovered_via': node.discovered_via,
        })

print(f"Graph built: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges")

# Create output directory
export_dir = Path("data/graph_export")
export_dir.mkdir(parents=True, exist_ok=True)

# 1. Edge list CSV (simplest format for PageRank)
edge_list_path = export_dir / "edges.csv"
with open(edge_list_path, 'w') as f:
    f.write("source,target,weight,interaction_type,depth\n")
    for u, v, data in G.edges(data=True):
        f.write(f"{u},{v},{data.get('weight', 1.0)},{data.get('interaction_type', '')},{data.get('depth', 1)}\n")
print(f"  ✓ Edge list CSV: {edge_list_path}")

# 2. Node list CSV (with attributes)
node_list_path = export_dir / "nodes.csv"
with open(node_list_path, 'w') as f:
    f.write("user_id,handle,name,followers_count,following_count,tweet_count,is_root,is_candidate,discovered_via\n")
    for node_id, attrs in G.nodes(data=True):
        handle = attrs.get('handle', '').replace(',', ' ')
        name = attrs.get('name', '').replace(',', ' ').replace('"', '')
        f.write(f"{node_id},{handle},{name},{attrs.get('followers_count', 0)},{attrs.get('following_count', 0)},{attrs.get('tweet_count', 0)},{attrs.get('is_root', False)},{attrs.get('is_candidate', False)},{attrs.get('discovered_via', '')}\n")
print(f"  ✓ Node list CSV: {node_list_path}")

# 3. GraphML (for Gephi, NetworkX - includes all attributes)
graphml_path = export_dir / "graph.graphml"
nx.write_graphml(G, graphml_path)
print(f"  ✓ GraphML: {graphml_path}")

# 4. GML (for igraph, NetworkX)
# Note: GML doesn't support all attribute types, so we simplify
G_gml = G.copy()
for node_id in G_gml.nodes():
    # Convert booleans to int for GML compatibility
    G_gml.nodes[node_id]['is_root'] = int(G_gml.nodes[node_id].get('is_root', False))
    G_gml.nodes[node_id]['is_candidate'] = int(G_gml.nodes[node_id].get('is_candidate', False))
gml_path = export_dir / "graph.gml"
nx.write_gml(G_gml, gml_path)
print(f"  ✓ GML: {gml_path}")

# 5. Pickle (fastest to load in Python)
pickle_path = export_dir / "graph.pickle"
with open(pickle_path, 'wb') as f:
    pickle.dump(G, f)
print(f"  ✓ Pickle: {pickle_path}")

# 6. Adjacency list (compact text format)
adjlist_path = export_dir / "graph.adjlist"
nx.write_adjlist(G, adjlist_path)
print(f"  ✓ Adjacency list: {adjlist_path}")

# 7. JSON format (for web visualization / JavaScript)
json_data = {
    "nodes": [
        {
            "id": node_id,
            **{k: v for k, v in attrs.items() if k != 'bio'}  # Exclude long bios
        }
        for node_id, attrs in G.nodes(data=True)
    ],
    "edges": [
        {"source": u, "target": v, **data}
        for u, v, data in G.edges(data=True)
    ],
    "metadata": {
        "total_nodes": G.number_of_nodes(),
        "total_edges": G.number_of_edges(),
        "roots": sum(1 for _, d in G.nodes(data=True) if d.get('is_root')),
        "candidates": sum(1 for _, d in G.nodes(data=True) if d.get('is_candidate')),
    }
}
json_path = export_dir / "graph.json"
with open(json_path, 'w') as f:
    json.dump(json_data, f)
print(f"  ✓ JSON: {json_path}")

print(f"\n{'='*60}")
print("EXPORT COMPLETE")
print(f"{'='*60}")
print(f"All files saved to: {export_dir}/")
print(f"""
Files:
  edges.csv      - Edge list (source, target, weight) for PageRank
  nodes.csv      - Node attributes (handle, followers, is_root, etc.)
  graph.graphml  - Full graph for Gephi/NetworkX
  graph.gml      - Full graph for igraph
  graph.pickle   - Python pickle (fastest to load)
  graph.adjlist  - Compact adjacency list
  graph.json     - JSON for web visualization

Quick PageRank example:
  import networkx as nx
  import pickle
  
  with open('data/graph_export/graph.pickle', 'rb') as f:
      G = pickle.load(f)
  
  # Run PageRank with seeds as personalization
  seeds = [n for n, d in G.nodes(data=True) if d.get('is_root')]
  personalization = {{n: 1.0 for n in seeds}}
  pr = nx.pagerank(G, personalization=personalization, weight='weight')
  
  # Get top candidates by PageRank
  candidates = [(n, pr[n]) for n, d in G.nodes(data=True) if d.get('is_candidate')]
  top_candidates = sorted(candidates, key=lambda x: x[1], reverse=True)[:20]
""")

EXPORTING GRAPH FOR PAGERANK ANALYSIS
Graph built: 17,322 nodes, 28,361 edges
  ✓ Edge list CSV: data/graph_export/edges.csv
  ✓ Node list CSV: data/graph_export/nodes.csv
  ✓ GraphML: data/graph_export/graph.graphml
  ✓ GML: data/graph_export/graph.gml
  ✓ Pickle: data/graph_export/graph.pickle
  ✓ Adjacency list: data/graph_export/graph.adjlist
  ✓ JSON: data/graph_export/graph.json

EXPORT COMPLETE
All files saved to: data/graph_export/

Files:
  edges.csv      - Edge list (source, target, weight) for PageRank
  nodes.csv      - Node attributes (handle, followers, is_root, etc.)
  graph.graphml  - Full graph for Gephi/NetworkX
  graph.gml      - Full graph for igraph
  graph.pickle   - Python pickle (fastest to load)
  graph.adjlist  - Compact adjacency list
  graph.json     - JSON for web visualization

Quick PageRank example:
  import networkx as nx
  import pickle

  with open('data/graph_export/graph.pickle', 'rb') as f:
      G = pickle.load(f)

  # Run PageRank with seeds as p

## Export Graph for PageRank Analysis

Export the network graph in multiple formats for your teammate to run PageRank:
- **Edge list CSV** - Simple (source, target, weight) format
- **GraphML** - Rich XML format with node attributes (for Gephi, NetworkX)
- **GML** - Graph Modelling Language (for igraph, NetworkX)
- **Adjacency list** - Compact text format
- **Pickle** - Python NetworkX graph object (fastest to load)

In [27]:
# Final summary (after fast screen, before deep eval)
print("=" * 60)
print("PIPELINE SUMMARY (Pre-Deep Evaluation)")
print("=" * 60)

# Count from fast screen results
passed_count = sum(1 for _, r in fast_screen_results if r.pass_filter)
failed_count = sum(1 for _, r in fast_screen_results if not r.pass_filter)

print(f"""
Graph Discovery:
  Total accounts discovered: {len(builder.nodes):,}
  Total interactions mapped: {len(builder.edges):,}
  Candidate accounts: {sum(1 for n in builder.nodes.values() if n.is_candidate):,}

Fast LLM Screening:
  Screened: {len(fast_screen_results)}
  Passed: {passed_count}
  Failed: {failed_count}
  Pass rate: {100*passed_count/len(fast_screen_results):.1f}%

Next: Run PageRank + Deep Evaluation with xAI Search Tools
""")

PIPELINE SUMMARY (Pre-Deep Evaluation)

Graph Discovery:
  Total accounts discovered: 17,323
  Total interactions mapped: 28,841
  Candidate accounts: 12,050

Fast LLM Screening:
  Screened: 1000
  Passed: 457
  Failed: 543
  Pass rate: 45.7%

Next: Run PageRank + Deep Evaluation with xAI Search Tools



## View Top Candidates

In [28]:
# Show top candidates from fast screen (before deep eval)
print("TOP CANDIDATES FROM FAST SCREEN")
print("=" * 60)

passed_candidates = [(c, r) for c, r in fast_screen_results if r.pass_filter]

# Sort by followers (lowest first = most underrated)
passed_candidates.sort(key=lambda x: x[0].get('followers_count', 0))

print(f"\nTotal passed: {len(passed_candidates)}")
print(f"\nTop 20 most underrated (lowest followers):")
print("-" * 60)

for i, (c, r) in enumerate(passed_candidates[:20], 1):
    handle = c.get('handle', 'unknown')
    followers = c.get('followers_count', 0)
    via = c.get('discovered_via', 'unknown')
    role = r.potential_role
    print(f"{i:2}. @{handle:<20} | {followers:>5,} followers | {via} | {role}")

print("-" * 60)
print(f"\nReady for PageRank + Deep Evaluation")

TOP CANDIDATES FROM FAST SCREEN

Total passed: 457

Top 20 most underrated (lowest followers):
------------------------------------------------------------
 1. @austen_liao          |    50 followers | depth3_following | research
 2. @zzzoooeee321         |    53 followers | depth3_following | research
 3. @rajivranjanmars      |    54 followers | depth5_following | engineering
 4. @j777ro               |    54 followers | depth3_following | research
 5. @ShuibaiZ69721        |    60 followers | depth3_following | research
 6. @BairoliyaShivam      |    62 followers | depth3_following | engineering
 7. @Gooskying            |    64 followers | depth4_following | research
 8. @dan_sci_phil         |    66 followers | depth5_following | research
 9. @SeKim1112            |    66 followers | depth4_following | research
10. @infiniter3grets      |    68 followers | depth5_following | engineering
11. @tangerinecoder       |    71 followers | depth4_following | research
12. @allanjienlp     

## Phase 6: Deep Evaluation with xAI Search Tools

Use `grok-4-1-fast` with `web_search` + `x_search` to deeply evaluate top candidates:
- Search X for their posts and discussions
- Search GitHub for their repos
- Search LinkedIn for their background
- Score on 5-criterion rubric (0-100 scale)

In [29]:
# Phase 6: Deep Evaluation using xAI Search Tools
# Uses grok-4-1-fast with web_search + x_search to deeply analyze candidates

from src.deep_evaluator import DeepEvaluator, print_evaluation
from src.ranking import compute_pagerank, get_top_candidates
import pickle

print("=" * 60)
print("PHASE 6: Deep Evaluation with xAI Search Tools")
print("=" * 60)

# Load graph for PageRank
with open('data/graph_export/graph.pickle', 'rb') as f:
    G = pickle.load(f)

print(f"Loaded graph: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges")

# Compute PageRank with seed personalization
seed_ids = [n for n, d in G.nodes(data=True) if d.get('is_root')]
print(f"Computing PageRank with {len(seed_ids)} seeds as personalization...")

pagerank_scores = compute_pagerank(G, seed_ids)
print(f"PageRank computed for {len(pagerank_scores)} nodes")

# Get top candidates by underratedness (PageRank / log(followers))
# Focus on candidates that passed fast screening
passed_handles = {c["handle"] for c, r in fast_screen_results if r.pass_filter}

candidates_for_deep_eval = []
for node_id, ppr_score in pagerank_scores.items():
    if node_id not in G.nodes:
        continue
    attrs = G.nodes[node_id]
    if not attrs.get('is_candidate'):
        continue
    
    handle = attrs.get('handle', '')
    if handle not in passed_handles:
        continue
    
    followers = attrs.get('followers_count', 1)
    # Underratedness = PPR / log(1 + followers)
    import math
    underratedness = ppr_score / math.log(1 + followers)
    
    candidates_for_deep_eval.append({
        'user_id': node_id,
        'handle': handle,
        'bio': builder.nodes[node_id].bio if node_id in builder.nodes else '',
        'followers_count': followers,
        'pagerank_score': ppr_score,
        'underratedness': underratedness,
        'discovered_via': attrs.get('discovered_via', ''),
    })

# Sort by underratedness (hidden gems first)
candidates_for_deep_eval.sort(key=lambda x: x['underratedness'], reverse=True)

print(f"\nTop 20 underrated candidates for deep eval:")
for i, c in enumerate(candidates_for_deep_eval[:20], 1):
    print(f"  {i}. @{c['handle']} ({c['followers_count']:,} followers, PPR: {c['pagerank_score']:.6f})")

# How many to deep evaluate
MAX_DEEP_EVAL = 50  # Deep eval is expensive (uses search tools)

print(f"\n{'='*60}")
print(f"Running deep evaluation on top {MAX_DEEP_EVAL} candidates...")
print("This uses web_search + x_search for each candidate")
print("=" * 60)

PHASE 6: Deep Evaluation with xAI Search Tools
Loaded graph: 17,322 nodes, 28,361 edges
Computing PageRank with 88 seeds as personalization...
PageRank computed for 17322 nodes

Top 20 underrated candidates for deep eval:
  1. @juleszqiu (92 followers, PPR: 0.000067)
  2. @WesleyYue (534 followers, PPR: 0.000081)
  3. @Ani_nlp (84 followers, PPR: 0.000046)
  4. @BahlAnuraag (531 followers, PPR: 0.000057)
  5. @tianyue_01 (100 followers, PPR: 0.000041)
  6. @guangyi_l (380 followers, PPR: 0.000046)
  7. @alexzhuang_ (100 followers, PPR: 0.000033)
  8. @LiaoZeyi (312 followers, PPR: 0.000041)
  9. @MatricesAI (422 followers, PPR: 0.000041)
  10. @einsums (167 followers, PPR: 0.000034)
  11. @shmkane (504 followers, PPR: 0.000041)
  12. @nifleisch (77 followers, PPR: 0.000027)
  13. @BillZheng155508 (92 followers, PPR: 0.000027)
  14. @the_philbert (330 followers, PPR: 0.000034)
  15. @mihiranand (492 followers, PPR: 0.000033)
  16. @kauterry (165 followers, PPR: 0.000027)
  17. @jaewon_c

In [32]:
# Initialize deep evaluator and run evaluations
try:
    deep_evaluator = DeepEvaluator(cache_dir="data/enriched")
    print("DeepEvaluator initialized successfully")
    print(f"  Model: {deep_evaluator.model}")
    print(f"  Cache dir: {deep_evaluator.cache_dir}")
    
    # Run deep evaluation
    deep_results = deep_evaluator.evaluate_batch(
        candidates=candidates_for_deep_eval,
        max_candidates=MAX_DEEP_EVAL,
        verbose=True,
    )
    
    print(f"\n{'='*60}")
    print("DEEP EVALUATION COMPLETE")
    print("=" * 60)
    
except ImportError as e:
    print(f"[error] xai_sdk not installed: {e}")
    print("Run: pip install xai-sdk>=1.3.1")
    deep_results = []
except Exception as e:
    print(f"[error] Deep evaluation failed: {e}")
    import traceback
    traceback.print_exc()
    deep_results = []

DeepEvaluator initialized successfully
  Model: grok-4-1-fast
  Cache dir: data/enriched

[deep_eval] Evaluating 50 candidates

[1/50] @juleszqiu
  [eval] @juleszqiu - searching X, GitHub, LinkedIn...
    -> x_keyword_search: {"query":"from:juleszqiu","limit":30,"mode":"Lates...
    -> x_keyword_search: {"query":"from:juleszqiu (AI OR ML OR \"machine le...
    -> x_semantic_search: {"query":"technical discussions on AI, math, engin...
    -> x_user_search: {"query":"juleszqiu"}...
    -> web_search: {"query":"\"juleszqiu\" site:github.com","num_resu...
    -> web_search: {"query":"\"jules qiu\" github","num_results":10}...
    -> web_search: {"query":"\"jules qiu\" site:linkedin.com/in","num...
    -> web_search: {"query":"\"juleszqiu\" site:linkedin.com","num_re...
    -> browse_page: {"url": "https://www.linkedin.com/in/jules-qiu"}...
    -> web_search: {"query":"\"jules qiu\" radical ventures github OR...
    -> web_search: {"query":"\"jules qiu\" harvard math thesis OR pap...
    -

Traceback (most recent call last):
  File "/var/folders/6j/q7vlb7k917jcgd428x526prc0000gn/T/ipykernel_64163/4149176246.py", line 9, in <module>
    deep_results = deep_evaluator.evaluate_batch(
        candidates=candidates_for_deep_eval,
        max_candidates=MAX_DEEP_EVAL,
        verbose=True,
    )
  File "/Users/hungtran/code/xai-hack/src/deep_evaluator.py", line 388, in evaluate_batch
    if verbose:
    
  File "/Users/hungtran/code/xai-hack/src/deep_evaluator.py", line 327, in evaluate
    response_text += chunk.content
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hungtran/code/xai-hack/src/deep_evaluator.py", line 140, in _save_cache
    # Convert to dict and handle non-JSON-serializable types
  File "/Users/hungtran/miniconda3/lib/python3.13/json/__init__.py", line 179, in dump
    for chunk in iterable:
                 ^^^^^^^^
  File "/Users/hungtran/miniconda3/lib/python3.13/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_leve

In [31]:
# Export deep evaluation results
import json as json_module
from pathlib import Path
from dataclasses import asdict

if deep_results:
    print("=" * 60)
    print("EXPORTING DEEP EVALUATION RESULTS")
    print("=" * 60)
    
    # Create output directory
    output_dir = Path("data/enriched")
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Export as JSON for API consumption
    export_data = []
    for evaluation in deep_results:
        candidate_data = next(
            (c for c in candidates_for_deep_eval if c['handle'] == evaluation.handle),
            {}
        )
        user_id = candidate_data.get('user_id', '')
        node_name = builder.nodes[user_id].name if user_id in builder.nodes else ''
        
        export_data.append({
            'handle': evaluation.handle,
            'name': node_name,
            'bio': evaluation.bio,
            'followers_count': evaluation.followers,
            'pagerank_score': candidate_data.get('pagerank_score', 0),
            'underratedness_score': candidate_data.get('underratedness', 0),
            'final_score': evaluation.final_score,
            'recommended_role': evaluation.recommended_role,
            'summary': evaluation.summary,
            'strengths': evaluation.strengths,
            'concerns': evaluation.concerns,
            'technical_depth': asdict(evaluation.technical_depth),
            'project_evidence': asdict(evaluation.project_evidence),
            'mission_alignment': asdict(evaluation.mission_alignment),
            'exceptional_ability': asdict(evaluation.exceptional_ability),
            'communication': asdict(evaluation.communication),
            'github_url': evaluation.github_url,
            'linkedin_url': evaluation.linkedin_url,
            'top_repos': evaluation.top_repos,
        })
    
    export_data.sort(key=lambda x: x['final_score'], reverse=True)
    
    json_path = output_dir / "candidates_deep_eval.json"
    with open(json_path, 'w') as f:
        json_module.dump(export_data, f, indent=2)
    print(f"Exported to: {json_path}")
    
    print(f"\nTop 10 candidates:")
    for i, c in enumerate(export_data[:10], 1):
        print(f"{i}. @{c['handle']} - Score: {c['final_score']:.1f} ({c['recommended_role']})")
else:
    print("No deep evaluation results to export")

No deep evaluation results to export
