This notebook contains 3 things:

1. Linkedin scraping via proxycurl
2. Encoding data into ordinals (sufficient for naive matrix) and a $26x26$ feature vector (one hot by ordinal)
3. Computing a simple score and prioritizing based on the score (from original matrix)

In [19]:
# !pip install requests pandas python-dotenv openai -q

In [16]:
import os 
import requests
import pandas as pd
import numpy as np
import json
import re
from openai import OpenAI
from dotenv import load_dotenv
import sys

sys.path.append('..')
from src.config.config import cfg

## Linkedin Search

Change top funnel search here

In [None]:
load_dotenv()

API_KEY = os.getenv("PROXYCURL_API_KEY")
if not API_KEY:
    raise ValueError('proxycurl api key unset')
print(f'env loaded with proxycurl api key: {API_KEY}')

PERPLEXITY_API_KEY=os.getenv('PERPLEXITY_API_KEY')
if not PERPLEXITY_API_KEY:
    raise ValueError('perplexity api key unset')
print(f'env loaded with perplexity api key: {PERPLEXITY_API_KEY}')

np.set_printoptions(precision=2, suppress=True, linewidth=120)

MATRIX = cfg.MATRIX

In [22]:
N = 7 # limit results
params = {
    'country' : 'US',
    'education_school_name' : 'Georgia Institute of Technology', # can add college of computing, isye, etc. later
    'current_role_title': "Founder OR Co-Founder OR \"Founding Engineer\" OR CEO OR CTO OR Stealth",
    'enrich_profiles' : 'enrich',
    'page_size' : N,
    'use_cache' : 'if-present' # should be if-recent for final 
}
# more opts: https://nubela.co/proxycurl/docs?python#search-api-person-search-endpoint
headers = {'Authorization' : f'Bearer {API_KEY}'}
response = requests.get('https://nubela.co/proxycurl/api/v2/search/person', params=params, headers=headers)
data = response.json()

In [19]:
# Raw Proxycurl data
data
with open("../data/raw/proxy_sample.json", "r") as json_file:
    data = json.load(json_file)
data

{'results': [{'linkedin_profile_url': 'https://www.linkedin.com/in/cantino',
   'profile': {'public_identifier': 'cantino',
    'profile_pic_url': 'https://s3.us-west-000.backblazeb2.com/proxycurl/person/cantino/profile?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20250303%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20250303T090109Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=b0afb8ea625a1d9f40f6915d3257254c4b15b8d89d051da173aca9105bb437a8',
    'background_cover_image_url': None,
    'first_name': 'Andrew',
    'last_name': 'Cantino',
    'full_name': 'Andrew Cantino',
    'follower_count': 1258,
    'occupation': 'Founder / Chief Strategy Officer at Overview Energy',
    'headline': 'Climate ◊ Space ◊ Science ◊ Software',
    'summary': "I am a hands-on technical leader specializing in strategic planning, technical prioritization, and engineering practices. I've founded companies, raised money, and built teams. I've also managed phila

In [20]:
# Parse raw data to df

def extract_undergrad_school(edu_list):
    for ed in edu_list:
        deg = (ed.get("degree_name") or "").lower()
        fos = (ed.get('field_of_study') or "").lower()
        if any((keyword in deg or keyword in fos) for keyword in ["bs", "ba", "bachelor", "bse", "bsba"]):
            return ed.get("school")
    return None

def extract_grad_school(edu_list):
    for ed in edu_list:
        fos = (ed.get('field_of_study') or "").lower()
        deg = (ed.get("degree_name") or "").lower()
        if any((keyword in deg or keyword in fos)for keyword in ["master", "mba", "ms", "phd"]):
            return ed.get("school")
    return None

def extract_current_experience(experiences):
    # Look for the experience with ends_at=None
    current_exp = next((exp for exp in experiences if exp.get("ends_at") is None), None)
    if current_exp:
        return (current_exp.get("company"), current_exp.get("title"))
    
    return (None, None)

def extract_previous_experience(experiences):
    if len(experiences) > 1:
        return [(exp.get("company"), exp.get("title")) for exp in experiences[1:]]

    return (None, None)

records = []
for result in data["results"]:
    profile = result.get("profile", {})
    full_name = profile.get("full_name")
    edu_list = profile.get("education", [])
    exp_list = profile.get("experiences", [])
    linkedin = result.get('linkedin_profile_url', [])
    
    undergrad = extract_undergrad_school(edu_list)
    grad = extract_grad_school(edu_list)
    current_company, current_title = extract_current_experience(exp_list)
    previous_experiences_titles = extract_previous_experience(exp_list) 
    
    row = {
        "Name": full_name,
        "Undergrad School": undergrad,
        "Graduate School": grad,
        "Current Company": current_company,
        "Current Title": current_title,
        "Previous Companies": [exp[0] for exp in previous_experiences_titles],
        "Previous Titles" : [exp[1] for exp in previous_experiences_titles],
        "Linkedin" : linkedin
    }
    records.append(row)

df = pd.DataFrame(records)

In [21]:
# Unranked Founder Data
df

Unnamed: 0,Name,Undergrad School,Graduate School,Current Company,Current Title,Previous Companies,Previous Titles,Linkedin
0,Andrew Cantino,Haverford College,Georgia Institute of Technology,Overview Energy,Founder / Chief Strategy Officer,"[The Navigation Fund, The Orbital Index, Overv...","[Climate Program Consultant, Co-Creator, Found...",https://www.linkedin.com/in/cantino


## Scoring Configuration

Consider [this](https://docs.google.com/document/d/1D7Zjma2FrrnSuoQTMsI0ec_5ENTzjN6wIpKoI4O1ASE/edit?tab=t.0) evaluation matrix. 

We could consider learning the tiers, but hardcoding it seems fine since we don't have enough data.

Here's the current matrix scoring schema:

| Category  | Encoding | Example                             | Dimension |
|-----------|----------|-------------------------------------|-----------|
| Undergrad | One hot  | [Tier 3, Tier 2, Tier 1 (other)]| 3  |
| Graduate | One hot | [Tier 3, Tier 2, Tier 1 (other), Fallback/None] | 4|
| Previous Exit | One hot | [100m+, 25m-100m, 1-25m, Fallback/None] | 4 |
| Previous Founder | One hot | [yes - success, yes, no] | 3 |
| Prior Startup Exp | One hot | [early + success, early, no] | 3|
| Company as Employee Quality | One hot | [Tier 3, Tier 2, Tier 1 (other)] | 3|
| Seniority | One hot | [Tier 3, Tier 2, Tier 1 (other)]| 3|
| Expertise | One hot | [Tier 3, Tier 2, Tier 1 (other)]|3|
||||26|


### Scoring

UndergradScore, GraduateScore, CompanyQuality, SeniorityScore, and ExpertiseScore all come from proxycurl.

Evaluations of exit size, previous startup experience, and previous founder experience come from perplexity, which limits to around 7 seconds processing time per founder, since it seems we need to hit sonar-pro for accurate results.

In [22]:
# Ordinals
def get_tier(category, tier):
    try:
        return MATRIX[category]['TIERS'][tier]
    except Exception:
        print('could not get tier')

# Undergrad
df["UNDERGRAD"] = np.where(
    df["Undergrad School"].isin(get_tier('UNDERGRAD', 3)), 3,
    np.where(df["Undergrad School"].isin(get_tier('UNDERGRAD', 2)), 2, 1)
)

# Graduate
df["GRADUATE"] = np.where(
    df["Graduate School"].isin(get_tier('GRADUATE', 3)), 3, 
    np.where(df["Graduate School"].isin(get_tier('GRADUATE', 2)), 2,
        np.where(df["Graduate School"].notnull() & (df["Graduate School"] != "None"), 1, 0)
    )
)

# Company Quality
df["COMPANY"] = df.apply(
    lambda row: 3 if any(company in get_tier('COMPANY', 3) for company in row["Previous Companies"] if company) else
                2 if any(company in get_tier('COMPANY', 2) for company in row["Previous Companies"] if company) else 1,
    axis=1
)

df["SENIORITY"] = df.apply(
    lambda row: 3 if any(title and any(keyword.lower() in title.lower() for keyword in get_tier('SENIORITY', 3)) 
                         for title in row["Previous Titles"] if title) else
                2 if any(title and any(keyword.lower() in title.lower() for keyword in get_tier('SENIORITY', 2)) 
                         for title in row["Previous Titles"] if title) else 1,
    axis=1
)

df["EXPERTISE"] = df.apply(
    lambda row: 3 if any(title and any(kw.lower() in word.lower() 
                         for word in title.split() for kw in get_tier('EXPERTISE', 3))
                         for title in row["Previous Titles"] if title) else
                2 if any(title and any(kw.lower() in word.lower() 
                         for word in title.split() for kw in get_tier('EXPERTISE', 2))
                         for title in row["Previous Titles"] if title) else 1,
    axis=1
)


In [23]:
# Ordinal representation for api result columns
client = OpenAI(api_key=PERPLEXITY_API_KEY, base_url='https://api.perplexity.ai')

def get_ai_evaluation(person_data):
    
    # Profile
    name = person_data.get("Name", "Unknown")
    titles = ", ".join([t for t in person_data.get("Previous Titles", []) if t]) if person_data.get("Previous Titles") else "Unknown"
    companies = ", ".join([c for c in person_data.get("Previous Companies", []) if c]) if person_data.get("Previous Companies") else "Unknown"
   
    # prompt
    messages = [
        {
            "role": "system",
            "content": (
                "You are a venture research assistant. Your task is to evaluate a person's founder and startup experience "
                "based on the provided information. Provide concise and structured evaluations for the categories: "
                "'Previously an exited founder?', 'Previously a founder?', and 'Prior Startup Experience'. Use the following rating scales:\n\n"
                f"- Previously an exited founder?\n  3: {get_tier('EXIT', 3)}, 2: {get_tier('EXIT', 2)}, 1: {get_tier('EXIT',1)}, 0: {get_tier('EXIT',0)}\n"
                f"- Previously a founder?\n  3: {get_tier('FOUNDER', 3)}, 2: {get_tier('FOUNDER',2)}, 1: {get_tier('FOUNDER', 1)}'\n"
                f"- Prior Startup Experience\n  3: {get_tier('STARTUP', 3)}, 2: {get_tier('STARTUP', 2)}, 1: {get_tier('STARTUP', 1)}.\n"
                "Do not consider a a person's current experience as prior startup experience or previously a founder.\n\n"
                "Provide your response in JSON format with keys 'exited_founder', 'previous_founder', and 'startup_experience'.\n"
            ),
        },
        {
            "role": "user",
            "content": (
                f"Evaluate the following person's founder and startup experience:\n\n"
                f"Name: {name}\n"
                f"Titles: {titles}\n"
                f"Experiences: {companies}\n\n"
                "Provide ratings for the categories as described."
            ),
        },
    ]
    
    # Query response 
    fallback = {"exited_founder": 0, "previous_founder": 1, "startup_experience": 1}
    try:
        response = client.chat.completions.create(
            model="sonar-pro",
            messages=messages,
        )

        json_match = re.search(r'\{.*?\}', response.choices[0].message.content, re.DOTALL)
        if json_match:
            evaluation = json.loads(json_match.group())
            return evaluation
        else:
            print(f"Could not extract JSON for {name}")
            return fallback
        
    except Exception as e:
        print(f"Error evaluating {name}: {e}")
        return fallback

ai_evaluations = df.apply(get_ai_evaluation, axis=1)
df["EXIT"] = ai_evaluations.apply(lambda x: x.get("exited_founder", 0))
df["FOUNDER"] = ai_evaluations.apply(lambda x: x.get("previous_founder", 1))
df["STARTUP"] = ai_evaluations.apply(lambda x: x.get("startup_experience", 1))

In [24]:
# Ordinal representation
df

Unnamed: 0,Name,Undergrad School,Graduate School,Current Company,Current Title,Previous Companies,Previous Titles,Linkedin,UNDERGRAD,GRADUATE,COMPANY,SENIORITY,EXPERTISE,EXIT,FOUNDER,STARTUP
0,Andrew Cantino,Haverford College,Georgia Institute of Technology,Overview Energy,Founder / Chief Strategy Officer,"[The Navigation Fund, The Orbital Index, Overv...","[Climate Program Consultant, Co-Creator, Found...",https://www.linkedin.com/in/cantino,1,1,2,3,3,0,3,3


In [25]:
def one_hot_encode_column(values, dimension):
    if dimension == 3:
        # Ordinals are [1,2,3] => want indices [0,1,2]
        indices = values - 1
    else:
        # dimension = 4 => Ordinals are [0,1,2,3] => want indices [0..3]
        indices = values
    
    # Clip just in case
    indices = np.clip(indices, 0, dimension - 1)
    return np.eye(dimension, dtype=int)[indices]


# Apply one-hot encoding
one_hot_matrices = []
for cat, cfg in MATRIX.items():
    dim = cfg['DIMENSION']
    values = df[cat].to_numpy()  # Ordinal values in {0,1,2,3}
    matrix = one_hot_encode_column(values, dim)
    one_hot_matrices.append(matrix)

feature_matrix = np.concatenate(one_hot_matrices, axis=1)

df["feature_vector"] = list(feature_matrix)
feature_matrix.shape

(1, 26)

In [26]:
print(f'{feature_matrix}\n')
df[['Name','Current Company', 'Current Title', 'Linkedin', 'feature_vector']]

[[1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1]]



Unnamed: 0,Name,Current Company,Current Title,Linkedin,feature_vector
0,Andrew Cantino,Overview Energy,Founder / Chief Strategy Officer,https://www.linkedin.com/in/cantino,"[1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..."


### Ranking 

W : 26x26 

feature_matrix : N x 26

Compute scores as: $\mathrm{score_i} = x_i^T(\frac{(W+W^T)}{2})x_i$ for each row $x_i$

Weight matrix initialization: 
- Diagonal elements are individual contribution of each feature
- Off diagonal elements are interactions between different features. $w_{ij} >0 \Rightarrow$ having $i,j$ active together increases score more than individual contributions alone.

Note: 
- initialization below works because everything is one-hot and 3-tier. If that changes, need to adjust 
- using a quadratic form limits to linear, pairwise interactions between elements.

In [32]:
K = feature_matrix.shape[1]
W = np.zeros((K, K))

start_idx = 0
for cat, cfg in MATRIX.items():
    weight = cfg['WEIGHT']
    dim = cfg['DIMENSION']
    end_idx = start_idx + dim
    tiers = np.array(list(range(3, 3-dim,-1))[::-1]) * weight  
    indices = np.arange(start_idx, end_idx)
    W[indices, indices] = tiers
    start_idx = end_idx

# Add small random noise to off-diagonal elements
np.random.seed(42) 
noise = np.random.normal(0, 0.005, (K, K))
np.fill_diagonal(noise, 0)
W += noise

#enforce symmetry:
W = 0.5 * (W + W.T)

scores = np.sum((feature_matrix @ W) * feature_matrix, axis=1)
df["score"] = scores
results = df.sort_values(by="score", ascending=False)

# print(W)

In [28]:
# results.to_csv('out/results.csv')
results

Unnamed: 0,Name,Undergrad School,Graduate School,Current Company,Current Title,Previous Companies,Previous Titles,Linkedin,UNDERGRAD,GRADUATE,COMPANY,SENIORITY,EXPERTISE,EXIT,FOUNDER,STARTUP,feature_vector,score
0,Andrew Cantino,Haverford College,Georgia Institute of Technology,Overview Energy,Founder / Chief Strategy Officer,"[The Navigation Fund, The Orbital Index, Overv...","[Climate Program Consultant, Co-Creator, Found...",https://www.linkedin.com/in/cantino,1,1,2,3,3,0,3,3,"[1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...",36.022944
