This notebook an entire ranking workflow (no training). Includes:

1. Linkedin scraping via proxycurl
2. Encoding data into ordinals (sufficient for naive matrix) and a $1\times26$ feature vector (one hot by ordinal)
3. Computing a simple score and prioritizing based on the score (from original matrix)

In [1]:

import numpy as np
import json
import sys

sys.path.append("..")

from src.config.config import cfg

from src.clients.perplexity_client import PerplexityClient
from src.clients.proxycurl_client import ProxycurlClient
from src.data.profile_transforms import ProfileTransforms
from src.utils.model_utils import initialize_weight_matrix, score_feature_matrix
from src.core.ranking import search_founders


np.set_printoptions(precision=2, suppress=True, linewidth=120)

### Linkedin Search

In [None]:
px = ProxycurlClient()

SEARCH = False
N = 1


In [2]:
if SEARCH:
    data = search_founders(px=px, limit=N)
else:
    with open("../data/proxycurl/proxy_sample.json", "r") as json_file:
        data = json.load(json_file)

In [None]:
data

### Scoring Configuration

Consider [this](https://docs.google.com/document/d/1D7Zjma2FrrnSuoQTMsI0ec_5ENTzjN6wIpKoI4O1ASE/edit?tab=t.0) evaluation matrix. 

We could consider learning the tiers, but hardcoding it seems fine since we don't have enough data.

Here's the current matrix scoring schema:

| Category  | Encoding | Example                             | Dimension |
|-----------|----------|-------------------------------------|-----------|
| Undergrad | One hot  | [Tier 3, Tier 2, Tier 1 (other)]| 3  |
| Graduate | One hot | [Tier 3, Tier 2, Tier 1 (other), Fallback/None] | 4|
| Previous Exit | One hot | [100m+, 25m-100m, 1-25m, Fallback/None] | 4 |
| Previous Founder | One hot | [yes - success, yes, no] | 3 |
| Prior Startup Exp | One hot | [early + success, early, no] | 3|
| Company as Employee Quality | One hot | [Tier 3, Tier 2, Tier 1 (other)] | 3|
| Seniority | One hot | [Tier 3, Tier 2, Tier 1 (other)]| 3|
| Expertise | One hot | [Tier 3, Tier 2, Tier 1 (other)]|3|
||||26|


In [None]:
pc = PerplexityClient()
T = ProfileTransforms(data)

df = T.process_profiles(profiles=data, perplexity_client=pc)

### Ranking 

W : 26x26 

feature_matrix : N x 26

Compute scores as: $\mathrm{score_i} = x_i^T(\frac{(W+W^T)}{2}+\Epsilon)x_i$ for each row $x_i$

Weight matrix initialization: 
- Diagonal elements are individual contribution of each feature
- Off diagonal elements are pairwise interactions between different features. $w_{ij} >0 \Rightarrow$ having $i,j$ active together increases score more than individual contributions alone.

In [6]:
feature_matrix = T.get_feature_matrix()
   
K = feature_matrix.shape[1]

W = initialize_weight_matrix(K, cfg.MATRIX, seed=42, eps=0)

df["score"] = score_feature_matrix(feature_matrix, W)

results = df.sort_values(by="score", ascending=False)

# print(W)

In [None]:
results