# Founder Rank
This notebook implements all ranking workflows. Data pipeline to be added.

### Preprocessing

1. Encoding
   - Convert founder profiles to numerical values (ordinals) using:
     - Configuration mappings from `src.config.Config.MATRIX`
     - Profile evaluation via `src.clients.perplexity_client.eval_person`
   - Transform ordinals into feature vectors:
     - Each ordinal value is converted to one-hot encoding
     - All one-hot vectors are concatenated into a single feature vector
     - Implementation: `src.processing.transforms.transform`
2. Dataset Creation
   - Synthetic Data (`src.datagen.datagen.DataGenerator`):
     - Generate founder attributes using pdfs defined in `src.config.Config.SYNTH`
     - Assign binary success labels based on predefined criteria (exit or Series B)
   
   - YC Dataset:
     - Scraped batches 2012-2021 and top companies via `src.clients.yc_client` and eval exits, funding via `src.clients.perplexity_client.eval_company`.
     - Top YC Companies:
        - Known to be successful.
     - Batches
        - YC W/S 21 have targets for now.
     - Implementation: `notebooks.live-data.ipynb` & `src.clients.yc_client`
3. Split
    - Training is synth + some batch data
    - Val/Test is top YC and batch

### Model 

1. Architecutre
    - Models (`src.models.quadratic`):
        - `QuadraticModel`: Learns pairwise feature interactions via W matrix
        - `QuadMLP`: tries to capture nonlinearity or higher order (marginal imporvement )
            - Quadratic: x^T W x captures explicit pairwise interactions
            - MLP: [64] -> LN -> GELU -> D(0.2) -> [32] -> LN -> GELU -> D(0.1) -> [1]
    - Confidence scoring:
        - Raw score: $f(x)$ = quadratic + mlp terms 
        - Probability: $P(success | x) = σ(f(x))$


In [None]:
import pandas as pd
import numpy as np
import sys



sys.path.append('..')

from src.clients.perplexity_client import PerplexityClient
from src.clients.proxycurl_client import ProxycurlClient
from src.config.config import cfg   
from src.processing.transforms import ProfileTransforms

from src.clients.perplexity_client import PerplexityClient
from src.clients.proxycurl_client import ProxycurlClient
from src.processing.transforms import ProfileTransforms
import pandas as pd
import numpy as np
import sys
import os
import requests
from dotenv import load_dotenv

MATRIX=cfg.MATRIX
load_dotenv()

In [2]:
px = ProxycurlClient()
import requests
import os
from dotenv import load_dotenv
import os
import requests

load_dotenv()

founders = pd.read_csv('../data/live/yc/W21.csv')
founders = founders[~((founders['exit_value_usd'] == 0) & (founders['total_funding_usd'] == 0))]

founders = founders.head(3)

linkedin_profiles = []

for idx, row in founders.iterrows():
    params = {
        'linkedin_profile_url': row['LinkedIn'],
        'use_cache': 'if-present'
    }
    
    api_key =os.getenv('PROXYCURL_API_KEY')
    headers = {'Authorization': 'Bearer ' + api_key}
    api_endpoint = 'https://nubela.co/proxycurl/api/v2/linkedin'
    params = {
        'linkedin_profile_url': row['LinkedIn'],
        'use_cache': 'if-present',
    }
    response = requests.get(api_endpoint,params=params,headers=headers)
    linkedin_profiles.append(response.json())
    
# Convert to DataFrame for easier analysis
linkedin_df = pd.DataFrame(linkedin_profiles)


In [None]:
linkedin_profiles

In [4]:
T = ProfileTransforms(data={}, matrix=MATRIX)

df = T.transform_person_endpt(profile_list=linkedin_profiles, cutoff_date=2020)
T.df = df


In [None]:
df

In [6]:
pc = PerplexityClient()
df = T.transform(pc, process_profiles=False)

In [None]:
df

In [None]:
feature_matrix = T.create_feature_matrix()
df["feature_vector"] = list(feature_matrix)

df[["Name", "Current Company", "Current Title", "Linkedin", "feature_vector"]]

In [None]:
df

In [10]:
# Get dimensions for each category from MATRIX config
dimensions = {}
start_idx = 0
for cat in cfg.MATRIX:
    dim = cfg.MATRIX[cat]["DIMENSION"]
    dimensions[cat] = (start_idx, start_idx + dim)
    start_idx += dim

# Extract feature vectors into separate columns
feature_vectors = np.array(df["feature_vector"].tolist())

# Create new columns for each feature
feature_names = []
for cat, (start, end) in dimensions.items():
    for i in range(end - start):
        if cfg.MATRIX[cat]["DIMENSION"] == 3:
            col_name = f"{cat}_{i+1}"
        else:
            col_name = f"{cat}_{i}"
        df[col_name] = feature_vectors[:, start + i]
        feature_names.append(col_name)

# Drop the feature_vector column
df = df.drop("feature_vector", axis=1)

In [None]:
df

In [None]:
import torch
from src.models.quadratic import QuadMLP 
from sklearn.preprocessing import StandardScaler 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = "../models/founder_rank.pt"
model = QuadMLP(input_dim=26, hidden_dim=64) 

# Load state dict and handle scaling parameters
state_dict = torch.load(model_path)
model.load_state_dict(state_dict)  # The model class already has quad_scale and mlp_scale defined
model.to(device)
model.eval()

# Prepare features
X = df[feature_names].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_tensor = torch.FloatTensor(X_scaled).to(device)

# Evaluation
with torch.no_grad():
    logits = model(X_tensor)
    
    temperature = 2.0
    scaled_logits = logits / temperature
    
    probs = torch.sigmoid(scaled_logits).cpu().numpy()
    preds = (probs > 0.5).astype(int)

# Add predictions to dataframe    
df["raw_logits"] = logits.cpu().numpy()
df["success_probability"] = probs
df["predicted_success"] = preds

# Sort by probability to get rankings
ranked_founders = df.sort_values("success_probability", ascending=False)
print(ranked_founders[["Name", "Current Company", "success_probability", "predicted_success"]])