# Embedding Methods for Recommenders

| System Design  |
|----|
|*Offline*  
User events → **embedding training** → item/user vectors (versioned) |
|*Storage*  
Embedding store + ANN index (same version) |
|*Online*  
Request → user vector → ANN top-K → ranking → rules → response |

  
<br>

Embedding families used in recommender systems:

1. **Co-occurrence / PMI + SVD** (fast baseline)
2. **Item2Vec** (sequence-based)
3. **Graph embeddings via random walks** (structure)
4. **Multimodal content embeddings fusion** (text/image signals, cold start)


## 0) Setup

In [1]:

import numpy as np
from numpy.random import default_rng
from collections import Counter
import math
import time

rng = default_rng(7)

def l2_normalize(x, axis=1, eps=1e-12):
    n = np.linalg.norm(x, axis=axis, keepdims=True)
    return x / np.maximum(n, eps)

def recall_at_k(query_emb, item_emb, positives, k=20):
    """Mean Recall@K for one positive per query (fast + simple)."""
    q = l2_normalize(query_emb, axis=1)
    it = l2_normalize(item_emb, axis=1)
    scores = q @ it.T
    topk = np.argpartition(-scores, kth=k-1, axis=1)[:, :k]
    hits = 0
    for i, pos in enumerate(positives):
        if int(pos) in topk[i]:
            hits += 1
    return hits / len(positives)


## 1) Synthetic data: users, items, sessions
Sessions are represented as item sequences with latent structure + popularity

In [2]:

# Size knobs
n_users = 1500
n_items = 400
latent_dim = 16

# Latent factors that drive "true" affinity
U_true = rng.normal(size=(n_users, latent_dim)).astype(np.float32)
V_true = rng.normal(size=(n_items, latent_dim)).astype(np.float32)

# Item popularity (long tail)
pop = rng.power(a=2.0, size=n_items).astype(np.float32)
pop = pop / pop.sum()

def sample_item_for_user(u_id):
    # mixture: latent affinity + popularity
    scores = U_true[u_id] @ V_true.T
    scores = (scores - scores.max())
    p_aff = np.exp(scores).astype(np.float64)
    p_aff = p_aff / p_aff.sum()
    p = 0.8 * p_aff + 0.2 * pop
    return int(rng.choice(n_items, p=p))

# Build sessions (sequences). Scores: views/clicks in one visit.
n_sessions = 7000
session_len = 20

sessions = []
for _ in range(n_sessions):
    u = int(rng.integers(0, n_users))
    seq = [sample_item_for_user(u) for _ in range(session_len)]
    sessions.append(seq)

flat = [i for s in sessions for i in s]
item_counts = Counter(flat)

print("sessions:", len(sessions), "avg len:", round(np.mean([len(s) for s in sessions]), 2))
print("top items:", item_counts.most_common(5))


sessions: 7000 avg len: 20.0
top items: [(102, 2898), (156, 2463), (308, 2029), (28, 2012), (139, 1981)]


## 2) Method A: Co-occurrence → PMI → SVD embedding
Baseline, often hard to beat.

Steps:
- Count item-item co-occurrence within a context window
- Convert to PPMI (positive PMI)
- Run SVD to get dense vectors

In [3]:

def build_cooccurrence(sessions, n_items, window=5):
    C = np.zeros((n_items, n_items), dtype=np.float32)
    for seq in sessions:
        L = len(seq)
        for i, center in enumerate(seq):
            left = max(0, i-window)
            right = min(L, i+window+1)
            for j in range(left, right):
                if j == i:
                    continue
                context = seq[j]
                C[center, context] += 1.0
    return C

t0 = time.time()
C = build_cooccurrence(sessions, n_items, window=5)
print("cooc built in", round(time.time()-t0, 2), "s; nnz≈", int((C > 0).sum()))

row_sum = C.sum(axis=1, keepdims=True)
col_sum = C.sum(axis=0, keepdims=True)
total = C.sum()

eps = 1e-8
p_ij = C / (total + eps)
p_i = row_sum / (total + eps)
p_j = col_sum / (total + eps)

PMI = np.log((p_ij + eps) / (p_i @ p_j + eps))
PPMI = np.maximum(PMI, 0.0).astype(np.float32)

k = 32
U, S, VT = np.linalg.svd(PPMI, full_matrices=False)
item_emb_ppmi = (U[:, :k] * S[:k]).astype(np.float32)

print("item_emb_ppmi:", item_emb_ppmi.shape, "S[0:5]:", np.round(S[:5], 3))


cooc built in 0.61 s; nnz≈ 127644
item_emb_ppmi: (400, 32) S[0:5]: [95.062 27.541 26.924 26.232 25.471]


### Sanity check for retrieval
Predict next item from previous item. Fast sanity check that embeddings carry structure.

In [4]:

pairs = []
for seq in sessions[:3000]:
    for a, b in zip(seq[:-1], seq[1:]):
        pairs.append((a, b))

pairs = np.array(pairs, dtype=np.int32)
prev_items = pairs[:, 0]
next_items = pairs[:, 1]

r20 = recall_at_k(item_emb_ppmi[prev_items], item_emb_ppmi, next_items, k=20)
print("PPMI+SVD Recall@20:", round(r20, 4))


PPMI+SVD Recall@20: 0.3354


## 3) Method B: Item2Vec (skip-gram + negative sampling)
Sequence-based embeddings.

Implementation:
- Build (center, context) pairs from sessions
- Train embeddings with negative sampling (small SGD)

In [5]:

def build_skipgram_pairs(seqs, window=4, max_pairs=250_000):
    pairs = []
    for seq in seqs:
        L = len(seq)
        for i, center in enumerate(seq):
            left = max(0, i-window)
            right = min(L, i+window+1)
            for j in range(left, right):
                if j == i:
                    continue
                pairs.append((center, seq[j]))
                if len(pairs) >= max_pairs:
                    return np.array(pairs, dtype=np.int32)
    return np.array(pairs, dtype=np.int32)

pairs_sg = build_skipgram_pairs(sessions, window=4, max_pairs=220_000)
print("skipgram pairs:", pairs_sg.shape)

counts = np.bincount(np.array(flat, dtype=np.int32), minlength=n_items).astype(np.float64)
neg_dist = counts ** 0.75
neg_dist = neg_dist / neg_dist.sum()

def train_item2vec(pairs, n_items, dim=32, lr=0.05, epochs=2, neg_k=10, seed=7, steps=120_000):
    rng = default_rng(seed)
    W_in = (0.01 * rng.normal(size=(n_items, dim))).astype(np.float32)
    W_out = (0.01 * rng.normal(size=(n_items, dim))).astype(np.float32)

    def sigmoid(x):
        return 1.0 / (1.0 + np.exp(-x))

    idx = np.arange(len(pairs))
    for ep in range(epochs):
        rng.shuffle(idx)
        loss = 0.0
        for t in idx[:steps]:
            c, o = pairs[t]
            v_c = W_in[c]
            v_o = W_out[o]

            # positive label = 1
            score_pos = float(v_c @ v_o)
            p_pos = sigmoid(score_pos)
            loss += -math.log(max(p_pos, 1e-10))
            grad_pos = (p_pos - 1.0)

            g_in = grad_pos * v_o
            g_out = grad_pos * v_c

            # negatives label = 0
            negs = rng.choice(n_items, size=neg_k, replace=True, p=neg_dist)
            for n in negs:
                v_n = W_out[n]
                score_neg = float(v_c @ v_n)
                p_neg = sigmoid(score_neg)
                loss += -math.log(max(1.0 - p_neg, 1e-10))
                grad_neg = p_neg
                g_in += grad_neg * v_n
                W_out[n] -= lr * (grad_neg * v_c)

            W_in[c] -= lr * g_in
            W_out[o] -= lr * g_out

        print(f"epoch {ep+1}/{epochs} avg_loss≈{loss/steps:.4f}")
    return W_in

t0 = time.time()
item_emb_i2v = train_item2vec(pairs_sg, n_items, dim=32, lr=0.04, epochs=2, neg_k=8, seed=7)
print("trained in", round(time.time()-t0, 2), "s")

r20 = recall_at_k(item_emb_i2v[prev_items], item_emb_i2v, next_items, k=20)
print("Item2Vec Recall@20:", round(r20, 4))


skipgram pairs: (220000, 2)
epoch 1/2 avg_loss≈3.1329
epoch 2/2 avg_loss≈2.5874
trained in 53.6 s
Item2Vec Recall@20: 0.2991


## 4) Method C: Graph embeddings (random walks + skip-gram)
Build an item-item graph from co-occurrence, run random walks, then train Item2Vec on the walks.

Captures longer-range structure than a fixed window.

In [6]:

# Adjacency from co-occurrence counts: top-N neighbors per item
topN = 30
adj = [[] for _ in range(n_items)]
for i in range(n_items):
    row = C[i].copy()
    row[i] = 0
    nbrs = np.argpartition(-row, kth=topN)[:topN]
    nbrs = nbrs[row[nbrs] > 0]
    adj[i] = nbrs.tolist()

def random_walks(adj, n_walks=6, walk_len=25, seed=7):
    rng = default_rng(seed)
    walks = []
    for start in range(len(adj)):
        if not adj[start]:
            continue
        for _ in range(n_walks):
            w = [start]
            cur = start
            for _ in range(walk_len - 1):
                nbrs = adj[cur]
                if not nbrs:
                    break
                cur = int(rng.choice(nbrs))
                w.append(cur)
            walks.append(w)
    return walks

walks = random_walks(adj, n_walks=6, walk_len=25, seed=7)
print("walks:", len(walks), "avg len:", round(np.mean([len(w) for w in walks]), 2))

pairs_walk = build_skipgram_pairs(walks, window=3, max_pairs=240_000)
print("walk skipgram pairs:", pairs_walk.shape)

t0 = time.time()
item_emb_graph = train_item2vec(pairs_walk, n_items, dim=32, lr=0.04, epochs=2, neg_k=8, seed=9)
print("trained in", round(time.time()-t0, 2), "s")

r20 = recall_at_k(item_emb_graph[prev_items], item_emb_graph, next_items, k=20)
print("Graph-walk Recall@20:", round(r20, 4))


walks: 2400 avg len: 25.0
walk skipgram pairs: (240000, 2)
epoch 1/2 avg_loss≈3.2574
epoch 2/2 avg_loss≈2.8827
trained in 56.69 s
Graph-walk Recall@20: 0.2667


## 5) Method D: Fuse with content-based embeddings
In practice we often have precomputed text/image embeddings. We can use them directly (cold start), or fuse with behavior embeddings.

Here we simulate text and image vectors correlated with item latent factors, then compare retrieval from fused embeddings.

In [7]:

d_mm = 64
A_txt = rng.normal(size=(latent_dim, d_mm)).astype(np.float32)
A_img = rng.normal(size=(latent_dim, d_mm)).astype(np.float32)

text_emb = (V_true @ A_txt + 0.2 * rng.normal(size=(n_items, d_mm))).astype(np.float32)
img_emb  = (V_true @ A_img + 0.2 * rng.normal(size=(n_items, d_mm))).astype(np.float32)

# Simple fusion: concat + fixed linear projection
X = np.concatenate([l2_normalize(text_emb), l2_normalize(img_emb)], axis=1)
proj = rng.normal(size=(X.shape[1], 32)).astype(np.float32)
item_emb_mm = (X @ proj).astype(np.float32)

r20 = recall_at_k(item_emb_mm[prev_items], item_emb_mm, next_items, k=20)
print("Multimodal Recall@20:", round(r20, 4))

# Fuse behavior + multimodal (common in production)
item_emb_fused = l2_normalize(np.concatenate([item_emb_i2v, item_emb_mm], axis=1))
r20 = recall_at_k(item_emb_fused[prev_items], item_emb_fused, next_items, k=20)
print("Fused (behavior + multimodal) Recall@20:", round(r20, 4))


Multimodal Recall@20: 0.3164
Fused (behavior + multimodal) Recall@20: 0.3194


## 6) Production notes
**Emdedding choice**
- Need fast + stable: PPMI+SVD
- Strong sequences: Item2Vec
- Graph structure matters: walk-based graph embeddings
- Cold start: multimodal content embeddings, then fuse with behavior

**Common failure modes**
- Popularity collapse: everything maps to head items
- Leakage in session construction: future events sneaking in
- Staleness: catalog shifts faster than refresh cadence

**Serving**
- Store vectors in embedding store + query via ANN index
- Version vectors + ANN index with model + feature snapshot
- Monitor drift + retrieval coverage + define rollback triggers