# RRF-Enhanced Paper Recommendation Feed with Decay Learning

## Architecture
- **4 User Embeddings**: Complete, Subject1, Subject2, Subject3
- **RRF Fusion**: Fuse rankings from semantic query, complete profile, subject-focused, and recency
- **MMR Diversification**: Apply Maximal Marginal Relevance for balanced recommendations
- **Decay Learning**: Time-weighted interaction updates to vectors
- **Feed API**: Paginated personalized feed with interaction logging

In [25]:
# Install dependencies
!pip install -q feedparser
!pip uninstall -y xformers fastai easyocr timm 2>/dev/null
!pip install --extra-index-url https://download.pytorch.org/whl/cpu torch==2.9.0+cpu torchvision==0.24.0+cpu torchaudio==2.9.0+cpu
!pip install transformers==4.53.0 --upgrade
!pip install -q sentence-transformers qdrant-client numpy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [63]:
import os
import math
import json
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import List, Optional, Dict, Tuple
from enum import Enum
import numpy as np

# Qdrant credentials
QDRANT_URL = "https://553c68d4-edc5-4369-aef3-83ac014d1682.eu-central-1-0.aws.cloud.qdrant.io"
QDRANT_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIn0.yB3LJ-xh91hA4h-LKuX4XPpYt8Dp7BXDOyeykt6G2Zc"
COLLECTION_NAME = "arxiv_stella_1024_recommendations"

# Model config
MODEL_NAME = "NovaSearch/stella_en_400M_v5"
EMBEDDING_DIM = 1024

# Learning parameters from userfeed.py
ALPHA = 0.10  # Learning rate
DECAY_RATE = 0.1  # Exponential decay
MIN_WEIGHT_THRESHOLD = 0.05
MAX_INTERACTION_AGE_DAYS = 365

# Interaction weights
BASE_WEIGHTS = {"LIKE": 1.0, "BOOKMARK": 0.8, "VIEW": 0.3, "DISLIKE": -0.5}

# RRF parameters
RRF_K = 60  # Standard RRF constant
RRF_RETRIEVER_LIMIT = 100  # Results per retriever

print("‚úì Configuration loaded")

‚úì Configuration loaded


In [64]:
# Enum definitions
class VectorType(Enum):
    COMPLETE = "complete"
    SUBJECT1 = "subject1"
    SUBJECT2 = "subject2"
    SUBJECT3 = "subject3"

class InteractionType(Enum):
    LIKE = "like"
    DISLIKE = "dislike"
    VIEW = "view"
    BOOKMARK = "bookmark"

@dataclass
class Interaction:
    """User interaction with temporal decay"""
    arxiv_id: str
    interaction_type: InteractionType
    timestamp: datetime
    subject_area: Optional[str] = None
    base_weight: float = field(init=False)
    
    def __post_init__(self):
        self.base_weight = BASE_WEIGHTS.get(self.interaction_type.value.upper(), 0.3)
    
    def age_days(self) -> float:
        return (datetime.now() - self.timestamp).total_seconds() / 86400
    
    def get_decayed_weight(self) -> float:
        age = self.age_days()
        if age > MAX_INTERACTION_AGE_DAYS:
            return 0.0
        decayed = self.base_weight * math.exp(-DECAY_RATE * age)
        return 0.0 if abs(decayed) < MIN_WEIGHT_THRESHOLD else decayed

@dataclass
class MMRResult:
    """Result from MMR ranking"""
    arxiv_id: str
    relevance_score: float
    diversity_score: float
    mmr_score: float
    rank: int
    payload: Dict = field(default_factory=dict)

@dataclass
class RRFResult:
    """Result from RRF fusion"""
    arxiv_id: str
    rrf_score: float
    vector: Optional[np.ndarray] = None
    payload: Dict = field(default_factory=dict)
    retriever_ranks: Dict[str, int] = field(default_factory=dict)  # Track rank from each retriever

@dataclass
class FeedItem:
    """Item in personalized feed"""
    arxiv_id: str
    title: str
    authors: List[str]
    abstract: str
    rrf_score: float
    mmr_score: float
    relevance_score: float
    diversity_score: float
    rank: int
    pdf_url: str
    abs_url: str
    published: str
    categories: List[str] = field(default_factory=list)

print("‚úì Data structures defined")

‚úì Data structures defined


In [65]:
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client import models

client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY, timeout=60)
model = SentenceTransformer(
    MODEL_NAME,
    trust_remote_code=True,
    device="cpu",
    config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False}
)

print("‚úì Clients initialized")

Some weights of the model checkpoint at NovaSearch/stella_en_400M_v5 were not used when initializing NewModel: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


‚úì Clients initialized


In [66]:
# Core utility functions

def l2_norm(x: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(x)
    return x if n == 0 else x / n

def encode_text(text: str, is_query: bool = True) -> np.ndarray:
    """Encode text using Stella model"""
    v = model.encode(
        text,
        prompt_name="s2p_query" if is_query else "s2p_passage",
        convert_to_numpy=True
    ).astype(np.float32)
    return l2_norm(v)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity"""
    return float(np.dot(a, b))

def retrieve_vector_by_arxiv_id(arxiv_id: str) -> Optional[np.ndarray]:
    """Retrieve paper vector from Qdrant"""
    flt = models.Filter(
        must=[
            models.FieldCondition(
                key="arxiv_id",
                match=models.MatchValue(value=str(arxiv_id))
            )
        ]
    )
    resp = client.query_points(
        collection_name=COLLECTION_NAME,
        query_filter=flt,
        with_payload=True,
        with_vectors=True,
        limit=1
    )
    pts = getattr(resp, "points", []) or []
    if not pts:
        return None
    return l2_norm(np.array(pts[0].vector, dtype=np.float32))

print("‚úì Helper functions defined")

‚úì Helper functions defined


In [67]:
# arXiv API integration
import time
from functools import lru_cache
import urllib.parse
import feedparser

ARXIV_API = "http://export.arxiv.org/api/query"

def normalize_arxiv_id(aid: str) -> str:
    aid = (aid or "").strip()
    if aid.lower().startswith("arxiv:"):
        aid = aid[6:]
    return aid

def parse_entry(entry):
    """Parse arXiv API entry"""
    fullid = entry.id.rsplit("/", 1)[-1]
    baseid = fullid.split("v")[0]
    
    title = entry.title.strip()
    summary = getattr(entry, "summary", "").strip()
    published = getattr(entry, "published", None)
    updated = getattr(entry, "updated", None)
    authors = [a.name for a in getattr(entry, "authors", [])]
    
    absurl, pdfurl, doiurl = None, None, None
    for link in getattr(entry, "links", []):
        rel = getattr(link, "rel", "")
        titleattr = getattr(link, "title", "")
        href = getattr(link, "href", "")
        if rel == "alternate" and "abs" in href:
            absurl = href
        if titleattr == "pdf" or (rel == "related" and getattr(link, "type", "") == "application/pdf"):
            pdfurl = href
        if titleattr == "doi":
            doiurl = href
    
    categories = [t.get("term") for t in getattr(entry, "tags", []) if isinstance(t, dict) and t.get("term")]
    primarycat = entry.arxiv_primary_category.get("term") if hasattr(entry, "arxiv_primary_category") else None
    comment = getattr(entry, "arxiv_comment", None)
    journalref = getattr(entry, "arxiv_journal_ref", None)
    doi = getattr(entry, "arxiv_doi", None) or doiurl
    
    return {
        "arxiv_id_full": fullid,
        "arxiv_id": baseid,
        "title": title,
        "summary": summary,
        "authors": authors,
        "published": published,
        "updated": updated,
        "categories": categories,
        "primary_category": primarycat,
        "comment": comment,
        "journal_ref": journalref,
        "doi": doi,
        "absurl": absurl or f"https://arxiv.org/abs/{fullid}",
        "pdfurl": pdfurl or f"https://arxiv.org/pdf/{fullid}.pdf",
    }

def fetch_batch(ids_chunk):
    params = {"id_list": ",".join(ids_chunk)}
    url = f"{ARXIV_API}?{urllib.parse.urlencode(params)}"
    feed = feedparser.parse(url)
    results = {}
    for e in getattr(feed, "entries", []):
        if getattr(e, "title", "").strip().lower() == "error":
            continue
        r = parse_entry(e)
        results[r["arxiv_id"]] = r
    return results

def fetch_arxiv_by_ids(ids, chunksize=100, sleep_between=0.0):
    normed = [normalize_arxiv_id(x) for x in ids if x]
    out = {}
    for i in range(0, len(normed), chunksize):
        chunk = normed[i:i+chunksize]
        out.update(fetch_batch(chunk))
        if sleep_between > 0:
            time.sleep(sleep_between)
    return out

@lru_cache(maxsize=8192)
def fetch_arxiv_one(arxiv_id: str):
    res = fetch_arxiv_by_ids([arxiv_id])
    return res.get(normalize_arxiv_id(arxiv_id))

print("‚úì arXiv API client ready")

‚úì arXiv API client ready


In [68]:


class UserProfile:
    """User profile with 4 embeddings, decay learning, and RRF support"""
    
    def __init__(self, userid: str):
        self.userid = userid
        self.embedding_dim = EMBEDDING_DIM
        
        # Initialize 4 user vectors
        self.vectors = {
            VectorType.COMPLETE: np.zeros(EMBEDDING_DIM, dtype=np.float32),
            VectorType.SUBJECT1: np.zeros(EMBEDDING_DIM, dtype=np.float32),
            VectorType.SUBJECT2: np.zeros(EMBEDDING_DIM, dtype=np.float32),
            VectorType.SUBJECT3: np.zeros(EMBEDDING_DIM, dtype=np.float32),
        }
        
        # Subject metadata
        self.subjects = {
            VectorType.SUBJECT1: {"name": "Large Language Models", "keywords": "llm, gpt, transformer"},
            VectorType.SUBJECT2: {"name": "Reinforcement Learning", "keywords": "rl, policy, reward"},
            VectorType.SUBJECT3: {"name": "Computer Vision", "keywords": "image, vision, cnn"},
        }
        
        # Interaction history
        self.interactions: List[Interaction] = []
        
        # MMR parameters
        self.mmr_lambda = 0.7  # 70% relevance, 30% diversity
        
        print(f"‚úì Created user profile {userid}")
    
    def onboard_from_topics(self, topic_weights: Dict[str, float], k_per_topic: int = 50):
        """Initialize user vectors from topic interests"""
        print(f"Onboarding user with topics {topic_weights}")
        
        total = sum(max(0, w) for w in topic_weights.values())
        if total == 0:
            raise ValueError("topic_weights must have positive mass")
        norm_weights = {t: max(0, w) / total for t, w in topic_weights.items()}
        
        centroids = {}
        for topic, weight in norm_weights.items():
            if weight == 0:
                continue
            print(f"  Searching {topic}...")
            q = encode_text(topic)
            
            results = client.query_points(
                collection_name=COLLECTION_NAME,
                query=q.tolist(),
                limit=k_per_topic,
                with_vectors=True
            )
            
            vecs = [l2_norm(np.array(r.vector, dtype=np.float32)) for r in results.points if r.vector]
            if vecs:
                centroids[topic] = l2_norm(np.mean(np.stack(vecs), axis=0))
        
        # Weighted combination for complete vector
        complete_vec = np.zeros(EMBEDDING_DIM, dtype=np.float32)
        for topic, weight in norm_weights.items():
            if topic in centroids:
                complete_vec += weight * centroids[topic]
        self.vectors[VectorType.COMPLETE] = l2_norm(complete_vec)
        
        # Assign centroids to subject slots
        topic_list = list(centroids.keys())
        for i, vtype in enumerate([VectorType.SUBJECT1, VectorType.SUBJECT2, VectorType.SUBJECT3]):
            if i < len(topic_list):
                self.vectors[vtype] = centroids[topic_list[i]]
                self.subjects[vtype]["name"] = topic_list[i]
        
        print("‚úì Onboarding complete!")

    def add_interaction(self, arxiv_id: str, interaction_type: str, timestamp: datetime = None):
        """Record interaction and update vectors via decay learning"""
        if timestamp is None:
            timestamp = datetime.now()
        
        interaction = Interaction(
            arxiv_id=arxiv_id,
            interaction_type=InteractionType[interaction_type.upper()],
            timestamp=timestamp
        )
        self.interactions.append(interaction)
        
        # Get paper vector
        paper_vec = retrieve_vector_by_arxiv_id(arxiv_id)
        if paper_vec is None:
            print(f"Paper {arxiv_id} not found, skipping update")
            return
        
        # Apply decay learning to complete vector
        weight = interaction.get_decayed_weight()
        if weight != 0:
            self.vectors[VectorType.COMPLETE] = l2_norm(
                self.vectors[VectorType.COMPLETE] + ALPHA * weight * paper_vec
            )

        
        subject_types = [VectorType.SUBJECT1, VectorType.SUBJECT2, VectorType.SUBJECT3]
        
        # 1. Calculate positive similarities
        sims = {vtype: cosine_similarity(paper_vec, self.vectors[vtype]) for vtype in subject_types}
        pos_sims = {vtype: sim for vtype, sim in sims.items() if sim > 0}

        if pos_sims:
            # 2. Normalize positive similarities (so they sum to 1)
            total_sim = sum(pos_sims.values())
            norm_sims = {vtype: (sim / total_sim) for vtype, sim in pos_sims.items()}
            
            # 3. Apply weighted updates to all relevant subjects
            for vtype, norm_weight in norm_sims.items():
                if weight != 0:
                    # Update is gated by base weight (LIKE/DISLIKE) and proportional weight
                    proportional_update = ALPHA * weight * norm_weight * paper_vec
                    self.vectors[vtype] = l2_norm(
                        self.vectors[vtype] + proportional_update
                    )
            
            print(f"‚úì Proportional update for {interaction_type} on {arxiv_id} across {len(pos_sims)} subjects.")
        
        else:
            print(f"‚úì No subject vectors were relevant for {interaction_type} on {arxiv_id}.")


    def get_personalized_query(self, original_query: str, 
                             complete_weight: float = 0.5,
                             subject_weight: float = 0.3) -> np.ndarray:
        """Blend user preferences with query"""
        query_vec = l2_norm(encode_text(original_query))
        
        # Find most relevant subject
        subject_vec = None
        best_sim = -1
        for vtype in [VectorType.SUBJECT1, VectorType.SUBJECT2, VectorType.SUBJECT3]:
            sim = cosine_similarity(query_vec, self.vectors[vtype])
            if sim > best_sim:
                best_sim = sim
                subject_vec = self.vectors[vtype]
        
        if subject_vec is None:
            subject_vec = np.zeros(EMBEDDING_DIM, dtype=np.float32)
        
        # Weighted blend
        base_weight = 1.0 - complete_weight - subject_weight
        personalized = (
            base_weight * query_vec
            + complete_weight * self.vectors[VectorType.COMPLETE]
            + subject_weight * subject_vec
        )
        return l2_norm(personalized)
    
    def apply_mmr_ranking(self, search_results: List[Dict], query_vec: np.ndarray,
                        lambda_param: float = None, max_results: int = 20) -> List[MMRResult]:
        """Apply MMR for diversity"""
        if lambda_param is None:
            lambda_param = self.mmr_lambda
        
        if not search_results:
            return []
        
        result_vecs = []
        result_meta = []
        for r in search_results:
            if "vector" in r:
                result_vecs.append(r["vector"])
                result_meta.append(r)
        
        if not result_vecs:
            return []
        
        result_vecs = np.array(result_vecs)
        
        # MMR algorithm
        selected = []
        remaining = list(range(len(result_vecs)))
        
        for rank in range(min(max_results, len(result_vecs))):
            best_score = -float('inf')
            best_idx = None
            
            for idx in remaining:
                relevance = np.dot(result_vecs[idx], query_vec)
                
                if selected:
                    selected_vecs = result_vecs[[r.rank for r in selected]]
                    max_sim = np.max(np.dot(selected_vecs, result_vecs[idx]))
                    diversity = 1 - max_sim
                else:
                    diversity = 0
                
                mmr_score = lambda_param * relevance + (1 - lambda_param) * diversity
                if mmr_score > best_score:
                    best_score = mmr_score
                    best_idx = idx
            
            if best_idx is not None:
                relevance = float(np.dot(result_vecs[best_idx], query_vec))
                diversity = 1 - max_sim if selected else 0
                
                selected.append(MMRResult(
                    arxiv_id=result_meta[best_idx].get("arxiv_id", f"paper_{best_idx}"),
                    relevance_score=relevance,
                    diversity_score=diversity,
                    mmr_score=float(best_score),
                    rank=best_idx,
                    payload=result_meta[best_idx]
                ))
                remaining.remove(best_idx)
        
        return selected

print("‚úì UserProfile class defined")

‚úì UserProfile class defined


In [69]:
# RRF (Reciprocal Rank Fusion) Implementation

def rrf_fuse(ranked_lists: List[List[Dict]], k: int = RRF_K) -> List[RRFResult]:
    """
    Fuse multiple ranked lists using Reciprocal Rank Fusion.
    
    RRF Score = Œ£ (1 / (k + rank_i))
    
    Args:
        ranked_lists: List of ranked paper lists from different retrievers
        k: RRF constant (default 60)
    
    Returns:
        Sorted list of fused results by RRF score
    """
    rrf = {}
    pos_maps = []
    
    # --- MODIFICATION ---
    # Dynamically get retriever names from the input
    retriever_names = [r["name"] for r in ranked_lists]
    
    # Build position maps for each retriever
    for list_idx, retriever in enumerate(ranked_lists):
        lst = retriever["list"]
        pos = {item["arxiv_id"]: (i + 1) for i, item in enumerate(lst)}
        pos_maps.append(pos)
    
    # Get union of all arxiv_ids
    keys = set()
    for pos in pos_maps:
        keys.update(pos.keys())
    
    # Compute RRF scores
    for aid in keys:
        score = 0.0
        retriever_ranks = {}
        for list_idx, pos in enumerate(pos_maps):
            if aid in pos:
                rank = pos[aid]
                score += 1.0 / (k + rank)
                retriever_ranks[retriever_names[list_idx]] = rank
        rrf[aid] = (score, retriever_ranks)
    
    # Build fused list with metadata cache
    cache = {}
    for retriever in ranked_lists:
        for item in retriever["list"]:
            cache.setdefault(item["arxiv_id"], item)
    
    fused = []
    for aid, (score, retriever_ranks) in sorted(rrf.items(), key=lambda x: x[1][0], reverse=True):
        item = dict(cache.get(aid, {"arxiv_id": aid}))
        fused.append(RRFResult(
            arxiv_id=aid,
            rrf_score=score,
            vector=item.get("vector"),
            payload=item.get("payload", {}),
            retriever_ranks=retriever_ranks
        ))
    
    return fused
    
#
# üõë REPLACE ALL OF CELL 10 WITH THIS CODE üõë
#

def build_retrievers(user: UserProfile, query_vec: np.ndarray, has_query_text: bool) -> Dict[str, List[Dict]]:
    """
    Build ranked lists from 5-6 complementary retrievers for RRF.

    Retrievers:
    1. Semantic Query: (OPTIONAL) Personalized query + user preferences
    2. Complete Profile: (ALWAYS) Pure user profile vector
    3. Subject 1: (ALWAYS) Vector for subject 1
    4. Subject 2: (ALWAYS) Vector for subject 2
    5. Subject 3: (ALWAYS) Vector for subject 3
    6. Recency: (ALWAYS) Recently published papers from all retrievers
    """
    retrievers = {}

    # 1. Semantic Query Retriever (CONDITIONAL)
    list_sem_query = []
    if has_query_text:
        print("Building semantic query retriever...")
        raw_results = client.query_points(
            collection_name=COLLECTION_NAME,
            query=query_vec.tolist(),
            limit=RRF_RETRIEVER_LIMIT,
            with_vectors=True,
            with_payload=True
        )
        for i, r in enumerate(raw_results.points):
            list_sem_query.append({
                "arxiv_id": r.payload.get("arxiv_id", f"unknown_{i}"),
                "vector": l2_norm(np.array(r.vector, dtype=np.float32)),
                "payload": r.payload,
                "rank": i + 1,
                "raw_score": r.score
            })
    retrievers["semantic_query"] = list_sem_query
    print(f"  ‚úì Found {len(list_sem_query)} results (Semantic Query)")

    # 2. Complete Profile Retriever (ALWAYS RUNS)
    print("Building complete profile retriever...")
    raw_results_complete = client.query_points(
        collection_name=COLLECTION_NAME,
        query=user.vectors[VectorType.COMPLETE].tolist(),
        limit=RRF_RETRIEVER_LIMIT,
        with_vectors=True,
        with_payload=True
    )
    list_complete = []
    for i, r in enumerate(raw_results_complete.points):
        list_complete.append({
            "arxiv_id": r.payload.get("arxiv_id", f"unknown_{i}"),
            "vector": l2_norm(np.array(r.vector, dtype=np.float32)),
            "payload": r.payload,
            "rank": i + 1,
            "raw_score": r.score
        })
    retrievers["complete_profile"] = list_complete
    print(f"  ‚úì Found {len(list_complete)} results (Complete Profile)")

    # ------------------------------------------------------------------
    # ‚¨áÔ∏è START OF MAJOR CODE CHANGE (Replacing "Subject-Focused")
    # ------------------------------------------------------------------

    # 3. Subject 1 Retriever (ALWAYS RUNS)
    print("Building Subject 1 retriever...")
    raw_results_s1 = client.query_points(
        collection_name=COLLECTION_NAME,
        query=user.vectors[VectorType.SUBJECT1].tolist(),
        limit=RRF_RETRIEVER_LIMIT,
        with_vectors=True,
        with_payload=True
    )
    list_subject1 = []
    for i, r in enumerate(raw_results_s1.points):
        list_subject1.append({
            "arxiv_id": r.payload.get("arxiv_id", f"unknown_{i}"),
            "vector": l2_norm(np.array(r.vector, dtype=np.float32)),
            "payload": r.payload,
            "rank": i + 1,
            "raw_score": r.score
        })
    retrievers["subject1"] = list_subject1
    print(f"  ‚úì Found {len(list_subject1)} results ({user.subjects[VectorType.SUBJECT1]['name']})")


    # 4. Subject 2 Retriever (ALWAYS RUNS)
    print("Building Subject 2 retriever...")
    raw_results_s2 = client.query_points(
        collection_name=COLLECTION_NAME,
        query=user.vectors[VectorType.SUBJECT2].tolist(),
        limit=RRF_RETRIEVER_LIMIT,
        with_vectors=True,
        with_payload=True
    )
    list_subject2 = []
    for i, r in enumerate(raw_results_s2.points):
        list_subject2.append({
            "arxiv_id": r.payload.get("arxiv_id", f"unknown_{i}"),
            "vector": l2_norm(np.array(r.vector, dtype=np.float32)),
            "payload": r.payload,
            "rank": i + 1,
            "raw_score": r.score
        })
    retrievers["subject2"] = list_subject2
    print(f"  ‚úì Found {len(list_subject2)} results ({user.subjects[VectorType.SUBJECT2]['name']})")


    # 5. Subject 3 Retriever (ALWAYS RUNS)
    print("Building Subject 3 retriever...")
    raw_results_s3 = client.query_points(
        collection_name=COLLECTION_NAME,
        query=user.vectors[VectorType.SUBJECT3].tolist(),
        limit=RRF_RETRIEVER_LIMIT,
        with_vectors=True,
        with_payload=True
    )
    list_subject3 = []
    for i, r in enumerate(raw_results_s3.points):
        list_subject3.append({
            "arxiv_id": r.payload.get("arxiv_id", f"unknown_{i}"),
            "vector": l2_norm(np.array(r.vector, dtype=np.float32)),
            "payload": r.payload,
            "rank": i + 1,
            "raw_score": r.score
        })
    retrievers["subject3"] = list_subject3
    print(f"  ‚úì Found {len(list_subject3)} results ({user.subjects[VectorType.SUBJECT3]['name']})")



    print("Building Discovery retriever (serendipity)...")
    
    # 1. Create a "perturbed" vector by adding small random noise
    # This "nudges" the search into adjacent vector space
    noise = np.random.normal(0, 0.05, EMBEDDING_DIM).astype(np.float32)
    discovery_vec = l2_norm(user.vectors[VectorType.COMPLETE] + noise)

    # 2. Query with this new discovery vector
    raw_results_disc = client.query_points(
        collection_name=COLLECTION_NAME,
        query=discovery_vec.tolist(),
        limit=RRF_RETRIEVER_LIMIT,
        with_vectors=True,
        with_payload=True
    )
    list_discovery = []
    for i, r in enumerate(raw_results_disc.points):
        list_discovery.append({
            "arxiv_id": r.payload.get("arxiv_id", f"unknown_{i}"),
            "vector": l2_norm(np.array(r.vector, dtype=np.float32)),
            "payload": r.payload,
            "rank": i + 1,
            "raw_score": r.score
        })
    retrievers["discovery"] = list_discovery
    print(f"  ‚úì Found {len(list_discovery)} results (Discovery)")

    
    # ------------------------------------------------------------------
    # ‚¨ÜÔ∏è END OF MAJOR CODE CHANGE
    # ------------------------------------------------------------------

    # 6. Recency Retriever (ALWAYS RUNS)
    print("Building recency retriever...")

    # --- THIS IS THE KEY CHANGE for RECENCY---
    # The union must now include all 5 lists (semantic, complete, s1, s2, s3)
    union_ids = list(set(
        [item["arxiv_id"] for item in list_sem_query + list_complete + list_subject1 + list_subject2 + list_subject3 + list_discovery
         if "unknown" not in item["arxiv_id"]]
    ))

    
    if union_ids:
        meta = fetch_arxiv_by_ids(union_ids, sleep_between=0.1)
        recency_sorted = sorted(
            meta.keys(),
            key=lambda aid: (meta[aid].get("updated") or meta[aid].get("published") or ""),
            reverse=True
        )
        list_recency = []
        for i, aid in enumerate(recency_sorted):
            # Get vector if available from cache
            vec = None
            # (Update this loop to check all lists)
            for item in list_sem_query + list_complete + list_subject1 + list_subject2 + list_subject3 + list_discovery:
                if item["arxiv_id"] == aid:
                    vec = item["vector"]
                    break
            list_recency.append({
                "arxiv_id": aid,
                "vector": vec,
                "payload": {"arxiv_meta": meta[aid]},
                "rank": i + 1,
                "raw_score": 1.0 / (i + 1)
            })
        retrievers["recency"] = list_recency
        print(f"  ‚úì Found {len(list_recency)} unique papers (Recency)")
    else:
        retrievers["recency"] = []
        print("  ‚úì No papers for recency ranking")

    return retrievers

print("‚úì RRF and retriever functions defined")

‚úì RRF and retriever functions defined


In [74]:
#
# üõë REPLACE ALL OF CELL 70 WITH THIS CODE üõë
#

class FeedService:
    """Personalized feed service with RRF + MMR + decay learning"""

    def __init__(self, user: UserProfile):
        self.user = user
        self.feed_impressions = {}  # Track impressions for analytics

    
    def _get_cold_start_feed(self, page: int = 1, page_size: int = 10) -> List[FeedItem]:
        """
        Generates a non-personalized feed for new users.
        Bypasses RRF and serves popular/recent generic items.
        """
        print("  - Executing Cold Start: Serving generic 'LLM' feed.")
        try:
            # Query for a generic, high-quality term
            query_vec = encode_text("Large Language Models and transformers", is_query=True)
            
            raw_results = client.query_points(
                collection_name=COLLECTION_NAME,
                query=query_vec.tolist(),
                limit=page_size * page, # Get enough for pagination
                with_vectors=True,
                with_payload=True
            )
            
            # Paginate
            start = (page - 1) * page_size
            end = start + page_size
            page_results = raw_results.points[start:end]
            
            # Enrich and return (simplified enrichment)
            arxiv_ids = [r.payload.get("arxiv_id") for r in page_results if r.payload]
            meta = fetch_arxiv_by_ids(arxiv_ids, sleep_between=0.05) if arxiv_ids else {}
            
            feed_items = []
            for rank, result in enumerate(page_results, start=1):
                aid = result.payload.get("arxiv_id", "unknown")
                m = meta.get(aid, {})
                item = FeedItem(
                    arxiv_id=aid,
                    title=m.get("title", result.payload.get("title", "Unknown Title")),
                    authors=m.get("authors", [])[:5],
                    abstract=m.get("summary", "")[:200],
                    rrf_score=0.0,
                    mmr_score=result.score, # Use raw score as MMR score
                    relevance_score=result.score,
                    diversity_score=0.0,
                    rank=rank,
                    pdf_url=m.get("pdfurl", ""),
                    abs_url=m.get("absurl", ""),
                    published=m.get("published", ""),
                    categories=m.get("categories", [])
                )
                feed_items.append(item)
            
            print(f"‚úÖ Cold Start feed generated with {len(feed_items)} items\n")
            return feed_items

        except Exception as e:
            print(f"  - ‚ö†Ô∏è Cold Start Error: {e}")
            return [] # Return empty list on failure
    

    
    def get_feed(self, query: Optional[str] = None, page: int = 1, page_size: int = 20,
                 apply_mmr: bool = True, mmr_lambda: float = 0.7) -> List[FeedItem]:
        """
        Generate personalized feed using RRF + MMR pipeline.
        
        Pipeline:
        1. Determine Mode: "Search" (if query) or "Feed" (if no query)
        2. Set Base Vector: Blended query OR pure profile vector
        3. Build 5-6 retrievers (semantic, complete, s1, s2, s3, recency)
        4. Fuse rankings with RRF
        5. Apply MMR for diversity
        6. Filter seen items
        7. Paginate and enrich with metadata
        """
        
        print(f"\nüìù Generating feed for page {page}...")

        # --- NEW "A+" COLD START CHECK ---
        has_query_text = (query is not None) and (query.strip() != "")
        is_cold_start = np.linalg.norm(self.user.vectors[VectorType.COMPLETE]) < 0.01

        if is_cold_start and not has_query_text:
            print("  - Mode: Cold Start (No profile, no query)")
            # Call the new helper function and exit early
            return self._get_cold_start_feed(page, page_size)
            
        # Step 1: Determine Mode and Set Base Query Vector
        # has_query_text = (query is not None) and (query.strip() != "") # <--- This was redundant, I removed it

        if has_query_text:
            print("  - Mode: Personalized Search (Query + Profile)")
            query_vec = self.user.get_personalized_query(query, 0.5, 0.3)
        else:
            print("  - Mode: Recommendation Feed (Profile Only)")
            query_vec = self.user.vectors[VectorType.COMPLETE]

        # Step 2: Build retrievers
        print("\n1Ô∏è‚É£ Building retrievers for RRF fusion...")
        retrievers = build_retrievers(self.user, query_vec, has_query_text)

        # Step 3: RRF fusion
        print("\n2Ô∏è‚É£ Applying RRF fusion...")
        
        ranked_lists = []
        if retrievers["semantic_query"]:
            ranked_lists.append({"name": "semantic_query", "list": retrievers["semantic_query"]})
            print("  - Fusing: Semantic Query list")
        if retrievers["complete_profile"]:
            ranked_lists.append({"name": "complete_profile", "list": retrievers["complete_profile"]})
            print("  - Fusing: Complete Profile list")
        if retrievers["subject1"]:
            ranked_lists.append({"name": "subject1", "list": retrievers["subject1"]})
            print("  - Fusing: Subject 1 list")
        if retrievers["subject2"]:
            ranked_lists.append({"name": "subject2", "list": retrievers["subject2"]})
            print("  - Fusing: Subject 2 list")
        if retrievers["subject3"]:
            ranked_lists.append({"name": "subject3", "list": retrievers["subject3"]})
            print("  - Fusing: Subject 3 list")
        if retrievers["discovery"]:
            ranked_lists.append({"name": "discovery", "list": retrievers["discovery"]})
            print("  - Fusing: Discovery list")
        if retrievers["recency"]:
            ranked_lists.append({"name": "recency", "list": retrievers["recency"]})
            print("  - Fusing: Recency list")

        fused = rrf_fuse(ranked_lists, k=RRF_K)
        print(f"  ‚úì RRF fused {len(fused)} papers from {len(ranked_lists)} lists")

        # Step 4: Backfill vectors and apply MMR
        print("\n3Ô∏è‚É£ Preparing for MMR ranking...")
        search_results = []
        for item in fused[:min(200, len(fused))]: 
            if item.vector is None:
                item.vector = retrieve_vector_by_arxiv_id(item.arxiv_id)
            if item.vector is not None:
                search_results.append({
                    "arxiv_id": item.arxiv_id,
                    "vector": item.vector,
                    "payload": item.payload,
                    "rrf_score": item.rrf_score,
                    "retriever_ranks": item.retriever_ranks
                })

        # Step 5: Apply MMR if requested
        if apply_mmr:
            print(f"\n4Ô∏è‚É£ Applying MMR diversity filter (top {len(search_results)} by RRF)...")
            buffer_size = 50 # Fetch a larger buffer to account for filtering
            mmr_results = self.user.apply_mmr_ranking(
                search_results,
                query_vec,
                lambda_param=mmr_lambda,
                max_results=buffer_size 
            )
        else:
            mmr_results = [MMRResult(
                arxiv_id=sr["arxiv_id"],
                relevance_score=sr.get("rrf_score", 0.0),
                diversity_score=0.0,
                mmr_score=sr.get("rrf_score", 0.0),
                rank=idx,
                payload=sr["payload"]
            ) for idx, sr in enumerate(search_results[:50])] # Apply buffer size here too
        
        print(f"  ‚úì MMR selected {len(mmr_results)} papers for buffer.")

        # Step 6: Filter interacted items (from the larger buffer)
        interacted_ids_to_hide = set(
            interaction.arxiv_id for interaction in self.user.interactions
            if interaction.interaction_type in (InteractionType.LIKE, InteractionType.BOOKMARK, InteractionType.DISLIKE)
        )
        
        filtered_results = [r for r in mmr_results if r.arxiv_id not in interacted_ids_to_hide]
        print(f"  ‚úì Filtered out {len(mmr_results) - len(filtered_results)} already-interacted items from buffer.")

        # Step 7: Paginate (from the *filtered* list)
        start = (page - 1) * page_size
        end = start + page_size
        page_results = filtered_results[start:end] 

        # Step 8: Enrich with metadata
        print(f"\n5Ô∏è‚É£ Enriching {len(page_results)} items with metadata...")
        arxiv_ids = [r.arxiv_id for r in page_results]
        
        # --- THIS IS THE MISSING LINE THAT CAUSED THE NameError ---
        meta = fetch_arxiv_by_ids(arxiv_ids, sleep_between=0.05) if arxiv_ids else {}
        # ---------------------------------------------------------

        feed_items = []
        for rank, result in enumerate(page_results, start=1):
            m = meta.get(result.arxiv_id, {})
            item = FeedItem(
                arxiv_id=result.arxiv_id,
                title=m.get("title", "Unknown Title"),
                authors=m.get("authors", [])[:5],
                abstract=m.get("summary", "")[:200],
                rrf_score=result.payload.get("rrf_score", 0.0),
                mmr_score=result.mmr_score,
                relevance_score=result.relevance_score,
                diversity_score=result.diversity_score,
                rank=rank,
                pdf_url=m.get("pdfurl", ""),
                abs_url=m.get("absurl", ""),
                published=m.get("published", ""),
                categories=m.get("categories", [])
            )
            feed_items.append(item)

        # --- THIS BLOCK IS NOW REDUNDANT AND HAS BEEN REMOVED ---
        # interacted_ids_to_hide = set(...)
        # final_feed = [item for item in feed_items ...]
        # print(f"  ‚úì Filtered out ...")
        # --------------------------------------------------------
        
        print(f"‚úÖ Feed generated with {len(feed_items)} items\n")
        return feed_items # This list is now correctly filtered AND paginated

    def log_interaction(self, arxiv_id: str, interaction_type: str):
        """Log user interaction and update profile"""
        self.user.add_interaction(arxiv_id, interaction_type)
        if arxiv_id not in self.feed_impressions:
            self.feed_impressions[arxiv_id] = {}
        self.feed_impressions[arxiv_id][interaction_type] = datetime.now()
        print(f"‚úì Logged {interaction_type} on {arxiv_id}")

print("‚úì FeedService class defined")

‚úì FeedService class defined


## Example: Initialize User & Generate Initial Feed

In [79]:
# Create user and onboard from topics
print("üöÄ Initializing user profile...\n")
user = UserProfile("demo_user_001")

# Onboard with interests
topic_weights = {
    "Large Language Models ": 0.35,
    "Reinforcement Learning": 0.3,
    "Economics with Machine learning": 0.35
}

user.onboard_from_topics(topic_weights, k_per_topic=50)

# Display initialized vectors
print("\nüìä User Vector Status:")
for vtype, vec in user.vectors.items():
    norm = np.linalg.norm(vec)
    print(f"  {vtype.value:15s} | norm: {norm:.4f}")

üöÄ Initializing user profile...

‚úì Created user profile demo_user_001
Onboarding user with topics {'Large Language Models ': 0.35, 'Reinforcement Learning': 0.3, 'Economics with Machine learning': 0.35}
  Searching Large Language Models ...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Searching Reinforcement Learning...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Searching Economics with Machine learning...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

‚úì Onboarding complete!

üìä User Vector Status:
  complete        | norm: 1.0000
  subject1        | norm: 1.0000
  subject2        | norm: 1.0000
  subject3        | norm: 1.0000


In [80]:
# Initialize feed service
print("\nüéØ Initializing Feed Service...\n")
feed_service = FeedService(user)

# Generate initial feed
initial_feed = feed_service.get_feed(
    query=None,
    page=1,
    page_size=10,
    apply_mmr=True,
    mmr_lambda=0.7
)


üéØ Initializing Feed Service...


üìù Generating feed for page 1...
  - Mode: Recommendation Feed (Profile Only)

1Ô∏è‚É£ Building retrievers for RRF fusion...
  ‚úì Found 0 results (Semantic Query)
Building complete profile retriever...
  ‚úì Found 100 results (Complete Profile)
Building Subject 1 retriever...
  ‚úì Found 100 results (Large Language Models )
Building Subject 2 retriever...
  ‚úì Found 100 results (Reinforcement Learning)
Building Subject 3 retriever...
  ‚úì Found 100 results (Economics with Machine learning)
Building Discovery retriever (serendipity)...
  ‚úì Found 100 results (Discovery)
Building recency retriever...
  ‚úì Found 0 unique papers (Recency)

2Ô∏è‚É£ Applying RRF fusion...
  - Fusing: Complete Profile list
  - Fusing: Subject 1 list
  - Fusing: Subject 2 list
  - Fusing: Subject 3 list
  - Fusing: Discovery list
  ‚úì RRF fused 450 papers from 5 lists

3Ô∏è‚É£ Preparing for MMR ranking...

4Ô∏è‚É£ Applying MMR diversity filter (top 200 by RRF)...
  

In [81]:
# Display initial feed with detailed metrics
print("\n" + "="*120)
print("üì∞ INITIAL PERSONALIZED FEED (Top 10)")
print("="*120)

for item in initial_feed:
    print(f"\n{item.rank}. [{item.arxiv_id}] {item.title[:80]}..." if len(item.title) > 80 else f"\n{item.rank}. [{item.arxiv_id}] {item.title}")
    print(f"   Authors: {', '.join(item.authors[:3])}{'...' if len(item.authors) > 3 else ''}")
    print(f"   Abstract: {item.abstract[:100]}..." if len(item.abstract) > 100 else f"   Abstract: {item.abstract}")
    print(f"   Published: {item.published}")
    print(f"   üìä Scores | RRF: {item.rrf_score:.4f} | Relevance: {item.relevance_score:.4f} | Diversity: {item.diversity_score:.4f} | MMR: {item.mmr_score:.4f}")
    print(f"   üîó {item.abs_url}")

print("\n" + "="*120)


üì∞ INITIAL PERSONALIZED FEED (Top 10)

1. [2412.07031] Large Language Models: An Applied Econometric Framework
   Authors: Jens Ludwig, Sendhil Mullainathan, Ashesh Rambachan
   Abstract: How can we use the novel capacities of large language models (LLMs) in
empirical research? And how c...
   Published: 2024-12-09T22:37:48Z
   üìä Scores | RRF: 0.0325 | Relevance: 0.8569 | Diversity: 0.0000 | MMR: 0.5998
   üîó http://arxiv.org/abs/2412.07031v2

2. [1810.06339] Deep Reinforcement Learning
   Authors: Yuxi Li
   Abstract: We discuss deep reinforcement learning in an overview style. We draw a big
picture, filled with deta...
   Published: 2018-10-15T13:20:56Z
   üìä Scores | RRF: 0.0306 | Relevance: 0.8363 | Diversity: 0.3304 | MMR: 0.7097
   üîó http://arxiv.org/abs/1810.06339v1

3. [2504.20997] Toward Efficient Exploration by Large Language Model Agents
   Authors: Dilip Arumugam, Thomas L. Griffiths
   Abstract: A burgeoning area within reinforcement learning (RL) is the desig

## Simulate User Interactions & Update Vectors

In [82]:
# Simulate user interactions with decay learning
print("\n" + "="*120)
print("üîÑ SIMULATING USER INTERACTIONS (with Decay Learning)")
print("="*120)

interactions = [
    (initial_feed[0].arxiv_id, "LIKE", datetime.now() - timedelta(days=1)),
    (initial_feed[1].arxiv_id, "BOOKMARK", datetime.now() - timedelta(days=2)),
    (initial_feed[3].arxiv_id, "VIEW", datetime.now() - timedelta(days=0)),
    (initial_feed[4].arxiv_id, "DISLIKE", datetime.now() - timedelta(days=3)),
    (initial_feed[7].arxiv_id, "LIKE", datetime.now() - timedelta(days=0.5)),
]

print("\nLogging interactions with decay-weighted updates...\n")
for arxiv_id, action, timestamp in interactions:
    feed_service.log_interaction(arxiv_id, action)
    time.sleep(0.5)  # Brief pause for readability

print("\nüìä Updated Vector Status (After Interactions):")
for vtype, vec in user.vectors.items():
    norm = np.linalg.norm(vec)
    print(f"  {vtype.value:15s} | norm: {norm:.4f}")


üîÑ SIMULATING USER INTERACTIONS (with Decay Learning)

Logging interactions with decay-weighted updates...

‚úì Proportional update for LIKE on 2412.07031 across 3 subjects.
‚úì Logged LIKE on 2412.07031
‚úì Proportional update for BOOKMARK on 1810.06339 across 3 subjects.
‚úì Logged BOOKMARK on 1810.06339
‚úì Proportional update for VIEW on 1706.06302 across 3 subjects.
‚úì Logged VIEW on 1706.06302
‚úì Proportional update for DISLIKE on 1712.00409 across 3 subjects.
‚úì Logged DISLIKE on 1712.00409
‚úì Proportional update for LIKE on 1903.10075 across 3 subjects.
‚úì Logged LIKE on 1903.10075

üìä Updated Vector Status (After Interactions):
  complete        | norm: 1.0000
  subject1        | norm: 1.0000
  subject2        | norm: 1.0000
  subject3        | norm: 1.0000


In [83]:
# Generate feed AFTER learning
print("\n" + "="*120)
print("üîÑ GENERATING PERSONALIZED FEED (AFTER LEARNING)")
print("="*120)

learned_feed = feed_service.get_feed(
    query=None,
    page=1,
    page_size=10,
    apply_mmr=True,
    mmr_lambda=0.7
)


üîÑ GENERATING PERSONALIZED FEED (AFTER LEARNING)

üìù Generating feed for page 1...
  - Mode: Recommendation Feed (Profile Only)

1Ô∏è‚É£ Building retrievers for RRF fusion...
  ‚úì Found 0 results (Semantic Query)
Building complete profile retriever...
  ‚úì Found 100 results (Complete Profile)
Building Subject 1 retriever...
  ‚úì Found 100 results (Large Language Models )
Building Subject 2 retriever...
  ‚úì Found 100 results (Reinforcement Learning)
Building Subject 3 retriever...
  ‚úì Found 100 results (Economics with Machine learning)
Building Discovery retriever (serendipity)...
  ‚úì Found 100 results (Discovery)
Building recency retriever...
  ‚úì Found 10 unique papers (Recency)

2Ô∏è‚É£ Applying RRF fusion...
  - Fusing: Complete Profile list
  - Fusing: Subject 1 list
  - Fusing: Subject 2 list
  - Fusing: Subject 3 list
  - Fusing: Discovery list
  - Fusing: Recency list
  ‚úì RRF fused 439 papers from 6 lists

3Ô∏è‚É£ Preparing for MMR ranking...

4Ô∏è‚É£ Applying M

In [84]:
# Display learned feed
print("\n" + "="*120)
print("üì∞ LEARNED PERSONALIZED FEED (Top 10)")
print("="*120)

for item in learned_feed:
    print(f"\n{item.rank}. [{item.arxiv_id}] {item.title[:80]}..." if len(item.title) > 80 else f"\n{item.rank}. [{item.arxiv_id}] {item.title}")
    print(f"   Authors: {', '.join(item.authors[:3])}{'...' if len(item.authors) > 3 else ''}")
    print(f"   üìä Scores | RRF: {item.rrf_score:.4f} | Relevance: {item.relevance_score:.4f} | Diversity: {item.diversity_score:.4f} | MMR: {item.mmr_score:.4f}")

print("\n" + "="*120)
print("‚úÖ Feed generation complete!")


üì∞ LEARNED PERSONALIZED FEED (Top 10)

1. [1706.06302] Deep Learning in (and of) Agent-Based Models: A Prospectus
   Authors: Sander van der Hoog
   üìä Scores | RRF: 0.0464 | Relevance: 0.8592 | Diversity: 0.1634 | MMR: 0.6690

2. [2504.20997] Toward Efficient Exploration by Large Language Model Agents
   Authors: Dilip Arumugam, Thomas L. Griffiths
   üìä Scores | RRF: 0.0149 | Relevance: 0.8355 | Diversity: 0.1634 | MMR: 0.6618

3. [2108.07783] Toward a `Standard Model' of Machine Learning
   Authors: Zhiting Hu, Eric P. Xing
   üìä Scores | RRF: 0.0297 | Relevance: 0.8299 | Diversity: 0.1634 | MMR: 0.6531

4. [2406.04344] Verbalized Machine Learning: Revisiting Machine Learning with Language
  Models
   Authors: Tim Z. Xiao, Robert Bamler, Bernhard Sch√∂lkopf...
   üìä Scores | RRF: 0.0254 | Relevance: 0.8338 | Diversity: 0.1634 | MMR: 0.6527

5. [2401.07345] Learning to be Homo Economicus: Can an LLM Learn Preferences from Choice
   Authors: Jeongbin Kim, Matthew Kovach, Ky

## Analysis: How RRF Enhanced Recommendations

In [85]:
# Compare initial vs learned feeds
print("\n" + "="*120)
print("üìä COMPARISON: Initial vs Learned Feeds")
print("="*120)

initial_ids = set(item.arxiv_id for item in initial_feed)
learned_ids = set(item.arxiv_id for item in learned_feed)

common = initial_ids & learned_ids
new_items = learned_ids - initial_ids
dropped = initial_ids - learned_ids

print(f"\nüìà Feed Statistics:")
print(f"  Initial feed: {len(initial_ids)} items")
print(f"  Learned feed: {len(learned_ids)} items")
print(f"  Items in common: {len(common)}")
print(f"  New items after learning: {len(new_items)}")
print(f"  Items dropped: {len(dropped)}")

print(f"\nüîÑ How RRF Fusion Improved Ranking:")
print(f"  RRF combines 4 independent ranking signals:")
print(f"    1. Semantic Query: Personalized text query + user profile")
print(f"    2. Complete Profile: Pure aggregated user interests")
print(f"    3. Subject-Focused: Most relevant subject area embedding")
print(f"    4. Recency: Recently published papers")
print(f"\n  Result: More robust, diverse, and well-ranked recommendations!")
print(f"\nüí° Decay Learning Updates:")
print(f"  - LIKE interactions: +1.0 weight (decays over time)")
print(f"  - BOOKMARK interactions: +0.8 weight")
print(f"  - VIEW interactions: +0.3 weight")
print(f"  - DISLIKE interactions: -0.5 weight")
print(f"\n‚úÖ All vectors updated with ALPHA={ALPHA} and DECAY_RATE={DECAY_RATE}")
print("\n" + "="*120)


üìä COMPARISON: Initial vs Learned Feeds

üìà Feed Statistics:
  Initial feed: 10 items
  Learned feed: 10 items
  Items in common: 5
  New items after learning: 5
  Items dropped: 5

üîÑ How RRF Fusion Improved Ranking:
  RRF combines 4 independent ranking signals:
    1. Semantic Query: Personalized text query + user profile
    2. Complete Profile: Pure aggregated user interests
    3. Subject-Focused: Most relevant subject area embedding
    4. Recency: Recently published papers

  Result: More robust, diverse, and well-ranked recommendations!

üí° Decay Learning Updates:
  - LIKE interactions: +1.0 weight (decays over time)
  - BOOKMARK interactions: +0.8 weight
  - VIEW interactions: +0.3 weight
  - DISLIKE interactions: -0.5 weight

‚úÖ All vectors updated with ALPHA=0.1 and DECAY_RATE=0.1

