**TL;DR:** A tutorial on creating a simple paper recommender system using embeddings from paper abstracts.

Recently, I shared a tool that helps people find relevant papers among 17,000+ ICLR 2026 submissions. To my surprise, the post attracted quite a bit of attention, much more than any of my posts about [my research](https://wenhangao21.github.io/tech_blogs/)😢.

- [ICLR2026 Paper Finder on Hugging Face](https://huggingface.co/spaces/wenhanacademia/ICLR2026_PaperFinder)
- [Open Source Repository](https://github.com/wenhangao21/ICLR26_Paper_Finder?tab=readme-ov-file)

A few people reached out asking how the app was built. It’s actually very simple, so I thought I’d write a short blog/tutorial about it. **The algorithm took me less than 30 minutes to write (with the help from GPT-5-Instant)**, though the user interface ended up taking an entire afternoon (with the help from GPT-5-Instant). **You can run this notebook for free on Google Colab if you don’t have Jupyter Notebook installed.**

# Build a Simple Paper Recommender with Language Embedding Models

## Overall Pipeline

1. **Data Retrieval:** Collect all paper submissions from OpenReview (For other venues, you can find API for or write a web scraper).
2. **Data Processing:** Clean and structure the retrieved data into a usable format for downstream tasks.
3. **Vector Database Construction:** Generate abstract embeddings with language embedding models and store them in a vector database to enable fast semantic similarity search (approximate k-NN).
4. **Inference:** Query the database to identify the top-K most relevant submissions based on semantic similarity.

Note:
- **Data Retrieval and Processing:** We are given a collection of text documents (a collection of abstracts in our case):
$$
\mathcal{T}=\left\{t_1, t_2, \ldots, t_N\right\}.\\
$$

- **Vector Database Construction:** An embedding model maps a given text $t$ (an abstract in our case) into a high-dimensional continuous vector:
  $$
  \operatorname{Embedding}_\theta: t_i \rightarrow e_i \in \mathbb{R}^d,
  $$
  where $\operatorname{Embedding}_\theta$ is the embedding model (you can think of the text embedding being the last feature before the the LM Head in a modern LLM), and $d$ is a fixed embedding dimension.

  Now, we have a collection of embeddings $\mathcal{E}=\left\{e_1, e_2, \ldots, e_N\right\}$, and you can run approximate KNN algorithms (e.g. [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/)) on it for semantic search.


## Install and Import Dependencies

In [None]:
! pip install chromadb markdown google-generativeai sentence_transformers openreview-py

In [23]:
import openreview
from openreview import tools
import json
import re
import chromadb
import google.generativeai as genai
import chromadb.utils.embedding_functions as embedding_functions
import textwrap
from collections import Counter
from datetime import datetime
from tqdm import tqdm

## Data Retrieval

In [16]:
# Connect to the OpenReview API client
client = openreview.api.OpenReviewClient(
    baseurl="https://api2.openreview.net",
    username="<your openreview email>",  # enter your openreview email and password here
    password="<your openreview password>"
)
# Extract all submissions to ICLR 2026
notes = list(tools.iterget_notes(
    client,
    invitation="ICLR.cc/2026/Conference/-/Submission"
))

  notes = list(tools.iterget_notes(
Getting Notes: 100%|█████████▉| 19714/19734 [00:11<00:00, 1789.27it/s]


In [17]:
# Check how many submissions there are
print(f"Total submissions: {len(notes)}")
# Get all the possible atrributes
key_counter = Counter(k for note in notes for k in note.content.keys())
print("\nAttribute occurrence counts:")
for key, count in key_counter.items():
    print(f"  {key}: {count}")

Total submissions: 19734

Attribute occurrence counts:
  title: 19734
  keywords: 19734
  abstract: 19734
  primary_area: 19734
  venue: 19734
  venueid: 19734
  pdf: 19655
  _bibtex: 19734
  supplementary_material: 7678
  TLDR: 10126
  authors: 141
  authorids: 141
  paperhash: 141
  code_of_ethics: 83
  submission_guidelines: 83
  anonymous_url: 83
  no_acknowledgement_section: 83
  reciprocal_reviewing_author: 3
  reciprocal_reviewing_exemption: 3
  resubmission: 3
  student_author: 3
  large_language_models: 3
  reciprocal_reviewing_exemption_reason: 3


## Data Processing

In [18]:
json_data = []
for note in notes:
    # Extract title value safely
    title_val = note.content.get("title", "")
    if isinstance(title_val, dict) and "value" in title_val:
        title_val = title_val["value"]
    # Skip notes where title starts with 'Null' followed by any non-space chars (one word)
    if isinstance(title_val, str) and re.fullmatch(r"Null\S+", title_val):
        continue
    entry = {}
    for key, value in note.content.items():
        # Safely extract inner value if present
        if isinstance(value, dict) and "value" in value:
            val = str(value["value"])
        else:
            val = str(value)
        # Special handling for 'pdf' field
        if key == "pdf":
            # Prepend if it's not already a full URL
            if val and not val.startswith("https://openreview.net/"):
                val = "https://openreview.net/" + val.lstrip("/")
        entry[key] = val
    json_data.append(entry)

# Save to JSON file
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") # They update the submission list, so timestamps are added
filename = f"notes_{timestamp}.json"
with open(filename, "w", encoding="utf-8") as f:
    json.dump(json_data, f, ensure_ascii=False, indent=2)

In [19]:
# Check how many submissions there are
print(f"Total non-empty submissions: {len(json_data)}")

Total non-empty submissions: 19733


## Vector Database Construction

In [25]:
with open(filename, "r") as f:
    note_list = json.load(f)
# We use only the first 1000 for demonstration purposes
note_list = note_list[:1000]

In [26]:
# Create a persistent chromadb client
client = chromadb.PersistentClient(path="ICLR2026")
# We use a free local small model, you can use other models or API-gated models
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection_name = "MiniLM"
# Just sanity check if there are already existing db for this
# If you did not finish last time
existing = [c.name for c in client.list_collections()]
print(existing)
count = 0
if collection_name in existing:
    collection = client.get_collection(name=collection_name)
    count = collection.count()
    print(f"Collection '{collection_name}' already exists with {count} entries.")

    if count > 0:
        cont = input("Do you want to continue adding to the existing collection? (y/n): ").strip().lower()
        if cont != "y":
            confirm = input("Do you want to delete and recreate it instead? (y/n): ").strip().lower()
            if confirm == "y":
                client.delete_collection(name=collection_name)
                print(f"Deleted old collection '{collection_name}'.")
                collection = client.create_collection(name=collection_name, embedding_function=embedding_fn)
                print(f"Recreated collection '{collection_name}'.")
            else:
                print("Operation cancelled.")
                exit()
        else:
            print("Continuing to add to existing collection.")
    else:
        print("Collection exists but has no entries-continuing.")
else:
    collection = client.create_collection(name=collection_name, embedding_function=embedding_fn)
    print(f"Created new collection '{collection_name}'.")

# Build a ChromaDB collection that automatically stores embeddings with metadata (e.g., title, keywords) and enables fast retrieval.
batch_size = 10  # adjust depending on memory
for start in tqdm(range(count, len(note_list), batch_size), desc="Inserting papers in batches"):
    batch = note_list[start:start + batch_size]
    ids = [str(i) for i in range(start, start + len(batch))]
    abstracts = [p.get("abstract", "") for p in batch]
    metadatas = [{k: v for k, v in p.items() if k != "abstract"} for p in batch]

    collection.add(
        ids=ids,
        documents=abstracts, # we use abstracts for embeddings
        metadatas=metadatas
    )

print(f"Inserted {len(note_list)} papers into ChromaDB in batches of {batch_size}.")
print(f"Final collection size: {collection.count()} entries.")

['MiniLM']
Collection 'MiniLM' already exists with 210 entries.
Do you want to continue adding to the existing collection? (y/n): y
Continuing to add to existing collection.


Inserting papers in batches: 100%|██████████| 79/79 [01:47<00:00,  1.36s/it]

Inserted 1000 papers into ChromaDB in batches of 10.
Final collection size: 1000 entries.





## Inference

In [28]:
# Connect to the ChromaDB collection
# Not necessary here, but included for reference in a production environment
client = chromadb.PersistentClient(path="ICLR2026")
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
COLLECTION_NAME = "MiniLM"
collection = client.get_collection(name=COLLECTION_NAME, embedding_function=embedding_fn)

# Query
abstract = input("Enter your abstract: ")
results = collection.query(
    query_texts=[abstract],
    n_results=3
)

# See outputs
for doc_id, doc, meta in zip(results["ids"][0], results["documents"][0], results["metadatas"][0]):
    print(f"🧩 ID: {doc_id}")
    print(f"📘 Title: {meta.get('title', 'N/A')}")
    print(f"🏷️ Keywords: {meta.get('keywords', 'N/A')}")
    print(f"📍 Venue: {meta.get('venue', 'N/A')}")
    print(f"🌐 PDF Link: https://openreview.net/{meta.get('pdf', 'N/A')}")
    print("🧠 Abstract:")
    print(textwrap.fill(doc.strip(), width=100))  # wrap nicely at 100 chars per line

    print("-" * 100)

Enter your abstract: In recent years, neural operators have emerged as a prominent approach for learning mappings between function spaces, such as the solution operators of parametric PDEs. A notable example is the Fourier Neural Operator (FNO), which models the integral kernel as a convolution operator and uses the Convolution Theorem to learn the kernel directly in the frequency domain. The parameters are decoupled from the resolution of the data, allowing the FNO to take inputs of different resolutions. However, training at a lower resolution and inferring at a finer resolution does not guarantee consistent performance, nor can fine details, present only in fine-scale data, be learned solely from coarse data. In this work, we address this misconception by defining and examining the discretization mismatch error: the discrepancy between the outputs of the neural operator when using different discretizations of the input data. We demonstrate that neural operators may suffer from discr

## Contact


My collaborator [Jingxiang Qu](https://qujx.github.io/), my undergraduate mentee [Yichi Zhang](https://yichixiaoju.github.io/YichiZhang.github.io/), and I (and GPT) are actively expanding this system to include more venues, employ advanced similarity-matching algorithms, train specialized models, support multi-agent collaboration, handle batch inputs/outputs, and provide an improved user interface. We welcome feedback and collaborations, feel free to [contact me](https://wenhangao21.github.io/).