# 02 ‚Äì Embedding and FAISS Index Creation

This notebook:
- Loads the cleaned dataset from Step 1 (`clean_papers.csv`)
- Converts each row into a LangChain Document with metadata
- Generates OpenAI embeddings
- Stores embeddings in a FAISS index for semantic retrieval
- Saves the index locally for reuse

In [None]:
# pip install faiss-cpu

In [8]:
import os
import pandas as pd
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

# Load API key
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
assert OPENAI_API_KEY, "Please set your OPENAI_API_KEY in a .env file"

# Paths
DATA_PATH = "../data/clean_papers.csv"
FAISS_PATH = "../data/faiss_index"

# Model and limits
EMBEDDING_MODEL = "text-embedding-3-small"
MAX_PAPERS = None  # adjust for testing; set to None for all

Load Cleaned Dataset

In [9]:
df = pd.read_csv(DATA_PATH)
print("Loaded dataset:", df.shape)
df.head(3)

Loaded dataset: (136154, 9)


Unnamed: 0,id,title,summary,category,category_code,published_date,authors,text,year
0,cs-9308101v1,Dynamic Backtracking,Because of their occasional need to return to ...,Artificial Intelligence,cs.AI,1993-08-01,['M. L. Ginsberg'],Dynamic Backtracking. Because of their occasio...,1993
1,cs-9308102v1,A Market-Oriented Programming Environment and ...,Market price systems constitute a well-underst...,Artificial Intelligence,cs.AI,1993-08-01,['M. P. Wellman'],A Market-Oriented Programming Environment and ...,1993
2,cs-9309101v1,An Empirical Analysis of Search in GSAT,We describe an extensive study of search in GS...,Artificial Intelligence,cs.AI,1993-09-01,"['I. P. Gent', 'T. Walsh']",An Empirical Analysis of Search in GSAT. We de...,1993


Convert Rows to Documents

In [10]:
from typing import List

def build_documents(df: pd.DataFrame, max_papers=None) -> List[Document]:
    if max_papers and len(df) > max_papers:
        df = df.sample(max_papers, random_state=42).reset_index(drop=True)

    docs = []
    for i, row in df.iterrows():
        metadata = {
            "id": row["id"],
            "title": row["title"],
            "category": row["category"],
            "category_code": row["category_code"],
            "published_date": row["published_date"],
            "authors": row["authors"],
            "year": int(row["year"]) if not pd.isna(row["year"]) else None,
        }
        docs.append(Document(page_content=row["text"], metadata=metadata))
    return docs

docs = build_documents(df, max_papers=MAX_PAPERS)
print(f"Prepared {len(docs)} documents.")
print("Example metadata:", docs[0].metadata)

Prepared 136154 documents.
Example metadata: {'id': 'cs-9308101v1', 'title': 'Dynamic Backtracking', 'category': 'Artificial Intelligence', 'category_code': 'cs.AI', 'published_date': '1993-08-01', 'authors': "['M. L. Ginsberg']", 'year': 1993}


Create Embeddings and FAISS Index (DO NOT UNCOMMENT)
"takes a long time to run"

In [None]:
# embeddings = OpenAIEmbeddings(
#     model=EMBEDDING_MODEL,
#     api_key=OPENAI_API_KEY,
# )

# # Build FAISS index
# vectorstore = FAISS.from_documents(docs, embedding=embeddings)
# print("FAISS index built successfully.")

FAISS index built successfully.


Test Semantic Search

In [None]:
# query = "recent papers on graph neural networks for molecular property prediction"
# results = vectorstore.similarity_search(query, k=3)

# for doc in results:
#     print("üîπ TITLE:", doc.metadata.get("title"))
#     print("üìò CATEGORY:", doc.metadata.get("category_code"))
#     print("üß† YEAR:", doc.metadata.get("year"))
#     print("üìù SNIPPET:", doc.page_content[:250].replace("\n", " "), "...")
#     print("-" * 100)

üîπ TITLE: Gated Graph Recursive Neural Networks for Molecular Property Prediction
üìò CATEGORY: cs.LG
üß† YEAR: 2019
üìù SNIPPET: Gated Graph Recursive Neural Networks for Molecular Property Prediction. Molecule property prediction is a fundamental problem for computer-aided drug discovery and materials science. Quantum-chemical simulations such as density functional theory (DF ...
----------------------------------------------------------------------------------------------------
üîπ TITLE: Analyzing Learned Molecular Representations for Property Prediction
üìò CATEGORY: cs.LG
üß† YEAR: 2019
üìù SNIPPET: Analyzing Learned Molecular Representations for Property Prediction. Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising resu ...
----------------------------------------------------------------------------------------------------
üîπ TITLE: Fast Quant

Save FAISS Index

In [None]:
# os.makedirs(FAISS_PATH, exist_ok=True)
# vectorstore.save_local(FAISS_PATH)
# print(f"FAISS index saved to: {FAISS_PATH}")

FAISS index saved to: ../data/faiss_index


Reload FAISS and Check

In [None]:
# # Test reloading the index (ensures serialization works)
# reloaded_vs = FAISS.load_local(FAISS_PATH, embeddings, allow_dangerous_deserialization=True)

# query = "transformers for natural language understanding"
# results = reloaded_vs.similarity_search(query, k=2)

# for doc in results:
#     print("üîπ", doc.metadata["title"])

üîπ HuggingFace's Transformers: State-of-the-art Natural Language Processing
üîπ Transformadores: Fundamentos teoricos y Aplicaciones
