# Phase 2: Vector Search & Embeddings

## Overview
This notebook builds the semantic search foundation for the book recommender. We'll convert book descriptions into vector embeddings, store them in a vector database, and enable similarity search to find books based on semantic meaning rather than exact keyword matches.

## Objectives
1. Load cleaned dataset and prepare tag descriptions
2. Create text embeddings using OpenAI's embedding model
3. Build Chroma vector database from embeddings
4. Persist vector database to disk for fast loading
5. Test similarity search functionality
6. Create reusable recommendation function

## Expected Output
- **Vector Database**: `chroma_index/` directory (persisted for reuse)
- **Text File**: `data/tag_descriptions.txt` (ISBN + descriptions)
- **Function**: `retrieve_semantic_recommendations()` for semantic book search

In [4]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings   
from langchain_chroma import Chroma # vector database

## Step 1: Load Environment Variables

Load OpenAI API key from `.env` file for authentication.

In [5]:
from dotenv import load_dotenv

load_dotenv()

True

Load the dataset.

In [8]:
import pandas as pd
from pathlib import Path

# Load cleaned dataset
data_path = Path("../data/books_cleaned.csv")
books = pd.read_csv(data_path)

print(f"✓ Dataset loaded: {books.shape[0]} rows, {books.shape[1]} columns")

✓ Dataset loaded: 5197 rows, 14 columns


In [10]:
books

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,words_in_description,title_and_subtitle,tag_description
0,9780002005883,0002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,199,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,0002261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,205,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...
2,9780006178736,0006178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,57,Rage of angels,"9780006178736 A memorable, mesmerizing heroine..."
3,9780006280897,0006280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,45,The Four Loves,9780006280897 Lewis' work on the nature of lov...
4,9780006280934,0006280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,75,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5192,9788172235222,8172235224,Mistaken Identity,Nayantara Sahgal,Indic fiction (English),http://books.google.com/books/content?id=q-tKP...,On A Train Journey Home To North India After L...,2003.0,2.93,324.0,0.0,288,Mistaken Identity,9788172235222 On A Train Journey Home To North...
5193,9788173031014,8173031010,Journey to the East,Hermann Hesse,Adventure stories,http://books.google.com/books/content?id=rq6JP...,This book tells the tale of a man who goes on ...,2002.0,3.70,175.0,24.0,63,Journey to the East,9788173031014 This book tells the tale of a ma...
5194,9788179921623,817992162X,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,http://books.google.com/books/content?id=c_7mf...,"Wisdom to Create a Life of Passion, Purpose, a...",2003.0,3.82,198.0,1568.0,117,The Monk Who Sold His Ferrari: A Fable About F...,9788179921623 Wisdom to Create a Life of Passi...
5195,9788185300535,8185300534,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999.0,4.51,531.0,104.0,174,I Am that: Talks with Sri Nisargadatta Maharaj,9788185300535 This collection of the timeless ...


## Step 2: Prepare Tag Descriptions for Vector Database

When we query the vector database, it returns book descriptions. However, users need book titles and authors, not descriptions.

By prepending the ISBN to each description, we can extract the ISBN from search results and map it back to the full book metadata (title, author, etc.) efficiently. This avoids slow string matching on full descriptions.

Save tag descriptions to a text file (one per line) for loading into the vector database.

In [None]:
books["tag_description"].to_csv("tag_description.txt",
                                    sep = "\n",
                                    index = False,
                                    header = False)
                        

## Step 3: Load and Split Text Documents

Load the tag descriptions file and split it into individual book descriptions. We use `CharacterTextSplitter` with:
- **Large chunk_size**: Set to a high value (e.g., 10000) so it never splits based on size
- **Separator**: Newline (`\n`) - each line is one book description (ISBN + description)
- **No overlap**: Each description stays as a single document chunk

This ensures each book description remains intact as one document for embedding, rather than being split across multiple chunks.

In [17]:
raw_documents = TextLoader("tag_description.txt").load()
text_splitter = CharacterTextSplitter(
    chunk_size=10000,  # Large enough to keep each description as one chunk
    chunk_overlap=0, 
    separator="\n"
)
documents = text_splitter.split_documents(raw_documents)

In [18]:
# sanity check
documents[0]

Document(metadata={'source': 'tag_description.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilea

## Step 4: Create Document Embeddings and Store in Vector Database

Convert tag descriptions into vector embeddings using OpenAI's embedding model. Chroma handles both embedding creation and storage in one step - when we add documents, it automatically generates embeddings and stores them in the vector database. The database is persisted to disk for fast loading in future sessions.

In [24]:
from tqdm import tqdm
import tiktoken

# Initialize embedding function
embeddings = OpenAIEmbeddings()

# Create tokenizer for counting tokens
# OpenAI uses cl100k_base encoding
encoding = tiktoken.get_encoding("cl100k_base")

# Create Chroma DB with persist directory
persist_directory = "../data/chroma_index"
db_books = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)

all_texts = [doc.page_content for doc in documents]
total_docs = len(all_texts)

# Process with token-based batching
# Target: max 250K tokens per batch (safe margin under 300K limit)
max_tokens_per_batch = 250000
current_batch = []
current_tokens = 0
batch_num = 0

for i, text in enumerate(tqdm(all_texts, desc="Processing documents")):
    # Count tokens for this text
    token_count = len(encoding.encode(text))
    
    # If adding this text would exceed limit, process current batch first
    if current_tokens + token_count > max_tokens_per_batch and current_batch:
        # Embed current batch
        batch_embeddings = embeddings.embed_documents(current_batch)
        ids = [f"doc_{batch_num}_{j}" for j in range(len(current_batch))]
        db_books.add_texts(texts=current_batch, ids=ids)
        
        # Start new batch
        current_batch = [text]
        current_tokens = token_count
        batch_num += 1
    else:
        current_batch.append(text)
        current_tokens += token_count

# Process final batch
if current_batch:
    batch_embeddings = embeddings.embed_documents(current_batch)
    ids = [f"doc_{batch_num}_{j}" for j in range(len(current_batch))]
    db_books.add_texts(texts=current_batch, ids=ids)

print(f"✓ Vector database created with {total_docs} documents")
print(f"✓ Saved to: {persist_directory}")

Processing documents: 100%|██████████| 272/272 [00:15<00:00, 17.60it/s]


✓ Vector database created with 272 documents
✓ Saved to: ../data/chroma_index


In [32]:
query = "cooking books"
docs = db_books.similarity_search(query, k = 10)
docs

[Document(id='doc_1_102', metadata={}, page_content='"9781400043460 A memoir begun just months before Child\'s death describes the legendary food expert\'s years in Paris, Marseille, and Provence and her journey from a young woman from Pasadena who cannot cook or speak any French to the publication of her legendary Mastering cookbooks and her winning the hearts of America as ""The French Chef."" 150,000 first printing."\n9781400044160 Re-creates the 1960s struggle of Biafra to establish an independent republic in Nigeria, following the intertwined lives of the characters through a military coup, the Biafran secession, and the resulting civil war.\n"9781400044733 A story of life in France under the Nazi occupation includes two parts--""Storm in June,"" set amid the chaotic 1940 exodus from Paris, and ""Dolce,"" set in a German-occupied village rife with resentment, resistance, and collaboration."\n9781400044740 A behind-the-scenes look at the art of French breadmaking includes sixteen r

In [39]:
# Check what's actually in the page_content
print(repr(docs[0].page_content.split()[0]))  # repr() shows hidden characters

'"9781400043460'


## Step 5: Retrieve Semantic Recommendations Function

This function performs semantic search on the book database and returns full book metadata for the top recommendations.

**How it works:**
1. **Vector Search**: Queries the Chroma vector database using `similarity_search()` to find book descriptions semantically similar to the query
2. **Extract ISBNs**: Parses the ISBN from each search result (the first token in the tagged description)
3. **Map to Full Metadata**: Filters the books DataFrame to return complete book information (title, author, description, ratings, etc.) for the recommended books

**Input**: Natural language query string (e.g., "books about space for children")
**Output**: DataFrame with top_k recommended books including all metadata columns

In [44]:
def retrieve_semantic_recommendations(query: str, top_k: int = 10) -> pd.DataFrame:
    """
    Retrieve semantic book recommendations based on a query.
    """
    # Search vector database
    recs = db_books.similarity_search(query, k=top_k)
    
    # Extract ISBNs from search results (handle quotes)
    isbn_list = [
        int(rec.page_content.split()[0].strip().strip('"').strip("'"))
        for rec in recs
        if rec.page_content.split()[0].strip().strip('"').strip("'").isdigit()
    ]
    
    # Return matching books from dataframe
    return books[books["isbn13"].isin(isbn_list)]

In [46]:
 retrieve_semantic_recommendations("books on war and politics")

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,words_in_description,title_and_subtitle,tag_description
756,9780141185163,0141185163,Orwell in Spain,George Orwell,Fiction,http://books.google.com/books/content?id=uVNpA...,"Including Homage to Catalonia, Orwell's profou...",2001.0,4.33,416.0,203.0,48,Orwell in Spain: the full text of Homage to Ca...,"9780141185163 Including Homage to Catalonia, O..."
1006,9780195168952,019516895X,Battle Cry of Freedom,James M. McPherson,History,http://books.google.com/books/content?id=09FkZ...,Filled with fresh interpretations and informat...,2005.0,4.34,867.0,22318.0,112,Battle Cry of Freedom: The Civil War Era,9780195168952 Filled with fresh interpretation...
1386,9780330340199,0330340190,In Pharaoh's Army,Tobias Wolff,"Authors, American",http://books.google.com/books/content?id=TO77D...,Having survived the extraordinary childhood re...,1995.0,4.08,224.0,21.0,215,In Pharaoh's Army: Memories of a Lost War,9780330340199 Having survived the extraordinar...
1716,9780375727061,037572706X,Julian,Gore Vidal,Fiction,http://books.google.com/books/content?id=RCFiA...,An insightful historical novel recreates the b...,2003.0,4.19,528.0,5035.0,34,Julian: A Novel,9780375727061 An insightful historical novel r...
2443,9780452282827,0452282829,We Were the Mulvaneys,Joyce Carol Oates,Fiction,http://books.google.com/books/content?id=FBtVG...,"The Mulvaneys, at first a close and very lucky...",1996.0,3.72,454.0,83736.0,34,We Were the Mulvaneys,"9780452282827 The Mulvaneys, at first a close ..."
3180,9780688085872,0688085873,A Short History of World War II,James L. Stokesbury,History,http://books.google.com/books/content?id=uDBhl...,"Despite the numerous books on World War II, un...",1980.0,3.93,416.0,454.0,127,A Short History of World War II,9780688085872 Despite the numerous books on Wo...
3961,9780812532630,0812532635,The Ships of Earth,Orson Scott Card,Fiction,http://books.google.com/books/content?id=5Vo-m...,"The City of Basilica has fallen. Now Wetchik, ...",1995.0,3.54,351.0,9143.0,85,The Ships of Earth: Homecoming:,9780812532630 The City of Basilica has fallen....
3993,9780812968378,0812968379,Funny in Farsi,Firoozeh Dumas,Biography & Autobiography,http://books.google.com/books/content?id=PNh9-...,An autobiography of growing up as an Iranian-A...,2004.0,3.79,240.0,12072.0,31,Funny in Farsi: A Memoir of Growing Up Iranian...,9780812968378 An autobiography of growing up a...
4132,9780843955828,0843955821,Blood Moon Over Britain,Morag McKendrick Pippin,Fiction,http://books.google.com/books/content?id=Eswze...,When Cicely Winterborne's best friend is murde...,2005.0,3.44,323.0,17.0,32,Blood Moon Over Britain,9780843955828 When Cicely Winterborne's best f...
5020,9781852421175,1852421177,The Chomsky Reader,Noam Chomsky;James Peck,Estados Unidos - Relaciones exteriores - 1945-...,http://books.google.com/books/content?id=pc5zQ...,At the centre of pratically every major debate...,1987.0,3.97,492.0,1564.0,202,The Chomsky Reader,9781852421175 At the centre of pratically ever...


## Phase 2 Complete: Vector Search Summary

Built semantic search foundation with ~5,200 book embeddings stored in Chroma vector database. Created `retrieve_semantic_recommendations()` function for finding books based on semantic meaning. Database persisted to `data/chroma_index/` for fast future loading.

**Next:** Phase 3 - Zero-shot category classification to normalize 500+ categories into 4 main categories.