<a href="https://colab.research.google.com/github/tuanqpham0921/Semantic-Book-Recommender-/blob/main/vector_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> 🔎 Vector Search </center>

In this section, we implement a **vector-based recommendation system** using **OpenAI embeddings** to find books with similar descriptions.

We begin by using the **cleaned book dataset**, which includes a `tagged_description` column. These tagged descriptions provide a consistent and preprocessed textual representation of each book, which we embed using OpenAI’s model.

### 🧠 Model & Tools

- **Embedding Model**: [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models)
- **Vector Store**: [LangChain + Chroma](https://docs.langchain.com/docs/integrations/vectorstores/chroma/)

The process works as follows:
1. Convert all `tagged_description` entries into vector embeddings using the OpenAI model.
2. Store these embeddings in a **Chroma database**.
3. When a user inputs a query, it is also embedded using the same model.
4. Retrieve the **most semantically similar books** by comparing vector similarity (e.g., cosine distance) in the Chroma vector store.

This approach enables **semantic search** — allowing the system to find relevant book recommendations even when the query uses different wording from the actual book descriptions.


In [3]:
!pip install langchain-openai
!pip install langchain-chroma
!pip install langchain-community



In [4]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

In [5]:
# get the openAI key from Google Secrets Manager
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import pandas as pd

# Replace 'path/to/your/book_cleaned.csv' with the actual path to your file in Google Drive
file_path = '/content/drive/MyDrive/Colab Notebooks/Semantic Book Recommender/books_cleaned.csv'

try:
    books = pd.read_csv(file_path)
    print("CSV file loaded successfully!")
    display(books.head())
except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

CSV file loaded successfully!


Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtiles,tagged_description
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...
2,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,Rage of angels,"9780006178736 A memorable, mesmerizing heroine..."
3,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,The Four Loves,9780006280897 Lewis' work on the nature of lov...
4,9780006280934,6280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le..."


In [9]:
books["tagged_description"]

Unnamed: 0,tagged_description
0,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982 A new 'Christie for Christmas' -...
2,"9780006178736 A memorable, mesmerizing heroine..."
3,9780006280897 Lewis' work on the nature of lov...
4,"9780006280934 ""In The Problem of Pain, C.S. Le..."
...,...
5192,9788172235222 On A Train Journey Home To North...
5193,9788173031014 This book tells the tale of a ma...
5194,9788179921623 Wisdom to Create a Life of Passi...
5195,9788185300535 This collection of the timeless ...


## 🔎 Vector Search with Tagged Descriptions

In this section, we implement a **semantic search system** using OpenAI embeddings to find books with similar content based on user queries.

Our input data includes a `tagged_description` column in the following format:




However, for the purpose of embedding, we **only use the book description**, not the ISBN. This ensures the embeddings reflect the book's semantic content, not its identifier.

### ⚙️ Workflow

1. **Extract Descriptions**  
   From each `tagged_description`, we isolate the description by removing the ISBN prefix.

2. **Generate Embeddings**  
   We embed these cleaned descriptions using OpenAI’s [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) model.

3. **Store in Vector Database**  
   The embeddings are stored in a **Chroma** vector database using LangChain for easy retrieval.

4. **Querying the System**  
   When a user enters a natural-language query (e.g., *"books about overcoming adversity with humor"*), we:
   - Embed the query using the same OpenAI model
   - Perform a similarity search in Chroma
   - Retrieve the **most similar descriptions**

5. **Extract Recommendations**  
   From the matched descriptions, we extract the corresponding **ISBN13** values to retrieve full book information (title, author, etc.) from the original DataFrame.

---

This setup enables a powerful **semantic recommendation system** where users can receive relevant book suggestions, even if their queries don’t exactly match the phrasing in the dataset.


In [10]:
# make a file with just the "tagged_description"
books["tagged_description"].to_csv("tagged_descriptions.txt",
                                   sep = "\n",
                                   index= False,
                                   header = False)

In [None]:
import os
import shutil

# Define the source file path (the file in your current Colab environment)
source_file = 'tagged_descriptions.txt'

# Define the destination path in your Google Drive
# Replace 'My Drive/Colab Notebooks/Semantic Book Recommender/' with your desired path
destination_dir = '/content/drive/MyDrive/Colab Notebooks/Semantic Book Recommender/'
destination_file = os.path.join(destination_dir, source_file)

# Ensure the destination directory exists
os.makedirs(destination_dir, exist_ok=True)

try:
    # Copy the file
    shutil.copyfile(source_file, destination_file)
    print(f"'{source_file}' successfully saved to '{destination_file}'")
except FileNotFoundError:
    print(f"Error: Source file '{source_file}' not found.")
except Exception as e:
    print(f"An error occurred: {e}")

'tagged_descriptions.txt' successfully saved to '/content/drive/MyDrive/Colab Notebooks/Semantic Book Recommender/tagged_descriptions.txt'


In [None]:
# raw_documents = TextLoader(file_path = "tagged_descriptions.txt").load()
# # The chunk_size must be > 0. We set a large size since we prioritize splitting on newline.
# text_splitter = CharacterTextSplitter(chunk_size=99999, chunk_overlap=0, separator="\n")
# documents = text_splitter.split_documents(raw_documents)

In [11]:
from langchain_core.documents import Document
import re
from langchain_community.document_loaders import TextLoader

raw_documents = TextLoader(file_path = "tagged_descriptions.txt").load()
seen_line = set()
documents = []

for doc in raw_documents:
  lines = doc.page_content.split("\n")
  for line in lines:
    stripped = line.strip()
    if not stripped:
      continue

    # Extract ISBN from the beginning of the line
    # Assuming ISBN is a sequence of digits at the start
    # Use regex to find the first sequence of digits
    match = re.match(r'^(\d+)', stripped)
    isbn = None
    if match:
      isbn = match.group(1)

    # Check if the full line (which should be unique per book) has been seen
    # Or alternatively, if the extracted ISBN has been seen
    # Let's deduplicate by the full stripped line content to keep unique descriptions
    if stripped not in seen_line:
      # Store the extracted ISBN in metadata if found
      metadata = {"source": "tagged_descriptions.txt"}
      if isbn:
          metadata["isbn"] = isbn # Store as "isbn" in metadata

      documents.append(Document(page_content=stripped, metadata=metadata))
      seen_line.add(stripped) # Deduplicate based on the full stripped line

print(f"Created {len(documents)} documents after deduplication.")

Created 5197 documents after deduplication.


In [12]:
documents[0]

Document(metadata={'source': 'tagged_descriptions.txt', 'isbn': '9780002005883'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smal

We use the **`Document` object** to feed data into the LangChain and OpenAI pipeline. The `Document` class is designed for handling **unstructured text data**, such as book descriptions or other natural language inputs, which we later convert into vector embeddings.

### Why Use a `Document` Object?
- **Unstructured Data Handling**: `Document` is well-suited for text content like book summaries or descriptions that don’t fit neatly into tabular structures.
- **Metadata Storage**: Each `Document` can hold metadata (e.g., `isbn13`, `title`, `author`), which is essential for **RAG** (Retrieval-Augmented Generation) pipelines and **context-based answers**.
- **Seamless Integration**: LangChain’s vector store APIs, such as **Chroma**, expect `Document` objects, making it easy to store and retrieve embeddings along with their metadata.

This approach ensures that when we retrieve similar content via vector search, we also retain critical context (like ISBN or title) to link the result back to the original book.


In [13]:
db_books = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY))

Now that we've successfully created the **vector embedding database**, we can move forward by testing it with a **sample query**.



In [14]:
query = "A book to teach children about nature"
docs = db_books.similarity_search(query, k = 4)
docs

[Document(id='f57acdf7-a6d4-4088-a826-8c084c4a73a2', metadata={'isbn': '9780786808069', 'source': 'tagged_descriptions.txt'}, page_content='9780786808069 Children will discover the exciting world of their own backyard in this introduction to familiar animals from cats and dogs to bugs and frogs. The combination of photographs, illustrations, and fun facts make this an accessible and delightful learning experience.'),
 Document(id='ad43601b-6f4a-4f14-aca6-d5bc8cf997f1', metadata={'source': 'tagged_descriptions.txt', 'isbn': '9780786808380'}, page_content="9780786808380 Introduce your babies to birds, cats, dogs, and babies through fine art, illustration, and photographs. These books are a rare opportunity to expose little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing the baby to some basic -- and sometimes playful -- information about the subjects."),
 Document(id='e58a3274

### 🧠 Observations from Query Results

As we can see, the returned descriptions are relevant and semantically aligned with the given query — which confirms that our vector search is working effectively.

However, one issue becomes apparent: **many of the top results are nearly identical**, differing only in minor details such as ISBN numbers, punctuation, or spelling variations. These may be duplicates or different editions of the same book.

### ⚖️ Next Steps: Handling Near-Duplicates

We’ll need to decide whether or not to **filter out near-duplicate entries**. This could involve:

- Comparing **titles** and **ISBN13** values
- Checking for **high textual overlap**
- Removing entries that offer no meaningful variation

This step will improve the **diversity and quality of recommendations**, especially in cases where redundancy adds little value to the user experience.
As we can see, these are good descriptions for us given the query of prompt. There is a slight problem, we can see that there very close to identical descriptions, with very minor differences, like ISBN, spelling and such. We will have to decide to filter these out or not depending on the titles and overlapping.

In [16]:
books[books["isbn13"] == int(docs[0].metadata["isbn"])]

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtiles,tagged_description
3747,9780786808069,786808063,Baby Einstein: Neighborhood Animals,Marilyn Singer;Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=X9a4P...,Children will discover the exciting world of t...,2001.0,3.89,16.0,180.0,Baby Einstein: Neighborhood Animals,9780786808069 Children will discover the excit...


In [19]:
import pandas as pd

def retrive_semantic_recomendations(
    query: str,
    top_k: int = 10,
) -> pd.DataFrame:
  recs = db_books.similarity_search(query, k = 50) # Retrieve more results to have enough after potential filtering

  books_isbn = []

  for doc in recs:
    # Safely get the ISBN from metadata

    isbn_str = doc.metadata.get("isbn") # Use .get() for safety, assuming metadata key is now "isbn"
    if isbn_str:
      # Clean the ISBN string (remove quotes and whitespace)
      cleaned_isbn_str = isbn_str.strip().strip('"')

      try:
        # Attempt conversion to integer using standard int()
        isbn_int = int(cleaned_isbn_str)
        books_isbn.append(isbn_int)
      except ValueError:
        # Handle cases where conversion fails after cleaning
        print(f"Warning: Could not convert ISBN '{isbn_str}' to integer. Skipping.")
        continue

  # Ensure we have enough ISBNs after cleaning
  if not books_isbn:
      print("No valid ISBNs found in the recommendations.")
      return pd.DataFrame() # Return empty DataFrame if no valid ISBNs

  # Filter the main books DataFrame by the collected ISBNs and return the top_k
  # Use .head(top_k) on the filtered result
  recommended_books = books[books["isbn13"].isin(books_isbn)]

  # Handle cases where fewer than top_k matches are found
  if len(recommended_books) < top_k:
      print(f"Warning: Found only {len(recommended_books)} unique books matching the recommended ISBNs.")

  return recommended_books.head(top_k)

In [20]:
retrive_semantic_recomendations(query)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtiles,tagged_description
59,9780007151240,7151241,The Family Way,Tony Parsons,Parenthood,http://books.google.com/books/content?id=dJEIx...,It should be the most natural thing in the wor...,2005.0,3.51,400.0,2095.0,The Family Way,9780007151240 It should be the most natural th...
143,9780060546571,60546573,Three Rotten Eggs,Gregory Maguire,Juvenile Fiction,http://books.google.com/books/content?id=t2pWl...,The students of Miss Earth's class in rural Ve...,2005.0,3.74,240.0,76.0,Three Rotten Eggs,9780060546571 The students of Miss Earth's cla...
429,9780064434980,64434982,The Deer in the Wood,Laura Ingalls Wilder,Juvenile Fiction,http://books.google.com/books/content?id=V7YDW...,Even the youngest child can enjoy a special ad...,1999.0,4.17,32.0,302.0,The Deer in the Wood,9780064434980 Even the youngest child can enjo...
692,9780140448009,140448004,Three Tales,Gustave Flaubert;Roger Whitehouse;Geoffrey Wall,Fiction,http://books.google.com/books/content?id=XFzga...,Features short fiction by the French naturalis...,2005.0,3.71,110.0,3050.0,Three Tales,9780140448009 Features short fiction by the Fr...
707,9780140568196,140568190,The Giraffe and the Pelly and Me,Roald Dahl;Quentin Blake,Candy,http://books.google.com/books/content?id=J7FdI...,"A Dahl story in which the giraffe, the pelican...",2001.0,3.81,32.0,16265.0,The Giraffe and the Pelly and Me,9780140568196 A Dahl story in which the giraff...
711,9780140621624,140621628,The Railway Children,E. Nesbit,Fiction,http://books.google.com/books/content?id=fFesd...,"When their father is sent away to prison, thre...",1995.0,4.0,212.0,178.0,The Railway Children,9780140621624 When their father is sent away t...
763,9780141186078,141186070,The Log from the Sea of Cortez,John Steinbeck,Biography & Autobiography,http://books.google.com/books/content?id=9CrIf...,This light-hearted journal tells of John Stein...,2001.0,3.84,288.0,3226.0,The Log from the Sea of Cortez,9780141186078 This light-hearted journal tells...
812,9780142302279,142302279,Dirty Beasts,Roald Dahl,Juvenile Nonfiction,,Poems tell the stories of a smart pig who outw...,2002.0,4.02,32.0,3953.0,Dirty Beasts,9780142302279 Poems tell the stories of a smar...
855,9780143037392,143037390,The Read-aloud Handbook,Jim Trelease,Language Arts & Disciplines,http://books.google.com/books/content?id=B2_yU...,Explains the importance of reading aloud to ch...,2006.0,4.4,432.0,4122.0,The Read-aloud Handbook,9780143037392 Explains the importance of readi...
1078,9780241003008,241003008,The Very Hungry Caterpillar,Eric Carle,Babytime resource,http://books.google.com/books/content?id=DpGEQ...,Eric Carle's children's classic is the story o...,1994.0,4.29,26.0,340101.0,The Very Hungry Caterpillar,9780241003008 Eric Carle's children's classic ...


### ✅ Relevance of Results & Next Steps

Our book recommender is working well — it returns relevant matches based on the query and effectively handles cases where multiple books have **very similar descriptions**. Upon closer inspection, we confirmed that these are indeed **distinct books** with different titles, despite the near-identical wording.

### 🧩 Enhancing Search with Text Classification

To further improve the recommendation system, we now introduce a **text classification layer**. By classifying each book based on its description (e.g., **Fiction** vs. **Non-Fiction**), we can:

- **Group books into meaningful categories**
- **Narrow down search space**
- **Make filtering and recommendations more efficient**

### 🧠 Zero-Shot Classification

We will use **zero-shot text classification** — a technique that allows **large pre-trained language models (LLMs)** to classify text into user-defined labels without any additional training data.

- No dataset preparation
- No fine-tuning required
- Just a prompt and a model (e.g., Hugging Face’s `facebook/bart-large-mnli` or similar)

This approach is powerful and cost-effective, and it's particularly useful when working with new labels or evolving classification needs.
