# Board Game Rule Q&A Using Retrieval Augmented Generation (RAG)
-- A project by Rutuja Jangle

## Project Overview
This project demonstrates a Retrieval Augmented Generation (RAG) application tailored to help users quickly understand board game rules by applying several Gen AI capabilities. The project transforms lengthy board game rule PDFs into manageable text chunks, creates embeddings for each chunk, and uses a custom retrieval process (via cosine similarity–based vector search) to find the most relevant information in response to a user query. Finally, it synthesizes an answer using Google Generative AI's text generation API.


## Use Case and Innovation

### Use Case
**Problem:** Many board games come with extensive, complex rulebooks that can overwhelm new players. Finding the answer to a specific rule-related query often means manually searching through multiple pages of dense text.

**Solution:**  
- Automatically extract and split the text from PDF rulebooks (document understanding).  
- Converting text chunks into semantic vector representations (embeddings).  
- Retrieving the most relevant segments through a custom retrieval algorithm (RAG).
- Synthesizing a concise answer based on the retrieved context (controlled generation).

### Innovation & Impact
- **Novel Approach:** The pipeline leverages state-of-the-art Gen AI capabilities in a novel way by integrating document understanding, semantic embeddings, and retrieval augmented generation.  
- **Impact:** The solution helps lower the learning curve for board games by providing quick answers derived from official rulebooks. While focused on board game rules, the underlying pipeline (PDF ingestion, embedding generation, and RAG) is highly generic and can be repurposed for any domain that involves dense textual documents (legal texts, manuals, research articles).  

## Environment Setup & Data Discovery

Before we start processing rulebook PDFs, we need to prepare our environment and verify what data is available:

- **Import core libraries**  
  - `numpy` (`np`): for efficient numerical operations (e.g., vector math later)  
  - `pandas` (`pd`): for structured data handling (e.g., tabular metadata or results)

- **Inspect the Kaggle input directory**  
  - We use Python’s `os.walk()` to recursively traverse `/kaggle/input`,  
  - Printing each file path lets us confirm that all PDF rulebooks (or other assets) are present and correctly mounted.

Ensuring the input files are accessible is a crucial first step in our RAG pipeline—if the raw documents aren’t there, nothing else can proceed!  


In [1]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Corrosion rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Metal Gear rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/The Thing rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Intarsia rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Azul rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/PointCity rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Eldritch rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Chess rulebook.pdf
/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks/Kick-Ass rulebook.pdf


## Setup & Dependency Installation

Before diving into the RAG pipeline, we need to install and pin all the required libraries for the end‑to‑end RAG pipeline, everything from PDF parsing and vector search to the Google GenAI SDK (Gemini). After the installs, it imports the SDK and prints its version to confirm that we’re working with the exact client version required for embedding generation and answer synthesis.  


In [None]:

!pip install "jupyterlab>=3.1.0,<4.0.0"


!pip install protobuf==3.19.6


!pip install "google-api-core>=2.19.1,<3.0.0"


!pip install "rich<14"


!pip install PyPDF2 faiss-cpu openai tqdm


In [2]:
!pip uninstall -qqy google-genai
!pip install -qU google-genai==1.7.0


In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

## Setting up API key

I used an API key generated from the Google AI Studio and then stored it in Kaggle secrets inside the Add-ons menu and named it as `GOOGLE_API_KEY`

In [10]:
import google.generativeai as genai
from kaggle_secrets import UserSecretsClient

# Retrieve your API key
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

# Configure the client with your API key
genai.configure(api_key=GOOGLE_API_KEY)

## Initializing GenAI Client & Listing Embedding Models

Here I have first created my GenAI client with my API key and then listed all models that support `embedContent` API method so I can pick the best one for embedding in my RAG pipeline.  


In [5]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if m.supported_actions is not None and "embedContent" in m.supported_actions:
        print(m.name)


models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


## Data Ingestion: Extracting & Loading Rulebook PDFs

 
In this cell, I pointed to my input folder where all the board game rulebook PDFs are stored and used PyPDF2 to pull the raw text out of each file. First, I defined an `extract_text_from_pdf()` helper that opens a PDF in binary mode, loops through every page, and safely grabs whatever text is there (falling back to an empty string if a page is blank or extraction fails). Then I walked the directory to find every `.pdf`, run it through that function, and build a list of dictionaries, each with the original filename under `source` and the full extracted text under `content`. Finally, I printed how many documents loaded successfully so I know my RAG pipeline has the raw rulebook text ready for chunking, embedding, and retrieval.  


In [6]:
import os
import PyPDF2
from IPython.display import Markdown

PDF_FOLDER = "/kaggle/input/boardgame-rulebooks/BoardGame_Rulebooks"

def extract_text_from_pdf(pdf_path):
    """
    Extract text from a PDF using PyPDF2.
    """
    text = []
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text() or ""
                text.append(page_text)
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
    
    return "\n".join(text)

# Load all PDFs in the folder
documents = []
for file_name in os.listdir(PDF_FOLDER):
    if file_name.lower().endswith(".pdf"):
        full_path = os.path.join(PDF_FOLDER, file_name)
        pdf_text = extract_text_from_pdf(full_path)
        if pdf_text:
            documents.append({
                "source": file_name,
                "content": pdf_text
            })

print(f"Loaded {len(documents)} PDF documents.")


Loaded 9 PDF documents.


## Text chunking

Here, I just broke down each rulebook’s full text into bite‑sized pieces for embedding and retrieval. I started with a simple `split_into_chunks()` function that splits any long string into segments of roughly 300 words by counting space‑delimited tokens. Then, I looped over every document I loaded earlier, running its content through that splitter, and collected each segment into `all_chunks`, annotating it with the original filename (`source`), the chunk text itself, and a unique `chunk_id` (filename plus chunk index). Finally, I printed the total number of chunks to confirm that my RAG pipeline has a nicely partitioned set of passages to embed and search over. This was done because splitting the text improves both embedding quality and retrieval performance.


In [7]:
def split_into_chunks(text, max_words=300):
    """
    Naive splitting by word count.
    Splits the text into segments of ~300 words each.
    """
    words = text.split()
    chunks = []
    current = []
    
    for w in words:
        current.append(w)
        if len(current) >= max_words:
            chunks.append(" ".join(current))
            current = []
    if current:
        chunks.append(" ".join(current))
    
    return chunks

all_chunks = []
for doc in documents:
    chunks = split_into_chunks(doc["content"], max_words=300)
    for i, chunk_text in enumerate(chunks):
        all_chunks.append({
            "source": doc["source"],
            "text": chunk_text,
            "chunk_id": f"{doc['source']}_chunk_{i}"
        })

print(f"Total chunks created: {len(all_chunks)}")


Total chunks created: 219


## Embedding Generation

In this cell, I turned each 300‑word chunk into a semantic vector using Google’s GenAI embedding API. I started by defining `generate_embeddings()`, which takes a batch of texts and calls `genai.embed_content()` with the `text-embedding-004` model. If the response includes an `embedding` field, I return that list of vectors; otherwise I just loged an unexpected format or any errors and returned a list of `None` placeholders.

To keep the requests manageable, I looped over `all_chunks` in batches of 10, extracted the raw texts, and fed each batch into `generate_embeddings()` function. For each chunk that successfully returns an embedding vector, I built a new record including the original source, chunk text, chunk ID, and its embedding—and append it to `embedded_chunks`. At the end, I printed the total number of embedded chunks to verify that my semantic index is ready for similarity search.  


In [11]:
def generate_embeddings(batch_of_texts, embedding_model="models/text-embedding-004"):
    """
    Given a list of strings, returns a list of embedding vectors
    using google.generativeai's embed_content method.
    """
    try:
        
        response = genai.embed_content(content=batch_of_texts, model=embedding_model)      
        if "embedding" in response:
            return response["embedding"]
        else:
            print("Unexpected response format:", response)
            return [None] * len(batch_of_texts)
    except Exception as e:
        print(f"Embedding error: {e}")
        return [None] * len(batch_of_texts)




# Embedding all chunks (in small batches to avoid large requests)
embedded_chunks = []
BATCH_SIZE = 10

for i in range(0, len(all_chunks), BATCH_SIZE):
    batch = all_chunks[i:i+BATCH_SIZE]
    texts = [item["text"] for item in batch]
    embeddings = generate_embeddings(texts)
    
    for item, emb in zip(batch, embeddings):
        if emb is not None:
            embedded_chunks.append({
                "source": item["source"],
                "text": item["text"],
                "chunk_id": item["chunk_id"],
                "embedding": emb
            })

print(f"Successfully embedded {len(embedded_chunks)} chunks.")


Successfully embedded 219 chunks.


## Retrieval Augmented Generation (RAG) pipeline

In this cell, I implemented the core of the Retrieval‑Augmented Generation (RAG) pipeline: given a free‑form user query, I (1) embeded it into the same vector space as the document chunks, (2) computed cosine similarity between the query embedding and each chunk’s embedding, and (3) returned the **top _k_** most relevant chunks as context for generation.

1. *Helper: `cosine_similarity(vec_a, vec_b)`* 
   Computes the cosine of the angle between two vectors which is a normalized dot product so that higher values which are closer to 1 indicate greater semantic similarity.

2. *Function: `retrieve_top_k(query, data, k=3)`* 
   - Embed the query by calling `genai.embed_content()` with the same embedding model (`text-embedding-004`) used for the documents.  
   - Score each chunk by pairing its precomputed embedding with the query embedding, then measuring similarity.  
   - Sort and select the top _k_ chunks with the highest scores. These chunks are the most semantically related pieces of the rulebook to our question.

3. *Quick Test*  
   I then tried to run a sample query (“How do I set up the game board?”), which retrieved the top 3 chunks, and printed a preview of each to verify that our retrieval layer is correctly surfacing relevant rulebook passages.

This retrieval step ensures that before we hand anything to the text generation model, the vast rulebook is distilled into a few highly relevant excerpts making the subsequent answer both accurate and concise.  


In [12]:
import math
def cosine_similarity(vec_a, vec_b):
    """
    Compute the cosine similarity between two vectors.
    """
    dot = sum(a * b for a, b in zip(vec_a, vec_b))
    mag_a = math.sqrt(sum(a * a for a in vec_a))
    mag_b = math.sqrt(sum(b * b for b in vec_b))
    return dot / (mag_a * mag_b) if (mag_a != 0 and mag_b != 0) else 0.0

def retrieve_top_k(query, data, k=3, embedding_model="models/text-embedding-004"):
    """
    1) Embed the query using genai.embed_content.
    2) Compute cosine similarity with each chunk's embedding.
    3) Return the top-k chunks.
    """
    try:
        response = genai.embed_content(content=[query], model=embedding_model)
        if "embedding" in response:
            query_emb = response["embedding"][0]
        else:
            print("Unexpected response format when embedding query:", response)
            return []
    except Exception as e:
        print(f"Embedding error in retrieve_top_k: {e}")
        return []

    # Score each chunk
    scored = []
    for item in data:
        sim = cosine_similarity(query_emb, item["embedding"])
        scored.append((item, sim))

    # Sort by similarity in descending order
    scored.sort(key=lambda x: x[1], reverse=True)
    return [x[0] for x in scored[:k]]


# Quick test
test_query = "How do I set up the game board?"
top_chunks = retrieve_top_k(test_query, embedded_chunks, k=3)
for i, chunk in enumerate(top_chunks, 1):
    print(f"[{i}] Source: {chunk['source']} | Preview: {chunk['text'][:200]}...\n")


[1] Source: Intarsia rulebook.pdf | Preview: 4 frames per color. In a 3-player game , remove 2 frames per color. In a 4-player game , remove no frames. In a 2-player game , return all tool tiles with a 3(+)-player icon or 4-player icon on their ...

[2] Source: Intarsia rulebook.pdf | Preview: round, gain 1 point per connector on your floor board. If it’s the end of the 2nd round, gain 2 points per connector on your floor board. If it’s the end of the 3rd round, gain 3 points per connector ...

[3] Source: PointCity rulebook.pdf | Preview: and place them in piles with their resource side faceup. Based on the number of players, remove/use the following number of cards from each tier: Return the removed cards to the game box, they will no...



## Text Generation with Gemini-2.0-Flash

Here, the prompt is built by combining the retrieved context with the question, which is then passed to the Gemini-2.0-Flash text generation model. The output demonstrates Controlled Generation, using few-shot prompting wherein my prompt includes a few examples. This step shows how Gen AI can synthesize an answer from a given context.


In this cell, I implemented the synthesis phase of the RAG pipeline by taking the top‑K retrieved context chunks and the user’s query to produce a concise, accurate answer.

1. *Context Assembly*  
   I concatenated each retrieved chunk with its source filename and text preview into a single `context_text` block. This ensures the model sees all relevant rulebook excerpts at once.

2. *Prompt Construction using Few‑Shot Prompting*  
   I crafted a clear, few‑shot‑inspired prompt that:
   - Frames the assistant as an expert on board game rules  
   - Injects the concatenated context under a `CONTEXT:` header  
   - Poses the user’s question under `QUESTION:`  
   - Leaves an open `ANSWER:` slot for the model’s response  

This structure helps Gemini‑2.0‑Flash to stay focused on the given facts and deliver a directly relevant answer. Finally, I rendered the model’s response as a Markdown cell so it appears formatted and easy to read in the notebook.


In [13]:
def generate_answer(query, retrieved_context, text_model="gemini-2.0-flash"):
    """
    Use Google GenAI's generate_content to produce an answer to `query`
    by incorporating the retrieved context.
    """
    # Combine the retrieved context into one text block.
    context_text = "\n\n".join(
        [f"Source: {c['source']}\n{c['text']}" for c in retrieved_context]
    )
    
    # Build the prompt that includes context and the query using few shot prompting.
    prompt = f"""
You are an expert on board game rules.
Use the following context to answer the user's question accurately and concisely.

CONTEXT:
{context_text}

QUESTION: {query}

ANSWER:
"""
    try:
        answer = client.models.generate_content(
            model=text_model,
            contents=prompt
        )
        return answer.text  
    except Exception as e:
        return f"Error generating text: {e}"


final_answer = generate_answer("How do I set up the game board?", top_chunks)
from IPython.display import Markdown
Markdown(f"**Answer:** {final_answer}")


**Answer:** To set up the game board for Intarsia, do the following:

1.  Each player chooses a scoring marker and places it on space 0 of the score track on the score board.
2.  Place one of the light brown markers as the reward marker onto the reward track space with the arrow.
3.  Stack the 4 point tiles in the center of the score board.
4.  Place the second light brown marker as the round marker onto space 1 of the round track.
5.  Each player takes the floor board whose wall color matches their chosen scoring marker. Decide together which side to use (A or B). Then, each player places their floor board with that side facing up in front of them.
6.  Each player takes 1 connector from the supply and places it in their floor board’s center.


## Interactive QnA Loop

In this final cell, I tried to put everything together into a simple command‑line interface so users can ask arbitrary board game rule questions. For each query, I called the retrieval function to embed the user’s question with `text-embedding-004` model, then compute cosine similarities against our pre‑computed chunk embeddings, and pick the top 3 most relevant passages. Then, I passed those retrieved chunks plus the original question into my text generation function where Gemini‑2.0‑Flash helps to produce a concise, context‑grounded answer.

This loop makes our RAG pipeline fully interactive letting anyone ask “How do I set up the game board?”, “What happens on a tie?”, or any other rule query and get back a sub‑second, accurate answer drawn straight from the official PDF rulebooks.  


In [14]:
while True:
    user_query = input("\nEnter your board game rule question (or type 'exit'): ")
    if user_query.lower() in ["exit", "quit"]:
        break
    
   
    context_chunks = retrieve_top_k(user_query, embedded_chunks, k=3)
    
    answer = generate_answer(user_query, context_chunks)
    
    print("\nRetrieved Context (preview):")
    for idx, chunk in enumerate(context_chunks, 1):
        print(f"\n[{idx}] Source: {chunk['source']} | Preview: {chunk['text'][:200]}...")
    print("\nFinal Answer:\n", answer)



Enter your board game rule question (or type 'exit'):  chess rules



Retrieved Context (preview):

[1] Source: Chess rulebook.pdf | Preview: BASIC RULES OFCHESS Introduction Chess isagame ofstrategy believ edtohavebeen invented mor ethen 1500 years ago inIndia. Itisagame fortwo play - ers,one with the light pieces and one with the darkpiec...

[2] Source: Chess rulebook.pdf | Preview: during agame and only when certain conditions aremet. Castling isaspecial move that lets aplay ermovetwo pieces atonce -theKingand one 2 Rook. Incastling, the play ermoveshisKing twosquar esei- ther t...

[3] Source: Chess rulebook.pdf | Preview: one light-squar eBishop .The Bishop 'smoves areshownbythehighlighted squar esinthefollo wing chess- boar d.This black Bishop can captur ethewhite pawn but its path isblocked bytheblack Knight ifitwa...

Final Answer:
 Okay, I can help you with chess rules. Here's a summary of the basic rules, based on the provided text:

**Objective:**

*   The main goal of chess is to checkmate the opponent's King.

**Board Setup:**

*   The c


Enter your board game rule question (or type 'exit'):  azul rules



Retrieved Context (preview):

[1] Source: Azul rulebook.pdf | Preview: Rulebook Center of the table FGame Setup 1. Give each player a player board (A). Flip your board to the side with the colored wall. (See Variant Play to play with the gray side of the player board). E...

[2] Source: Azul rulebook.pdf | Preview: palace in Portugal be decorated with similar wall tiles. Azul brings you, a tile laying artist, to embellish the walls of the Royal Palace of Evora. A GAME BY MICHAEL KIESLING Wall Pattern lines Score...

[3] Source: Eldritch rulebook.pdf | Preview: game if they solve three of the Ancient One’s Mysteries. If the Ancient One awakens, investigators will also need to solve the Final Mystery in order to win the game (see page 16). Azathoth Mystery Ca...

Final Answer:
 Azul is a tile-laying game where players compete to embellish the walls of the Royal Palace of Evora. Here's a breakdown of the gameplay:

**Setup:**

1.  Each player takes a player board and places their scoring


Enter your board game rule question (or type 'exit'):  exit


## Gen AI Capabilities Demonstrated

### 1. Document Understanding
- **PDF Extraction:** Uses PyPDF2 to automatically extract text from board game rule PDFs.  
- **Chunking:** Splits long texts into manageable chunks to make retrieval more efficient and context‑relevant.

### 2. Embeddings
- **Semantic Vectors:** Generates embeddings via Google GenAI’s `embed_content` API (model: `text-embedding-004`), turning each text chunk into a high‑dimensional vector that captures its meaning.

### 3. Retrieval Augmented Generation (RAG) & Vector Search
- **Custom Retrieval:**  Implements a custom retrieval function that computes cosine similarity between a query embedding and each chunk's embedding, then retrieves the top-k most relevant chunks.
- **In‑Memory Vector Search:** This simple in-memory retrieval strategy exemplifies vector search techniques.

### 4. Controlled Generation (Few‑Shot Prompting)
- **Structured Prompting:** Builds a clear, few‑shot prompt combining retrieved context with the user’s question.  
- **GenAI Synthesis:** Sends the prompt to a text generation endpoint using Gemini‑2.0‑Flash via `client.models.generate_content` to produce a concise, accurate answer grounded in the rulebook.


# Project Summary

## Innovation
- **Real‑world Impact:** Helps new players rapidly understand complex board‑game rulebooks without manual page‑by‑page searching.  
- **Integrated Pipeline:** The pipeline intelligently integrates document understanding, embeddings, and custom vector search to improve data retrieval, and controlled text generation to produce concise answers in seconds.

## Gen AI Capabilities Demonstrated
1. **Document Understanding**: Automated extraction and chunking of lengthy PDF rulebooks using PyPDF2.  
2. **Embeddings & Vector Search**: Converting text chunks into semantic embeddings and retrieving them via cosine similarity based top‑k retrieval for in‑memory vector search.  
3. **Retrieval Augmented Generation (RAG)**: Retrieved the most relevant context chunks to ground the answer.  
4. **Controlled Generation (Few‑Shot Prompting)**: Constructing a detailed prompt that guides the text generation process.

## Documentation Quality

The notebook clearly outlines the problem, describes the solution’s workflow, and explains every step with code and markdown. It communicates how each Gen AI capability contributes to solving the problem, ensuring that readers understand both the technical implementation and the innovation behind the solution.
