# FAISS Indexing and Query

This notebook ingests the embeddings from workspace_with_embeddings.json into a FAISS index and demonstrates semantic search.

In [1]:
import json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Tuple

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the workspace with embeddings
with open('workspace_with_embeddings.json', 'r') as f:
    workspace_data = json.load(f)

print(f"Document ID: {workspace_data['doc_id']}")
print(f"Number of pages: {workspace_data['num_pages']}")
print(f"Number of blocks: {len(workspace_data['blocks'])}")

Document ID: chess_pdf
Number of pages: 95
Number of blocks: 1943


In [3]:
# Extract embeddings and metadata
embeddings = []
valid_blocks = []

for block in workspace_data['blocks']:
    if block.get('embedding') is not None:
        embeddings.append(block['embedding'])
        valid_blocks.append(block)

print(f"Blocks with valid embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")

Blocks with valid embeddings: 1941
Embedding dimension: 384


In [4]:
# Convert embeddings to numpy array
embeddings_array = np.array(embeddings, dtype='float32')
print(f"Embeddings array shape: {embeddings_array.shape}")

Embeddings array shape: (1941, 384)


In [5]:
# Create FAISS index
dimension = embeddings_array.shape[1]

# Using L2 (Euclidean) distance for similarity
# Alternative: faiss.IndexFlatIP for inner product (cosine similarity with normalized vectors)
index = faiss.IndexFlatL2(dimension)

# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings_array)

# Add vectors to the index
index.add(embeddings_array)

print(f"FAISS index created with {index.ntotal} vectors")

FAISS index created with 1941 vectors


In [7]:
# Load the same model used for creating embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded: {model}")

Model loaded: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)


In [8]:
def search_documents(query: str, k: int = 5) -> List[Tuple[Dict, float]]:
    """
    Search for the top-k most similar documents to the query.
    
    Args:
        query: Search query text
        k: Number of results to return
    
    Returns:
        List of tuples (block, distance) sorted by similarity
    """
    # Encode the query
    query_embedding = model.encode([query], convert_to_numpy=True)
    query_embedding = query_embedding.astype('float32')
    
    # Normalize for cosine similarity
    faiss.normalize_L2(query_embedding)
    
    # Search the index
    distances, indices = index.search(query_embedding, k)
    
    # Retrieve matching blocks
    results = []
    for idx, distance in zip(indices[0], distances[0]):
        block = valid_blocks[idx]
        results.append((block, distance))
    
    return results

In [9]:
def display_results(results: List[Tuple[Dict, float]]):
    """
    Display search results in a readable format.
    """
    for i, (block, distance) in enumerate(results, 1):
        similarity_score = 1 - distance  # Convert L2 distance to similarity (0-1)
        print(f"\n{'='*80}")
        print(f"Result #{i} (Similarity: {similarity_score:.4f})")
        print(f"{'='*80}")
        print(f"Page: {block['page_num']}")
        print(f"Block Index: {block['block_idx']}")
        print(f"Type: {block['type']}")
        if block.get('section_path'):
            print(f"Section: {block['section_path']}")
        print(f"\nText:\n{block['text']}")
        print(f"\nCharacter Count: {block['char_count']}")

In [10]:
# Example Query 1: Search for opening moves
query1 = "How should I start a chess game? What are good opening moves?"
print(f"Query: {query1}\n")

results1 = search_documents(query1, k=5)
display_results(results1)

Query: How should I start a chess game? What are good opening moves?


Result #1 (Similarity: 0.1230)
Page: 0
Block Index: 9
Type: h1
Section: CHESS FUNDAMENTALS

Text:
CHESS FUNDAMENTALS

Character Count: 18

Result #2 (Similarity: 0.1230)
Page: 3
Block Index: 19
Type: h1
Section: CHESS FUNDAMENTALS

Text:
CHESS FUNDAMENTALS

Character Count: 18

Result #3 (Similarity: 0.1144)
Page: 12
Block Index: 6
Type: body
Section: EXAMPLE 13. > 7. B - Kt 5

Text:
First, the complete development of the opening has taken only seven moves. (This varies up to ten or twelve moves in some very exceptional cases. As a rule, eight should be enough.) Second, Black has {28} been compelled to exchange a Bishop for a Knight, but as a compensation he has isolated White's Q R P and doubled a Pawn. (This, at such an early stage of the game, is rather an advantage for White, as the Pawn is doubled towards the centre of the board.) Third, White by the exchange brings up a Pawn to control the square Q 4, puts Bla

In [11]:
# Example Query 2: Search for endgame strategy
query2 = "What are the key principles in chess endgame?"
print(f"Query: {query2}\n")

results2 = search_documents(query2, k=5)
display_results(results2)

Query: What are the key principles in chess endgame?


Result #1 (Similarity: 0.5253)
Page: 1
Block Index: 23
Type: h1
Section: FURTHER PRINCIPLES IN END-GAME PLAY

Text:
FURTHER PRINCIPLES IN END-GAME PLAY

Character Count: 35

Result #2 (Similarity: 0.5253)
Page: 15
Block Index: 0
Type: h1
Section: FURTHER PRINCIPLES IN END-GAME PLAY

Text:
FURTHER PRINCIPLES IN END-GAME PLAY

Character Count: 35

Result #3 (Similarity: 0.3934)
Page: 0
Block Index: 9
Type: h1
Section: CHESS FUNDAMENTALS

Text:
CHESS FUNDAMENTALS

Character Count: 18

Result #4 (Similarity: 0.3934)
Page: 3
Block Index: 19
Type: h1
Section: CHESS FUNDAMENTALS

Text:
CHESS FUNDAMENTALS

Character Count: 18

Result #5 (Similarity: 0.3357)
Page: 2
Block Index: 21
Type: h1
Section: END-GAME STRATEGY

Text:
END-GAME STRATEGY

Character Count: 17


In [12]:
# Example Query 3: Search for pawn structure
query3 = "Tell me about pawn structure and pawn advantages"
print(f"Query: {query3}\n")

results3 = search_documents(query3, k=5)
display_results(results3)

Query: Tell me about pawn structure and pawn advantages


Result #1 (Similarity: 0.2238)
Page: 5
Block Index: 11
Type: body
Section: 2...K - B 1; 3 K - B 3, K - K 1; 4 K - K 4, K - Q 1; 5 K - Q 5, K - B 1; 6 K - Q 6. > 2. PAWN PROMOTION

Text:
The gain of a Pawn is the smallest material advantage that can be obtained in a game; and it often is sufficient to win, even when the Pawn is the only remaining unit, apart from the Kings. It is essential, speaking generally, that

Character Count: 231

Result #2 (Similarity: 0.2109)
Page: 52
Block Index: 6
Type: h2
Section: FURTHER OPENINGS AND MIDDLE-GAMES > 31. SOME SALIENT POINTS ABOUT PAWNS

Text:
31. SOME SALIENT POINTS ABOUT PAWNS

Character Count: 35

Result #3 (Similarity: 0.1843)
Page: 27
Block Index: 0
Type: body
Section: PLANNING A WIN IN MIDDLE-GAME PLAY > 19. WINNING BY INDIRECT ATTACK

Text:
Pawns.

Character Count: 6

Result #4 (Similarity: 0.1843)
Page: 89
Block Index: 0
Type: body
Section: GAME 14. QUEEN'S GAMBIT DECLINED > 32.

In [13]:
# Save the FAISS index for later use
faiss.write_index(index, 'chess_pdf.faiss')
print("FAISS index saved to chess_pdf.faiss")

FAISS index saved to chess_pdf.faiss


In [None]:
# Optional: Load the index later
# loaded_index = faiss.read_index('chess_pdf.faiss')
# print(f"Loaded index with {loaded_index.ntotal} vectors")

## Statistics and Analysis

In [14]:
# Analyze blocks by type
from collections import Counter

block_types = Counter(block['type'] for block in valid_blocks)
print("Block distribution by type:")
for block_type, count in block_types.most_common():
    print(f"  {block_type}: {count}")

# Analyze blocks by page
blocks_per_page = Counter(block['page_num'] for block in valid_blocks)
print(f"\nAverage blocks per page: {np.mean(list(blocks_per_page.values())):.2f}")
print(f"Max blocks on a page: {max(blocks_per_page.values())}")
print(f"Min blocks on a page: {min(blocks_per_page.values())}")

Block distribution by type:
  body: 1039
  h2: 422
  skip: 408
  h1: 66
  h3: 6

Average blocks per page: 20.43
Max blocks on a page: 34
Min blocks on a page: 8
