# Basic RAG Pipeline Implementation


### Intro

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline using classes from src directory. The implementation is divided into individual pipeline stages:


1. Document Processing: Token-aware chunking with overlap
2. Embedding Generation: Semantic vectorization
3. Vector Storage: PostgreSQL + pgvector indexing
4. Retrieval: Cosine similarit*y search
5. Generation: Context-augmented LLM completion

### Components

**TextProcessor** - Text segmentation and token budget management.

* Token-based chunking (512 tokens, 50 token overlap)
* Uses cl100k_base tokenizer for GPT compatibility
* Adaptive context assembly within token budgets

**HuggingFaceClient** - Embedding generation and LLM inference.

* Embeddings: Local sentence-transformers (all-MiniLM-L6-v2, 384-dim)
* Generation: Remote Hugging Face Inference API (default: Mistral-7B-Instruct)

**PgVectorDB** - PostgreSQL interface with vector similarity search.

* Stores embeddings as VECTOR(384) with chunk metadata
* Uses ivfflat indexing for approximate nearest neighbor search
* Cosine similarity search via <=> operator


In [44]:
import logging
import os
import sys
from typing import List, Dict, Optional
# from pathlib import Path


# import psycopg2
# from psycopg2.extras import execute_values, Json
# from pgvector.psycopg2 import register_vector
# from huggingface_hub import InferenceClient
# from transformers import pipeline


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


sys.path.insert(0, '../src/')


from dotenv import load_dotenv

# Setup
load_dotenv()


True

In [45]:
from text_processor import TextProcessor
from pgvector_client import PgVectorClient
from hf_client import HuggingFaceClient
# from rag_handler import PgVectorRAG


In [46]:
PG_CONN_STRING = os.getenv("PG_CONNECTION_STRING")
HF_TOKEN= os.getenv("HF_TOKEN")

file_paths = [
    "../documents/policy.txt",
    "../documents/basic_info.md",
    # Add more files
]

BATCH_SIZE=32 # batch_size=
EMBEDDING_DIM=384 # embedding_dim

CHUNK_SIZE=512
CHUNK_OVERLAP=50
MAX_CONTEXT_TOKENS=2000

EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
LLM_MODEL="mistralai/Mistral-7B-Instruct-v0.2"# "mistralai/Mistral-7B-Instruct-v0.2"
#  Query the system
questions = [
    "What is mario's email?",
    "How long does shipping take?",
    "Where there any projects with recommendation systems done by Mario?",
    "Does mario like data science?"
]


SIMILARITY_THRESHOLD=0.2
K=10

In [47]:
# Initialize components 
text_processor = TextProcessor(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    max_context_tokens=MAX_CONTEXT_TOKENS
)

db_client = PgVectorClient(
    connection_string=PG_CONN_STRING,
    embedding_dim=EMBEDDING_DIM  # Match embedding model output
)

hf_client = HuggingFaceClient(
    hf_token=HF_TOKEN,
    embedding_model=EMBEDDING_MODEL,
    llm_model=LLM_MODEL,
    # use_remote_llm=True
)


INFO:pgvector_client:Database connected
INFO:pgvector_client:pgvector extension enabled
INFO:pgvector_client:Database schema created
INFO:hf_client:Initialized embedding model: sentence-transformers/all-MiniLM-L6-v2
INFO:hf_client:Initialized LLM: mistralai/Mistral-7B-Instruct-v0.2


## Load data

In [12]:
all_chunks = [] # list of dicts. Each dict is a 

# Process each file
for file_path in file_paths:
    chunks = text_processor.chunk_file(file_path)
    all_chunks.extend(chunks)
    logger.info(f"Processed {file_path}: {len(chunks)} chunks")
    logger.info(f"chunk lengt {len(chunks)}, {chunks[:10]}")
    logger.info(f"chunk lenght {len(chunks)}, {chunks[:10]}")
    logger.info(f"total chunks: {len(all_chunks)}, {all_chunks[-10:]}")

logger.info(f"Total chunks from all files: {len(all_chunks)}")

INFO:text_processor:Chunked policy.txt: 1 chunks
INFO:__main__:Processed ../documents/policy.txt: 1 chunks
INFO:__main__:chunk lengt 1, [{'content': 'Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened items.\n', 'chunk_id': 0, 'token_count': 21, 'start_token': 0, 'end_token': 21, 'source': 'policy.txt'}]
INFO:__main__:chunk lenght 1, [{'content': 'Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened items.\n', 'chunk_id': 0, 'token_count': 21, 'start_token': 0, 'end_token': 21, 'source': 'policy.txt'}]
INFO:__main__:total chunks: 1, [{'content': 'Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened items.\n', 'chunk_id': 0, 'token_count': 21, 'start_token': 0, 'end_token': 21, 'source': 'policy.txt'}]
INFO:text_processor:Chunked basic_info.md: 2 chunks
INFO:__main__:Processed ../documents/basic_info.md: 2 chunks
INFO:__main__:chunk lengt 2, [{'c

In [13]:
all_chunks [0]

{'content': 'Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened items.\n',
 'chunk_id': 0,
 'token_count': 21,
 'start_token': 0,
 'end_token': 21,
 'source': 'policy.txt'}

In [18]:
# Generate embeddings for all chunks
texts = [chunk['content'] for chunk in all_chunks]
embeddings = hf_client.get_embeddings(texts)

logger.info(f"Generated embeddings for {len(embeddings)} chunks")

# Add embeddings to chunks
for i, chunk in enumerate(all_chunks):
    embedding = embeddings[i]
    if hasattr(embedding, 'tolist'):
        embedding = embedding.tolist()
    chunk['embedding'] = embedding
    # logger.info(f"Generated embeddings for {len(embeddings)} chunks")

INFO:hf_client:Generated 3 embeddings
INFO:__main__:Generated embeddings for 3 chunks


In [None]:

# Insert all chunks
db_client.insert_chunks(all_chunks)
logger.info(f"Inserted {len(all_chunks)} chunks into database")


In [5]:
# chunks = []    
# for file_path in file_paths:        
#     try:            
#         file_chunks = text_processor.chunk_file(file_path)            
#         chunks.extend(file_chunks)        
#     except Exception as e:            
#         logger.error(f"Failed to chunk {file_path}: {e}")



INFO:text_processor:Chunked policy.txt: 1 chunks
INFO:text_processor:Chunked basic_info.md: 2 chunks


In [19]:
for ind, chunk in enumerate(chunks):
    logger.info(f"chunk number {ind}, chunk len {len(chunk['content'])}: {chunk}")
    logger.info("\n")

INFO:__main__:chunk number 0, chunk len 2646: {'content': '# Cover letter\n\n## Basic info\n\n\nThis is a document with CV summary\n\n\nName: Mario (Marvin) Theplumber\n\nemail: mario.thplumber@gmail.com\nphone: +72787226083\n\nNorth Holland, Netherlands\nhttps://github.com/razmarrus\nhttps://www.linkedin.com/in/razmarrus/\n\nnpm i jsonresume-theme-caffeine-tweaked\nresume export --theme caffeine-tweaked resume.pdf\n\nadress: Beethovenstraat 22-1 1099 LK Rotterdam\nNetherlands \n\nEducation: Brazil State University of Informatics\n\n\n## About me / Role description\n\nAs a results-oriented Data Scientist, I believe that combining Machine Learning and Statistics provides businesses with solid answers to their questions. While staying close to the data, I work closely with my business-oriented colleagues to translate business goals into achievable objectives using models.\n\nMy focus is on customer and financial analytics, where I assist in improving marketing strategies with a better un

In [7]:

# #  Generate embeddings in batches (HuggingFaceClient)
# texts = [chunk["content"] for chunk in chunks]
# embeddings = []

# for i in range(0, len(texts), batch_size):
#     batch = texts[i:i + batch_size]
#     batch_embeddings = hf_client.get_embeddings(batch)
#     embeddings.extend(batch_embeddings)   # adds each embedding vector
#     logger.info(f"Embedded batch {i // batch_size + 1}/{(len(texts) + batch_size - 1) // batch_size}")

INFO:hf_client:Generated 3 embeddings
INFO:__main__:Embedded batch 1/1


In [20]:
# # Insert into database (PgVectorDB)
# db_client.insert_chunks(chunks) # embeddings

In [19]:


# # Log statistics (TextProcessor)
# stats = text_processor.get_chunk_stats(chunks)
# logger.info(
#     f"Loaded {stats['total_chunks']} chunks, "
#     f"avg {stats['avg_tokens']:.0f} tokens/chunk"
# )

## Query the model

In [None]:
# def query_rag(question, text_processor, hf_client, db_client, k=5):
#     """
#     Execute RAG query: embed → search → assemble context → generate answer.
    
#     Args:
#         question: User query
#         text_processor: TextProcessor instance
#         hf_client: HuggingFaceClient instance
#         db_client: PgVectorDB instance
#         k: Number of chunks to retrieve
        
#     Returns:
#         dict: {"answer": str, "sources": list, "num_chunks": int}
#     """

#     # 1. Embed question
#     query_embedding = hf_client.get_embeddings([question])[0]
#     if not isinstance(query_embedding, list):
#         query_embedding = query_embedding.tolist()
    
#     # 2. Search database with LOWER threshold
#     chunks = db_client.search(
#         query_embedding, 
#         k=k, 
#         similarity_threshold=0.3  # ← CHANGE: 0.7 → 0.3
#     )
    
#     if not chunks:
#         return {
#             "answer": "No relevant information found.",
#             "sources": [],
#             "num_chunks": 0
#         }
    
#     # 3. Assemble context
#     context = text_processor.assemble_context(chunks, question=question)
    
#     # 4. Generate answer
#     try:
#         answer = hf_client.generate_answer(question, context)
#         if not answer or len(answer) < 10:
#             answer = text_processor.create_fallback(chunks)
#     except:
#         answer = text_processor.create_fallback(chunks)
    
#     return {
#         "answer": answer,
#         "sources": chunks,
#         "num_chunks": len(chunks)
#     }



In [50]:

# result = query_rag(
#     question="what is refund policy?",
#     text_processor=text_processor,
#     hf_client=hf_client,
#     db_client=db_client,
#     k=5
# )

# print(result["answer"])


### Create embeddings

In [28]:
question="Cover letter" #"refund policy"

print( hf_client.get_embeddings([question]))

query_embedding = hf_client.get_embeddings([question])[0]
# if not isinstance(query_embedding, list):
#     query_embedding = query_embedding.tolist()
query_embedding


INFO:hf_client:Generated 1 embeddings
INFO:hf_client:Generated 1 embeddings


[[-0.07321703433990479, 0.10643117874860764, 0.07387036830186844, 0.078147754073143, 0.046925924718379974, 0.0721328854560852, 0.013906000182032585, -0.02369273081421852, -0.0615106038749218, -0.06274979561567307, -0.018684348091483116, 0.025482309982180595, 0.012852526269853115, -0.04796987399458885, -0.04774804040789604, -0.01256596390157938, -0.011400102637708187, -0.04338683560490608, -0.05103318765759468, -0.01570521667599678, -0.06017197296023369, 0.08440078794956207, 0.08455660939216614, -0.03912951424717903, -0.06467852741479874, 0.045355647802352905, 0.018806571140885353, -0.045225538313388824, -0.07094831764698029, -0.04065645858645439, 0.033490847796201706, 0.021872708573937416, 0.025714468210935593, 0.021797820925712585, 0.09821108728647232, 0.07279461622238159, -0.07994870096445084, 0.059448935091495514, -0.01713724248111248, 0.0026676850393414497, -0.016397647559642792, -0.05995016172528267, -0.03052023984491825, 0.051142700016498566, 0.017188958823680878, 0.0141248665750

[-0.07321703433990479,
 0.10643117874860764,
 0.07387036830186844,
 0.078147754073143,
 0.046925924718379974,
 0.0721328854560852,
 0.013906000182032585,
 -0.02369273081421852,
 -0.0615106038749218,
 -0.06274979561567307,
 -0.018684348091483116,
 0.025482309982180595,
 0.012852526269853115,
 -0.04796987399458885,
 -0.04774804040789604,
 -0.01256596390157938,
 -0.011400102637708187,
 -0.04338683560490608,
 -0.05103318765759468,
 -0.01570521667599678,
 -0.06017197296023369,
 0.08440078794956207,
 0.08455660939216614,
 -0.03912951424717903,
 -0.06467852741479874,
 0.045355647802352905,
 0.018806571140885353,
 -0.045225538313388824,
 -0.07094831764698029,
 -0.04065645858645439,
 0.033490847796201706,
 0.021872708573937416,
 0.025714468210935593,
 0.021797820925712585,
 0.09821108728647232,
 0.07279461622238159,
 -0.07994870096445084,
 0.059448935091495514,
 -0.01713724248111248,
 0.0026676850393414497,
 -0.016397647559642792,
 -0.05995016172528267,
 -0.03052023984491825,
 0.051142700016498

In [36]:
chunks = db_client.search_new(query_embedding, k=K, similarity_threshold=SIMILARITY_THRESHOLD)

# results = db_client.search(query_embedding, k=5, similarity_threshold=0.2)
# print(f"Found: {len(results)} results")

chunks

[{'id': 2,
  'content': '# Cover letter\n\n## Basic info\n\n\nThis is a document with CV summary\n\n\nName: Mario (Marvin) Theplumber\n\nemail: mario.thplumber@gmail.com\nphone: +72787226083\n\nNorth Holland, Netherlands\nhttps://github.com/razmarrus\nhttps://www.linkedin.com/in/razmarrus/\n\nnpm i jsonresume-theme-caffeine-tweaked\nresume export --theme caffeine-tweaked resume.pdf\n\nadress: Beethovenstraat 22-1 1099 LK Rotterdam\nNetherlands \n\nEducation: Brazil State University of Informatics\n\n\n## About me / Role description\n\nAs a results-oriented Data Scientist, I believe that combining Machine Learning and Statistics provides businesses with solid answers to their questions. While staying close to the data, I work closely with my business-oriented colleagues to translate business goals into achievable objectives using models.\n\nMy focus is on customer and financial analytics, where I assist in improving marketing strategies with a better understanding of customers’ preferen

In [14]:
context_text = "\n".join([d["content"][:400] for d in chunks])
prompt = f"Context: {context_text[:600]}\n\nQ: what is in {question}?\nA:"
prompt

'Context: # Cover letter\n\n## Basic info\n\n\nThis is a document with CV summary\n\n\nName: Mario (Marvin) Theplumber\n\nemail: mario.thplumber@gmail.com\nphone: +72787226083\n\nNorth Holland, Netherlands\nhttps://github.com/razmarrus\nhttps://www.linkedin.com/in/razmarrus/\n\nnpm i jsonresume-theme-caffeine-tweaked\nresume export --theme caffeine-tweaked resume.pdf\n\nadress: Beethovenstraat 22-1 1099 LK Rotterdam\nNetherland\n\nQ: what is in Cover letter?\nA:'

In [48]:
# # answer = hf_client.generate_answer(question, prompt)

# max_new_tokens: int = 512
# temperature: float = 0.7

# response = hf_client.llm_client.conversational(
#     prompt,
#     max_new_tokens=max_new_tokens,
#     temperature=temperature,
#     do_sample=True,
#     top_p=0.9,
#     return_full_text=False
# )

# # Extract the generated text from the response
# answer = response.generated_text.strip()

# answer = response.strip()
# logger.info(f"Generated answer ({len(answer)} chars)")

In [49]:
# from huggingface_hub import InferenceClient

# # Replace with your Hugging Face token and prompt
# # hf_token = "your_hf_token_here"
# prompt = "Context:"
# context_text = "\n".join([d["content"][:400] for d in chunks])


# #"Your context text here"
# question = f"Q: what is in {question}?\nA:" #"Your question here"

# # Initialize the client with a supported model
# llm_client = InferenceClient(model="mistralai/Mistral-7B-Instruct-v0.2", token=HF_TOKEN)

# # Format the prompt as a chat-like instruction
# messages = [
#     {
#         "role": "user",
#         "content": f"""<s>[INST] Based on this context:

# {context_text[:800]}

# Answer: {question} [/INST]"""
#     }
# ]

# # Convert the messages to a single prompt string
# prompt = messages[0]["content"]

# # Generate the answer
# response = llm_client.text_generation(
#     prompt=prompt,
#     max_new_tokens=512,
#     temperature=0.7,
#     do_sample=True,
#     top_p=0.9,
#     return_full_text=False
# )

# # Extract the answer
# answer = response.strip()
# print(answer)


## Query the model

In [40]:
llm_client=InferenceClient(
                model="mistralai/Mistral-7B-Instruct-v0.2",
                token=HF_TOKEN
            )

In [None]:
llm_question="what is this cover letter about"

messages = [
    {
        "role": "user",
        "content": f"Based on this context:\n\n{context_text}\n\nAnswer: {llm_question}?"
    }
]

response = llm_client.chat_completion(
    messages=messages,
    max_tokens=500,
    temperature=0.7
)

answer = response.choices[0].message.content.strip()

In [43]:
response

ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content=" This cover letter does not provide sufficient information to determine its specific content. It appears to be accompanying a resume or CV and includes some basic contact information and a link to the person's GitHub and LinkedIn profiles. The last two lines suggest that the resume or CV has been generated using a specific tool and saved as a PDF file. However, there is no explicit statement in the cover letter about the purpose of the document or what it is requesting or offering.", reasoning=None, tool_call_id=None, tool_calls=None), logprobs=None)], created=1767726252925, id='1d9c3a1f-6b9d-476e-b878-40c412e17d3b', model='mistralai/Mistral-7B-Instruct-v0.2', system_fingerprint='', usage=ChatCompletionOutputUsage(completion_tokens=94, prompt_tokens=135, total_tokens=229), object='chat.completion')

In [23]:
answer

'This cover letter does not provide sufficient information to determine the specific topic or purpose of the letter. It includes basic contact information for Mario Theplumber, his email address, phone number, and links to his GitHub and LinkedIn profiles. It also mentions the use of a JSON resume and exporting it as a PDF using the "caffeine-tweaked" theme. However, there is no text in the cover letter explaining why Mario is writing the letter or to whom it is being addressed.'

---

## Old code

In [11]:
# 3. Assemble context
context = text_processor.assemble_context(chunks, question=question)

# 4. Generate answer
# try:
    
#     if not answer or len(answer) < 10:
#         answer = text_processor.create_fallback(chunks)
# except:
#     answer = text_processor.create_fallback(chunks)
answer = hf_client.generate_answer(question, context)
print(answer)

INFO:text_processor:Context: 512/1348 tokens, 1/1 chunks
ERROR:hf_client:Answer generation failed: Model mistralai/Mistral-7B-Instruct-v0.2 is not supported for task text-generation and provider featherless-ai. Supported task: conversational.


RuntimeError: Failed to generate answer: Model mistralai/Mistral-7B-Instruct-v0.2 is not supported for task text-generation and provider featherless-ai. Supported task: conversational.

In [None]:
query_embedding_str = f"[{','.join(map(str, query_embedding))}]"

with db_client.conn.cursor() as cur:
    cur.execute("""
        SELECT 
            content, source, chunk_id, 
            start_token, end_token, token_count,
            1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        WHERE 1 - (embedding <=> %s::vector) > %s
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_embedding_str, query_embedding_str, similarity_threshold, query_embedding_str, k))
    
    results = cur.fetchall()

In [31]:
with db_client.conn.cursor() as cur:
    # cur.execute("""
    #     SELECT 
    #         content, source, chunk_id, 
    #         (embedding <=> %s::vector) AS distance
    #     FROM documents
    #     ORDER BY distance
    #     LIMIT %s
    # """, (query_embedding_str, k))
    cur.execute("SELECT COUNT(*) FROM documents;")

    results = cur.fetchall()
    print("Results with distances:", results)


Results with distances: [(3,)]


### query examples

In [32]:
query_embedding_str = f"[{','.join(map(str, query_embedding))}]"

sql = """
    SELECT
        id,
        content,
        source,
        chunk_id,
        start_token,
        end_token,
        token_count,
        embedding <=> %s::vector AS cosine_distance
    FROM documents
    ORDER BY cosine_distance
"""

# if limit is not None:
#     sql += f" LIMIT {limit}"

with db_client.conn.cursor() as cur:
    cur.execute(sql, (query_embedding_str,))
    results = cur.fetchall()

results

[(2,
  '# Cover letter\n\n## Basic info\n\n\nThis is a document with CV summary\n\n\nName: Mario (Marvin) Theplumber\n\nemail: mario.thplumber@gmail.com\nphone: +72787226083\n\nNorth Holland, Netherlands\nhttps://github.com/razmarrus\nhttps://www.linkedin.com/in/razmarrus/\n\nnpm i jsonresume-theme-caffeine-tweaked\nresume export --theme caffeine-tweaked resume.pdf\n\nadress: Beethovenstraat 22-1 1099 LK Rotterdam\nNetherlands \n\nEducation: Brazil State University of Informatics\n\n\n## About me / Role description\n\nAs a results-oriented Data Scientist, I believe that combining Machine Learning and Statistics provides businesses with solid answers to their questions. While staying close to the data, I work closely with my business-oriented colleagues to translate business goals into achievable objectives using models.\n\nMy focus is on customer and financial analytics, where I assist in improving marketing strategies with a better understanding of customers’ preferences and measuring

In [21]:
query_embedding_str = f"[{','.join(map(str, query_embedding))}]"

with db_client.conn.cursor() as cur:
    cur.execute("""
        SELECT 
            content, source, chunk_id, 
            start_token, end_token, token_count, embedding
        FROM documents
    """, (query_embedding_str))
    
    results = cur.fetchall()
results 

[('Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened items.\n',
  'policy.txt',
  0,
  0,
  21,
  21,
  '[-0.049646314,0.00788142,0.06498598,0.02755123,0.05388102,0.0034853122,0.0019658671,-0.021383384,-0.028284602,0.03796265,0.08782472,0.008704987,-0.0036289226,-0.11326598,-0.03446046,-0.030774001,-0.0074088993,-0.03812605,-0.03755244,-0.010578815,0.04534926,-0.058244433,0.017379388,-0.053542886,0.04122955,0.08540669,-0.056329153,-0.034305654,-0.014205925,-0.01184808,0.044347055,-0.054839652,-0.066744395,-0.054778058,-0.044496957,0.0525469,-0.060063083,-0.09016178,-0.03886948,0.015047273,-0.10945757,0.09787436,-0.0068044807,0.0814983,0.034812823,-0.025477463,0.044552438,-0.044960745,0.07981487,0.024243109,0.0077475174,0.06295065,-0.039939594,0.03283787,0.0010958338,0.062451098,-0.0153706875,0.001572128,-0.028900862,-0.094697304,0.026326738,-0.12996931,-0.032205645,-0.10682113,-0.043133542,-0.06842565,-0.07529022,-0.035768274,-0.0463195

In [8]:


result = query_rag(
    #question="Any information on  recommendation systems?",
    question="Any information on  recommendation systems?",
    text_processor=text_processor,
    hf_client=hf_client,
    db_client=db_client,
    k=5
)

print(result["answer"])


INFO:hf_client:Generated 1 embeddings


No relevant information found.


In [29]:


for question in questions:
    print(f"\n Question: {question}")
    
    result = query_rag(
        question=question,
        text_processor=text_processor,
        hf_client=hf_client,
        db_client=db_client,
        k=5
    )
    
    print(f"\nAnswer:\n{result['answer']}")
    print(f"\nSources ({result['num_chunks']} chunks):")
    for src in result['sources']:
        print(
            f"  - {src['source']} (chunk {src['chunk_id']}, "
            f"tokens {src['start_token']}-{src['end_token']}, "
            f"score {src['similarity']})"
        )


 Question: What is mario's email?


INFO:hf_client:Generated 1 embeddings
INFO:hf_client:Generated 1 embeddings



Answer:
No relevant information found.

Sources (0 chunks):

 Question: How long does shipping take?


INFO:hf_client:Generated 1 embeddings



Answer:
No relevant information found.

Sources (0 chunks):

 Question: Where there any projects with recommendation systems done by Mario?

Answer:
No relevant information found.

Sources (0 chunks):

 Question: Does mario like data science?


INFO:hf_client:Generated 1 embeddings



Answer:
No relevant information found.

Sources (0 chunks):


## Debug search

In [13]:
# def test_vector_operations(db_client):
"""Test if pgvector operators work."""

with db_client.conn.cursor() as cur:
    # Test 1: Create dummy vectors
    cur.execute("""
        SELECT 
            ARRAY[1.0, 2.0, 3.0]::vector <=> ARRAY[1.0, 2.0, 3.0]::vector AS identical,
            ARRAY[1.0, 2.0, 3.0]::vector <=> ARRAY[4.0, 5.0, 6.0]::vector AS different
    """)
    identical, different = cur.fetchone()
    
    logger.info(f"Vector distance (identical): {identical:.4f} (should be 0.0)")
    logger.info(f"Vector distance (different): {different:.4f} (should be > 0)")
    
    # Test 2: Check if index exists
    cur.execute("""
        SELECT indexname, indexdef 
        FROM pg_indexes 
        WHERE tablename = 'documents' AND indexname = 'idx_embedding'
    """)
    index = cur.fetchone()
    if index:
        logger.info(f"Vector index exists: {index[0]}")
    else:
        logger.warning("Vector index not found")



INFO:__main__:Vector distance (identical): 0.0000 (should be 0.0)
INFO:__main__:Vector distance (different): 0.0254 (should be > 0)
INFO:__main__:Vector index exists: idx_embedding


In [14]:
# def test_raw_search(db_client, hf_client, question="email"):
# """Test search with NO filtering."""

question= "This is a document with CV summary" #"email"

logger.info(f"Testing raw search for: '{question}'")
logger.info("=" * 80)

# Generate embedding
query_embedding = hf_client.get_embeddings([question])[0]
if not isinstance(query_embedding, list):
    query_embedding = query_embedding.tolist()

logger.info(f"Query embedding generated: {len(query_embedding)} dimensions")
logger.info(f"Sample values: {query_embedding[:5]}")

# Search without ANY threshold
with db_client.conn.cursor() as cur:
    cur.execute("""
        SELECT 
            id,
            source,
            LEFT(content, 100) as preview,
            1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT 5
    """, (query_embedding, query_embedding))
    
    results = cur.fetchall()

if not results:
    logger.error("No results returned at all!")
    # return
else:
    logger.info(f"Found {len(results)} results:")
    for i, (doc_id, source, preview, sim) in enumerate(results, 1):
        logger.info(f"{i}. Similarity: {sim:.4f}")
        logger.info(f"   Source: {source} (ID: {doc_id})")
        logger.info(f"   Preview: {preview}...")




INFO:__main__:Testing raw search for: 'This is a document with CV summary'
INFO:hf_client:Generated 1 embeddings
INFO:__main__:Query embedding generated: 384 dimensions
INFO:__main__:Sample values: [-0.07514718919992447, 0.1503158062696457, -0.06211642175912857, 0.020352579653263092, 0.053853780031204224]
ERROR:__main__:No results returned at all!


In [40]:
# Get actual text from database
with db_client.conn.cursor() as cur:
    cur.execute("SELECT content, embedding FROM documents WHERE id = 1")
    stored_content, stored_embedding = cur.fetchone()

logger.info(f"Stored content: {stored_content[:100]}...")
logger.info(f"Stored embedding type: {type(stored_embedding)}")
logger.info(f"Stored embedding sample: {stored_embedding[:5]}")

# Generate NEW embedding for same text
new_embedding = hf_client.get_embeddings([stored_content])[0]
if not isinstance(new_embedding, list):
    new_embedding = new_embedding.tolist()

logger.info(f"New embedding type: {type(new_embedding)}")
logger.info(f"New embedding sample: {new_embedding[:5]}")

# Compare them
with db_client.conn.cursor() as cur:
    cur.execute("""
        SELECT embedding <=> %s::vector AS distance
        FROM documents 
        WHERE id = 1
    """, (new_embedding,))
    
    distance = cur.fetchone()[0]
    logger.info(f"Distance between stored and regenerated: {distance:.6f}")
    logger.info(f"Similarity score: {1 - distance:.6f}")


INFO:__main__:Stored content: Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened ...
INFO:__main__:Stored embedding type: <class 'str'>
INFO:__main__:Stored embedding sample: [-0.0
INFO:hf_client:Generated 1 embeddings
INFO:__main__:New embedding type: <class 'list'>
INFO:__main__:New embedding sample: [-0.04964632913470268, 0.007881461642682552, 0.06498601287603378, 0.02755117043852806, 0.05388107895851135]
INFO:__main__:Distance between stored and regenerated: 0.000000
INFO:__main__:Similarity score: 1.000000


In [15]:
with db_client.conn.cursor() as cur:
    cur.execute("""
        SELECT column_name, data_type, udt_name
        FROM information_schema.columns
        WHERE table_name = 'documents' AND column_name = 'embedding'
    """)
    
    col_info = cur.fetchone()
    logger.info(f"Column name: {col_info[0]}")
    logger.info(f"Data type: {col_info[1]}")
    logger.info(f"UDT name: {col_info[2]}")


INFO:__main__:Column name: embedding
INFO:__main__:Data type: USER-DEFINED
INFO:__main__:UDT name: vector


## Insert text v2

In [35]:
with db_client.conn.cursor() as cur:
    cur.execute("SELECT COUNT(*) FROM documents")
    print(f"Total documents: {cur.fetchone()[0]}")


Total documents: 3


In [38]:
with db_client.conn.cursor() as cur:
    cur.execute("SELECT COUNT(*) FROM documents WHERE embedding IS NOT NULL")
    print(f"Documents with embeddings: {cur.fetchone()[0]}")

Documents with embeddings: 3


In [11]:
# Test 1: Check data exists
with db_client.conn.cursor() as cur:
    cur.execute("SELECT COUNT(*) FROM documents")
    print(f"Total docs: {cur.fetchone()[0]}")
    
    cur.execute("SELECT content FROM documents LIMIT 1")
    row = cur.fetchone()
    if row:
        print(f"Sample content: {row[0][:100]}")

# Test 2: Check embedding dimension (fixed)
with db_client.conn.cursor() as cur:
    cur.execute("SELECT vector_dims(embedding) FROM documents LIMIT 1")
    stored_dim = cur.fetchone()[0]
    print(f"Stored dim: {stored_dim}, Expected: {db_client.embedding_dim}")

# Test 3: Raw vector search (no threshold)
query_embedding = hf_client.get_embeddings(["recommendation systems"])[0]
if hasattr(query_embedding, 'tolist'):
    query_embedding = query_embedding.tolist()

query_embedding_str = f"[{','.join(map(str, query_embedding))}]"

with db_client.conn.cursor() as cur:
    cur.execute("""
        SELECT content, 1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT 5
    """, (query_embedding_str, query_embedding_str))
    
    for content, sim in cur.fetchall():
        print(f"Similarity: {sim:.4f} | {content[:80]}")


INFO:hf_client:Generated 1 embeddings


Total docs: 3
Sample content: Our refund policy allows returns within 35 days of purchase. Full refunds are provided for unopened 
Stored dim: 384, Expected: 384
