# Retrieval Experiments

Relevance of context returned from search greatly impacts the quality of our RAG system. This notebook will explore what retrieval methods are available to us with the goal of producing the most relevant content to a users query.

In [1]:
import sys 
import subprocess

# get root of current repo and add to our path
root_dir = subprocess.check_output(["git", "rev-parse", "--show-toplevel"], stderr=subprocess.DEVNULL).decode("utf-8").strip()

sys.path.append(root_dir)

## Setup 

Before we get started, let's initialize our clients, models, and test params for re-use throguhout this notebook

In [2]:
# for consistency, define a set of test queries 
test_queries = [
    "tell me about yourself", 
    "what is your educational background", 
    "why are you seeking a new position", 
    "what experience do you have with data pipelines", 
]

In [69]:
from utils.postgres import PostgresClient
import  os 

# initialize a postgres client 
pg = PostgresClient(
    pg_host=os.getenv("PG_HOST"),
    pg_user=os.getenv("PG_USER"),
    pg_password=os.getenv("PG_PASSWORD"),
    pg_db="resume_rag"
)

## Cosine Similarity Search 

Let's start simple with `pgvector` cosine similarity search. This approach will take a query embedding and calculate the cosine similarity between each observation in our database. 


**Strengths:**
- Excellent response to why searching and tell me about yourself. There are documents in the knowledge base speaking directly to these topics and this metric does a good job surfacing them 

**Weaknesses:**
- Results for educational background are not what we want. As expected, there is some confusion between "Education Analytics" and formal education at university
- Results for experience with pipelines misses the mark a bit. The first result is great, but the subsequent two aren't really related to the question. 

**Next Steps:**
- Find a way to differentiate Education Analytics from "education" when referring to college or unitversity
- Add additional documents that speak to specific experience with technologies or projects. 


In [8]:
for query in test_queries:
    print("Test Query: ", query)

    results = pg.semantic_search(query, n_results=3)
    print("Top 3 Results:")
    for result in results:
        print(f"- {result[2]}")
    print("\n")


Test Query:  tell me about yourself
Top 3 Results:
- skills that directly support her current work in data engineering and machine learning while studying psychology sophie developed a deep interest in the mechanisms of human cognition which naturally led her to explore fields like artificial intelligence and machine learning her coursework in
- sophie marshall graduated from the university of wisconsin madison in 2022 with a bachelor of science in psychology and economics with a mathematical emphasis her academic training reflects a strong interdisciplinary foundation in human cognition data modeling and statistical analysis skills that
- her coursework in economics paired with a focus on mathematical modeling helped her build a solid foundation in systems thinking multivariate analysis and data driven decision making during her undergraduate years sophie was a division i athlete competing as a four year member of the wisconsin


Test Query:  what is your educational background
Top 3 

## Hybrid Search (Cosine Similarity + Fuzzy Lexical Search w/ Tags)

To address some of the downfalls of semantic search, let's try out a hybrid search approach that takes tags into account. This approach builds on semantic search without replacing it entirely. Now, a search across the contents tags is conducted and used to augment any results returned by semantic search. 

**Strenghts:**
- Our education problem looks to have been addressed! Because I used tags as a way to describe the type of expereince in a document, the education tag was matched from the query and was boosted in our results 


**Weaknesses:**
- Nothing gaping like before. Overall I think the system needs more documents to provide better answers to more nuanced questions, but for now this seems like a great improvement. 

**Next Steps:**
- Build out the knowledgebase as time goes on! Add projects and typed responses to known interview and screening questions.

In [80]:
for query in test_queries:
    print(f"Test Query: {query}")
    print("Top 3 Results:")
    hybird_results = pg.hybrid_search(query, n_results=3)
    for result in hybird_results:
        print(f"- {result['clean_text']} (Hybrid Score: {result['hybrid_score']:.4f})")
    print("\n")

Test Query: tell me about yourself
Top 3 Results:
- skills that directly support her current work in data engineering and machine learning while studying psychology sophie developed a deep interest in the mechanisms of human cognition which naturally led her to explore fields like artificial intelligence and machine learning her coursework in (Hybrid Score: 0.1061)
- sophie marshall graduated from the university of wisconsin madison in 2022 with a bachelor of science in psychology and economics with a mathematical emphasis her academic training reflects a strong interdisciplinary foundation in human cognition data modeling and statistical analysis skills that (Hybrid Score: 0.0998)
- her coursework in economics paired with a focus on mathematical modeling helped her build a solid foundation in systems thinking multivariate analysis and data driven decision making during her undergraduate years sophie was a division i athlete competing as a four year member of the wisconsin (Hybrid Scor