# Retrieval Experiments

Relevance of context returned from search greatly impacts the quality of our RAG system. This notebook will explore what retrieval methods are available to us with the goal of producing the most relevant content to a users query.

In [1]:
import sys 
import subprocess

# get root of current repo and add to our path
root_dir = subprocess.check_output(["git", "rev-parse", "--show-toplevel"], stderr=subprocess.DEVNULL).decode("utf-8").strip()

sys.path.append(root_dir)

## Setup 

Before we get started, let's initialize our clients, models, and test params for re-use throguhout this notebook

In [5]:
# for consistency, define a set of test queries 
test_queries = [
    "tell me about yourself", 
    "what is your educational background", 
    "why are you seeking a new position", 
    "what experience do you have with data pipelines", 
]

In [3]:
from sentence_transformers import SentenceTransformer

# instantiate the model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [43]:
from utils.postgres import PostgresClient
import  os 

# initialize a postgres client 
pg = PostgresClient(
    pg_host=os.getenv("PG_HOST"),
    pg_user=os.getenv("PG_USER"),
    pg_password=os.getenv("PG_PASSWORD"),
    pg_db="resume_rag"
)

## Cosine Similarity Search 

Let's start simple with `pgvector` cosine similarity search. This approach will take a query embedding and calculate the cosine similarity between each observation in our database. 


**Strengths:**
- Excellent response to why searching and tell me about yourself. There are documents in the knowledge base speaking directly to these topics and this metric does a good job surfacing them 

**Weaknesses:**
- Results for educational background are not what we want. As expected, there is some confusion between "Education Analytics" and formal education at university
- Results for experience with pipelines misses the mark a bit. The first result is great, but the subsequent two aren't really related to the question. 

**Next Steps:**
- Find a way to differentiate Education Analytics from "education" when referring to college or unitversity
- Add additional documents that speak to specific experience with technologies or projects. 


In [17]:
for query in test_queries:
    print("Test Query: ", query)

    query_embedding = model.encode(query)
    results = pg.semantic_search(query_embedding, n_results=3)
    print("Top 3 Results:")
    for result in results:
        print(f"- {result[2]}")
    print("\n")


Test Query:  tell me about yourself
Top 3 Results:
- skills that directly support her current work in data engineering and machine learning while studying psychology sophie developed a deep interest in the mechanisms of human cognition which naturally led her to explore fields like artificial intelligence and machine learning her coursework in
- sophie marshall graduated from the university of wisconsin madison in 2022 with a bachelor of science in psychology and economics with a mathematical emphasis her academic training reflects a strong interdisciplinary foundation in human cognition data modeling and statistical analysis skills that
- her coursework in economics paired with a focus on mathematical modeling helped her build a solid foundation in systems thinking multivariate analysis and data driven decision making during her undergraduate years sophie was a division i athlete competing as a four year member of the wisconsin


Test Query:  what is your educational background
Top 3 

## Tag Augementation 

We want some way to bolster responses if a word from the query either exactly matches or closely matches a word from a tag. The hope is that this addresses the confusion between "Education Analytics" and education as it relates to college. 

An additional layer of security we can add later down the line is to explicitly provide this information in the system prompt. However, this doesnt address the problem of irrelevant information taking the place of relevant information in the 3 returned DB results.

In [44]:
education_query = test_queries[1]

# cosine search with a bigger set of results 
hybrid_search_results = pg.hybrid_search(education_query, query_embedding, n_results=3)

In [45]:
hybrid_search_results

[('pbs.txt',
  ['job', 'professional', 'experience', 'work history'],
  'infrastructure data pipelines and ml integrated systems that bring together siloed data sources across pbs her work includes building robust data workflows using python apache airflow and aws step functions with projects spanning etl pipelines vector database integration and retrieval augmented',
  0.6486596191297384,
  None),
 ('summary.txt',
  ['summary', 'professional summary', 'elevator pitch'],
  'sophie is a data engineer passionate about building intelligent resilient pipelines that help teams keep pace with today s rapidly evolving data landscape she brings a strong foundation in ml integrated systems backend development and full stack prototyping currently sophie serves as the data',
  0.6078358082184105,
  None),
 ('hive.txt',
  ['job', 'professional', 'experience', 'work history'],
  'data engineering infrastructure',
  0.5554656121412531,
  None)]

In [25]:
for result in cosine_similarity_results:
    print(result[1])

['job', 'professional', 'experience', 'work history']
['summary', 'professional summary', 'elevator pitch']
['job', 'professional', 'experience', 'work history']
['job', 'professional', 'experience', 'work history']
['summary', 'professional summary', 'elevator pitch']
['job', 'professional', 'experience', 'work history']
['education', 'university', 'college', 'degree']
['job', 'professional', 'experience', 'work history']
['job', 'professional', 'experience', 'work history']
['job', 'professional', 'experience', 'work history']
['job', 'professional', 'experience', 'work history']
['summary', 'professional summary', 'elevator pitch']
['job', 'professional', 'experience', 'work history']
['job', 'professional', 'experience', 'work history', 'internship']
['job', 'professional', 'experience', 'work history', 'internship']
['looking for', 'job_search', 'professional']
['education', 'university', 'college', 'degree']
['job', 'professional', 'experience', 'work history', 'internship']
['jo