# RAG System Testing 

Now that we have our database set up (see `database-setup` notebook) and populated (see `pipeline-testing` notebook), we're ready to start developing our retrieval and generation components!

## Retrieval 

We'll compare and contrast 3 retrieval methods:
- Semantic search using `pgvector`'s built in similarity search functionality
- Lexical search
- Hybrid search (with and without tags) 

In [1]:
import sys 

sys.path.append("/Users/srmarshall/Desktop/code/personal/resume-rag/")

In [2]:
from utils.database import PgClient
import os 

# instantiate client
pg_client = PgClient(
    pg_host = os.getenv("PG_HOST"), 
    pg_user = os.getenv("PG_USER"), 
    pg_password = os.getenv("PG_PASSWORD"), 
    pg_db = "resume_rag"
)

In [5]:
# set query 
query = "tell me about your education"

### Semantic Search with `pgvector`

Semantic searching allows us to ask questions of our data using natural language. Where lexical search uses a naieve direct string comparison to find and surface results, semantic search compares the meaning of phrases using vector operations allowing for more robust searching. 

In [9]:
from sentence_transformers import SentenceTransformer

# instantiate the model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# generate query embedding
query_embedding = model.encode(query)

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
# use embeddings to search database
semantic_results = pg_client.semantic_search(query_embedding, "content_embeddings")

When we print the results we see almost all results describe my educational background. Weather if be the actual degree I obtained, or the reason I decided to study that subject.

In [12]:
print(f"User Query: {query}\n")

# print results
print(f"Semantic Search Results: ")
for index, item in enumerate(semantic_results):
    print(f"  {index + 1}.) {item[3]}")

User Query: tell me about your education

Semantic Search Results: 
  1.) education i graduated from the university of wisconsin madison in may of 2022 with bachelors of science in economics with a mathematical emphasis and psychology coursework from both degrees are highly relevant to my current area of work i draw on knowledge of human cognition while working alongside
  2.) now that embedding powered technology is on the rise similarly it enables me to quickly ingest new information and apply it to prototypes and projects education i graduated from the university of wisconsin madison in may of 2022 with degrees in psychology and economics with a mathematical emphasis
  3.) and cutting edge research helps me contextualize and quickly apply new techniques and conecepts as they are published the mathematical coursework i completed as part of my economics degree is something i use almost daily statics and linear algebra are everywhere especially now that embedding
  4.) emphasis i opted

### Lexical Search with `tsvector`

`tsvector` is a data type provided in Postgres that allows us to store pre-processed documents for full-text searching. Read more about it [here](https://www.postgresql.org/docs/current/textsearch-intro.html)

In [13]:
# use regular query and generated tsvector column to conduct full text searc
lexical_results = pg_client.lexical_search(query, table="content_embeddings")

While we would expect at least some results as our document corpus grows, this is an example of why lexical search alone may not get the job done. If we conducted just plain lexical search, we'd assume there are no relevant documents in our database! 

While this isnt the most accurate representation, it goes to show that semantic search can drastically out perform a plain text search. 

In [15]:
print (f"User Query: {query}\n")

print(f"Lexical Search Results: ")
if len(lexical_results) > 0:
    for index, item in enumerate(lexical_results):
        print(f"  {index + 1}.) {item[3]}")
else:
    print("No results found.")

User Query: tell me about your education

Lexical Search Results: 
No results found.


How Postgres breaks out our query might be negatively impacting our assessment of lexical search on its own. For now, especially with a small corpus of data, lets conduct a more targeted lexical search to see what we might get back.

In [None]:
single_word_lexical_results = pg_client.lexical_search("education", table="content_embeddings")

In [None]:
print (f"User Query: {"education"}\n")

print(f"Lexical Search Results: ")
if len(single_word_lexical_results) > 0:
    for index, item in enumerate(lexical_results):
        print(f"  {index + 1}.) {item[3]}")
else:
    print("No results found.")