# `ResumeRAG` Data Pipeline Testing 

Let's explore how we might:
- Gather data to serve as our knowledge base 
- Transform and enrich the data to meet our use cases specific needs 
- Load the data into a vector store of choice

In [1]:
import sys 
import subprocess

# get root of current repo and add to our path
root_dir = subprocess.check_output(["git", "rev-parse", "--show-toplevel"], stderr=subprocess.DEVNULL).decode("utf-8").strip()

sys.path.append(root_dir)

## Extraction 

The purpose of our `ResumeRAG` system is to allow others to ask the system questions about your professional history. To do this succesfully, the system needs to be well hydrated with accurate and detailed information about said history. 

Step 1 is to create some documents detailing the information you'd like to be available to your users and read it into your workspace.

In [13]:
from pathlib import Path

data_dir = Path(f"{root_dir}/data")

data_dict = {}

for file_name in data_dir.iterdir():
    with open(file_name, "r", encoding="utf-8") as file:
        content = file.read()

    data_dict[file_name.name] = {
        "raw_content": content
    }

## Transformation 

Before we're ready to embed, we want to: 
- Standardize the text data
- Enrich it with content tags to help our search results later 

In [23]:
from utils.helpers import strip_text

# clean up raw content a bit
for file_name, content in data_dict.items():
    clean_content = strip_text(content["raw_content"])
    data_dict[file_name]["clean_content"] = clean_content

### Tagging 

This will be pretty manual, but next I'll add tags to each piece of content. The goal of these tags are to help ensure our retrieval mechanism returns relevant information. 

I decided to add tags that add context to the ___ of the text within the document. My hope is that this will help ensure when people ask about "work" they only get "work" or "professional" tagged content. Or ensure that if someone asks about my education, we can mitigate confusion that might arise from an employed with "Education" in the title and my actual University education.

If we see results are better/worse than we expect, we can always modify these tags as one method of imprvment. 

In [29]:
data_dict["looking_for.txt"]["tags"] = ["looking for", "job_search", "professional"]
data_dict["education.txt"]["tags"] = ["education", "university", "college", "degree"]
data_dict["summary.txt"]["tags"] = ["summary", "professional summary", "elevator pitch"]
data_dict["personal.txt"]["tags"] = ["personal", "interests", "hobbies", "outside work"]
data_dict["pbs.txt"]["tags"] = ["job", "professional", "experience", "work history"]
data_dict["education_analytics.txt"]["tags"] = ["job", "professional", "experience", "work history", "internship"]
data_dict["hive.txt"]["tags"] = ["job", "professional", "experience", "work history"]

### Chunking

In RAG systems, chunking text helps improve the retrieval accuracy of the system. To do this we'll use an available Langchain tool to split our text into chunk sized of 300. Additionally, well add some overlap to ensure content continuity and minimize loss of context.

In [39]:
import pandas as pd

# convert to DF for ease 
df = pd.DataFrame.from_dict(data_dict, orient="index")

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# instantiate text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)

# add new split texts row 
df["split_text"] = df["clean_content"].apply(text_splitter.split_text)

In [54]:
# cast long so each row is a single split text
df_long = df.explode('split_texts').reset_index()

# quick rename
df_long = df_long.rename(columns={'index': 'document_id'})

### Embedding

Finally, we'll embed our cleaned and chunked text! This is where the magic of RAG really lies. By embedding the text, we make it machine interpretable. This will help us bridge the gap between human language and computer understanding. 

In [59]:
from sentence_transformers import SentenceTransformer

# instantiate the model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# embed each chunk 
df_long["embedding"] = df_long["split_texts"].apply(lambda x: model.encode(x))

## Loading 

The final step is to get this content into our selected vector store! For this project I picked PostgreSQL. If you want to learn more about how the database was set up, check out the `database_setup` notebook!

In [83]:
# subset to the columns of interest
subset = df_long[["document_id", "tags", "split_texts", "embedding"]]

subset = subset.rename(columns={'split_texts': 'clean_text'})

In [85]:
# convert to list for client
data = subset.to_dict(orient="records")

In [88]:
from utils.postgres import PostgresClient
import os 

pg = PostgresClient(
    pg_host=os.getenv("PG_HOST"),
    pg_user=os.getenv("PG_USER"),
    pg_password=os.getenv("PG_PASSWORD"),
    pg_db="resume_rag"
)

In [89]:
pg.insert_content_embeddings(data)