# `ResumeRAG` Pipeline Testing

**Goal:** Test individual pipeline components here before scriptifying them and eventually turning this into an API?

## Extracting Data

For this example, we will pull our text data from a Google Doc. We can programmatically access this file using a GoogleAPI Service Account and the avaiable Python API.

In [1]:
import sys

sys.path.append("/Users/srmarshall/Desktop/code/personal/resume-rag/")

In [2]:
import os 
from utils.google import GoogleDocClient

# instantiate a client
docs_client = GoogleDocClient(
    service_account_json="../credentials.json", 
    scopes=['https://www.googleapis.com/auth/documents.readonly']
)

# fetch document
response = docs_client.fetch_document(document_id=os.getenv("RESUME_RAG_DOCUMENT_ID"))

# extract text 
raw_text = docs_client.extract_text(google_doc_repsonse=response)

## Transforming Data

Our vector database has the following columns `document_id`, `chunk_id`, `tags`, `clean_text`, and `embedding`. We will need to generate content to match each of these fiels in our database.

In [3]:
from utils.helpers import strip_text

# strip our documents raw text to clean it up a bit and ensure uniformity
clean_text = strip_text(raw_text)

Next, well split our large document into workable chunks. Chunking our text not only improves the accuracy/relevance of information returned by our retrieval mechanism, but also ensures we won't bump up against any token limist when we go to embed our content. 

Each split text will represent a row in our database.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# instantiate text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)

# split texts 
split_texts = text_splitter.split_text(clean_text)

Once we've cleaned and split our text, we're ready to embed. There are a host of embedding models avaiable for use (even multi-modal ones if you'd like to include non text documents in your knowledge base). For this project, we'll use `MiniLM-L6-v2` which is free to access using the `SentenceTransformer` library!

In [5]:
from sentence_transformers import SentenceTransformer

# instantiate the model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# generate content embeddings for each chunk of split text
embeddings = model.encode(split_texts)

  from .autonotebook import tqdm as notebook_tqdm


Now it's time to stich our content together to create full rows for our database. We'll create a list of dictionaries to iterate over and insert into Postgres. Each dictionary represents a row where the `key` is the column name and the `value` is the columns value for the specified row:
- `document_id` comes from Google Drive and will enbale document reconstruction in the future 
- `chunk_id` is generated by us will also play a role in document reconstruction. Think "for each `document_id` grab all rows then order by `chunk_id`" to reconstruct the full document
- `tags` are also generated by us, and can be used to enhance our retrieval process
- `clean_text` is a single text chunk generated by the text splitter above
- `embedding` is the vector representation of the `clean_text` field 

In [6]:
# instantiate a list to hold transformed data
transformed_data = []

# populate list
for index, text in enumerate(split_texts):

    # instantiate single record
    record = {}

    # populate dict for this text
    record["document_id"] = os.getenv("RESUME_RAG_DOCUMENT_ID") 
    record["chunk_id"] = index 
    record["tags"] = "resume" 
    record["clean_text"] = text
    record["embedding"] = embeddings[index]
    
    # add to master list 
    transformed_data.append(record)

## Loading the Data

The final step is iterating over our list of dictionaries and adding them to our Postgres database. Once we verify our data is available in Postgres we're ready to start querying!

In [7]:
from utils.database import PgClient

# instantiate client
pg_client = PgClient(
    pg_host = os.getenv("PG_HOST"), 
    pg_user = os.getenv("PG_USER"), 
    pg_password = os.getenv("PG_PASSWORD"), 
    pg_db = "resume_rag"
)

# insert data
pg_client.insert_content_embeddings(transformed_data)