# `ResumeRAG` Pipeline Testing

**Goal:** Test individual pipeline components here before scriptifying them and eventually turning this into an API?

## Extracting Data

For this example, all data is coming from a Google Doc file that can be accessed using a GoogleAPI Service Account and the associated Python API client.

In [1]:
import sys

sys.path.append("/Users/srmarshall/Desktop/code/personal/resume-rag/")

In [2]:
import os 
from utils.google import GoogleDocClient

# instantiate a client
docs_client = GoogleDocClient(
    service_account_json="../credentials.json", 
    scopes=['https://www.googleapis.com/auth/documents.readonly']
)

# fetch document
response = docs_client.fetch_document(document_id=os.getenv("RESUME_RAG_DOCUMENT_ID"))

# extract text 
raw_text = docs_client.extract_text(google_doc_repsonse=response)

## Transforming Data

Let's clean up our raw text a bit to prepare for our embedings step. We'll remove 

In [3]:
from utils.helpers import strip_text

# strip our documents raw text
clean_text = strip_text(raw_text)

Next, well split our large document into workable chunks. Chunking our text not only improves the accuracy, but also ensures we wont bump up against any token limist when we go to embed our content. 

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# instantiate text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)

# split texts 
split_texts = text_splitter.split_text(clean_text)

Once we've cleaned and split our text, we're ready to embed. There are a host of embedding models avaiable for use (even multi-modal ones if you'd like to include non text documents in your knowledge base). For this project, we'll use `MiniLM-L6-v2`.

In [5]:
from sentence_transformers import SentenceTransformer

# instantiate the model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# generate embeddings using the model
embeddings = model.encode(split_texts)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# instantiate a list to hold transformed data
transformed_data = []

# populate list
for index, text in enumerate(split_texts):
    # instantiate single record
    record = {}

    # populate dict for this text
    record["document_id"] = os.getenv("RESUME_RAG_DOCUMENT_ID")
    record["chunk_id"] = index
    record["tags"] = ["resume"]
    record["clean_text"] = text
    record["embedding"] = embeddings[index]
    
    # add to master list 
    transformed_data.append(record)

## Loading the Data

In [18]:
from utils.database import PgClient

# instantiate client
pg_client = PgClient(
    pg_host = os.getenv("PG_HOST"), 
    pg_user = os.getenv("PG_USER"), 
    pg_password = os.getenv("PG_PASSWORD"), 
    pg_db = "resume_rag"
)

# insert data
pg_client.insert_content_embeddings(transformed_data)