January 6th, 2024  
RAG-bot  
Code Guru: Samantha Roberts 


## Outline
* PDF Plumber
    * Read in the PDF Text (chunk text by page and newline)
    * Read in the PDF Tables (extract tables for pdf)
    * Generate text embeddings with data in DF (generate text df with embeddings)
    * Generate table embeddings with data in DF (generate table df with embeddings)
    * Combine text and table dataframes (combine text and table)
---
* Weaviate
    * create tenant in manuals (add tenant)  
    * add data to weaviate cloud (add data)  
---
* In Dashboard
    * get prompt from user
    * vectorize prompt
    * query weaviate cloud (query weaviate)
    * Format for App output (retrieve formatted)

In [1]:
# this is needed to import the module ragbot 

import sys
sys.path.append('/home/sng/RAG-bot/ragbot') 

In [None]:
from ragbot.config import get_client 
from ragbot import weav
from ragbot import plumb
from ragbot import utils
import pandas as pd

%load_ext autoreload
%autoreload 2

### Initialize the weaviate client

In [12]:
client = get_client()
client.is_ready()

True

### Enter your Weaviate Class name below

In [None]:
WEAVIATE_CLASS = 'Manuals'

### Read in the data with embeddings

In [None]:
df = pd.read_parquet('data/text_and_embeds.parquet', engine='pyarrow')
tenant_list = list(df.filename.unique())
tenant_list

In [None]:
df.info()

In [None]:
list(df.columns)

### Create the class in the weaviate database

In [None]:
weav.create_class(WEAVIATE_CLASS)

In [None]:
weav.get_schema(WEAVIATE_CLASS)

### Create Tenants - 1 for each manual

In [None]:
for tenant in tenant_list:
    weav.add_tenant(tenant, WEAVIATE_CLASS)

In [None]:
weav.write_tenants(WEAVIATE_CLASS)

### Upload the data to Weaviate, 1 tenant at a time

In [None]:
for tenant in tenant_list:
    temp = df[df.filename == tenant]
    print(f'Now adding {tenant}')
    weav.add_pdf_data_objects(temp, 'New_manuals', tenant)

## Query Weaviate
1. Get user prompt
2. Send prompt to openai embeddings api and extract the vector
3. Send the vector to Weaviate to find nearest neighbors
4. Get the retrieved text with the associated metadata, scores, page_number and filename

### Fill in your prompt info below 

In [None]:
prompt = 'type your questions here'
k = 10 # number of retrievals
tenant_name = 'your_tenant_name'
weaviate_class = 'your class_name'

In [None]:
retrieved_texts = weav.query_weaviate(prompt, k, weaviate_class, tenant_name)
retrieved_texts

### Text Ingestion Pipeline
1. create dataframe from google sheet containing the list of pdfs and the corresponding tenant names
2. scrape the text and tables using PDF Plumber
3. create the embeddings for the text and tables with openAI embedding
4. store these in a df
5. write the dataframe to a file (so you do not have to recreate the pickle files)

In [None]:
pdf_path = 'path to your PDF file'
chunk_size = 500 # or other integer

chunked_df = plumb.pdf_to_df(pdf_path, chunk_size)

### Configure and upload to Weaviate
1. Create the tenants for each of the PDFs
2. upload each of the embeddings with the metadata to weaviate cloud
3. verify the upload in the dashboard