# Parse the PDFs, make a vector database

In [1]:
import pdfplumber
from os import listdir
from os.path import isfile, join

In [2]:
class Chunk:
    def __init__(self, id, pdf_name, text):
        self.metadata = {
            "id": id,
            "pdf_source": pdf_name
        }
        self.text = text

In [3]:
folder_path = "./demo-pdfs/"

# for every file in the folder name, parse the pdf
pdfs = [ pdf for pdf in listdir(folder_path) if isfile(join(folder_path, pdf))]

# list of Chunks
text_chunks = []

counter = 0

for pdf_path in pdfs:
    with pdfplumber.open(folder_path + pdf_path) as pdf:
        # remove the PDF extension to get the PDF name
        pdf_name = pdf_path.rstrip(".pdf")
        for page in pdf.pages:
            page_text = page.extract_text()
            # split the page text into multiple chunks based on the \n line
            page_chunks = page_text.split(sep='\n')
            for chunk in page_chunks:
                # add the chunk to text_chunks
                new_chunk = Chunk(counter, pdf_name, chunk)
                text_chunks.append(new_chunk)
                counter += 1
            # text_chunks.extend(page_chunks)

text_chunks

[<__main__.Chunk at 0x7f4cc1bdc590>,
 <__main__.Chunk at 0x7f4cc1bdc7d0>,
 <__main__.Chunk at 0x7f4cd807bcb0>,
 <__main__.Chunk at 0x7f4cc178b0e0>,
 <__main__.Chunk at 0x7f4cc178b140>,
 <__main__.Chunk at 0x7f4cc178b170>,
 <__main__.Chunk at 0x7f4cc178b1a0>,
 <__main__.Chunk at 0x7f4cc178b1d0>,
 <__main__.Chunk at 0x7f4cc178b200>,
 <__main__.Chunk at 0x7f4cc178b230>,
 <__main__.Chunk at 0x7f4cc178b290>,
 <__main__.Chunk at 0x7f4cc178b2c0>,
 <__main__.Chunk at 0x7f4cc178b2f0>,
 <__main__.Chunk at 0x7f4cc178b260>,
 <__main__.Chunk at 0x7f4cc178b350>,
 <__main__.Chunk at 0x7f4cc178b380>,
 <__main__.Chunk at 0x7f4cc178b3b0>,
 <__main__.Chunk at 0x7f4cc178b3e0>,
 <__main__.Chunk at 0x7f4cc178b410>,
 <__main__.Chunk at 0x7f4cc178b440>,
 <__main__.Chunk at 0x7f4cc178b470>,
 <__main__.Chunk at 0x7f4cc178b4a0>,
 <__main__.Chunk at 0x7f4cc178b4d0>,
 <__main__.Chunk at 0x7f4cc178b320>,
 <__main__.Chunk at 0x7f4cc178b110>,
 <__main__.Chunk at 0x7f4cc178b5c0>,
 <__main__.Chunk at 0x7f4cc178b620>,
 

## We use vectorDB library to construct the vector database

In [4]:
# if importing is an issue, try running this:
# pip install --upgrade tensorflow_hub
from vectordb import Memory

# Memory is where all content you want to store/search goes.
memory = Memory()

memory_texts = [text_chunk.text for text_chunk in text_chunks]

# dict of rest of chunk data
memory_metadata = [text_chunk.metadata for text_chunk in text_chunks]

memory.save(
    texts=memory_texts,
    metadata=memory_metadata
)

2025-02-01 20:51:45.428139: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-01 20:51:45.510710: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738461105.539222  267303 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738461105.547023  267303 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-01 20:51:45.616813: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Initiliazing embeddings:  normal
OK.


In [5]:
# Search for top n relevant results, automatically using embeddings
query = "Give me a step-by-step protocol for running native gel electrophoresis."
results = memory.search(query, top_n = 100, unique=True)

# join the text results into one long string
combined_context = ''.join([result['chunk'] for result in results])

combined_context

'amide gel electrophoresis under such “native” conditions, a tech-250 on native PAGE gels following electrophoretic mobility shift assays. The method ispolyacrylamide gel electrophoresis. Proc Natl Acad Sci USA dermeeren M, Mercken M, Luo J, Sweet RW, Gilliland GLpolyacrylamide gel electrophoresis (Figure S3).Biji T. Kurien and R. Hal Scofi eld (eds.), Western Blotting: Methods and Protocols, Methods in Molecular Biology,Biji T. Kurien and R. Hal Scofield (eds.), Protein Gel Detection and Imaging: Methods and Protocols, Methods in Molecular Biology,R. New methods based on capillary electrophoresis for in vitro3. Prior to electrophoresis, leave the gel at room temperatureElectrophoresis 24:1347–1352assays [ 1 , 13 – 15 , 17 ] before performing the electrophoreticImproved staining of proteins in polyacrylamide sulfate-polyacrylamide gel electrophoresis. Bull(1979) A new electrophoretic method for theElectrophoresis Separationresolution gels (possibly certifi ed for molecular biology, ide

# Run the query on the locally hosted deepseek model

In [9]:
# make call to locally hosted deepseek model
from ollama import chat
from ollama import ChatResponse

query = "Give me a step-by-step protocol for running native gel electrophoresis."

background_prompt = ""
background_prompt = "You are a lab assistant and have read up research papers about the topic the researcher is asking you about. In particular, you recall the following information: " + combined_context + "\n\n"
background_prompt += "Please respond the following user query in succint bullet points within 3-5 sentences. Also please limit the thinking phase to a few sentences, with a max of 10 sentences."

# execute the request
response: ChatResponse = chat(model='deepseek-r1:1.5b', messages=[
    {
        'role': 'system',
        'content': background_prompt
    },
    {
        'role': 'user',
        'content': query
    },
])

print(response.message.content)

<think>
Alright, so I need to figure out how to run native gel electrophoresis properly. I'm not very experienced with this, but I know it's used for separating proteins in biological samples. Let me think through the steps.

First, I remember something about the standard buffer. That makes sense because if you use the wrong buffer, your proteins might denature or stick together. So maybe I should start by thinking about what the right buffer is. I think it has to have phosphates for ionization and nucleotides for the DNA component. Oh right, so phosphoric acids with phosphate groups and nucleotide bases like ADP, ATP, ribose.

Next, I need a pH level. Native gels are usually done at around pH 7.2, which is slightly basic. That's where proteins stay in their native form. So I should probably set the pH correctly before adding anything else.

Now, about the electrophoresis machine. I think it has different components: a motor for moving the sample, a current source, a pulldown column or

# Performance optimization: cache the results into a database, use that database in all future queries

# Optional: add support to immediately load up PDF text

In [None]:
""" 
Flow of communication:
1. Ask to load up a particular PDF (or just ask for which PDFs are available)
2. Ask what page of the PDF (also provide how many pages there exist in the PDF)
3. Ask for summaries of a particular page in the PDF
4. Return the text of the PDF (scrolling element) - front end mostly
"""

# Optional: add database support

In [None]:
""" 
Flow of communication:
1. Ask to retrieve a particular table (based on ID/name) - front end
2. Retrieve the list of column names, return those
3. Input one entry at a time (listing out the different attributes) - front end
4. Delete previous entry/undo - front end/back end
"""
