## Use PDF documents with RAG
This tutorial downloads a set of PDF files from the Analog Devices website. You can use your own later. We will use these documents as our RAG knowledge base, with the goal of asking the LLM questions about information they contain.

In this tutorial:
1. Download the documents
2. We will extract text from the documents and break it apart into chunks
3. Vectorize the text chunks so the vector database can search them efficiently (semantic search)
4. Load the vectors and text chunks into the database.

### Document download
We will download documents from [this page](https://www.analog.com/en/lp/001/blackfin-manuals.html) using Pandas' awesome capabilities. 

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
from urllib.parse import urlparse
import os
import subprocess

url = "https://www.analog.com/en/lp/001/blackfin-manuals.html"
file_name = "blackfin-manuals.html"
directory = "docs2"

try:
    # download the file
    subprocess.run(["wget", "-O", file_name, url])
    #subprocess.run(["curl", "-L", "--tlsv1.2", "-O", url], check=True)
    parsed_url = urlparse(url)
    file_name = os.path.basename(parsed_url.path)
    # open the file
    with open(file_name, 'r', encoding='utf-8') as file:
        file_content = file.read()
    
    # parse the webpage
    soup = BeautifulSoup(file_content, "html.parser")
    
    # find pdf links
    pdf_links = []
    for link in soup.find_all("a", href=True):
        href = link["href"]
        if href.endswith(".pdf"):
            pdf_links.append(href)

    # make directory to store the pdfs in
    os.makedirs(directory, exist_ok = True)
    
    # download pdfs
    for cur_pdf in tqdm(pdf_links, total=len(pdf_links), desc="Downloading", unit="files"):
        response = requests.get(cur_pdf)
        # get file name
        parsed_url = urlparse(cur_pdf)
        file_name = directory + "/" + os.path.basename(parsed_url.path)     
        with open(file_name, "wb") as file:
            file.write(response.content)

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
except Exception as err:
    print(f"An error occurred: {err}")

--2024-11-06 10:37:58--  https://www.analog.com/en/lp/001/blackfin-manuals.html
Resolving www.analog.com (www.analog.com)... 23.197.199.106
Connecting to www.analog.com (www.analog.com)|23.197.199.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘blackfin-manuals.html’

     0K .......... .......... .......... .......... .......... 3.26M
    50K .......... .......... .......... .......... .......... 16.2M
   100K .......... .......... ..                               3.38M=0.02s

2024-11-06 10:37:58 (4.88 MB/s) - ‘blackfin-manuals.html’ saved [125639]

Downloading: 100%|███████████████████████████| 91/91 [00:12<00:00,  7.13files/s]


### Break documents into text chunks

In [11]:
print(directory)

docs2


In [13]:
from langchain.document_loaders import DirectoryLoader
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

# load the documents
def load_documents():
    document_loader = DirectoryLoader(directory, show_progress=True, loader_cls=PyPDFLoader)
    return document_loader.lazy_load()

# split documents to managable chunks
def split_documents(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 800,
        chunk_overlap = 80,
        length_function = len,
        is_separator_regex=False,
    )
    return text_splitter.split_documents(documents)
    
documents = load_documents()
#print("Loaded " + str(len(documents)) + " documents")

chunks = split_documents(documents)
print(chunks[0])




100%|███████████████████████████████████████████| 91/91 [09:06<00:00,  6.00s/it]


page_content='a
W 5.0
C/C++ Compiler and Library Manual
 for Blackfin® Processors
Revision 5.4, January 2011
Part Number
82-000410-03
Analog Devices, Inc.
One T echnology Way
Norwood, Mass. 02062-9106' metadata={'source': 'docs2/50_bf_cc_rtl_mn_rev_5.4.pdf', 'page': 0}


In [14]:
print (chunks[330])

page_content='switch turns on optimization, and -O0 turns off all optimizations.
Invoke this switch by selecting the Enable optimization  check box in the 
Project Options  dialog box ( Compile : General page).
-Oa
The -Oa (automatic function inlining) sw itch enables the inline expansion 
of C/C++ functions, which are not necess arily declared inline in the source 
code. The amount of auto-inlining the compiler performs is controlled 
using the –Ov (optimize for speed versus size) switch ( on page 1-61 ). 
Therefore, the use of -Ov100 indicates that as many functions as possible 
will be auto-inlined, whereas –Ov0 prevents any function from being 
auto-inlined. Specifying -Oa implies the use of -O.
Invoke this switch with the Automatic option button located in the' metadata={'source': 'docs2/50_bf_cc_rtl_mn_rev_5.4.pdf', 'page': 117}


Let's check how many chunks were generated from our documents...

In [15]:
len(chunks)

68448

In [18]:
print(chunks[330].page_content)
print("Source: " + chunks[330].metadata['source'])
print("Page: " + str(chunks[330].metadata['page']))

switch turns on optimization, and -O0 turns off all optimizations.
Invoke this switch by selecting the Enable optimization  check box in the 
Project Options  dialog box ( Compile : General page).
-Oa
The -Oa (automatic function inlining) sw itch enables the inline expansion 
of C/C++ functions, which are not necess arily declared inline in the source 
code. The amount of auto-inlining the compiler performs is controlled 
using the –Ov (optimize for speed versus size) switch ( on page 1-61 ). 
Therefore, the use of -Ov100 indicates that as many functions as possible 
will be auto-inlined, whereas –Ov0 prevents any function from being 
auto-inlined. Specifying -Oa implies the use of -O.
Invoke this switch with the Automatic option button located in the
Source: docs2/50_bf_cc_rtl_mn_rev_5.4.pdf
Page: 117


Let's save the chunks in case we need them again...

In [19]:
import pickle

with open("docs2/text_chunks.pkl", "wb") as file:  # 'wb' means write in binary mode
    pickle.dump(chunks, file)

### Embedding the chunks
Now we need to embed the chunks in the database.

As you can see, each chunk consists of [Langchain Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document). 
Each Document has two elements:
* ```page_content``` - the text of the chunk
* ```metadata``` - a dictionary object, which in our case, contains the source document path and the page in which the chunk appeared.

We will need to store this information in our vector database, Weaviate. Therefore, we will need to start with a very simple schema.
* chunk_content
* chunk_document_name
* chunk_document_page

Weaviate will handle the embedding for us using the model we specified (OpenAI).

In [20]:
import weaviate.classes.config as wc
import weaviate
import os

headers = {
    "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}  # Replace with your OpenAI API key

client = weaviate.connect_to_local(headers=headers)

client.collections.create(
    name="ADI_DOCS",
    properties=[
        wc.Property(name="chunk_content", data_type=wc.DataType.TEXT),
        wc.Property(name="chunk_document_name", data_type=wc.DataType.TEXT),
        wc.Property(name="chunk_document_page", data_type=wc.DataType.INT),
    ],
    # Define the vectorizer module
    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
    # Define the generative module
    generative_config=wc.Configure.Generative.openai()
)

client.close()

            Sleeping for 62 seconds.
            Sleeping for 62 seconds.
            Sleeping for 62 seconds.
            Sleeping for 62 seconds.
            Sleeping for 62 seconds.
            Sleeping for 124 seconds.
            Sleeping for 62 seconds.
            Sleeping for 62 seconds.
            Sleeping for 124 seconds.


#### Load Weaviate with the data!

In [21]:
from tqdm import tqdm
from weaviate.util import generate_uuid5

try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
           
    # Get the collection
    adi_docs = client.collections.get("ADI_DOCS")
    
    cur_page = 0
    cur_doc = ""
    i = 0
    
    # Enter context manager
    with adi_docs.batch.dynamic() as batch:
        # Loop through the data
        for chunk in tqdm(chunks, total=len(chunks)):
            i +=1
            # Convert data types

            chunk_obj = {
                "chunk_content": chunk.page_content,
                "chunk_document_name": chunk.metadata['source'],
                "chunk_document_page": chunk.metadata['page'],
            }

            if cur_doc != chunk.metadata['source']:
                cur_doc = chunk.metadata['source']

            if cur_page != chunk.metadata['page']:
                cur_page = chunk.metadata['page']
                
            seed = cur_doc + ":" + str(cur_page) + ":" + str(i)
    
            # Add object to batch queue
            batch.add_object(
                properties=chunk_obj,
                uuid=generate_uuid5(seed)
                # references=reference_obj  # You can add references here
            )
            # Batcher automatically sends batches
    
    # Check for failed objects
    if len(adi_docs.batch.failed_objects) > 0:
        print(f"Failed to import {len(adi_docs.batch.failed_objects)} objects")
finally:
    client.close()

100%|████████████████████████████████████| 68448/68448 [11:15<00:00, 101.32it/s]


Let's verify that the records are in the vector database:

In [22]:
try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
           
    # Get the collection
    adi_docs = client.collections.get("ADI_DOCS")
    response = adi_docs.aggregate.over_all(total_count=True)
    print(response.total_count)

finally:
    client.close()

68448
