## **Setup**

In [None]:
%pip install indexify indexify-extractor-sdk indexify-langchain langchain langchain-openai

# Download Indexify Server
!curl https://www.tensorlake.ai | sh

# Install Poppler (required for PDF extraction)
# You can use brew on MacOS.
!sudo apt-get install -y poppler-utils

# Download Extractors
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor download hub://pdf/pdf-extractor

After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

## **Test the extractors**

We will try PDFExtractor first. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the next chained extractors which can be used for question answering.

In [None]:
from indexify_extractor_sdk import load_extractor, Content

pdfextractor, pdfconfig_cls = load_extractor("pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("uber-20231231.pdf")

pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')
text_content

## **Create Extraction Graph**
Instantiate the Indexify Client

In [None]:
from indexify import IndexifyClient
client = IndexifyClient()

### **Extraction Graph Setup**

1. Import the `ExtractionGraph` class from the `indexify` package.

2. Define the extraction graph specification in YAML format:
   - Set the name of the extraction graph to "pdfqa".
   - Define the extraction policies:
     - Use the "tensorlake/pdf-extractor" extractor for PDF marking and name it "pdf-extraction".
     - Use the "tensorlake/chunk-extractor" for text chunking and name it "chunks".
       - Set the input parameters for the chunker:
         - `chunk_size`: 1000 (size of each text chunk)
         - `overlap`: 100 (overlap between chunks)
         - `content_source`: "pdf-extraction" (source of content for chunking)
     - Use the "tensorlake/minilm-l6" extractor for embedding and name it "get-embeddings".
       - Set the content source for embedding to "chunks".

3. Create an `ExtractionGraph` object from the YAML specification using `ExtractionGraph.from_yaml()`.

4. Create the extraction graph on the Indexify client using `client.create_extraction_graph()`.

In [None]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'pdf'
extraction_policies:
   - extractor: 'tensorlake/pdf-extractor'
     name: 'pdf-extraction'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunks'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'pdf-extraction'
   - extractor: 'tensorlake/minilm-l6'
     name: 'get-embeddings'
     content_source: 'chunks'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

### **Upload a FORM 10-K PDF File**

In [None]:
import requests
req = requests.get("https://www.sec.gov/files/form10-k.pdf")

with open('form10-k.pdf','wb') as f:
    f.write(req.content)

In [None]:
content_id = client.upload_file("pdf", path="form10-k.pdf")
client.wait_for_extraction(content_id)  

## **What is happening behind the scenes**

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

### **Perform RAG**
Initialize the Langchain Retriever.

In [None]:
from indexify_langchain import IndexifyRetriever
params = {"name": "pdfqa.get-embeddings.embedding", "top_k": 3}
retriever = IndexifyRetriever(client=client, params=params)

Now create a chain to prompt OpenAI with data retrieved from Indexify to create a simple Q&A bot

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

In [None]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Now ask any question related to the ingested FORM 10-K PDF document

In [None]:
chain.invoke("What are the disclosure with respect to Foreign Subsidiaries?")
# It may be omitted to the extent that the required disclosure would be detrimental to the registrant.