### Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client

In [None]:
%%capture
!pip install indexify-extractor-sdk indexify-langchain indexify

### Start the Indexify Server

In [None]:
!./indexify server -d

### Download an Embedding Extractor
On another terminal we'll download and start the embedding extractor which we will use to index text from the pdf document.

In [None]:
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor join-server

### Download the Donut Invoice Extractor
On another terminal we'll download and start the Invoice extractor which we will use to get text out of Invoice documents.

In [None]:
!indexify-extractor download hub://invoices/donut_invoice
!indexify-extractor join-server

### Create Extraction Policies
Instantiate the Indexify Client

In [None]:
from indexify import IndexifyClient
client = IndexifyClient()

First, create a policy to get texts and contents out of the invoices.

In [None]:
client.add_extraction_policy(extractor='tensorlake/donut-invoice', name="invoice-extraction")

Second, from the texts and contents create an embedding based index.

In [None]:
client.add_extraction_policy(extractor='tensorlake/minilm-l6', name="get-embeddings", content_source="invoice-extraction")

### Upload a Invoice File

In [None]:
import requests
req = requests.get("https://extractor-files.diptanu-6d5.workers.dev/invoice-example.jpg")

with open('invoice-example.jpg','wb') as f:
    f.write(req.content)

In [None]:
client.upload_file(path="invoice-example.jpg")

### What is happening behind the scenes

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the Invoice extractor completes the process of extracting texts from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of Invoice files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

### Perform RAG
Initialize the Langchain Retriever.

In [None]:
from indexify_langchain import IndexifyRetriever
params = {"name": "get-embeddings.embedding", "top_k": 3}
retriever = IndexifyRetriever(client=client, params=params)

Now create a chain to prompt OpenAI with data retreived from Indexify to create a simple Q&A bot

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

In [None]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Now ask any question related to the ingested Invoice

In [None]:
chain.invoke("How much does the Camisol Top cost?")
# The Eggshell Camisol Top costs $123