# RAG on Multiple Terms and Conditions Documents Varying By Geography
In this demo we are going to build a pipeline to build and update policy documents which vary by geography. 
Approach 
* Label documents during ingestion
* Propogate the labels on the documents all the way into the vector store
* During Retrieval make the LLM generate filters with labels based on the question 
* Pass the label filters into the vector store for retrieval 
* Make the LLM cite the sources of the response during response synthesis

### Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client

In [None]:
%%capture
!pip install indexify-extractor-sdk indexify openai

### Start the Indexify Server

In [None]:
!./indexify server -d

### Download an Embedding Extractor
On another terminal we'll download and start the embedding extractor which we will use to index text from the pdf document.

In [None]:
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor join-server

### Download a Chunk Extractor
The documents will have to be chunked so that the paragraphs are not longer than what the embedding model can support 

In [None]:
!indexify-extractor download hub://text/chunking
!indexify-extractor join-server

### Download the PDF Extractor
On another terminal we'll install the necessary dependencies and start the PDF extractor which we will use to get text, bytes or json out of PDF documents.

Install Poppler on your machine

In [None]:
!sudo apt-get install -y poppler-utils

Download and start the PDF extractor

In [None]:
!indexify-extractor download hub://pdf/pdf-extractor
!indexify-extractor join-server

### Create Extraction Policies
Instantiate the Indexify Client

In [None]:
from indexify import IndexifyClient
client = IndexifyClient()

First, create a policy to get texts and contents out of the PDF.

In [None]:
client.add_extraction_policy(extractor='tensorlake/pdf-extractor', name="pdf-extraction")

Lastly, create chunks from the text and embeddings

In [None]:
client.add_extraction_policy(extractor='tensorlake/chunk-extractor', name="chunks", content_source="pdf-extraction", input_params={"chunk_size": 512, "overlap": 150})

In [None]:
client.add_extraction_policy(extractor='tensorlake/minilm-l6', name="terms", content_source="chunks")

### Upload a PDF File

In [None]:
import requests
req = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_CALIFORNIA.pdf")
req1 = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_HAWAII.pdf")
req2 = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_ILLINOIS.pdf")


with open('sixt_US_en_CALIFORNIA.pdf','wb') as f:
    f.write(req.content)

with open('sixt_US_en_HAWAII.pdf','wb') as f:
    f.write(req.content)

with open('sixt_US_en_ILLINOIS.pdf', 'wb') as f:
    f.write(req.content)

In [None]:
client.upload_file(path="sixt_US_en_CALIFORNIA.pdf", labels={"state": "california"})
client.upload_file(path="sixt_US_en_HAWAII.pdf", labels={"state": "hawaii"})
client.upload_file(path="sixt_US_en_ILLINOIS.pdf", labels={"state": "illinois"})

### What is happening behind the scenes

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

### Perform RAG
Initialize the Langchain Retriever.

In [None]:
import os
from openai import OpenAI

oai_client = OpenAI(
    # This is the default and can be omitted
    api_key="",
)

def answer_question(question) -> str:
    chat_completion = oai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": f"given the question {question}, if there is the name of a US state, generate a predicate such as state=texas or state=new york. The predicate name and value should be in small letters.",
        }
    ],
    model="gpt-3.5-turbo",
    )
    query_filter = chat_completion.choices[0].message.content
    query_filter
    search_results = client.search_index("terms.embedding", question, top_k=5, filters=[query_filter])
    context = ""
    for result in search_results:
        context += f"content_id: {result['content_id']}\n text: {result['text']}\n"
    context
    chat_completion = oai_client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f" Answer the question based on the context provided below and provide citation in the response as 'Citation: '. The context has the citation to content_ids and the text below it. \n Context: {context} \n \n Question: {question}",
            }
        ],
        model="gpt-3.5-turbo",
    )
    print(chat_completion.choices[0].message.content)
    chat_completion.choices[0].message.content


In [88]:
answer_question("If I rent a car from Sixt in California, how many days do I have to return the vehicle before being considered overdue??")

Based on the information provided, if you rent a car from Sixt in California, you have until the third day of no response to return the vehicle before being considered overdue. On the third day of no response, you will be informed that the vehicle must be returned to a Sixt location within 24 hours to avoid being considered overdue.

Citation: content_id: 92a7afa15d284599


In [89]:
answer_question("If I rent a car from Sixt in Hawaii, how many days do I have to return the vehicle before being considered overdue??")

Based on the context provided, if you rent a car from Sixt in Hawaii and do not respond or contact them back within three consecutive days after being informed about the failed authorization, you must return the vehicle to a Sixt location within 24 hours on the third day to avoid being considered overdue. 

Citation: content_id: 635044f6a376745a
