## **Installation and Setup**

<div class="align-center">
  <a href="https://getindexify.ai/"><img src="https://getindexify.ai/Indexify_Logo_Wordmark.svg" width="145"></a>
  <a href="https://discord.com/invite/kF8UZACA7r"><img src="https://raw.githubusercontent.com/rishiraj/random/main/Discord%20button.png" width="145"></a><br>
  Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/tensorlakeai/indexify">Github</a></i> ⭐
</div>

1. Install the `indexify-extractor-sdk` package using pip.

In [None]:
%pip install -q indexify-extractor-sdk

2. Download the required extractors:
   - `hub://embedding/minilm-l6`: An embedding extractor based on the MiniLM-L6 model.
   - `hub://text/chunking`: A text chunking extractor.
   - `hub://pdf/marker`: A PDF marker extractor.

In [None]:
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://pdf/marker

3. Start the Indexify Extractor server on a separate terminal using the `indexify-extractor join-server` command.

In [None]:
!indexify-extractor join-server

4. Install the `indexify` package using pip.

In [None]:
pip install -q indexify

## **Indexify Client Setup**

1. Import the `IndexifyClient` class from the `indexify` package.
2. Create an instance of the `IndexifyClient` called `client`.

In [4]:
from indexify import IndexifyClient
client = IndexifyClient()

## **Create an Extraction Graph**

1. Import the `ExtractionGraph` class fr
2. Define the extraction graph specification in YAML format:
   - Set the name of the extraction graph to "pdfqa".
   - Define the extraction policies:
     - Use the "tensorlake/marker" extractor for PDF marking and name it "mdextract".
     - Use the "tensorlake/chunk-extractor" for text chunking and name it "chunker".
       - Set the input parameters for the chunker:
         - `chunk_size`: 1000 (size of each text chunk)
         - `overlap`: 100 (overlap between chunks)
         - `content_source`: "mdextract" (source of content for chunking)
     - Use the "tensorlake/minilm-l6" extractor for embedding and name it "pdfembedding".
       - Set the content source for embedding to "chunker".
3. Create an `ExtractionGraph` object from the YAML specification using `ExtractionGraph.from_yaml()`.
4. Create the extraction graph on the Indexify client using `client.create_extraction_graph()`.

In [13]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
   - extractor: 'tensorlake/marker'
     name: 'mdextract'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'mdextract'
   - extractor: 'tensorlake/minilm-l6'
     name: 'pdfembedding'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

## **Document Ingestion**

1. Add the PDF document to the "pdfqa" extraction graph using `client.upload_file()`.

In [None]:
content_id = client.upload_file("pdfqa", "chess.pdf")
client.wait_for_extraction(content_id)

## **Context Retrieval Function**

1. Define a function called `get_context` that takes a question, index name, and top_k as parameters.

2. Search the specified index using `client.search_index()` with the given question and top_k.

3. Concatenate the retrieved passages into a single context string.

4. Return the context string.

In [32]:
def get_context(question: str, index: str, top_k=3):
    results = client.search_index(name=index, query=question, top_k=3)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

## **Prompt Creation Function**

1. Define a function called `create_prompt` that takes a question and context as parameters.

2. Create a prompt string that includes the question and context.

3. Return the prompt string.

In [33]:
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

## **Question Answering**

1. Define a question string.
2. Call the `get_context` function with the question, index name ("pdfqa.pdfembedding.embedding"), and top_k (default is 3) to retrieve the relevant context.

In [34]:
question = "Who is the greatest player of all time and what is his record?"
context = get_context(question, "pdfqa.pdfembedding.embedding")

## **Setup OpenAI Client**

1. Import the `OpenAI` class from the `openai` package.
2. Create an instance of the `OpenAI` client called `client_openai` with the API key.

In [None]:
from openai import OpenAI
client_openai = OpenAI(api_key="")

## **Answering Question with OpenAI**

1. Call the `create_prompt` function with the question and retrieved context to generate the prompt.
2. Use the `client_openai.chat.completions.create()` method to send the prompt to the OpenAI API.
   - Set the model to "gpt-3.5-turbo".
   - Pass the prompt as a message with the "user" role.
3. Print the generated answer from the API response.

In [None]:
prompt = create_prompt(question, context)

chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)