# **Introduction**

<div class="align-center">
  <a href="https://getindexify.ai/"><img src="https://getindexify.ai/Indexify_Logo_Wordmark.svg" width="145"></a>
  <a href="https://discord.com/invite/kF8UZACA7r"><img src="https://raw.githubusercontent.com/rishiraj/random/main/Discord%20button.png" width="145"></a><br>
  Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/tensorlakeai/indexify">Github</a></i> ⭐
</div>

This notebook demonstrates how Indexify can make it easier to quickly extract insights from complex real-world PowerPoint presentations like a talk given on "[A little guide to building Large Language Models in 2024](https://docs.google.com/presentation/d/1IkzESdOwdmwvPxIELYJi8--K3EZ98_cL6c5ZcLKSyVg/edit?usp=sharing)" by Thomas, the co-founder of Hugging Face. Using the slides as an example, we show how the Indexify library can enable question answering on the talk to get rapid answers.

## **Setup**

In [None]:
%pip install indexify indexify-extractor-sdk

# Download Indexify Server
!curl https://getindexify.ai | sh

# Download Extractors
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor download hub://pdf/presentations

After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

## **Test the extractors**

We will try PPTExtractor first. The PPTExtractor can extract all the values from text as well as tables in one shot and passes it to the next chained extractors which can be used for question answering.

We'll start by downloading the talk's slides.

In [1]:
import requests
req = requests.get("https://raw.githubusercontent.com/tensorlakeai/indexify/main/docs/docs/files/test.pptx")

with open('test.pptx','wb') as f:
    f.write(req.content)

In [2]:
from indexify_extractor_sdk import load_extractor, Content

pptextractor, pptconfig_cls = load_extractor("presentations.ppt_extractor:PPTExtractor")
content = Content.from_file("test.pptx")
config = pptconfig_cls()

ppt_result = pptextractor.extract(content, config)
text_content = next(content.data.decode('utf-8') for content in ppt_result if content.content_type == 'text/plain')

In [3]:
print(text_content)

A little guide to building Large Language Modelsin 2024


## **Create a Client**
Instantiate the Indexify Client

In [4]:
from indexify import IndexifyClient
client = IndexifyClient()

## **Question Answering Task**

### **Extraction Graph Setup**

1. Import the `ExtractionGraph` class from the `indexify` package.

2. Define the extraction graph specification in YAML format:
   - Set the name of the extraction graph to "pptqa".
   - Define the extraction policies:
     - Use the "tensorlake/ppt" extractor for PPT marking and name it "docextractor".
     - Use the "tensorlake/chunk-extractor" for text chunking and name it "chunks".
       - Set the input parameters for the chunker:
         - `chunk_size`: 1000 (size of each text chunk)
         - `overlap`: 100 (overlap between chunks)
         - `content_source`: "docextractor" (source of content for chunking)
     - Use the "tensorlake/arctic" extractor for embedding and name it "get-embeddings".
       - Set the content source for embedding to "chunks".

3. Create an `ExtractionGraph` object from the YAML specification using `ExtractionGraph.from_yaml()`.

4. Create the extraction graph on the Indexify client using `client.create_extraction_graph()`.

In [5]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'pptqa'
extraction_policies:
   - extractor: 'tensorlake/ppt'
     name: 'docextractor'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'docextractor'
   - extractor: 'tensorlake/arctic'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

### **Upload the talk's PPT slides**

In [None]:
content_id = client.upload_file("pptqa", "test.pptx")
client.wait_for_extraction(content_id)

### **What is happening behind the scenes**

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PPT extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of PPT files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

### **Perform RAG with OpenAI**

In [7]:
def get_context(question: str, index: str, top_k=3):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

In [8]:
question = "What are the things to keep in mind to finetune a pretrained model?"
context = get_context(question, "pptqa.embedder.embedding")
context

'content id: 629da2991e51ccc0 \n\n passage: Focus on efficient pretraining while taking a holistic view of model life-cycle\ncontent id: bdfec28ce8239696 \n\n passage: Start by test existing models on your domain and task(s) of interest\ncontent id: c4fc86ea0bf467a4 \n\n passage: When the model is too big:\nTensor Parallelism\n'

In [9]:
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)

In [10]:
from openai import OpenAI
client_openai = OpenAI()

Now ask any question related to the ingested talk ppt slides

In [11]:
chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)

Based on the context provided, to finetune a pretrained model, one should focus on efficient pretraining and take a holistic view of the model's life cycle. It is also important to start by testing existing models on your specific domain and tasks of interest. Additionally, when the model is too big, one strategy is to use Tensor Parallelism.
