# RAG on Resume PDF files with Gemini's multimodal capabilities

<div class="align-center">
  <a href="https://getindexify.ai/"><img src="https://getindexify.ai/Indexify_Logo_Wordmark.svg" width="145"></a>
  <a href="https://discord.com/invite/kF8UZACA7r"><img src="https://raw.githubusercontent.com/rishiraj/random/main/Discord%20button.png" width="145"></a><br>
  Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/tensorlakeai/indexify">Github</a></i> ⭐
</div>

### Step 1: Direct Data Extraction from PDF with Gemini

The first step in Indexify's pipeline is to extract data, such as text, from various sources like PDF files. We understand that unstructured data poses a significant challenge and regular OCR based solutions can't always produce coherent & complete content. Hence, we use Gemini's multimodal capabilities to do the extraction.

### Step 2: Enhanced Chunking with RecursiveCharacterTextSplitter

Indexify's pipeline proceeds to perform chunking using the RecursiveCharacterTextSplitter algorithm. This algorithm has been specifically designed to handle large texts and create meaningful chunks based on a specified maximum chunk size.

### Step 3: Embedding Creation with Snowflake's Arctic Model

The final step in Indexify's pipeline is the creation of embeddings using Snowflake's Arctic embedding model. Embeddings are critical for enabling efficient similarity search and retrieval of relevant information from the chunked text.

## Creating a PDF Extraction Pipeline is Simple with Indexify

#### Install Indexify, Start the Server & Download the Extractors

In [1]:
%pip install indexify indexify-extractor-sdk

# Download Indexify Server
!curl https://getindexify.ai | sh

# Download Extractors
!indexify-extractor download hub://text/gemini
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://embedding/arctic

Note: you may need to restart the kernel to use updated packages.


After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

#### Create a Client, Define Extraction Graph & Ingest Contents

In [2]:
from indexify import IndexifyClient
client = IndexifyClient()

In [3]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'geminiresume'
extraction_policies:
   - extractor: 'tensorlake/gemini'
     name: 'pdfprocessor'
     input_params:
        model_name: 'gemini-1.5-flash-latest'
        prompt: 'Extract all text from the document.'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'pdfprocessor'
   - extractor: 'tensorlake/arctic'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

In [None]:
import requests
req = requests.get("https://www.overleaf.com/latex/templates/iit-dhanbad-resume-oncampus/sdtkcgtgxhtg.pdf")

with open('resume.pdf','wb') as f:
    f.write(req.content)

In [4]:
content_id = client.upload_file("geminiresume", "resume.pdf")
client.wait_for_extraction(content_id)

'29e347f7f00d02ad'

## Performing RAG with OpenAI

In [5]:
def get_context(question: str, index: str, top_k=2):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

In [None]:
question = "What are the javascript related projects he has done?"
context = get_context(question, "geminiresume.embedder.embedding")
context

In [13]:
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)

In [14]:
from openai import OpenAI
client_openai = OpenAI()

In [None]:
chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)