# Exploring OpenAI's multimodal capabilities with Indexify extractor

<div class="align-center">
  <a href="https://getindexify.ai/"><img src="https://getindexify.ai/images/logos/base2.svg" width="145"></a>
  <a href="https://discord.com/invite/kF8UZACA7r"><img src="https://raw.githubusercontent.com/rishiraj/random/main/Discord%20button.png" width="145"></a><br>
  Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/tensorlakeai/indexify">Github</a></i> ⭐
</div>

## Creating a Extraction Pipeline is Simple with Indexify

#### Install Indexify, Start the Server & Download the Extractors

In [1]:
%pip install indexify indexify-extractor-sdk

# Download Indexify Server
!curl https://getindexify.ai | sh

# Download Extractors
!indexify-extractor download tensorlake/openai
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/arctic

Note: you may need to restart the kernel to use updated packages.


After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

#### Create a Client, Define Extraction Graph & Ingest Contents

In [1]:
from indexify import IndexifyClient
client = IndexifyClient()

### Direct Data Extraction from **Texts** with OpenAI

The first example with Indexify's pipeline is to extract data, such as programming languages, from various textual sources using OpenAI.

In [2]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'resume_text'
extraction_policies:
   - extractor: 'tensorlake/openai'
     name: 'pdfprocessor'
     input_params:
        model_name: 'gpt-3.5-turbo'
        key: 'OPENAI_API_KEY'
        prompt: 'Extract names of all programming languages from the text.'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

In [3]:
text = "**First Last** \nAdm. No. 22JEXXXX  \nfirstlast@gmail.com   \nXXX-XXX-XXXX\nlinkedin.com/in/firstlast\ngithub.com/firstlast \n\n**Education**\n**University Name**\nBachelor of Science in Computer Science (GPA: 4.00 / 4.00)\n* **Relevant Coursework:** Data Structures and Algorithms (C++), Prob & Stat in CS (Python), Intro to CS II (C++), Linear Algebra w/Computational Applications (Python)\n\nExpected | May 20XX\n------- | --------\nCity, State\n\n**Experience**\n\n**Company Name 1** \nJan 20XX - May 20XX\nSoftware Engineer\nCity, State\n* Implemented microservices architecture using Node.js and Express, improving API response time by 25% and reducing server load by 30%.\n* Led a cross-functional team in implementing a new feature using React and Redux, resulting in a 20% increase in user engagement within the first month.\n* Optimized MySQL database queries, reducing page load times by 15% and enhancing overall application performance.\n\n**Projects**\n\n**Project Name 1** | React.js, Angular, Vue.js, Django, Flask, Ruby on Rails\n* Led the development of a microservices-based e-commerce platform using Node.js, resulting in a 40% increase in daily transactions within the first quarter.\n* Designed and deployed a scalable RESTful API using Django and Django REST Framework, achieving a 30% improvement in data retrieval speed.\n* Implemented a real-time chat feature using WebSocket and Socket.io, enhancing user engagement and reducing response time by 20%.\n\n**Project Name 2** | Spring Boot, Express.js, TensorFlow, PyTorch, jQuery, Bootstrap\n* Developed a data visualization dashboard using D3.js, providing stakeholders with real-time insights and improving decision-making processes.\n* Built a CI/CD pipeline using Jenkins and Docker, reducing deployment time by 40% and ensuring consistent and reliable releases.\n\n**Technical Skills**\n**Languages:** Rust, Kotlin, Swift, Go, Scala, TypeScript, R, Perl, Haskell, Groovy, Julia, Dart\n**Technologies:** React.js, Angular, Vue.js, Django, Flask, Ruby on Rails, Spring Boot, Express.js, TensorFlow, PyTorch, jQuery, Bootstrap, Laravel, Flask, ASP.NET, Node.js, Electron, Android SDK, iOS SDK, Symfony\n**Concepts:** Compiler, Operating System, Virtual Memory, Cache Memory, Encryption, Decryption, Artificial Intelligence, Machine Learning, Neural Networks, API, Database Normalization, Agile Methodology, Cloud Computing\n\n**Achievements**\n* Pls Add your Achievements here e.g., Hackathons, Exam Ranks, etc. \n\n**Social Engagements**\n**Vice-President:** Of Association of Exploration Geophysicist - Student Chapter, IIT Dhanbad\n**Club Member:** at CYBER LABS -tech society of IIT Dhanbad\n**Volunteer:** at KARTAVYA - NGO run by students of IIT Dhanbad to educate underprivileged childrens.\n**Organiser:** Concetto'22 (Tech-fest) Khanan'22 (Geo-Mining fest).\n**Sports-Engagements:** Badminton(state-level), chess, cricket, table-tennis. \n"
content_id = client.add_documents("resume_text", text)
client.wait_for_extraction(content_id)

In [4]:
client.get_extracted_content(content_id, 'resume_text', 'pdfprocessor')

[{'id': '4cf0d70a9a64dbc7',
  'content': b'Here are the programming languages from the text, formatted as "First Last":\n\n* **C++**\n* **Python**\n* **Node.js** \n* **React**\n* **Redux**\n* **MySQL**\n* **React.js**\n* **Angular**\n* **Vue.js**\n* **Django**\n* **Flask**\n* **Ruby on Rails**\n* **Spring Boot**\n* **Express.js**\n* **TensorFlow**\n* **PyTorch**\n* **jQuery**\n* **Bootstrap**\n* **D3.js**\n* **Jenkins**\n* **Docker**\n* **Rust**\n* **Kotlin**\n* **Swift**\n* **Go**\n* **Scala**\n* **TypeScript**\n* **R**\n* **Perl**\n* **Haskell**\n* **Groovy**\n* **Julia**\n* **Dart**\n* **Laravel**\n* **ASP.NET** \n* **Electron**\n* **Android SDK**\n* **iOS SDK**\n* **Symfony** \n'}]

### Direct Data Extraction from **Images** with OpenAI

The second example with Indexify's pipeline is to extract data, such as programming languages, from various visual sources like images using OpenAI.

In [5]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'resume_images'
extraction_policies:
   - extractor: 'tensorlake/openai'
     name: 'pdfprocessor'
     input_params:
        model_name: 'gpt-4o'
        prompt: 'Extract names of all programming languages from the image.'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

In [6]:
content_id = client.upload_file("resume_images", "resume.jpg")
client.wait_for_extraction(content_id)

In [7]:
client.get_extracted_content(content_id, 'resume_images', 'pdfprocessor')

[{'id': 'f5faaadd623b8632',
  'content': b'The programming languages mentioned in the image are:\n\n* **React.js**\n* **Angular.js**\n* **Vue.js**\n* **Django**\n* **Flask**\n* **Ruby on Rails**\n* **Node.js**\n* **Express.js**\n* **TensorFlow**\n* **PyTorch**\n* **jQuery**\n* **Bootstrap**\n* **D3.js**\n* **Rust**\n* **Kotlin**\n* **Swift**\n* **Go**\n* **Scala**\n* **TypeScript**\n* **R**\n* **Perl**\n* **Haskell**\n* **Groovy**\n* **Julia**\n* **Dart** \n'}]

### Building an end-to-end RAG pipeline with **PDF**
#### Step 1: Direct Data Extraction from PDF with OpenAI

The first step in Indexify's pipeline is to extract data, such as text, from various sources like PDF files. We understand that unstructured data poses a significant challenge and regular OCR based solutions can't always produce coherent & complete content. Hence, we use OpenAI's multimodal capabilities to do the extraction.

#### Step 2: Enhanced Chunking with RecursiveCharacterTextSplitter

Indexify's pipeline proceeds to perform chunking using the RecursiveCharacterTextSplitter algorithm. This algorithm has been specifically designed to handle large texts and create meaningful chunks based on a specified maximum chunk size.

#### Step 3: Embedding Creation with Snowflake's Arctic Model

The final step in Indexify's pipeline is the creation of embeddings using Snowflake's Arctic embedding model. Embeddings are critical for enabling efficient similarity search and retrieval of relevant information from the chunked text.

In [2]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'resumerag'
extraction_policies:
   - extractor: 'tensorlake/openai'
     name: 'pdfprocessor'
     input_params:
        model_name: 'gpt-4o'
        prompt: 'Extract all text from the document.'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'pdfprocessor'
   - extractor: 'tensorlake/arctic'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

In [3]:
import requests
req = requests.get("https://www.overleaf.com/latex/templates/iit-dhanbad-resume-oncampus/sdtkcgtgxhtg.pdf")

with open('resume.pdf','wb') as f:
    f.write(req.content)

In [4]:
content_id = client.upload_file("resumerag", "resume.pdf")
client.wait_for_extraction(content_id)

In [5]:
client.get_extracted_content(content_id, 'resumerag', 'pdfprocessor')

[{'id': '41c9b925a44596c3',
  'content': b"**First Last** \nAdm. No. 22JEXXXX  \nfirstlast@gmail.com   \nXXX-XXX-XXXX\nlinkedin.com/in/firstlast\ngithub.com/firstlast \n\n**Education**\n**University Name**\nBachelor of Science in Computer Science (GPA: 4.00 / 4.00)\n* **Relevant Coursework:** Data Structures and Algorithms (C++), Prob & Stat in CS (Python), Intro to CS II (C++), Linear Algebra w/Computational Applications (Python)\n\nExpected | May 20XX\n------- | --------\nCity, State\n\n**Experience**\n\n**Company Name 1** \nJan 20XX - May 20XX\nSoftware Engineer\nCity, State\n* Implemented microservices architecture using Node.js and Express, improving API response time by 25% and reducing server load by 30%.\n* Led a cross-functional team in implementing a new feature using React and Redux, resulting in a 20% increase in user engagement within the first month.\n* Optimized MySQL database queries, reducing page load times by 15% and enhancing overall application performance.\n\n**Pr

## Performing RAG with OpenAI

In [5]:
def get_context(question: str, index: str, top_k=2):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

In [None]:
question = "What are the javascript related projects he has done?"
context = get_context(question, "resumerag.embedder.embedding")
context

In [13]:
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)

In [14]:
from openai import OpenAI
client_openai = OpenAI()

In [None]:
chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)