In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [2]:
from Indox import IndoxRetrievalAugmentation

## Using for IndoxRetrievalAugmentation for pdf files

First, we initialize an instance of IndoxRetrievalAugmentation (IRA) to leverage its document processing capabilities:

In [3]:
IRA = IndoxRetrievalAugmentation()

Next , let's take a look at the configuration options available for IRA:

In [4]:
IRA.config

{'clustering': {'dim': 10, 'threshold': 0.1},
 'embedding_model': 'openai',
 'postgres': {'conn_string': 'postgresql+psycopg2://postgres:xxx@localhost:port/db_name'},
 'prompts': {'summary_model': {'content': 'You are a helpful assistant. Give a detailed summary of the documentation provided'}},
 'qa_model': {'temperature': 0},
 'splitter': 'semantic-text-splitter',
 'summary_model': {'max_tokens': 100,
  'min_len': 30,
  'model_name': 'gpt-3.5-turbo-0125'},
 'vector_store': 'chroma'}

Here's what's happening:

If unstructured is set to True, IRA will extract data from various document types such as PDF, text, HTML, Markdown, or LaTeX using the unstructured library.
If unstructured is set to False, IRA expects the document to be either a PDF or a text file. Additionally, when unstructured is set to False, you have the option to add an extra clustering layer to the document processing.
This flexibility allows users to handle a wide range of document types and tailor the processing approach based on their specific needs.

In [5]:
file_path = "Benchmarking_large_language_models.pdf" 

In [6]:
documents = IRA.create_chunks(file_path=file_path, unstructured=True)

Starting processing...


2024-05-02 14:06:53,243 - INFO - Reading PDF for file: Benchmarking_large_language_models.pdf ...
2024-05-02 14:06:54,255 - INFO - Detecting page elements ...
2024-05-02 14:06:55,475 - INFO - Detecting page elements ...
2024-05-02 14:06:56,678 - INFO - Detecting page elements ...
2024-05-02 14:06:57,982 - INFO - Detecting page elements ...
2024-05-02 14:06:59,175 - INFO - Detecting page elements ...
2024-05-02 14:07:00,577 - INFO - Detecting page elements ...
2024-05-02 14:07:02,188 - INFO - Detecting page elements ...
2024-05-02 14:07:03,852 - INFO - Detecting page elements ...
2024-05-02 14:07:05,483 - INFO - Detecting page elements ...
2024-05-02 14:07:07,221 - INFO - Detecting page elements ...
2024-05-02 14:07:08,804 - INFO - Detecting page elements ...
2024-05-02 14:07:10,379 - INFO - Detecting page elements ...
2024-05-02 14:07:11,901 - INFO - Detecting page elements ...
2024-05-02 14:07:13,474 - INFO - Detecting page elements ...
2024-05-02 14:07:14,973 - INFO - Detecting page 

End Chunking process


In [7]:
documents

[Document(page_content='Article Benchmarking large language models from open and closed source models to apply data annotation for free-text criteria in the healthcare\n\nAli Nemati1,†,‡ , Mohammad Assadi Shalmani1,†,‡ , Qiang Lu, and Jake Luo2,*\n\n1 Ali Nemati, nemati@uwm.edu; Tel.: +1-971-400-2132 2 Mohammad Assadi Shalmani, assadis2@uwm.edu; Tel.: +1-414-406-9052 3 Qiang Lu, luqiang@cup.edu.cn * † Current address: Affiliation. ‡\n\nThese authors contributed equally to this work.', metadata={'filename': 'Benchmarking_large_language_models.pdf', 'filetype': 'application/pdf', 'last_modified': '2024-04-10T18:40:10', 'page_number': 1}),
 Document(page_content='Abstract: LLMs can significantly improve data annotation for free-text healthcare records, but thorough evaluation is critical to ensure they meet strict accuracy and reliability requirements, particularly in understanding patient characteristics commonly used in clinical research. In this study, we aim to assess LLM performance 

### Connecting to the Vector Database and Storing Data

Step 1: Connect to the Vector Database:
Start by extracting the connection settings from your configuration file. These settings should include the database connection string and any other necessary parameters.Just pass the collection name.

Step 2: Store Chunks:
Use the store_in_vectorstore method of your database client to store the prepared chunks.

In [8]:
IRA.connect_to_vectorstore(collection_name="sample_pdf")

2024-05-02 14:08:08,211 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


Connection established successfully.


In [9]:
IRA.store_in_vectorstore(chunks=documents)

2024-05-02 14:08:11,664 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:15,281 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:15,897 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:16,401 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:16,856 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:17,283 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:17,664 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:18,070 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:18,498 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:18,958 - INFO - HTTP

<Indox.vectorstore.ChromaVectorStore at 0x1f754cd96a0>

Execute a query and retrieve the responses, along with the scores of the retrieved chunks and the context that was sent to the language learning model (LLM).

In [10]:
query = "which model's result is better for gender criteria "

In [11]:
response = IRA.answer_question(query=query, top_k=5)

2024-05-02 14:08:22,188 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:08:25,266 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [15]:
response[0]

'The GPT4 model for clean data is the better model for the Gender criteria, as it has the highest Human Validation score of 0.59 compared to the GPT4 model for raw data.'

## Indox for html

In [17]:
IRA_2 = IndoxRetrievalAugmentation()

In [18]:
IRA_2.config

{'clustering': {'dim': 10, 'threshold': 0.1},
 'embedding_model': 'openai',
 'postgres': {'conn_string': 'postgresql+psycopg2://postgres:xxx@localhost:port/db_name'},
 'prompts': {'summary_model': {'content': 'You are a helpful assistant. Give a detailed summary of the documentation provided'}},
 'qa_model': {'temperature': 0},
 'splitter': 'semantic-text-splitter',
 'summary_model': {'max_tokens': 100,
  'min_len': 30,
  'model_name': 'gpt-3.5-turbo-0125'},
 'vector_store': 'chroma'}

In [6]:
html = "https://www.python.org/"

In [20]:
chunks = IRA_2.create_chunks(file_path=html, unstructured=True, content_type="html")

Starting processing...


2024-05-02 14:09:39,971 - INFO - Reading document from string ...
2024-05-02 14:09:40,053 - INFO - Reading document ...


End Chunking process


In [21]:
chunks

[Document(page_content='Notice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience.\n\nSkip to content\n\n▼ Close\n\nPython\n\nPSF\n\nDocs\n\nPyPI\n\nJobs\n\nCommunity\n\n▲ The Python Network\n\nDonate', metadata={'emphasized_text_contents': "['Notice:', '▼', '▼', '▲', '▲']", 'emphasized_text_tags': "['strong', 'span', 'span', 'span', 'span']", 'filetype': 'text/html', 'link_texts': "['Skip to content', '\\n                    ', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '\\n                    ', 'Donate']", 'link_urls': "['#content', '#python-network', '/', 'https://www.python.org/psf/', 'https://docs.python.org', 'https://pypi.org/', '/jobs/', '/community-landing/', '#top', 'https://psfmember.org/civicrm/contribute/transact?reset=1&id=2']", 'page_number': 1, 'url': 'https://www.python.org/'}),
 Document(page_content='≡ Menu\n\nA A\n                                    \n 

In [22]:
IRA_2.connect_to_vectorstore(collection_name="sample_html")

Connection established successfully.


In [23]:
IRA_2.store_in_vectorstore(chunks=chunks)

2024-05-02 14:09:46,856 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:09:47,394 - INFO - Document added successfully to the vector store.


<Indox.vectorstore.ChromaVectorStore at 0x1f71ca9c380>

In [24]:
query = "what is python?"

In [25]:
response = IRA_2.answer_question(query=query,top_k=3)

2024-05-02 14:09:55,891 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 14:09:59,735 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [26]:
response[0]

"Python is a programming language that allows users to work quickly and efficiently integrate systems. It is easy to learn and use, making it suitable for both beginners and experienced developers. Python offers a variety of applications, including web development with frameworks like Django and Flask, GUI development with libraries such as tkInter and PyQt, scientific and numeric computing with tools like SciPy and Pandas, software development with platforms like Buildbot and Trac, and system administration with tools like Ansible and OpenStack. Additionally, Python Enhancement Proposals (PEPs) discuss the future of Python. Python's clean syntax and indentation structure make it quick and easy to learn, appealing to both experienced programmers and beginners in other languages."

## Indox for markdown

In [3]:
IRA_3 = IndoxRetrievalAugmentation()

In [4]:
IRA_3.config

{'clustering': {'dim': 10, 'threshold': 0.1},
 'embedding_model': 'openai',
 'postgres': {'conn_string': 'postgresql+psycopg2://postgres:xxx@localhost:port/db_name'},
 'prompts': {'summary_model': {'content': 'You are a helpful assistant. Give a detailed summary of the documentation provided'}},
 'qa_model': {'temperature': 0},
 'splitter': 'semantic-text-splitter',
 'summary_model': {'max_tokens': 100,
  'min_len': 30,
  'model_name': 'gpt-3.5-turbo-0125'},
 'vector_store': 'chroma'}

In [7]:
md = "README.md"
chunks_md = IRA_3.create_chunks(file_path=md, unstructured=True, content_type="md")

2024-05-02 15:10:30,891 - INFO - Reading document from string ...
2024-05-02 15:10:30,891 - INFO - Reading document ...


Starting processing...


2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type
2024-05-02 15:10:31,043 - INFO - HTML element instance has no attribute type

End Chunking process


In [8]:
chunks_md

[Document(page_content='inDox : Advance Search and Retrieval Augmentation Generative\n\nOverview\n\nThis project combines advanced clustering techniques provided by Raptor with the efficient retrieval capabilities of pgvector and other vectorstores. It allows users to interact with and visualize their data in a PostgreSQL database. The solution involves segmenting text data into manageable chunks, enhancing retrieval through a custom model, and providing an interface for querying and retrieving relevant information.', metadata={'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2024-04-30T15:43:37', 'page_number': 1}),
 Document(page_content='Prerequisites\n\nBefore you can run this project, you need the following installed:\n- Python 3.8+\n- PostgreSQL (if you want to store your data on postgres)\n- OpenAI API Key (if using OpenAI embedding model)\n\nEnsure your system also meets the following requirements:\n- Access to environmental variables for sensitive inform

In [9]:
IRA_3.connect_to_vectorstore(collection_name="sample_md")

2024-05-02 15:10:35,355 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


Connection established successfully.


In [10]:
IRA_3.store_in_vectorstore(chunks=chunks_md)

2024-05-02 15:10:37,479 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 15:10:37,970 - INFO - Document added successfully to the vector store.


<Indox.vectorstore.ChromaVectorStore at 0x1de74bb42f0>

In [11]:
query = "how summary model works?"

In [12]:
response = IRA_3.answer_question(query=query,top_k=3)

2024-05-02 15:10:38,384 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-02 15:10:41,781 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [13]:
response[0]

'The summary model works by generating a concise summary of a given input text based on specified parameters such as minimum and maximum number of tokens, model name, and temperature. The model condenses the input text into a shorter version while maintaining the key information. The summary model can be customized by adjusting parameters such as min_len, max_tokens, and model_name to control the length and quality of the generated summary. The temperature parameter influences the diversity of the output, with higher values leading to more diverse but potentially nonsensical summaries, and lower values resulting in more focused but less diverse summaries.'

In [14]:
response[1]

(['min_len: Minimum number of tokens that summary model generates.\n\nmodel_name: The model you want to use as your summary model. the defualt is gpt-3.5-turbo-0125. you can use huggingface models that support summarize pipeline, for example, you can use:\n\npython\n  {"summary_model": {"max_tokens":   100, "min_len": 30, "model_name": "Falconsai/medical_summarization"}}\n- vector_store: Specify which vectorstore you want to use. defualt is pgvector, but you also can use "chroma" and "faiss" instead:',
  'temperature: The temperature of Question Answering model. if this parameter is high, the diversity of the would be high but probability of nonsense and hallucinations would incearse, and if this parameter is low, the diversity of the output would be low but also probability of hallucinations would decrease.\n\nsummary_model\n\nmax_tokens: Specifies max number of tokens that summary model could generate. more tokens means more cost and also potentially more quality.',
  'Roadmap\n\n[x]

In [15]:
IRA_3.evaluate()

BertScore scores:
   Precision@3: 0.4171
   Recall@3: 0.4959
   F1@3: 0.5013


In [16]:
IRA_3.get_tokens_info()


                    Overview of All Tokens Used:
                    Tokens used in the embedding section that were sent to the database: 1358
                               
