# Dense-X-Retrieval Pack

This notebook walks through using the `DenseXRetrievalPack`, which parses documents into nodes, and then generates propositions from each node to assist with retreival.

This follows the idea from the paper [Dense X Retrieval: What Retreival Granularity Should We Use?](https://arxiv.org/abs/2312.06648).

From the paper, a proposition is described as:

```
Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format.
```

We use the provided OpenAI prompt from their paper to generate propositions, which are then embedded and used to retrieve their parent node chunks.

In [None]:
!pip install python-dotenv llama-index llama-hub

Collecting unstructured
  Downloading unstructured-0.11.6-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.9.0-py2.py3-none-any.whl (397 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-iso639 (from unstructured)
  Downloading python_iso639-2023.12.11-py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.1/275.1 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install "unstructured[pdf]"

Collecting onnx (from unstructured[pdf])
  Downloading onnx-1.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdf2image (from unstructured[pdf])
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Collecting pdfminer.six (from unstructured[pdf])
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pikepdf (from unstructured[pdf])
  Downloading pikepdf-8.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m85.2 MB/s[0m eta [36m0:00:00[0m
Collecting unstructured-inference==0.7.18 (from unstructured[pdf])
  Downloading unstructured_inference-0.7.18-py3-none-any.whl (62 kB)
[2

In [None]:
import nest_asyncio
nest_asyncio.apply()

## Setup

In [None]:
dotenv_path = (
    "env"
)
#Create .env file
with open(dotenv_path, "w") as f:
    f.write('OPENAI_API_KEY="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"\n')

In [None]:
from dotenv import load_dotenv

# Load the enviroment variables
load_dotenv(dotenv_path=dotenv_path)

True

For this demo we use a simple PDFReader to read and extract the documents. You can use the following section to use a more advanced document loader and extract complete documents from the PDF file.

In [None]:
from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file='/content/220929_Chinese_Academy_of_Sciences.pdf')

In [None]:
from llama_hub.file.unstructured import UnstructuredReader

documents_2 = UnstructuredReader().load_data("/content/220929_Chinese_Academy_of_Sciences.pdf")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
documents

[Document(id_='b3bd255a-6be3-4b6a-a31c-44f03c4775c4', embedding=None, metadata={'page_label': '1', 'file_name': '220929_Chinese_Academy_of_Sciences.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='e0b3349960b78112847b7dbade68f6570f66ceae4eb1cd0d402cd4f1d6abd16c', text='3\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='c39fc8fd-deca-4a06-ba9c-833e8416c0dd', embedding=None, metadata={'page_label': '2', 'file_name': '220929_Chinese_Academy_of_Sciences.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='4c198e728906188a720b7e8de2e32384fa117cf9b0cf9a28cdc5b703a05dea69', text="4About CBAS\nThe International Research Center of Big Data for Sustainable \nDevelopment Goals (CBAS), founded in 2021, was established \nin response to the call by Chinese President Xi Jinping at the 75th UN Genera

## Run the DenseXRetrievalPack

The `DenseXRetrievalPack` creates both a retriver and query engine.

First we download the package

In [None]:
from llama_index.llama_pack import download_llama_pack

DenseXRetrievalPack = download_llama_pack("DenseXRetrievalPack", "./dense_pack")

Now, we create the retriever and the query engine from the `DenseXRetrieval` package using GPT 3.5-turbo as the LLM for the propositions extraction and for the query resolution.

In [None]:
from llama_index.llms import OpenAI
from llama_index.text_splitter import SentenceSplitter

dense_pack = DenseXRetrievalPack(
  documents,
  proposition_llm=OpenAI(model="gpt-3.5-turbo", max_tokens=750),
  query_llm=OpenAI(model="gpt-3.5-turbo", max_tokens=256),
  text_splitter=SentenceSplitter(chunk_size=1024)
)
dense_query_engine = dense_pack.query_engine


  0%|          | 0/16 [00:00<?, ?it/s][A
  6%|▋         | 1/16 [00:09<02:23,  9.54s/it][A
 12%|█▎        | 2/16 [00:13<01:30,  6.47s/it][A
 19%|█▉        | 3/16 [00:15<00:52,  4.05s/it][A
 25%|██▌       | 4/16 [00:17<00:42,  3.55s/it][A
 38%|███▊      | 6/16 [00:17<00:17,  1.71s/it][A
 44%|████▍     | 7/16 [00:18<00:12,  1.44s/it][A
 56%|█████▋    | 9/16 [00:19<00:06,  1.08it/s][A
 62%|██████▎   | 10/16 [00:22<00:09,  1.51s/it][A
 69%|██████▉   | 11/16 [00:22<00:05,  1.19s/it][A
 75%|███████▌  | 12/16 [00:24<00:05,  1.27s/it][A
 81%|████████▏ | 13/16 [00:32<00:09,  3.11s/it][A
 88%|████████▊ | 14/16 [00:36<00:07,  3.54s/it][A
 94%|█████████▍| 15/16 [00:38<00:02,  2.97s/it][A
100%|██████████| 16/16 [00:39<00:00,  2.44s/it]


Generating embeddings:   0%|          | 0/256 [00:00<?, ?it/s]

Let's create a base query engine to compare the results

In [None]:
from llama_index import VectorStoreIndex

base_index = VectorStoreIndex.from_documents(documents)
base_query_engine = base_index.as_query_engine()

## Solve a Query

### How are transformers related to convolutional neural networks?

In [None]:
response = dense_query_engine.query("Can you tell me about Digital Public Goods for SDG Indicators?")

In [None]:
print(response.response)

Digital Public Goods for SDG Indicators are defined as open-source software, open data, open artificial intelligence models, open standards, and open content that are publicly accessible and contribute to the achievement of the Sustainable Development Goals (SDGs). These digital resources are intended to promote awareness, knowledge, research, and innovative solutions to global and regional challenges. The goal is to continually upgrade and standardize these digital public goods to improve global monitoring of SDG progress, fill data gaps, and enhance global scientific and technical capacities. The development of these digital public goods is guided by core principles such as universality of science, scalability, inclusivity, availability and accessibility, quality acceptability, and enhancing training and upskilling across sectors and nations. The Digital Public Good Alliance (DPGA) plays a significant role in promoting the concept of Digital Public Goods and establishing standards fo

In [None]:
response = base_query_engine.query("Can you tell me about the key actions taken for the core principle digital public goods should be scalable?")
print(response.response)

The key actions taken for the core principle that digital public goods should be scalable include incorporating multi-stakeholder engagement and a mechanism for a bottom-up approach in the development process of digital public goods for SDG indicators. This ensures that the development process is inclusive and allows for innovation and creativity. Additionally, global SMART objectives should be created, aligning with national goals. Participation and engagement of relevant stakeholders from developing countries and local communities should be ensured. Efforts should be made to identify and engage projects and organizations already developing similar or related digital products to avoid duplication. Interdisciplinary collaborations and pooling of digital and human resources between participants, leading organizations, and public sector institutions should be promoted.


In [None]:
print(response.metadata)

{'8c6eee25-7009-48d9-9fff-c019d34c55e8': {'page_label': '9', 'file_name': '220929_Chinese_Academy_of_Sciences.pdf'}, '9680ac74-f7cb-486d-aab7-e1071ad599ed': {'page_label': '8', 'file_name': '220929_Chinese_Academy_of_Sciences.pdf'}}


In [None]:
for srcnode in response.source_nodes:
    #print(srcnode.get_content(metadata_mode="all"))
    print("###########")
    print(srcnode.text)

###########
7challenges. Secondly, improvement of global scientific and technical capabilities through an open science 
approach was extensively suggested by different experts in several of the responses received. Table 1 provides a summarized list of Core Principles with associated Key Actions to facilitate and guide the development process of proposed digital public goods for SDG indicators. 
Table 1: Core Principles and Key Actions suggested by experts for developing digital public goods for 
SDG Indicators
Core Principles Key Actions
1Universality of science1.1 Promote open science, open data, and open knowledge
1.2Encourage scientific partnerships, collaborations, and cooperation and 
multi-disciplinary stakeholder engagements
2The digital public goods should be scalable 2.1Development process of digital public goods for SDG indicators should incorporate multi-stakeholder engagement and a mechanism for a bottom to top approach 
3The development process should be inclusive and inno