# Main points RAG

**Algorithm designed for global and accurate retrieval augmented generation (RAG)**

The algorithm has the following steps:

1. Extract points of information from a paper that are relevant to answer a query.
2. Identify and chunks in the paper that are semantically close to the extrated points.
3. Select only the chunks with semantic distance under a threshold.
4. Build context information from the selected chunks.
5. Give the LLM the context information to answer the query.
6. Check if the answer has unsupported claims.
7. Fix the answer with any identified unsupported claims.
8. Repeat until there are no usupported claims.
9. Return the answer and retrieved chunks.

## Global context

The LLM is initially provided with the full document text to extract points of information present in the paper and that are relevant to the query (step 1). The individual points may not be semantically similar to the query, but they are needed information that is likely present in the paper. 

This sets a contrast with standard RAG since information that is not directly similar to the query can be logically combined to generate a nuanced answer. Other advanced-RAG algorithms also divide the query into sub-queries. The difference here is that the sub-queries (points) are confined to information that must be explicitly present in the paper. This then allows to quantitatively verify if a point or sub-query is valid through similarity metrics. 

## Accurate retrieval

Each extracted point is compared with literal chunks from the paper (step 2). Only matching chunks are selected (step 3). This mitigates hallucinations by considering only points that are closely related to explicit text in the paper.

## Caveats

An LLM needs to be able to process the entire document. This limits the document size to text smaller than the maximum LLM context length. This is, thus, intended for standard papers that can be processed by current LLMs (~100K context length). 

Only points with text close to text explicitly appearing in the paper are considered. This could limit the ability of the LLM to answer some queries. The verification/fix steps (6 and 7) are intended to mitigate hallucinations when the LLM finds itself with less than enough information to answer a query.

# Example usage

## setup

- Install ollama by running on a terminal the following command:

curl -fsSL https://ollama.com/install.sh | sh

- Pull an LLM. For instance gpt-oss:

ollama pull gpt-oss

- Recreate the conda environment:

conda create -f environment.yml --name deepresearch

- Activate the environment:

conda activate deepresearch

## get a paper text

In [1]:
with open('paper_text.txt', 'r') as fp:
    paper_text = fp.read()

print(paper_text[:500], '\n\n...\n\n', paper_text[-500:])

 Cell Cycle
ISSN: 1538-4101 (Print) 1551-4005 (Online) Journal homepage: www.tandfonline.com/journals/kccy20
Acetic acid eﬀects on aging in budding yeast: Are
they relevant to aging in higher eukaryotes?
William C. Burhans & Martin Weinberger
To cite this article: William C. Burhans & Martin Weinberger (2009) Acetic acid eﬀects on aging
in budding yeast: Are they relevant to aging in higher eukaryotes?, Cell Cycle, 8:14, 2300-2302,
DOI: 10.4161/cc.8.14.8852
To link to this article:  https://doi. 

...

  8. Chu IM, et al. Nat Rev Cancer 2008; 8:253-67.
 9. Beales IL, et al. BMC Cancer 2007; 7:97.
 10. Huang WC, et al. Curr Biol 2008; 18:781-5.
 11. Burhans WC, et al. Nucleic Acids Res 2007; 35:7545-56.
 12. Miyauchi H, et al. EMBO J 2004; 23:212-20.
 13. Nogueira V, et al. Cancer Cell 2008; 14:458-70.
 14. Halazonetis TD, et al. Science 2008; 319:1352-5.
 15. Brunet A, et al. Science 2004; 303:2011-5.
 16. Wang F, et al. Aging Cell 2007; 6:505-14.
 17. Jones RG, et al. Mol Cell 2005; 1

## run

In [5]:
import main_points_rag


answer, retrieved_chunks = main_points_rag.run(
    query='What mechanisms by which acetic acid contributes to aging in yeast can be extrapolated to human aging?',
    paper_text=paper_text,
    model='gpt-oss',
    num_predict=4000,
    num_tries=5,
    threshold=0.4,
    num_chunks_per_point=1,
    window_size=3,
    max_iterations=3,
    verbose=True,
)

Your task is to extract the most relevant points of information from a paper in order to answer a query. The points must closely correspond to literal excerpts from the paper. Both the paper and the query are given below:

<paper>
 Cell Cycle
ISSN: 1538-4101 (Print) 1551-4005 (Online) Journal homepage: www.tandfonline.com/journals/kccy20
Acetic acid eﬀects on aging in budding yeast: Are
they relevant to aging in higher eukaryotes?
William C. Burhans & Martin Weinberger
To cite this article: Will 

...

  be extrapolated to human aging?
</query>

Provide your answer in the following format:

<thoughts>
Your reasoning on what pieces of information are needed to answer the query. Are any of those pieces of information present in the paper? Which ones?
</thoughts>

<points>
A list of one-sentence distinct points of information from the paper (if any) that are relevant to the query. In as much as possible, the points must be independent from each other. Example:
- point 1.
- point 2.
...
</

## Retrieved chunks

In [6]:
for chunk in retrieved_chunks: print(chunk, '\n\n')

Abrogation of this checkpoint leads to growth arrest in S phase followed by apoptosis. 17 In summary, Burtner et al. raise an important question about the relevance of acetic acid effects in the yeast chronological aging model to aging in higher eukaryotes. indicate that in yeast, accumulation of acetic acid in stationary phase cultures stimulates highly conserved growth signaling pathways and increases oxidative stress and replication stress, all of which have been implicated in aging and/or age-related diseases in more complex organisms. Low pH also stimulates growth signaling pathways in mammals. Although the reduced production of acetic acid identified by Burtner et al. The remarkable parallels between regulation of chronological aging in yeast and of aging in more complex organisms suggest that conserved growth signaling pathways impact aging in all eukaryotes via dual effects on oxida- tive and replication stress. 


This conclusion is supported by the recent discovery of a disti

## Answer

In [7]:
print(answer)

In yeast, the accumulation of acetic acid during stationary phase activates conserved growth‑signaling pathways (e.g., Sch9/Sic1, the AKT/RAS/cAMP axis), drives cells into S‑phase even when nutrients are scarce, and thereby creates **replication stress**.  Acetic acid also lowers extracellular pH, which in yeast and in mammals stimulates mitochondrial signaling and increases **oxidative stress**.  These two stresses—replication and oxidative—are known to accelerate cellular senescence and apoptosis in yeast.

Because the same signaling molecules (AKT, RAS, cAMP, IGF‑1 pathways) and stress responses are conserved in mammals, the mechanisms observed in yeast can be extrapolated to human aging in the following ways:

1. **Chronic low‑pH or acetic‑acid‑like metabolic states** can continuously stimulate growth‑signaling pathways in human cells, promoting cell cycle entry and proliferation in an environment that may lack sufficient nutrients or DNA‑repair capacity, leading to replication str

The answer could be directly checked against the also provided retrieved paper chunks.