# Setup
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pprados/langchain-references/blob/master/langchain_reference.ipynb)


In [1]:
!python -m pip -q install --upgrade pip

In [2]:
# Document loading, retrieval methods and text splitting
%pip install -qU wikipedia

%pip install -qU langchain-references
%pip install -qU langchain-community
%pip install -qU langchain-text-splitters

# Local vector store via Chroma
%pip install -qU langchain-chroma

# inference and embeddings 
%pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import langchain_references

langchain_references.__version__

'0.2.35'

# Document loading, retrieval methods and text splitting
Load documents from the web and split them into smaller chunks for processing.

In [4]:
import os
from pprint import pprint

os.environ["USER_AGENT"] = "langhchain-references"

from langchain_community.retrievers import WikipediaRetriever

documents = WikipediaRetriever(
    top_k_results=10,
    doc_content_chars_max=2000,
).invoke("mathematic")




  lis = BeautifulSoup(html).find_all('li')


In [5]:
pprint([(doc.metadata["title"], doc.metadata["source"]) for doc in documents])

[('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('Mathematical analysis',
  'https://en.wikipedia.org/wiki/Mathematical_analysis'),
 ('History of mathematics',
  'https://en.wikipedia.org/wiki/History_of_mathematics'),
 ('Indian mathematics', 'https://en.wikipedia.org/wiki/Indian_mathematics'),
 ('Group (mathematics)', 'https://en.wikipedia.org/wiki/Group_(mathematics)'),
 ('Limit (mathematics)', 'https://en.wikipedia.org/wiki/Limit_(mathematics)'),
 ('Subtraction', 'https://en.wikipedia.org/wiki/Subtraction'),
 ('Applied mathematics', 'https://en.wikipedia.org/wiki/Applied_mathematics')]


In [6]:
import os
from getpass import getpass

import dotenv

dotenv.load_dotenv()

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass()

In [7]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
model = ChatOpenAI(model="gpt-4o-mini")

Load the documents into a vector store.

In [8]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(documents=documents,
                                    embedding=embeddings,
                                    )

Create a context with a combination of documents in a single string, with a single numerical identifier.

In [9]:
def format_docs(docs):
    # return "\n\n".join(doc.page_content for doc in docs)
    return "\n".join(
        # Add a document id so that LLM can reference it 
        [f"<document id={i + 1}>\n{doc.page_content}\n</document>\n" for i, doc in
         enumerate(docs)]
    )


# Manage references with langchain-reference

Create a prompt with `{format_references}`, `{context}` and `{question}` placeholders.

In [10]:
# Specific format for references.
from langchain_references import FORMAT_REFERENCES

pprint(FORMAT_REFERENCES)

('When referencing the documents, add a citation right after. Use '
 '"【ID_NUMBER†source】" for the citation (e.g. "The Space Needle is in Seattle '
 '【1†source】【2†source】.").')


In [11]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

rag_prompt = ChatPromptTemplate.from_template(
    """
You are an assistant for question-answering tasks. Use the following pieces of retrieved documents to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

{format_references}

<documents>
{context}
</documents>

Answer the following question:

{question}""",  # noqa
    partial_variables={"format_references": FORMAT_REFERENCES})

Create a context with documents.

In [12]:
context = RunnablePassthrough.assign(
    context=lambda input: format_docs(input["documents"]),
)


Create a chain with the `context`, `rag_prompt` and `model`.

In [13]:
from langchain_core.output_parsers import StrOutputParser

# Invoke the chain without `manage_references()`
chain = (
        context
        | rag_prompt
        | model
)

Select documents similar to the question.

In [28]:
question = "What is the difference aspect of mathematics, cite 2 aspects?"

docs = vectorstore.similarity_search(question, k=6)
pprint([(d.metadata["title"], d.metadata["source"]) for d in docs])

[('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('History of mathematics',
  'https://en.wikipedia.org/wiki/History_of_mathematics'),
 ('Applied mathematics', 'https://en.wikipedia.org/wiki/Applied_mathematics'),
 ('Indian mathematics', 'https://en.wikipedia.org/wiki/Indian_mathematics'),
 ('Mathematical analysis',
  'https://en.wikipedia.org/wiki/Mathematical_analysis'),
 ('Subtraction', 'https://en.wikipedia.org/wiki/Subtraction')]


Invoke the chain with the `documents` and `question`, but without `manage_references()`. You can see some **\[1](id=1)** references in the answer.

In [29]:
answer = (chain | StrOutputParser()).invoke({"documents": docs, "question": question})
answer

'The difference aspect of mathematics can be understood through the lens of subtraction, which represents the removal of objects from a collection, and proof, which is a critical method for establishing the truth of mathematical statements. Subtraction can be applied to numbers, fractions, and other mathematical entities, highlighting its versatility, while proof is essential for validating assumptions and discovering theorems within both pure and applied mathematics【1†source】【6†source】.'

Invoke the chain with the `documents` and `question` with `manage_references()`.

In [30]:
from langchain_references import manage_references

managed_chain = context | manage_references(rag_prompt | model)

In [31]:

answer = (managed_chain | StrOutputParser()).invoke(
    {"documents": docs, "question": question})
pprint(answer)

('The difference aspect of mathematics involves operations such as '
 'subtraction, which signifies the removal of objects from a collection, and '
 'introduces concepts like negative numbers and abstract quantities<a '
 'href="#fn_1" id="1">[1]</a></sup><a href="#fn_2" id="2">[2]</a></sup>. '
 'Additionally, it is relevant to operations in algebra and analysis, where it '
 'is crucial for understanding relationships between quantities and solving '
 'equations<a href="#fn_1" id="1">[1]</a></sup><a href="#fn_3" '
 'id="3">[3]</a></sup>.\n'
 '\n'
 '<sup id="fn1" style="font-size: 0.7em;">1.</a> '
 '[Mathematics](https://en.wikipedia.org/wiki/Mathematics)</sup></small>  \n'
 '<sup id="fn2" style="font-size: 0.7em;">2.</a> '
 '[Subtraction](https://en.wikipedia.org/wiki/Subtraction)</sup></small>  \n'
 '<sup id="fn3" style="font-size: 0.7em;">3.</a> [Mathematical '
 'analysis](https://en.wikipedia.org/wiki/Mathematical_analysis)</sup></small>  \n')


In [32]:
# Print in markdown format
from IPython.display import Markdown, display

display(Markdown(answer))

The difference aspect of mathematics involves operations such as subtraction, which signifies the removal of objects from a collection, and introduces concepts like negative numbers and abstract quantities<a href="#fn_1" id="1">[1]</a></sup><a href="#fn_2" id="2">[2]</a></sup>. Additionally, it is relevant to operations in algebra and analysis, where it is crucial for understanding relationships between quantities and solving equations<a href="#fn_1" id="1">[1]</a></sup><a href="#fn_3" id="3">[3]</a></sup>.

<sup id="fn1" style="font-size: 0.7em;">1.</a> [Mathematics](https://en.wikipedia.org/wiki/Mathematics)</sup></small>  
<sup id="fn2" style="font-size: 0.7em;">2.</a> [Subtraction](https://en.wikipedia.org/wiki/Subtraction)</sup></small>  
<sup id="fn3" style="font-size: 0.7em;">3.</a> [Mathematical analysis](https://en.wikipedia.org/wiki/Mathematical_analysis)</sup></small>  


# Chain with retriever
The previous model requires the list of documents to be obtained before invoking the chain.
But, it's possible to use a retriever to get the documents and invoke the chain in a single step.

In [33]:
from operator import itemgetter

from langchain_core.runnables import RunnableParallel

retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
context = (
    RunnableParallel(
        # Get list of documents, necessary for reference analysis
        documents=(itemgetter("question") | retriever),
        # and question
        question=itemgetter("question"),
    ).assign(
        context=lambda input: format_docs(input["documents"]),
        format_references=lambda _: FORMAT_REFERENCES,
    )
)

In [34]:
context.invoke({"question": question}).keys()

dict_keys(['documents', 'question', 'context', 'format_references'])

In [35]:
chain = (
        context
        | rag_prompt
        | model
)

In [36]:
pprint((chain | StrOutputParser()).invoke({"question": question}))

('Subtraction is one of the four fundamental arithmetic operations, '
 'representing the removal of objects from a collection, and it is considered '
 'the inverse of addition. It is characterized by properties such as being '
 'anticommutative, meaning the order of the numbers affects the sign of the '
 'result, and it is not associative, indicating that the order of operations '
 'matters when subtracting more than two numbers【6†source】. Additionally, '
 'subtraction can apply to various types of numbers, including natural '
 'numbers, negative numbers, and fractions【6†source】.')


In [37]:
answer = (context | manage_references(rag_prompt | model) | StrOutputParser()).invoke(
    {"question": question})
pprint(answer)

('The difference aspect of mathematics includes subtraction, which is the '
 'operation representing the removal of objects from a collection, and the '
 'study of continuous functions through analysis, which deals with limits, '
 'differentiation, and integration<a href="#fn_1" id="1">[1]</a></sup><a '
 'href="#fn_2" id="2">[2]</a></sup>.\n'
 '\n'
 '<sup id="fn1" style="font-size: 0.7em;">1.</a> '
 '[Mathematics](https://en.wikipedia.org/wiki/Mathematics)</sup></small>  \n'
 '<sup id="fn2" style="font-size: 0.7em;">2.</a> [Mathematical '
 'analysis](https://en.wikipedia.org/wiki/Mathematical_analysis)</sup></small>  \n')


In [38]:
display(Markdown(answer))

The difference aspect of mathematics includes subtraction, which is the operation representing the removal of objects from a collection, and the study of continuous functions through analysis, which deals with limits, differentiation, and integration<a href="#fn_1" id="1">[1]</a></sup><a href="#fn_2" id="2">[2]</a></sup>.

<sup id="fn1" style="font-size: 0.7em;">1.</a> [Mathematics](https://en.wikipedia.org/wiki/Mathematics)</sup></small>  
<sup id="fn2" style="font-size: 0.7em;">2.</a> [Mathematical analysis](https://en.wikipedia.org/wiki/Mathematical_analysis)</sup></small>  


## Customize the references format
It's possible to customize the references format and remove some reference, and delete references, because a chunk cannot be referenced and the corresponding document has too many pages

In [40]:
from typing import List, Tuple

from langchain_core.documents.base import BaseMedia

from langchain_references import ReferenceStyle


class MyReferenceStyle(ReferenceStyle):
    # If document hasn't "header 1" and the total_pages is > 3, remove the reference
    source_id_key = "source"
    chunk_anchor_key = "header 1"
    total_pages_key = "total_pages"
    max_total_pages = 3

    def format_reference(self, ref: int, media: BaseMedia) -> str:
        if media.metadata.get(self.chunk_anchor_key) is None:
            # Detect chunks without a specific references in the document
            # If the size of the document is too big, remove the reference
            get_total_pages = self._get_key_assigner(self.total_pages_key)
            total_pages = get_total_pages(media)
            if total_pages and total_pages > self.max_total_pages:
                return None
        return f"【{ref}】"

    def format_all_references(self, refs: List[Tuple[int, BaseMedia]]) -> str:
        if not refs:
            return ""
        result = []
        for ref, media in refs:
            source = media.metadata[self.source_id_key]
            if media.metadata.get(self.chunk_anchor_key):
                # Add the anchor to the chunk to the source
                source += "#" + media.metadata[self.chunk_anchor_key]
            if "title" in media.metadata:
                result.append(f"- **[{ref}]** {media.metadata['title']} ({source})\n")
            else:
                result.append(f"- **[{ref}]** {source}\n")
        if not result:
            return ""
        return "\n\n" + "".join(result)


answer = (context | manage_references(
    rag_prompt | model,
    style=MyReferenceStyle()) | StrOutputParser()).invoke({"question": question})
display(Markdown(answer))

The difference aspect of mathematics refers to the operations used to find how much one quantity differs from another, exemplified by subtraction. Two key aspects of mathematics involved in this are arithmetic operations, such as addition and subtraction, and abstract algebra, which studies these operations and their properties【1】【2】.

- **[1]** Subtraction (https://en.wikipedia.org/wiki/Subtraction)
- **[2]** Mathematics (https://en.wikipedia.org/wiki/Mathematics)
