## Lost in the middle: The problem with long contexts

> 迷失在中央: 长上下文问题

No matter the architecture of your model, there is a substantial performance degradation when you include 10+ retrieved documents.<br>
In brief: When models must access relevant information in the middle of long contexts, they tend to ignore the provided documents. <br>
See: https://arxiv.org/abs/2307.03172

<br>To avoid this issue you can re-order documents after retrieval to avoid performance degradation.

> 无论模型结构如何, 当超过10个文档被检索, 性能就会下降, 为了避免这个问题, 你需要对检索的文档们进行排序

In [4]:
import os
import chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_transformers import LongContextReorder
from langchain.chains import StuffDocumentsChain, LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

In [2]:
import os

In [3]:
os.environ['OPENAI_API_KEY'] = 'sk-'

In [6]:
# Get embeddings.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [7]:
texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

In [8]:
# Create a retriever
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)

In [9]:
query = "What can you tell me about the Celtics?"

In [10]:
# Get relevant documents ordered by relevance score
docs = retriever.get_relevant_documents(query)
docs

[Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='Larry Bird was an iconic NBA player.'),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.'),
 Document(page_content='Basquetball is a great sport.'),
 Document(page_content='I simply love going to the movies'),
 Document(page_content='Fly me to the moon is one of my favourite songs.'),
 Document(page_content='This is just a random text.')]

> Reorder the documents:<br>
> Less relevant document will be at the middle of the list and more relevant elements at beginning / end.<br>
> **不太相关的位于中间, 相关性高的位于开始/结束**<br>

In [13]:
reordering = LongContextReorder()

In [14]:
reordered_docs = reordering.transform_documents(docs)

> Confirm that the 4 relevant documents are at beginning and end. <br>
> 确认在开头和结尾4个相关的文档

In [15]:
reordered_docs

[Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.'),
 Document(page_content='I simply love going to the movies'),
 Document(page_content='This is just a random text.'),
 Document(page_content='Fly me to the moon is one of my favourite songs.'),
 Document(page_content='Basquetball is a great sport.'),
 Document(page_content='Larry Bird was an iconic NBA player.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='This is a document about the Boston Celtics')]

## We prepare and run a custom Stuff chain with reordered docs as context.

In [16]:
# Override prompts
document_prompt = PromptTemplate(
    input_variables=["page_content"], template="{page_content}"
)

In [17]:
document_variable_name = "context"

In [18]:
llm = OpenAI()

In [19]:
stuff_prompt_override = """Given this text extracts:
-----
{context}
-----
Please answer the following question:
{query}"""

In [20]:
prompt = PromptTemplate(
    template=stuff_prompt_override, input_variables=["context", "query"]
)

In [21]:
# Instantiate the chain
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [22]:
chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_prompt=document_prompt,
    document_variable_name=document_variable_name,
)

In [24]:
res = chain.run(input_documents=reordered_docs, query=query)

In [25]:
print(res)



The Celtics are a professional basketball team based in Boston, Massachusetts. They have won the NBA championship 17 times, the most recent being in 2018. They have had many iconic players throughout their history, such as Larry Bird and L. Kornet. They have recently won a game by 20 points. They are a popular team with many fans around the world.
