# Long-Context Reorder
- 모델의 아키텍처와 상관없이 10개 이상의 문서를 포함하면 성능 저하가 발생함
- 긴 문맥의 중간에서 관련 정보를 접근해야 할 때 제공된 문서를 무시하는 경향이 있음
- 참고: https://arxiv.org/abs/2307.03172
- 이 문제를 피하기 위해 문서를 검색 후 재정렬하여 성능 저하를 방지할 수 있음

In [9]:
%pip install --upgrade --quiet  sentence-transformers langchain-chroma langchain langchain-openai > /dev/null

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [10]:
from langchain.chains import LLMChain, StuffDocumentsChain
from langchain_chroma import Chroma
from langchain_community.document_transformers import (
    LongContextReorder,
)
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from dotenv import load_dotenv
load_dotenv('../dot.env')

# Get embeddings.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# Create a retriever
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about the Celtics?"

# Get relevant documents ordered by relevance score
docs = retriever.invoke(query)
docs

[Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='The Boston Celtics won the game by 20 points')]

# 관련없는 문서의 경우 중앙, 반대의 경우 처음과 끝으로 이동
- 왜지? 이게 LLM에 더 잘 인식 되나?

In [11]:
# Reorder the documents:
# Less relevant document will be at the middle of the list and more
# relevant elements at beginning / end. 
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

# Confirm that the 4 relevant documents are at beginning and end.
reordered_docs

[Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='This is a document about the Boston Celtics')]

In [12]:
# We prepare and run a custom Stuff chain with reordered docs as context.

# Override prompts
document_prompt = PromptTemplate(
    input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
llm = OpenAI()

stuff_prompt_override = """Given this text extracts:
-----
{context}
-----
Please answer the following question in Korean:
{query}"""

prompt = PromptTemplate(
    template=stuff_prompt_override, input_variables=["context", "query"]
)

# Instantiate the chain
llm_chain = LLMChain(llm=llm, prompt=prompt) #-> 얘가 stuff_prompt_over 사용
chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_prompt=document_prompt,
    document_variable_name=document_variable_name,
)
print(chain.run(input_documents=reordered_docs, query=query))



당신에게 Celtics에 대해 말할 수 있는 내용은 무엇인가요?
