<a href="https://colab.research.google.com/github/sugarforever/LlamaIndex-Tutorials/blob/main/llamaindex_llmlingua_prompt_compression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Compression with LlamaIndex and LlmLingua

In `RAG` (Retrieval-Augmented Generation), the input tokens consume the most resources, in which a user query is typically sent to a vector storage, to fetch the vector data of the most similar pieces of information. According to the context or relevant documents retrieved from a vector storage by the user's query, the prompt (input) can even reach thousands of tokens.

`Prompt compression` is the technique that's used to shorten the original prompt while keeping the most important information, and speed up the language model response generation.

The theory it's based on, is that language often includes unnecessary repetition.

## LLMLingua

https://github.com/microsoft/LLMLingua

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

In [1]:
!pip install llama-index openai -q -U

In [2]:
import openai

from google.colab import userdata
openai.api_key = userdata.get('OPENAI_API_KEY')

In [3]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
from llama_index.callbacks import CallbackManager, TokenCountingHandler
import tiktoken

OPENAI_MODEL_NAME = "gpt-3.5-turbo-16k"

llm = OpenAI(model=OPENAI_MODEL_NAME)

token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model(OPENAI_MODEL_NAME).encode
)
callback_manager = CallbackManager([token_counter])

service_context = ServiceContext.from_defaults(
    llm=llm, callback_manager=callback_manager
)
set_global_service_context(service_context)

In [4]:
from llama_index import VectorStoreIndex, download_loader

WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(pages=['Premier League'])

In [5]:
print(len(documents))

1


In [6]:
retriever = VectorStoreIndex.from_documents(documents).as_retriever(similarity_top_k=3)

In [7]:
question = "Why did English clubs get banned fron European competition?"

In [8]:
relevant_documents = retriever.retrieve(question)

In [9]:
contexts = [n.get_content() for n in relevant_documents]
contexts

['== History ==\n\n\n=== Origins ===\nDespite significant European success in the 1970s and early 1980s, the late 1980s marked a low point for English football. Stadiums were deteriorating and supporters endured poor facilities, hooliganism was rife, and English clubs had been banned from European competition for five years following the Heysel Stadium disaster in 1985. The Football League First Division, the top level of English football since 1888, was behind leagues such as Italy\'s Serie A and Spain\'s La Liga in attendances and revenues, and several top English players had moved abroad.By the turn of the 1990s, the downward trend was starting to reverse. At the 1990 FIFA World Cup, England reached the semi-finals; UEFA, European football\'s governing body, lifted the five-year ban on English clubs playing in European competitions in 1990, resulting in Manchester United lifting the Cup Winners\' Cup in 1991. The Taylor Report on stadium safety standards, which proposed expensive up

In [10]:
from llama_index.prompts import PromptTemplate

template = (
    "Given the context information below: \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Please answer the question: {query_str}\n"
)

qa_template = PromptTemplate(template)
prompt = qa_template.format(context_str="\n\n".join(contexts), query_str=question)

response = llm.complete(prompt)

In [11]:
response.text

'English clubs were banned from European competition for five years following the Heysel Stadium disaster in 1985.'

In [12]:
print(
    "Embedding Tokens: ",
    token_counter.total_embedding_token_count,
    "\n",
    "LLM Prompt Tokens: ",
    token_counter.prompt_llm_token_count,
    "\n",
    "LLM Completion Tokens: ",
    token_counter.completion_llm_token_count,
    "\n",
    "Total LLM Token Count: ",
    token_counter.total_llm_token_count,
    "\n",
)

Embedding Tokens:  13551 
 LLM Prompt Tokens:  2363 
 LLM Completion Tokens:  22 
 Total LLM Token Count:  2385 



In [13]:
!pip install llmlingua accelerate -q -U

In [14]:
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor
node_postprocessor = LongLLMLinguaPostprocessor(
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    rank_method="longllmlingua",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",
        "dynamic_context_compression_ratio": 0.3,
    },
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [15]:
retrieved_nodes = retriever.retrieve(question)
synthesizer = CompactAndRefine()

In [16]:
retrieved_nodes

[NodeWithScore(node=TextNode(id_='6ec9417e-decb-4290-a9bf-62630e844bb4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='28ebd3bd-9540-4e8a-9bff-6976c3aae981', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='49f783bdfd0ac70d7235553dac9a4b23a646a54a3fa319a87aceffd06b3c1310'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='5f5926b3-c364-4754-b171-a3620c0827d9', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='6a2a1d44721d8e629023dcd71977bcc3bdba201f052ded1ccd93271ba73f4178'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='25597b8b-b1c9-4769-bc2d-ee1bada663af', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='05d3b56d1e30dbad1d491ac9f1dcddeaa4be4f5001d73f5c22c7153de2ede65b')}, hash='38b2bd679b8bd0bcf92b1b878dd1dce93b1e13fb3774cd7ef562e6e1e9236019', text='== History ==\n\n\n=== Origins ===\nDespite significant European success in the 1970

In [17]:
from llama_index.indices.query.schema import QueryBundle

new_retrieved_nodes = node_postprocessor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)

In [18]:
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])
original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)
print(compressed_contexts)
print()
print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Compressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")


 ===Desite significant European success in the 197 and the late9s marked a low for footballs andters poor facilities,ism wasife, and English had been European in5 The First level of English since, wass A attendues, top English moved abroad.By9 theend was reverse9,s; UEFA European, lifted the five-year on English playing in European competitions in19 in Cupinners1 Report on safety, proposed to stadiums the, in09s, major English begun to commercial club administrationvenue Edwards Unitedolarur and were the this transformation The commercial imperative the clubs seeking to increase their power the away from League in so, to increase their voting and gain more arrangement50 ofship income in18. They demanded companies for of football re from in receivedyear in98, but by, deal price leading clubs taking ofash. Sch, whoations ofals each First5 rights800 in6,08.188 the clubs form a " but were eventually to, top taking theion deal Theations also in receive needed take whole of First Division o

In [19]:
token_counter.reset_counts()

In [20]:
response = synthesizer.synthesize(question, new_retrieved_nodes)

In [21]:
response

Response(response='English clubs were banned from European competition due to poor facilities, hooliganism, and a history of violence.', source_nodes=[NodeWithScore(node=TextNode(id_='d1b09aff-6ac7-40f0-9ef6-797a03e872e3', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='ea6d1aeb9d0b0098d02bbc88f3338265ea944b0ec66744e065e8296a7dfe1068', text='\n ===Desite significant European success in the 197 and the late9s marked a low for footballs andters poor facilities,ism wasife, and English had been European in5 The First level of English since, wass A attendues, top English moved abroad.By9 theend was reverse9,s; UEFA European, lifted the five-year on English playing in European competitions in19 in Cupinners1 Report on safety, proposed to stadiums the, in09s, major English begun to commercial club administrationvenue Edwards Unitedolarur and were the this transformation The commercial imperative the clubs seeking to increase

In [22]:
response.response

'English clubs were banned from European competition due to poor facilities, hooliganism, and a history of violence.'

In [23]:
print(
    "Embedding Tokens: ",
    token_counter.total_embedding_token_count,
    "\n",
    "LLM Prompt Tokens: ",
    token_counter.prompt_llm_token_count,
    "\n",
    "LLM Completion Tokens: ",
    token_counter.completion_llm_token_count,
    "\n",
    "Total LLM Token Count: ",
    token_counter.total_llm_token_count,
    "\n",
)

Embedding Tokens:  0 
 LLM Prompt Tokens:  525 
 LLM Completion Tokens:  23 
 Total LLM Token Count:  548 



In [24]:
retrieved_nodes = retriever.retrieve(question)

new_retrieved_nodes = node_postprocessor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)

contexts = [n.get_content() for n in new_retrieved_nodes]
prompt = qa_template.format(context_str="\n\n".join(contexts), query_str=question)
response = llm.complete(prompt)

In [25]:
new_retrieved_nodes

[NodeWithScore(node=TextNode(id_='bfc9ecb3-ad7d-4d50-a47f-86ab58af44db', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='ea6d1aeb9d0b0098d02bbc88f3338265ea944b0ec66744e065e8296a7dfe1068', text='\n ===Desite significant European success in the 197 and the late9s marked a low for footballs andters poor facilities,ism wasife, and English had been European in5 The First level of English since, wass A attendues, top English moved abroad.By9 theend was reverse9,s; UEFA European, lifted the five-year on English playing in European competitions in19 in Cupinners1 Report on safety, proposed to stadiums the, in09s, major English begun to commercial club administrationvenue Edwards Unitedolarur and were the this transformation The commercial imperative the clubs seeking to increase their power the away from League in so, to increase their voting and gain more arrangement50 ofship income in18. They demanded companies for of footb

In [26]:
response.text

'English clubs were banned from European competition due to a series of incidents and issues in the 1980s. These included poor facilities, hooliganism, and a lack of safety measures in stadiums. The ban was imposed by UEFA, the governing body for European football, in an effort to address these problems and improve the overall image and safety of the sport. The ban lasted for five years, from 1985 to 1990.'