## Online Meeting

<a target="_blank" href="https://colab.research.google.com/github/microsoft/LLMLingua/blob/main/examples/OnlineMeeting.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Using generative AI like ChatGPT in online meetings can greatly improve work efficiency (e.g., **Teams**). However, the context in such applications tends to be more conversational, with a high degree of redundancy and a large number of tokens(more than **40k**). By utilizing LLMLingua to compress prompts, we can significantly reduce the length of prompts, which in turn helps to reduce latency. This makes the AI more efficient and responsive in real-time communication scenarios like online meetings, enabling smoother interactions and better overall performance. We use meeting transcripts from the [**MeetingBank** dataset](https://huggingface.co/datasets/lytang/MeetingBank-transcript) as an example to demonstrate the capabilities of LLMLingua.

### MeetingBank Dataset

Next, we will demonstrate the use of LongLLMLingua on the **MeetingBank** dataset, which can achieve similar or even better performance with significantly fewer tokens. The online meeting scenario is quite similar to RAG, as it also suffers from the "lost in the middle" issue, where noise data at the beginning or end of the prompt interferes with LLMs extracting key information. This dataset closely resembles real-world online meeting scenarios, with prompt lengths exceeding **60k tokens at their longest.  
   
The original dataset can be found at https://huggingface.co/datasets/lytang/MeetingBank-transcript

In [1]:
# Install dependency.
!pip install llmlingua datasets
!pip install accelerate



In [2]:
# Download the original prompt and dataset
from datasets import load_dataset
dataset = load_dataset("lytang/MeetingBank-transcript")["train"]

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Using the OAI
import openai
openai.api_key = "<insert_openai_key>"

In [4]:
import os
from dotenv import load_dotenv
from openai import AzureOpenAI

load_dotenv()
# Save the credentials in the .env file
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2023-12-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME") #This will correspond to the custom name you chose for your deployment when you deployed a model. 

### Setup Data

In [5]:
# select an example from MeetingBank
contexts = dataset[1]["source"]

### Q1

In [6]:
question = "Question: How much did the crime rate increase last year?\nAnswer:"
reference = "5.4%"

In [7]:
# The response from original prompt, using GPT-4
import json
prompt = "\n\n".join([contexts, question])

message = [
    {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
   {"role": "user", "content": prompt},
]
request_data = {
   
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}

response = client.chat.completions.create(model=deployment_name,response_format={ "type": "json_object" }, messages=message, **request_data,)
print(json.dumps(response.choices[0].message.content, indent=4))


"{\n  \"response\": \"According to the information provided in the budget presentations, the city experienced a 5.4% increase in violent crime year to date. This was after starting the year with a 17.4% increase in violent crime in January.\"\n}"


In [9]:
# Setup LLMLingua
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


AssertionError: Torch not compiled with CUDA enabled

In [None]:
# 200 Compression
compressed_prompt = llm_lingua.compress_prompt(
    contexts.split("\n"),
    instruction="",
    question=question,
    target_token=20,
    condition_compare=True,
    condition_in_question='after',
    rank_method='llmlingua',
    use_sentence_level_filter=False,
    context_budget="+100",
    dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio
    reorder_context="sort"
)



In [None]:
import json
print(json.dumps(compressed_prompt, indent=4))


### Q2

In [None]:
message = [
    {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
   {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]
request_data = {
   
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}

response = client.chat.completions.create(model=deployment_name,response_format={ "type": "json_object" }, messages=message, **request_data,)
print(json.dumps(response.choices[0].message.content, indent=4))



In [None]:
question = "Question: What is the homicide clearance rate?\nAnswer:"
reference = "77%"

In [None]:
# The response from original prompt, using GPT-4-32k
import json
prompt = "\n\n".join([contexts, question])

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

In [None]:
# 200 Compression
compressed_prompt = llm_lingua.compress_prompt(
    contexts.split("\n"),
    instruction="",
    question=question,
    target_token=200,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    use_sentence_level_filter=True,
    context_budget="+100",
    reorder_context="sort"
)
message = [
    {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-4-32k",
    **request_data,
)

print(json.dumps(compressed_prompt, indent=4))
print("Response:", response)

### Q3

In [None]:
question = "Question: what are the arrangements the Police Department will make this year?"
reference = "enhancing community engagement and internal communication models, building a culture of accountability and transparency, and prioritizing recruitment and retention."

In [None]:
# The response from original prompt, using GPT-4-32k
import json
prompt = "\n\n".join([contexts, question])

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": 500,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-4-32k",
    **request_data,
)
print(json.dumps(response, indent=4))

In [None]:
# 2000 Compression
compressed_prompt = llm_lingua.compress_prompt(
    contexts.split("\n"),
    instruction="",
    question=question,
    target_token=2000,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    use_sentence_level_filter=False,
    context_budget="+100",
    dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio
    reorder_context="sort"
)
message = [
    {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]

request_data = {
    "messages": message,
    "max_tokens": 500,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-4-32k",
    **request_data,
)

print(json.dumps(compressed_prompt, indent=4))
print("Response:", response)