# Basic Usage on UDA Benchmark Suite

The demonstration encompasses several essential steps:

* Prepare the question-answer-document triplet data-item
* Extract and segment the document content
* Build indexes and retrieve data segments
* Generate answering reponse with LLMs
* Evaluate the accuracy of reponses using the specific metric.

## Load and view the Q&A labels

The Q&A labels are accessible through the csv files in the `dataset/qa` directory or by loading the dataset from the HuggingFace repository `qinchuanhui/UDA-QA`.

In [1]:
# Get the q&a labels from the csv file
import pandas as pd
from uda.utils import preprocess

DATASET_NAME = "fin"

csv_file_path = f"./dataset/qa/{DATASET_NAME}_qa.csv"
df = pd.read_csv(csv_file_path, sep="|",na_filter=False)
qas_dict = preprocess.qa_df_to_dict(DATASET_NAME, df)


Or you can also get the q&a labels from the huggingface dataset

In [2]:
# # Or you can also get the q&a labels from the huggingface dataset

# import pandas as pd
# from uda.utils import preprocess
# from datasets import load_dataset

# DATASET_NAME = "tat"

# hf_dataset = load_dataset("qinchuanhui/UDA-QA", DATASET_NAME)
# hf_data = hf_dataset["test"]
# df = hf_data.to_pandas()
# qas_dict = preprocess.qa_df_to_dict(DATASET_NAME, df)


In [3]:
# View the snapshot of qas_dict
for key in list(qas_dict.keys())[:2]:
    print("Document Name: ", key)
    print("Its Q&A pairs: ", qas_dict[key][:2])
    print("=========================================")

Document Name:  ADI_2009
Its Q&A pairs:  [{'question': 'what is the the interest expense in 2009?', 'answers': {'str_answer': '380', 'exe_answer': '3.8'}, 'q_uid': 'ADI/2009/page_49.pdf-1'}, {'question': 'what is the expected growth rate in amortization expense in 2010?', 'answers': {'str_answer': '-27.0%', 'exe_answer': '-0.26689'}, 'q_uid': 'ADI/2009/page_59.pdf-2'}]
Document Name:  ABMD_2012
Its Q&A pairs:  [{'question': 'during the 2012 year , did the equity awards in which the prescribed performance milestones were achieved exceed the equity award compensation expense for equity granted during the year?', 'answers': {'str_answer': '', 'exe_answer': 'yes'}, 'q_uid': 'ABMD/2012/page_75.pdf-1'}, {'question': 'for equity awards where the performance criteria has been met in 2012 , what is the average compensation expense per year over which the cost will be expensed?', 'answers': {'str_answer': '1719526', 'exe_answer': '1714285.71429'}, 'q_uid': 'ABMD/2012/page_75.pdf-2'}]


## Prepare the document data

In [4]:
# Get the local path of the example pdf file
example_doc_name = list(qas_dict.keys())[0]
example_qa = qas_dict[example_doc_name][1] # or set the index to 0
pdf_path = preprocess.get_example_pdf_path(
    DATASET_NAME, example_doc_name
)  # the function can be exchangable with get_pdf_path()
if pdf_path is None:
    print("No pdf found for this document")

### Basic data extraction:

Leverage the library `PyPDF` to extract the raw text data from the pdf_files. The tabular structure are presented as the structural markers, such as line-breakers and space.

In [5]:
import PyPDF2
# Extract text from pdf
pdf_text = ""
with open(pdf_path, "rb") as file:
    # Create a PDF file reader object
    reader = PyPDF2.PdfReader(file, strict=False)
    for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]
        pdf_text += page.extract_text()
# Show a snapshot of the text
print(pdf_text[8000:8500])


signiﬁcantly decreasing their
inventories. In response to these
unprecedented revenue declines, we
substantially decreased production levels to
reduce our inventory levels. This action, of
course, had the effect of temporarily lowering
our gross margins, which reached a trough of
54.1% in the third quarter.  We reacted
quickly to the business environment, reducingPresident’s Letter
0.000.100.200.300.400.500.60FY2009 Product Revenue and Diluted EPS
From Continuing Operations by Quarter
$0$100$200


### Data segmentation

Utilize the `langchain.text_splitter` to recursively segment text into overlapping chunks, maintaining a 10% overlap and taking explicit separators such as `\n\n` and `\n` into account.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,  # chunk size in characters not in words
    chunk_overlap=300,  # no overlap
)
text_chunks = text_splitter.split_text(pdf_text)
print("chunk_num:", len(text_chunks))
avg_chunk_word_counts = sum([len(chunk.split()) for chunk in text_chunks]) / len(text_chunks)
print("chunk_word_counts:", avg_chunk_word_counts)


chunk_num: 141
chunk_word_counts: 436.5531914893617


## Conduct indexing and retrieval
In this demo, we utilize the traditional dense embedding approach, utilizing the prevalent `SentenceTransformer` framework, specifically the `all-MiniLM-L6` model, within the vector database `ChromaDB`. 

Both queries and document segments are embedded into vectors, upon which cosine similarity measures are computed to retrieve the segments with the highest relevance.

In [7]:
import chromadb
import torch
import chromadb.utils.embedding_functions as embedding_functions

# Create the vector_db collection 
# and store the embeddings
model_name = "all-MiniLM-L6-v2"
chroma_client = chromadb.Client()
device_info = "cuda" if torch.cuda.is_available() else "cpu"
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=model_name, device=device_info
)
collection = chroma_client.create_collection(
    "demo_vdb", embedding_function=ef, metadata={"hnsw:space": "cosine"}
)
id_list = [str(i) for i in range(len(text_chunks))]
collection.add(documents=text_chunks, ids=id_list)

In [8]:
# Fetch the top_k most similar chunks according to the query
top_k = 5
question = example_qa["question"]
fetct_res = collection.query(query_texts=[question], n_results=top_k)
contexts = fetct_res["documents"][0]

# Show a snapshot of the context
print(f"The most relevant contexts to the question: {question}")
for idx,context in enumerate(contexts):
    print(f"===== Context {idx+1} =======")
    print(context[:100], "...")


The most relevant contexts to the question: what is the expected growth rate in amortization expense in 2010?
Amortization expense from continuing operations, related to intangibles was $7.4 million, $9.3 milli ...
Service (cost), interest (cost), and expected return on assets ............................ (338)
d. ...
2009 2008
Discount rate .............................................................. 6.60% 5.64%
E ...
Interest expense . . . ....................................... 4,094 — —
Interest income . . . ..... ...
Amortization or curtailment recognition of prior service cost .................... ( 5 ) ( 9 )
Amort ...


## Perform LLM answering
We input the combination of contexts and the question into a LLM to generate the final response. This process was illustrated utilizing locally-hosted open-source LLMs as well as commercially available GPT models.

We access the GPT-4 model through AzureOpenAI-API, and access the local LLM through HuggingFace. You should set up them with your own api-key or token, in the file [uda/utils/access_config.py](./uda/utils/access_config.py)

### Demonstration of GPT-4 Model

In [9]:
# Example on GPT model
from uda.utils import llm
from openai import AzureOpenAI
from uda.utils import access_config

llm_type = "gpt-4"
# Create the prompt tailored for different datasets and LLMs  
context_text = "\n".join(contexts)
llm_message = llm.make_prompt(question=question, context=context_text, task_name=DATASET_NAME, llm_type=llm_type)

# Call GPT-4/GPT-3.5 through Azure OpenAI API
# You should replace the following parameters with your own configurations
# You can also use other API platforms here

client = AzureOpenAI(
    api_key = access_config.GPT_API_KEY,
    api_version = "2024-04-01-preview",
    azure_endpoint = access_config.GPT_ENDPOINT,
)
raw_response = client.chat.completions.create(
    model = access_config.GPT_MODEL,
    messages = llm_message,
    temperature = 0.1,
)

gpt_response = raw_response.choices[0].message.content

# Show the response
print(f"Question: {question}")
print(f"Ground Truth Reference: {example_qa['answers']}")
print(f"LLM Response: {gpt_response}")

Question: what is the expected growth rate in amortization expense in 2010?
Ground Truth Reference: {'str_answer': '-27.0%', 'exe_answer': '-0.26689'}
LLM Response: The expected amortization expense for fiscal year 2010 is $5,425, and for fiscal year 2009 it was $7.4 million. To find the expected growth rate, we can use the formula:

Growth Rate = (New Value - Old Value) / Old Value * 100%

Plugging in the values:

Growth Rate = ($5,425 - $7,400,000) / $7,400,000 * 100%

First, we convert $5,425 to the same scale as $7,400,000, which is $5,425,000 (since the values in the report are in millions and the 2010 value is likely also meant to be in millions but is missing the appropriate notation).

Growth Rate = ($5,425,000 - $7,400,000) / $7,400,000 * 100%
Growth Rate = (-$1,975,000) / $7,400,000 * 100%
Growth Rate = -26.69%

The expected growth rate in amortization expense in 2010 is a decrease of 26.69%.

The answer is: -26.69%


### Demonstration of Local-LLM (Llama-3-8B)

In [10]:
# Example on Llama model
from uda.utils import llm
from uda.utils import inference

llm_name = "meta-llama/Meta-Llama-3-8B-Instruct"
llm_type = "llama-8B"
# Create the prompt tailored for different datasets and LLMs  
context_text = "\n".join(contexts)
llm_message = llm.make_prompt(question=question, context=context_text, task_name=DATASET_NAME, llm_type=llm_type)

# Local Inference
llm_service = inference.LLM(llm_name)
llm_service.init_llm()
llama_response = llm_service.infer(llm_message)

# Show the response
print(f"Question: {question}")
print(f"Ground Truth Reference: {example_qa['answers']}")
print(f"LLM Response: {llama_response}")




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


2024-07-05 12:23:34 CallLLM
Question: what is the expected growth rate in amortization expense in 2010?
Ground Truth Reference: {'str_answer': '-27.0%', 'exe_answer': '-0.26689'}
LLM Response: Based on the provided information, the amortization expense for intangible assets is expected to decrease from $9.3 million in 2008 to $5.425 million in 2010.

The expected growth rate in amortization expense in 2010 would be:

((5.425 - 9.3) / 9.3) * 100% ≈ -42.1%

So, the expected growth rate in amortization expense in 2010 is approximately -42.1%.


## Evaluation of the Responses

To assess the precision of the LLM-responses, we employ a variety of metrics and evaluative techniques. 

The results should be organized into a series of dictionaries that encapsulate the response and the ground_truth answers. 

For an in-depth examination of the evaluative codes and their functionality, refer to the `uda.eval` module within our repository.

In [12]:
# Format the result
res_dict = {
    "question": question,
    "response": gpt_response,
    "doc": example_doc_name,
    "q_uid": example_qa["q_uid"],
    "answers": example_qa["answers"],
}
res_data = [res_dict]

print(res_data)

# Evaluate the result
from uda.eval.my_eval import eval_main
eval_main(DATASET_NAME, res_data)


[{'question': 'what is the expected growth rate in amortization expense in 2010?', 'response': 'The expected amortization expense for fiscal year 2010 is $5,425, and for fiscal year 2009 it was $7.4 million. To find the expected growth rate, we can use the formula:\n\nGrowth Rate = (New Value - Old Value) / Old Value * 100%\n\nPlugging in the values:\n\nGrowth Rate = ($5,425 - $7,400,000) / $7,400,000 * 100%\n\nFirst, we convert $5,425 to the same scale as $7,400,000, which is $5,425,000 (since the values in the report are in millions and the 2010 value is likely also meant to be in millions but is missing the appropriate notation).\n\nGrowth Rate = ($5,425,000 - $7,400,000) / $7,400,000 * 100%\nGrowth Rate = (-$1,975,000) / $7,400,000 * 100%\nGrowth Rate = -26.69%\n\nThe expected growth rate in amortization expense in 2010 is a decrease of 26.69%.\n\nThe answer is: -26.69%', 'doc': 'ADI_2009', 'q_uid': 'ADI/2009/page_59.pdf-2', 'answers': {'str_answer': '-27.0%', 'exe_answer': '-0.266