<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/output_parsing/table_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tables QA program

In this example, we show how to perform Table Question Answering Task with 3 different baseline approaches.

Three approaches:
1. Take the raw text output (the whole page from PDF parser) as input for query engine for prediction
2. Using Recursive Retrieval to retrieve relevant nodes with tables to answer the question
3. Generate dataframe for each table and using `PandasQueryEngine` as nodes for recursive retriever.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex ü¶ô.

In [None]:
%pip install llama-index

In [None]:
import os

OPENAI_API_TOKEN = "sk-"  # Your OpenAI API token here
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

### Loading Raw Text from PDF parser as document nodes

In [None]:
# from llama_index.llms import MockLLM
from llama_index.node_parser import (
    MarkdownElementNodeParser,
)
from llama_index.schema import Document, IndexNode, TextNode


test_table_document = Document(
    text="""
|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|
|---|---|---|---|---|---|---|---|---|
|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|
|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|
|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|
|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|
|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|
|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|
|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|
|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|
|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|
|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|
|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|
|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|

    Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.

    ‚Ä¢ Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3‚Äì5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.

    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ‚á°5 and ‚á°8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.

    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.

    We also analysed the potential data contamination and share the details in Section A.6.

|Benchmark (shots)|GPT-3.5|GPT-4|PaLM|PaLM-2-L|Llama 2|
|---|---|---|---|---|---|
|MMLU (5-shot)|70.0|86.4|69.3|78.3|68.9|
|TriviaQA (1-shot)|‚Äì|‚Äì|81.4|86.1|85.0|
|Natural Questions (1-shot)|‚Äì|‚Äì|29.3|37.5|33.0|
|GSM8K (8-shot)|57.1|92.0|56.5|80.7|56.8|
|HumanEval (0-shot)|48.1|67.0|26.2|‚Äì|29.9|
|BIG-Bench Hard (3-shot)|‚Äì|‚Äì|52.3|65.7|51.2|

    Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4 are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the PaLM-2-L are from Anil et al. (2023).

    3 Fine-tuning

    Llama 2-Chat is the result of several months of research and iterative applications of alignment techniques, including both instruction tuning and RLHF, requiring significant computational and annotation resources. In this section, we report on our experiments and findings using supervised fine-tuning (Section 3.1), as well as initial and iterative reward modeling (Section 3.2.2) and RLHF (Section 3.2.3). We also share a new technique, Ghost Attention (GAtt), which we find helps control dialogue flow over multiple turns (Section 3.3). See Section 4.2 for safety evaluations on fine-tuned models.

    
    We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models
    benefit from knowledge acquired in pretraining. In short, the reward model ‚Äúknows‚Äù what the chat model
    ---
    # Statistics of human preference data for reward modeling

    The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize Llama 2-Chat during RLHF for better human preference alignment and improved helpfulness and safety.

    Others have found that helpfulness and safety sometimes trade off (Bai et al., 2022a), which can make it challenging for a single reward model to perform well on both. To address this, we train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM).

    We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model ‚Äúknows‚Äù what the chat model

|Dataset|Num. of Comparisons|Avg. # Turns per Dialogue|Avg. # Tokens per Example|Avg. # Tokens in Prompt|Avg. # Tokens in Response|
|---|---|---|---|---|---|
|Anthropic Helpful|122,387|3.0|251.5|17.7|88.4|
|Anthropic Harmless|43,966|3.0|152.5|15.7|46.4|
|OpenAI Summarize|176,625|1.0|371.1|336.0|35.1|
|OpenAI WebGPT|13,333|1.0|237.2|48.3|188.9|
|StackExchange|1,038,480|1.0|440.2|200.1|240.2|
|Stanford SHP|74,882|1.0|338.3|199.5|138.8|
|Synthetic GPT-J|33,139|1.0|123.3|13.0|110.3|
|Meta (Safety & Helpfulness)|1,418,091|3.9|798.5|31.4|234.1|
|Total|2,919,326|1.6|595.7|108.2|216.9|

    Table 6: Statistics of human preference data for reward modeling. We list both the open-source and internally collected human preference data used for reward modeling. Note that a binary human preference comparison contains 2 responses (chosen and rejected) sharing the same prompt (and previous dialogue). Each example consists of a prompt (including previous dialogue if available) and a response, which is the input of the reward model. We report the number of comparisons, the average number of turns per dialogue, the average number of tokens per example, per prompt and per response. More details on Meta helpfulness and safety data per batch can be found in Appendix A.3.1.

    knows. This prevents cases where, for instance, the two models would have an information mismatch, which could result in favoring hallucinations. The model architecture and hyper-parameters are identical to those of the pretrained language models, except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.

    Training Objectives. To train the reward model, we convert our collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart. We used a binary ranking loss consistent with Ouyang et al. (2022):

    Lranking = ‚àílog(œÉ(r‚úì(x, yc) ‚àí r‚úì(x, yr))) (1)

    where r‚úì(x, y) is the scalar score output for prompt x and completion y with model weights ‚úì. yc is the preferred response that annotators choose and yr is the rejected counterpart.

    Built on top of this binary ranking loss, we further modify it separately for better helpfulness and safety reward models as follows. Given that our preference ratings is decomposed as a scale of four points (e.g., significantly better), as presented in Section 3.2.1, it can be useful to leverage this information to explicitly teach the reward model to assign more discrepant scores to the generations that have more differences. To do so, we further add a margin component in the loss:
        """
)

## Baseline 1: Using PDF parser Raw Output (containing multiple tables) as Input for Query Engine

In [None]:
from llama_index import (
    VectorStoreIndex,
    get_response_synthesizer,
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

# build index
index = VectorStoreIndex.from_documents([test_table_document])

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
)

# query different questions
response_1 = query_engine.query(
    "What is MPT 30b performance for common sense reasoning?"
)
print(response_1)

response_2 = query_engine.query("What is PaLM-2-L performance for TriviaQA?")
print(response_2)


response_3 = query_engine.query("What is LLAMA 2 performance for HumanEval?")
print(response_3)

response_4 = query_engine.query("What is LLAMA 2 performance for AGI Eval?")
print(response_4)

response_5 = query_engine.query(
    "What is LLAMA 2 performance for HumanEval and AGI Eval?"
)
print(response_5)

The context information does not provide the performance of MPT 30B specifically for common sense reasoning.
PaLM-2-L performance for TriviaQA is 86.1.
Llama 2's performance for HumanEval is 29.9.
Llama 2's performance for AGI Eval is not mentioned in the given context.
LLAMA 2's performance for HumanEval and AGI Eval is not mentioned in the given context.


Observation: Baseline 1 approach failed to give correct answers for Questions 4 & 5

## Baseline 2: Apply `MarkdownElementNodeParser` for parsing table/text nodes and using `Recursive Retriever` to retrieve relevant nodes

### Paring nodes using `MarkdownElementNodeParser`

In [None]:
node_parser = MarkdownElementNodeParser()

doc_nodes = node_parser.get_nodes_from_documents([test_table_document])

Embeddings have been explicitly disabled. Using MockEmbedding.


0it [00:00, ?it/s]

3it [00:19,  6.34s/it]


In [None]:
print(len(doc_nodes))

9


### Get index nodes and child nodes mapping for recursive retriever

In [None]:
base_nodes, node_mappings = node_parser.get_base_nodes_and_mappings(doc_nodes)

## Table Retrieval and Question Answering using Recursive Retrieval Query Engine

In [None]:
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import VectorStoreIndex
from llama_index.embeddings import TogetherEmbedding, OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index.service_context import ServiceContext

# construct top-level vector index + query engine
vector_index = VectorStoreIndex(doc_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=3)
vector_query_engine = vector_index.as_query_engine(similarity_top_k=3)


recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    node_dict=node_mappings,
    verbose=True,
)

llm = OpenAI(temperature=0, model="gpt-4")
service_context = ServiceContext.from_defaults(llm=llm)

recursive_query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, service_context=service_context
)

In [None]:
# query different questions
response_1 = recursive_query_engine.query(
    "What is MPT 30b performance for common sense reasoning?"
)
print(response_1)

response_2 = recursive_query_engine.query(
    "What is PaLM-2-L performance for TriviaQA?"
)
print(response_2)


response_3 = recursive_query_engine.query(
    "What is LLAMA 2 performance for HumanEval?"
)
print(response_3)

response_4 = recursive_query_engine.query(
    "What is LLAMA 2 performance for AGI Eval?"
)
print(response_4)

response_5 = recursive_query_engine.query(
    "What is LLAMA 2 performance for HumanEval and AGI Eval?"
)
print(response_5)

[1;3;34mRetrieving with query id None: What is MPT 30b performance for common sense reasoning?
[0m[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,
with the following table title:
Model Performance Comparison,
with the following columns:
- Model: The name of the model
- Size: The size of the model
- Code: The code score for the model
- Commonsense Reasoning: The score for commonsense reasoning task
- World Knowledge: The score for world knowledge task
- Reading Comprehension: The score for reading comprehension task
- Math MMLU: The score for math MMLU task
- BBH: The score for BBH task
- AGI Eval: The score for AGI evaluation task

|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|
|

In [None]:
print(response_1)
print(response_2)
print(response_3)
print(response_4)
print(response_5)

The performance of the MPT 30B model for the commonsense reasoning task is 64.9.
The PaLM-2-L model scored 86.1 on the TriviaQA benchmark.
The performance of the Llama 2 model for HumanEval (0-shot) is 29.9.
The performance of the Llama 2 model for AGI Eval is 51.2.
The Llama 2 model scored 29.9 on the HumanEval (0-shot) benchmark. For the AGI Eval task, the Llama 2 70B model scored 51.2.


Observation: Baseline 2 approach can answer all the 5 questions. However for the 4th question, it only answers partly since LLama 2 model has different variations. `51.2` is for Llama 2 70B model.

## Baseline 3: Recursive Retrieval + Pandas Query Engine for Table Nodes

In [None]:
import json
import pandas as pd
import ast

table_dfs = []
table_summaries = []
for node in doc_nodes:
    if "table" in node.id_:
        if "table_df" in node.metadata:
            table_dfs.append(
                pd.DataFrame.from_dict(
                    ast.literal_eval(node.metadata["table_df"])
                )
            )
        if "table_summary" in node.metadata:
            table_summaries.append(node.metadata["table_summary"])
            # table_dfs.append(node.metadata["table_df"])
print(len(table_dfs))
print(len(table_summaries))

3
3


In [None]:
table_dfs[2]

Unnamed: 0,Dataset,Num. of Comparisons,Avg. # Turns per Dialogue,Avg. # Tokens per Example,Avg. # Tokens in Prompt,Avg. # Tokens in Response
0,Anthropic Helpful,122387,3.0,251.5,17.7,88.4
1,Anthropic Harmless,43966,3.0,152.5,15.7,46.4
2,OpenAI Summarize,176625,1.0,371.1,336.0,35.1
3,OpenAI WebGPT,13333,1.0,237.2,48.3,188.9
4,StackExchange,1038480,1.0,440.2,200.1,240.2
5,Stanford SHP,74882,1.0,338.3,199.5,138.8
6,Synthetic GPT-J,33139,1.0,123.3,13.0,110.3
7,Meta (Safety & Helpfulness),1418091,3.9,798.5,31.4,234.1
8,Total,2919326,1.6,595.7,108.2,216.9


In [None]:
table_summaries

['This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\nwith the following table title:\nModel Performance Comparison,\nwith the following columns:\n- Model: The name of the model\n- Size: The size of the model\n- Code: The code score for the model\n- Commonsense Reasoning: The score for commonsense reasoning task\n- World Knowledge: The score for world knowledge task\n- Reading Comprehension: The score for reading comprehension task\n- Math MMLU: The score for math MMLU task\n- BBH: The score for BBH task\n- AGI Eval: The score for AGI evaluation task\n',
 'This table provides benchmark scores for different language models, including GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2. The scores are measured in various tasks such as MMLU, TriviaQA, Natural Questions, GSM8K, HumanE

## Create Pandas Query Engines

We create a pandas query engine over each structured table (`dataframe`).

These can be executed on their own to answer queries about each table.

In [None]:
import logging
import sys
from IPython.display import Markdown, display

import pandas as pd
from llama_index.query_engine import PandasQueryEngine
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.query_engine import PandasQueryEngine, RetrieverQueryEngine
from llama_index.retrievers import RecursiveRetriever
from llama_index.schema import IndexNode
from llama_index.llms import OpenAI


df_query_engines = [
    PandasQueryEngine(table_df, service_context=service_context)
    for table_df in table_dfs
]

In [None]:
# define index nodes

df_index_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(table_summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}

In [None]:
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(doc_nodes + df_index_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=3)

In [None]:
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

# dedup_nodes = [node for node in doc_nodes if not isinstance(node, IndexNode)]


recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    node_dict=node_mappings,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(
    service_context=service_context, response_mode="compact"
)

recursive_query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

In [None]:
# query different questions
response_1 = recursive_query_engine.query(
    "What is MPT 30b performance for common sense reasoning?"
)
print(response_1)

response_2 = recursive_query_engine.query(
    "What is PaLM-2-L performance for TriviaQA?"
)
print(response_2)


response_3 = recursive_query_engine.query(
    "What is LLAMA 2 performance for HumanEval?"
)
print(response_3)

response_4 = recursive_query_engine.query(
    "What is LLAMA 2 performance for AGI Eval?"
)
print(response_4)

response_5 = recursive_query_engine.query(
    "What is LLAMA 2 performance for HumanEval and AGI Eval?"
)
print(response_5)

[1;3;34mRetrieving with query id None: What is MPT 30b performance for common sense reasoning?
[0m[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,
with the following table title:
Model Performance Comparison,
with the following columns:
- Model: The name of the model
- Size: The size of the model
- Code: The code score for the model
- Commonsense Reasoning: The score for commonsense reasoning task
- World Knowledge: The score for world knowledge task
- Reading Comprehension: The score for reading comprehension task
- Math MMLU: The score for math MMLU task
- BBH: The score for BBH task
- AGI Eval: The score for AGI evaluation task

|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|
|

In [None]:
print(response_1)
print(response_2)
print(response_3)
print(response_4)
print(response_5)

The performance of MPT 30B for common sense reasoning is 64.9.
The performance of PaLM-2-L for TriviaQA is 86.1.
The performance of LLAMA 2 for HumanEval is 29.9.
The performance of the Llama 2 model for the AGI Eval task varies depending on the size of the model. The Llama 2 13B model scored 39.4, the Llama 2 34B model scored 44.1, and the Llama 2 70B model scored 51.2.
The Llama 2 model scored 29.9 on the HumanEval task and 51.2 on the AGI Eval task.


Observation: Baseline 3 approach can answer all the 5 questions. For the 4 question, it can answer the score for all the LLama model variations.