<a href="https://colab.research.google.com/github/spradhanCLS/sf-apex-job-logs-ext/blob/main/learn/generation/structured-data/vectorizing-structured-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/structured-data/vectorizing-structured-data.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/structured-data/vectorizing-structured-data.ipynb)

# Setup

Install the following libraries to work with this notebook.

Note: You will need two API keys to run this notebook: a [Pinecone](https://www.pinecone.io/) serverless API key, which you can get at app.pinecone.io after signing up for an account, and an OpenAI API key, which you can get at [OpenAI](https://openai.com/blog/openai-api).


In [2]:
# This notebook runs on Python version:
!python3 --version

Python 3.12.11


In [6]:
# Installs, 1
!pip install -qU \
    "pinecone-client[grpc]" \
    "unstructured[pdf]" \
    langchain \
    llama-index \
    llama-index-vector-stores-pinecone \
    pillow \
    pytesseract \
    --upgrade --no-cache-dir --force-reinstall

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m553.0/981.5 kB[0m [31m19.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m212.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m120.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.0/68.0 kB[0m [31m129.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [

In [None]:
# Import libs you'll need:
import json
import os
import re
from typing import Any
import requests

from bs4 import BeautifulSoup, ResultSet
from copy import deepcopy
from IPython.display import HTML, display
import pandas as pd
from pathlib import Path
from pinecone import ServerlessSpec
from pinecone.grpc import PineconeGRPC


from langchain.document_loaders import TextLoader
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.core.readers import download_loader
from llama_index.core.ingestion.pipeline import IngestionPipeline
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.schema import Document, TransformComponent
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.readers.file import PDFReader
from llama_index.vector_stores.pinecone import PineconeVectorStore
from unstructured.partition.pdf import partition_pdf

# If you run into issues with LlamaIndex and LLM or VectorStore, run this command in a new cell:
# !pip install llama-index --upgrade --no-cache-dir --force-reinstall

In [None]:
# The following simply makes print statements wrap text in Google Colab.
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

# This will ensure wrapped line are also displayed within Pandas dataframes
pd.set_option('display.max_colwidth', 400)


# Table extraction with [Unstructured](https://unstructured-io.github.io/unstructured/index.html)

You will start by extracting embedded tables from a PDF using `Unstructured`. The strategy in this section largely follows the one outlined in [this blog post](https://unstructured.io/blog/mastering-table-extraction-revolutionize-your-earnings-reports-analysis-with-ai) by `Unstructured`.

Note the following:
- PDFs need the `hi_res` strategy parameter.
- You will use [`"yolox"`](https://unstructured-io.github.io/unstructured/best_practices/models.html), a table-specific ML model for extracting embedded tables from PDFs.
- You will set the `infer_table_structure` parameter to `True`, as per `Unstructured`'s instructions for using [`partition_pdf`](https://unstructured-io.github.io/unstructured/best_practices/table_extraction_pdf.html#method-1-using-partition-pdf).

The PDF you'll be using is [Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models](https://arxiv.org/pdf/2402.12276.pdf). It has already been [uploaded to Github](https://github.com/pinecone-io/examples/tree/master/learn/generation/semi-structured-data) for easy access.

In [None]:
def download_from_github(gh_dir: str, file_name: str):
    """
    Download file from Github.

    :param gh_dir: Github directory that houses file,
        e.g.https://github.com/pinecone-io/examples/blob/master/learn/generation/structured-data/

        Note trailing "/".

    :param file_name: Name of file (including file extension) you want to download.
    """
    # Convert GitHub URL to raw content URL
    raw_url = gh_dir.replace("https://github.com/", "https://raw.githubusercontent.com/").replace("/blob", "") + file_name

    # Use requests to download the file
    response = requests.get(raw_url)

    # Check if the request was successful
    if response.status_code == 200:
        # Write the content to a file
        with open(file_name, 'wb') as file:
            file.write(response.content)
        print(f"File '{file_name}' downloaded successfully.")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")

In [None]:
# Download file from Github
github_dir = "https://github.com/pinecone-io/examples/blob/master/learn/generation/structured-data/"
filename = "scale-calibration-of-neural-rankers.pdf"

download_from_github(github_dir, filename)

File 'scale-calibration-of-neural-rankers.pdf' downloaded successfully.


In [None]:
# Note: this cell takes ~1-2mins to run in Colab.
elements = partition_pdf(
filename="scale-calibration-of-neural-rankers.pdf",
strategy="hi_res",
infer_table_structure=True,
model_name="yolox_quantized"  # A bit faster than plain Yolox model
)

In [None]:
# Save table elements
tables = [el for el in elements if el.category == "Table"]

In [None]:
# You are going to save the extracted Table elements to a .txt file that will be used by BeautifulSoup downstream.
TEXT_FILE = "scale-calibration-of-neural-rankers.txt"

# Save HTML to .txt file
with open(TEXT_FILE, 'w') as output_file:
        for t in tables:
            content = t.metadata.text_as_html
            output_file.write(content + "\n\n")

In [None]:
# You will now use LangChain to load your "documents" (i.e. your tables)
loader = TextLoader(TEXT_FILE)
documents = loader.load()

In [None]:
# You will use BeautifulSoup to parse the HTML in your .txt file
html_tables = BeautifulSoup(documents[0].page_content).select('table')  # documents is only of len 1

In [None]:
# Note the number of embedded tables you've extracted from the PDF..
print(f'You\'ve extracted {len(html_tables)} tables from your PDF!')

You've extracted 5 tables from your PDF!


In [None]:
# You will now extract structured data from the tables you've extracted from the PDF.

def extract_cols_and_rows(tables: ResultSet) -> tuple[list, list]:
    """
    Grab column headers and rows from table elements.

    :param tables: extract table elements from HTML text.
    :return: tuple containing extracted column headers and row data.
    """
    headers = []
    rows = []
    # Iterate over each table
    for table in tables:
        # Extract headers
        th = [th.text for th in table.find_all('th')]
        headers.append(th)
        # Extract rows
        tr_td = []
        for tr in table.find_all('tr'):
            row = [td.text for td in tr.find_all('td')]
            if row:  # Skip empty rows
                tr_td.append(row)
            rows.append(row)
    return headers, rows

In [None]:
headers, rows = extract_cols_and_rows(html_tables)

You will only be playing with two tables in this example notebook, simply because it's easier than dealing with all 5.

The two you will be using for your experiments are "Table 1" and "Table 2" in the PDF.

Since nothing in table extraction is perfect (yet), you'll have to do some massaging of the extracted headers and rows to get them in the perfect, structured format.

#### Table 1 Construction

In [None]:
# Through manual investigation, you find their headers and rows in the "headers" and "rows" variables
t1_headers = headers[1]

In [None]:
t1_headers

['Metric', 'TREC-DL', 'NTCIR-14']

In [None]:
# You find the correct rows
t1_rows = rows[15:21]

In [None]:
t1_rows

[['# Queries (Train/Val/Test)', '| 97/53/67', '48/16/16'],
 ['Avg. # docs per query', '282.7', '345.3'],
 ['Levels of relevance', '4', '5'],
 ['Label dist. (low to high)', '58/22/14/6', '—48/23/17/8/3'],
 ['Avg. query length', '8.0', '22.0'],
 ['Avg. doc. length', '70.9', '493.2']]

In [None]:
# You make your headers and rows into a dataframe for easy parsing downstream.
df1 = pd.DataFrame(data=t1_rows, columns=t1_headers)

In [None]:
# Take a look at your constructed table:
df1

Unnamed: 0,Metric,TREC-DL,NTCIR-14
0,# Queries (Train/Val/Test),| 97/53/67,48/16/16
1,Avg. # docs per query,282.7,345.3
2,Levels of relevance,4,5
3,Label dist. (low to high),58/22/14/6,—48/23/17/8/3
4,Avg. query length,8.0,22.0
5,Avg. doc. length,70.9,493.2


#### Table 2 Construction

In [None]:
# Headers[2] is not super well formed (likely bc the headers are nested), so you will manually overwrite these in the next cell
headers[2]

['Metric',
 'Ranking',
 'Calibration',
 'Ranking',
 'Calibration',
 '',
 'nDCG',
 'nDCG@10',
 'CB-ECE',
 'ECE',
 'MSE |',
 'nDCG',
 'nDCG@10',
 'CB-ECE',
 'ECE',
 'MSE']

In [None]:
# Manually concatenate the nested headers to maintain semantic relations in a single header per column
# This is a design choice that you will need to discuss with stakeholders. You can structure the data extracted from your tables
# in any way that makes sense to you.

t2_headers = ['Method',
              'TREC-ndcg',
              'TREC-ndcg@10',
              'TREC-CB-ECE',
              'TREC-ECE',
              'TREC-MSE',
              'NTCIR-ndcg',
              'NTCIR-ndcg@10',
              'NTCIR-CB-ECE',
              'NTCIR-ECE',
              'NTCIR-MSE']

In [None]:
t2_rows = rows[21:-9]

In [None]:
df2 = pd.DataFrame(data=t2_rows, columns=t2_headers)

In [None]:
# Our extraction technique did not grab the index unlabeled column in Table 2, which contains the
# classification categories each of the listed methods fall into (A, B, C...)

# So, add this in:
categories = ['A', 'B', 'C', 'C', 'D', 'E', 'F', 'F']  # Note we have to duplicate some categories since the table has two values per row in some places

df2.insert(1, 'Category', categories)

In [None]:
# You can see some weirdness here w/missing decimal points and blank cells, so you will manually clean your dataframe below.
df2

Unnamed: 0,Method,Category,TREC-ndcg,TREC-ndcg@10,TREC-CB-ECE,TREC-ECE,TREC-MSE,NTCIR-ndcg,NTCIR-ndcg@10,NTCIR-CB-ECE,NTCIR-ECE,NTCIR-MSE
0,Uncalibrated monoBERT,A,0.799,0.494,1.205,—0.320.-——0.773,,| 0.735,0.337,1.757,0.799,1.824
1,Post hoc + monoBERT,B,0.799,0.494,1.141,0.125,0.684 |,0.735,0.337,1.624,0.457_—«1.462,
2,Finetune monoBERT,C,0.776,0.422,1.093,0.221,«0.721 |,0.696,0.268,1.843,0.709,‘1.874
3,Finetune BERT,C,0.738,0.327,1.253,0.266,~=—-0.785 |,0.727,0.285,1.756,0.546,«1.416
4,LLM prompting w/ rubrics,D,0.786,0.457,1.000,1.246,2.137 |,0.728,0.328,1.2947,1.194,2.773
5,Post hoc + MC Sampling LLM,E,0.790,0.473,1.165,0.145,0.673,| 0.736,"=~ 0.364""",1.677,0.472,‘1.540
6,Literal Explanation + BERT,F,0.815',"0.529""",0.996°,"0.067""","0.602"" |",0.742,0.340,"1.534""",0.355,1.3307
7,Conditional Explanation + BERT,F,0.822,0.5347,0.862',0.428,0.832 |,0.720,0.322,1.405',0.2577,1.2907


In [None]:
# From PDF, you know the actual values these cells need to be, so set them here
df2["TREC-ECE"].iloc[0] = '0.320'
df2["TREC-MSE"].iloc[0] = '0.773'
df2['NTCIR-ECE'].iloc[1] = '0.457'
df2['NTCIR-MSE'].iloc[1] = '1.462'
df2['TREC-CB-ECE'].iloc[0] = '1.205'
df2['TREC-CB-ECE'].iloc[1] = '1.141'
df2['TREC-CB-ECE'].iloc[5] = '1.165'

In [None]:
# Great! You can leave the special chars, etc. They shouldn't matter too much.
df2

Unnamed: 0,Method,Category,TREC-ndcg,TREC-ndcg@10,TREC-CB-ECE,TREC-ECE,TREC-MSE,NTCIR-ndcg,NTCIR-ndcg@10,NTCIR-CB-ECE,NTCIR-ECE,NTCIR-MSE
0,Uncalibrated monoBERT,A,0.799,0.494,1.205,0.320,0.773,| 0.735,0.337,1.757,0.799,1.824
1,Post hoc + monoBERT,B,0.799,0.494,1.141,0.125,0.684 |,0.735,0.337,1.624,0.457,1.462
2,Finetune monoBERT,C,0.776,0.422,1.093,0.221,«0.721 |,0.696,0.268,1.843,0.709,‘1.874
3,Finetune BERT,C,0.738,0.327,1.253,0.266,~=—-0.785 |,0.727,0.285,1.756,0.546,«1.416
4,LLM prompting w/ rubrics,D,0.786,0.457,1.000,1.246,2.137 |,0.728,0.328,1.2947,1.194,2.773
5,Post hoc + MC Sampling LLM,E,0.790,0.473,1.165,0.145,0.673,| 0.736,"=~ 0.364""",1.677,0.472,‘1.540
6,Literal Explanation + BERT,F,0.815',"0.529""",0.996°,"0.067""","0.602"" |",0.742,0.340,"1.534""",0.355,1.3307
7,Conditional Explanation + BERT,F,0.822,0.5347,0.862',0.428,0.832 |,0.720,0.322,1.405',0.2577,1.2907


# LlamaIndex and Pinecone
 Now that you have the 2 tables for experimentation, you will use [LlamaIndex](https://docs.llamaindex.ai/en/stable/) to turn the rest of the PDF (including the embedded tables, any diagrams, images, etc.) into [Documents](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/root.html) objects.

This step is necessary because your earlier `documents` object, which you generated from your `.txt` file using `LangChain`, contains only your extracted *tables*, not the rest of the PDF's content.

The main tools you'll use from LlamaIndex are as follows:
- [`PDFReader`](https://github.com/run-llama/llama_index/blob/50806ba526dde4a054842394fe32e3880646fe6d/llama-index-legacy/llama_index/legacy/readers/file/docs_reader.py#L16) from LlamaHub
- [`SemanticSplitterNodeParser`](https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.SemanticSplitterNodeParser.html#semanticsplitternodeparser), which splits a Document into Nodes, with each node being a group of semantically related sentences.
- [`IngestionPipeline`](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html) to build an ETL flow that chunks your PDF, embeds it (i.e. vectorizes it), and then stores it in Pinecone in a specific [`namespace`](https://docs.pinecone.io/docs/namespaces).
- [`RetrieverQueryEngine`](https://github.com/run-llama/llama_index/blob/v0.10.12/llama-index-core/llama_index/core/query_engine/retriever_query_engine.py#L27) for querying your LLM in the RAG pipeline you'll build.

Learn more about using LlamaIndex with Pinecone on our [Integrations page](https://github.com/pinecone-io/examples/blob/master/learn/generation/llama-index/using-llamaindex-with-pinecone.ipynb).

## Load all PDF contents with LlamaIndex


In [None]:
# Read in your PDF

loader = PDFReader()
path = Path('scale-calibration-of-neural-rankers.pdf')
ctrl_docs = loader.load_data(file=path)

In [None]:
# Preview a Document
ctrl_docs[6]

Document(id_='b375ddec-d6c1-4046-b99d-f6c48513c62b', embedding=None, metadata={'page_label': '7', 'file_name': 'scale-calibration-of-neural-rankers.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models Conference’17, July 2017, Washington, DC, USA\nTable 2: Ranking and scale calibration performance of baseline methods and our approaches on two scale calibration datasets\nTREC and NTCIR. Note that lower is better with calibration metrics (CB-ECE, ECE and MSE). Statistically significant improve-\nments over “Platt Scaling monoBERT” are marked with†.\nCollection TREC NTCIR\nMetricRanking Calibration Ranking Calibration\nnDCG nDCG@10 CB-ECE ECE MSE nDCG nDCG@10 CB-ECE ECE MSE\nA Uncalibrated monoBERT 0.799 0.494 1.205 0.320 0.773 0.735 0.337 1.757 0.799 1.824\nB Post hoc + monoBERT 0.799 0.494 1.141 0.125 0.684 0.735 0.337 1.624 0.45

## Create Pinecone serverless index to store and retrieve Document Nodes


In [None]:
os.environ['PINECONE_API_KEY'] = "<your-key-from-app.pinecone.io>"  # REPLACE THIS WITH YOUR API KEY!
pinecone_api_key = os.getenv("PINECONE_API_KEY")

# Initialize connection to Pinecone
pc = PineconeGRPC(api_key=pinecone_api_key)
index_name = "structured-data-example"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        index_name,
        dimension=1536,  # Dimensions match encoder (embedder/vectorizer) you will use downstream, ada-002 from OpenAI.
        spec=ServerlessSpec(cloud="aws", region="us-west-2"),
    )

# Initialize your index
pinecone_index = pc.Index(index_name)

In [None]:
# Confirm creation of your index & that (if new) it has no vectors in it yet.
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

In [None]:
# If for any reason you want to delete your Pinecone index and start over, execute this code:
# pc.delete_index(index_name)

## Connect to Pinecone via LlamaIndex and build indexing pipeline

Below, you will build an indexing pipeline via LlamaIndex. You will upload your initial batch of vectors into a Pinecone index, in the `"control"` namespace. You will then use this namespace to compare and contrast downstream LLM answers to variants in your experiment.

Note: You will need an OpenAI API key for this step.

In [None]:
# Set/Get your OpenAI API Key

os.environ['OPENAI_API_KEY'] = "<your-openai-key>"  # REPLACE THIS WITH YOUR API KEY!
openai_api_key = os.getenv("OPENAI_API_KEY")

In [None]:
# Declare embedding model you will use throughout notebook:
# OpenAI's ada-002 text embedding modal is the model you will use both for Node parsing and for vectorization of PDF contents
EMBED_MODEL = OpenAIEmbedding(api_key=openai_api_key)

In [None]:
# You will need to re-define Pinecone as a LlamaIndex PineconeVectorStore obj when you add namespaces, so build a
# function to help you do that:
def initialize_vector_store(index: PineconeGRPC, namespace: str) -> PineconeVectorStore:
    """
    Initialize Pinecone index as a VectorStore obj.

    :param index: Pinecone serverless index.
    :param namespace: Namespace constraint you want on your queries, indexing operations, etc. when using this vector store.
    :return: PineconeVectorStore obj.
    """
    return PineconeVectorStore(pinecone_index=index, namespace=namespace)


In [None]:
def run_indexing_pipeline(vector_store, documents, embed_model=EMBED_MODEL):
    # Define pipeline stages
    pipeline = IngestionPipeline(
        transformations=[
            # CleanTextForOpenAI(),  # Clean doc text
            SemanticSplitterNodeParser(
                buffer_size=1,
                breakpoint_percentile_threshold=95,
                embed_model=embed_model,
                ),
            embed_model,  # Vectorize nodes
            ],
        vector_store=vector_store # Index into Pinecone
        )

    # Run documents through pipeline
    return pipeline.run(documents=documents)


In [None]:
# Declare namespace you will put your first batch of vectors into:
ctrl_namespace = 'control'

# Initialize vector store w/control namespace
ctrl_vector_store = initialize_vector_store(pinecone_index, ctrl_namespace)

# Run pipeline
output = run_indexing_pipeline(ctrl_vector_store, ctrl_docs)

Upserted vectors: 100%|████████████████████████████████████████████| 48/48 [00:00<00:00, 88.20it/s]


In [None]:
# Confirm your docs made it to the index, in the right namespace
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'control': {'vector_count': 48}},
 'total_vector_count': 48}

# Build RAG pipeline, background on experiments

You will run a variety of RAG experiments to figure out which way of vectorizing table data works best (i.e. provides the most accurate answers).

You will run two families of experiments:
1. Baseline RAG experiment where you do not do anything special to your PDF (this is the "control" variant)
2. Experimetns where you explicitly vectorize the extracted table elements (`df_1` and `df_2`) in different ways. The different ways you will experiment with are:
- Concatenating all row data (`v1`)
- Concatenating all row data with header data, too (`v2`)
- Concatenating all row data with header data, and with table description data (`v3`)
- Injecting table values into a natural language template (`v4`)


You ask your LLM the same 7 questions (defined below) across all experiment variants.

## Questions

You will ask the following 7 questions during each experiment. The answers were given by humans who read the article.

In [None]:
QUERIES = [
    "How does the average query length compare to the average document length in table 1?",
    "What are the Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2?",
    "What impact do natural language explanations (NLEs) have on improving the calibration and overall effectiveness of these models in document ranking tasks?",
    "How do i interpret table 2's calibration and ranking scores?",
    "What are the weights of \"Uncalibrated monoBERT\" tuned on?",
    "What category was used to build and train literal explanation + BERT? what does this category mean?",
    "Is the 'trec-dl' in table 1 the same as the 'trec' in table 2?"
]

ANSWERS = [
    "The average query length is shorter in TREC (8) than it is in NTCIR (22). The average doc length is also shorter in TREC (70.9 vs 493.2)",
    "TREC: 0.529; NTCIR: 0.340.",
    "NLEs lead to better calibrated neural rankers while maintaining or even boosting ranking performance in most scenarios",
    "Lower is better for calibration, higher is better for ranking",
    "MSMarco",
    "Category F: training nle-based neural rankers on calibration data.",
    "Yes"
]

## Build Control RAG pipeline

You will use the following RAG pipeline for each of your experiments. It fetches the top `5` semantic search results from Pinecone to use as context to send to your LLM.


In [None]:
def run_rag_pipeline(vector_store, queries, k=5, filters=None):
    """
    Send queries to an LLM, having it take context from a vector store (and namespace).

    :param vector_store: Your Pinecone vector store.
    :param queries: The queries you want to ask your LLM.
    :param k: The number of results you want retrieved as context from your Pinecone index.
    :param filters: Option to add metadata filters to request if desired.
    :return: Tuple of responses from your LLM.
    """

    # Instantiate VectorStoreIndex object from our vector_store object
    vector_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

    if not filters:
        retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=k, namespace=vector_store.namespace)
    else:
        retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=k, namespace=vector_store.namespace, filters=filters)

    # Query engine
    query_engine = RetrieverQueryEngine(retriever=retriever)

    # Pass our 7 test queries
    responses = ()
    for i in queries:
        response = query_engine.query(i).response
        responses += (response, )

    return responses


# Run experiments

## Control variant RAG pipeline

In [None]:
# Run RAG pipeline for control use case
one_ctrl, two_ctrl, three_ctrl, four_ctrl, five_ctrl, six_ctrl, seven_ctrl = run_rag_pipeline(ctrl_vector_store, QUERIES)

In [None]:
print(f"One: {one_ctrl}\n-----\nTwo: {two_ctrl}\n-----\nThree: {three_ctrl}\n-----\nFour: {four_ctrl}\
       \nFive: {five_ctrl}\n-----\nSix: {six_ctrl}\n-----\nSeven: {seven_ctrl}")

One: The average query length is shorter compared to the average document length in Table 1.
-----
Two: 0.529 and 0.534
-----
Three: Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additionally, NLEs help in elucidating the rationale behind system decisions and enhancing task efficacy, ultimately improving the overall effectiveness of these models in document ranking tasks.
-----
Four: Interpreting Table 2's calibration and ranking scores involves understanding that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. The table compares the performance of different methods on two scale calibration datasets, TREC and NTCIR, us

In [None]:
# Start a dataframe that you can continue to add your results to as you run future experiments
ctrl_responses = [{'ANSWER': ANSWERS[0], 'control': one_ctrl},
                {'ANSWER': ANSWERS[1], 'control': two_ctrl},
                {'ANSWER': ANSWERS[2], 'control': three_ctrl},
                {'ANSWER': ANSWERS[3], 'control': four_ctrl},
                {'ANSWER': ANSWERS[4], 'control': five_ctrl},
                {'ANSWER': ANSWERS[5], 'control': six_ctrl},
                {'ANSWER': ANSWERS[6], 'control': seven_ctrl}]

exp_results = pd.DataFrame(data=ctrl_responses, index=[QUERIES[0], QUERIES[1], QUERIES[2], QUERIES[3],
                                                       QUERIES[4], QUERIES[5], QUERIES[6]])

exp_results

Unnamed: 0,ANSWER,control
How does the average query length compare to the average document length in table 1?,The average query length is shorter in TREC (8) than it is in NTCIR (22). The average doc length is also shorter in TREC (70.9 vs 493.2),The average query length is shorter compared to the average document length in Table 1.
What are the Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2?,TREC: 0.529; NTCIR: 0.340.,0.529 and 0.534
What impact do natural language explanations (NLEs) have on improving the calibration and overall effectiveness of these models in document ranking tasks?,NLEs lead to better calibrated neural rankers while maintaining or even boosting ranking performance in most scenarios,"Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi..."
How do i interpret table 2's calibration and ranking scores?,"Lower is better for calibration, higher is better for ranking","Interpreting Table 2's calibration and ranking scores involves understanding that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. The table compares the performance of different methods on two scale calibration datasets, TREC and NTCIR, using metrics like nDCG, nDCG@10, CB-ECE, ECE, and MSE. Significant improvements over a baseline method are marked with a dagger ..."
"What are the weights of ""Uncalibrated monoBERT"" tuned on?",MSMarco,"The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO."
What category was used to build and train literal explanation + BERT? what does this category mean?,Category F: training nle-based neural rankers on calibration data.,"Category F was used to build and train the literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. Specifically, in the scenario of the literal explanation approach, where each input is represented with two meta NLEs (one for relevance an..."
Is the 'trec-dl' in table 1 the same as the 'trec' in table 2?,Yes,"Yes, the 'trec-dl' in Table 1 is the same as the 'TREC' in Table 2."


## Variant 1: concatenate row values

For this variant, you will concatenate each row of your extracted tables (stored in `df1` and `df2`). You will then create vectors of each of these rows, upsert them into a Pinecone namespace, and run a RAG pipeline to see how your LLM's responses differ from the actual answers and the control variant's answers, given this vectorization strategy.

In [None]:
# Quick reminder what your extracted tables look like:
# Table 1
df1.head()

Unnamed: 0,Metric,TREC-DL,NTCIR-14
0,# Queries (Train/Val/Test),| 97/53/67,48/16/16
1,Avg. # docs per query,282.7,345.3
2,Levels of relevance,4,5
3,Label dist. (low to high),58/22/14/6,48/23/17/8/3
4,Avg. query length,8.0,22.0


In [None]:
# Table 2
df2.head()

Unnamed: 0,Method,Category,TREC-ndcg,TREC-ndcg@10,TREC-CB-ECE,TREC-ECE,TREC-MSE,NTCIR-ndcg,NTCIR-ndcg@10,NTCIR-CB-ECE,NTCIR-ECE,NTCIR-MSE
0,Uncalibrated monoBERT,A,0.799,0.494,1.205,0.32,0.773,| 0.735,0.337,1.757,0.799,1.824
1,Post hoc + monoBERT,B,0.799,0.494,1.141,0.125,0.684,| 0.735,0.337,1.624,0.457,1.462
2,Finetune monoBERT,C,0.776,0.422,1.093,0.221,-~—«0.721 |,0.696,0.268,1.843,0.709,‘1.874
3,Finetune BERT,C,0.738,0.327,1.253,0.266,~=—0.785_ |,_ 0.727,0.285,1.756,0.546,«1.416
4,LLM prompting w/ rubrics,D,0.786,0.457,1.000',1.246,«2.137,| 0.728,0.328,1.2947,1.194,2.773


In [None]:
# Define function to iterate through your dataframes and concatenate each row's data to itself:

def concat_row_values(dataframe: pd.DataFrame) -> list[str]:
    """
    Concatenate all values per row in a dataframe, separated by ", ".

    :param: Dataframe containing rows you want to concatenate.
    :return: Concatenated row values.
    """
    return dataframe.apply(lambda row: ', '.join(row.astype(str)), axis=1).tolist()

In [None]:
df1_concat_rows = concat_row_values(df1)
df2_concat_rows = concat_row_values(df2)

In [None]:
# Preview, nice!
df2_concat_rows

['Uncalibrated monoBERT, A, 0.799, 0.494, 1.205, 0.320, 0.773, | 0.735, 0.337, 1.757, 0.799, 1.824',
 'Post hoc + monoBERT, B, 0.799, 0.494, 1.141, 0.125, 0.684, | 0.735, 0.337, 1.624, 0.457, 1.462',
 'Finetune monoBERT, C, 0.776, 0.422, 1.093, 0.221, -~—«0.721 |, 0.696, 0.268, 1.843, 0.709, ‘1.874',
 'Finetune BERT, C, 0.738, 0.327, 1.253, 0.266, ~=—0.785_ |, _ 0.727, 0.285, 1.756, 0.546, «1.416',
 "LLM prompting w/ rubrics, D, 0.786, 0.457, 1.000', 1.246, «2.137, | 0.728, 0.328, 1.2947, 1.194, 2.773",
 'Post hoc + MC Sampling LLM, E, 0.790, 0.473, 1.165, 0.145, 0.673, | 0.736, 0.364", 1.677, 0.472, ‘1.540',
 'Literal Explanation + BERT, F, 0.815*, 0.529%, 0.996°, 0.067*, 0.602" |, 0.742, 0.340, 1.534", 0.355, 1.3307',
 "Conditional Explanation + BERT, F, 0.822, —0.534*, 0.862', 0.428, ~—0.832_ |, 0.720, 0.322, 1.405', 0.2577, 1.2907"]

In [None]:
# Now you need to write a function to turn items into aLlamaIndex Document objs so they can go in your indexing pipeline downstream:

def turn_data_into_documents(rows: list[str]) -> list[Document]:
    """
    Transform data into LlamaIndex Document objects.
    Document obj: llama_index >> core >> schema.py

    :param rows: Data you want to turn into Documents.
    :return: Document objects.
    """
    docs = []
    for i in rows:
        doc = Document(text=i)
        docs.append(doc)
    return docs

In [None]:
# Turn each concatenated row into Document objs:
df1_concat_rows_docs = turn_data_into_documents(df1_concat_rows)
df2_concat_rows_docs = turn_data_into_documents(df2_concat_rows)

### Run indexing pipeline

In [None]:
# Define namespace for Variant 1
v1_namespace = 'v1'

# Initialize vector store w/v1 namespace
v1_vector_store = initialize_vector_store(pinecone_index, v1_namespace)

# Define docs you'll send through indexing pipeline into v1_namespace
# You will combine the table contents you defined above w/the regular PDF contents from control
v1_docs = df1_concat_rows_docs + df2_concat_rows_docs + ctrl_docs

# Run pipeline
output = run_indexing_pipeline(v1_vector_store, v1_docs)

Upserted vectors:   0%|          | 0/63 [00:00<?, ?it/s]

In [None]:
# Confirm your v1 docs made it to the index, in the correct namespace
pinecone_index.describe_index_stats()

### Run RAG pipeline

In [None]:
# Run RAG pipeline for v1 use case
one_v1, two_v1, three_v1, four_v1, five_v1, six_v1, seven_v1 = run_rag_pipeline(v1_vector_store, QUERIES)


In [None]:
# Add variant 1's responses to `exp_results` dataframe:

v1_responses = [one_v1, two_v1, three_v1, four_v1, five_v1, six_v1, seven_v1]

exp_results['v1'] = v1_responses

exp_results

Unnamed: 0,ANSWER,control,v1
How does the average query length compare to the average document length in table 1?,The average query length is shorter in TREC (8) than it is in NTCIR (22). The average doc length is also shorter in TREC (70.9 vs 493.2),The average query length is shorter compared to the average document length in Table 1.,The average query length is shorter compared to the average document length in table 1.
What are the Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2?,TREC: 0.529; NTCIR: 0.340.,0.529 and 0.534,The Literal Explanation + BERT method's nDCG@10 scores on both datasets in Table 2 are 0.529 for the TREC dataset and 0.602 for the NTCIR dataset.
What impact do natural language explanations (NLEs) have on improving the calibration and overall effectiveness of these models in document ranking tasks?,NLEs lead to better calibrated neural rankers while maintaining or even boosting ranking performance in most scenarios,"Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi..."
How do i interpret table 2's calibration and ranking scores?,"Lower is better for calibration, higher is better for ranking","Interpreting Table 2's calibration and ranking scores involves understanding that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. The table compares the performance of different methods on two scale calibration datasets, TREC and NTCIR, using metrics like nDCG, nDCG@10, CB-ECE, ECE, and MSE. Significant improvements over a baseline method are marked with a dagger ...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Statistically significant improvements over the ""Platt Scaling monoBERT"" baseline are marked with a dagger symbol. The table presents the performance of different methods on two scale calibration datasets, TREC and NTCIR, with metrics like nDCG, n..."
"What are the weights of ""Uncalibrated monoBERT"" tuned on?",MSMarco,"The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO."
What category was used to build and train literal explanation + BERT? what does this category mean?,Category F: training nle-based neural rankers on calibration data.,"Category F was used to build and train the literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. Specifically, in the scenario of the literal explanation approach, where each input is represented with two meta NLEs (one for relevance an...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. The approach includes generating and aggregating natural language explanations for query-document pairs, and then using these explanations to r..."
Is the 'trec-dl' in table 1 the same as the 'trec' in table 2?,Yes,"Yes, the 'trec-dl' in Table 1 is the same as the 'TREC' in Table 2.","Yes, the 'trec-dl' in Table 1 is the same dataset as the 'TREC' dataset mentioned in Table 2."


## Variant 2: concatenate row values with header data

In [None]:
def concat_rows_with_headers(dataframe) -> list[str, str]:
    """
    For each row, for each value, concatenate it with its column header.
    """
    return dataframe.apply(lambda row: ', '.join(f"{col}: {row[col]}" for col in dataframe.columns), axis=1).tolist()


In [None]:
df1_rows_w_headers = concat_rows_with_headers(df1)
df2_rows_w_headers = concat_rows_with_headers(df2)

In [None]:
df1_rows_w_headers

['Metric: # Queries (Train/Val/Test), TREC-DL: | 97/53/67, NTCIR-14: 48/16/16',
 'Metric: Avg. # docs per query, TREC-DL: 282.7, NTCIR-14: 345.3',
 'Metric: Levels of relevance, TREC-DL: 4, NTCIR-14: 5',
 'Metric: Label dist. (low to high), TREC-DL: 58/22/14/6, NTCIR-14: 48/23/17/8/3',
 'Metric: Avg. query length, TREC-DL: 8.0, NTCIR-14: 22.0',
 'Metric: Avg. doc. length, TREC-DL: 70.9, NTCIR-14: 493.2']

In [None]:
df2_rows_w_headers

['Method: Uncalibrated monoBERT, Category: A, TREC-ndcg: 0.799, TREC-ndcg@10: 0.494, TREC-CB-ECE: 1.205, TREC-ECE: 0.320, TREC-MSE: 0.773, NTCIR-ndcg: | 0.735, NTCIR-ndcg@10: 0.337, NTCIR-CB-ECE: 1.757, NTCIR-ECE: 0.799, NTCIR-MSE: 1.824',
 'Method: Post hoc + monoBERT, Category: B, TREC-ndcg: 0.799, TREC-ndcg@10: 0.494, TREC-CB-ECE: 1.141, TREC-ECE: 0.125, TREC-MSE: 0.684, NTCIR-ndcg: | 0.735, NTCIR-ndcg@10: 0.337, NTCIR-CB-ECE: 1.624, NTCIR-ECE: 0.457, NTCIR-MSE: 1.462',
 'Method: Finetune monoBERT, Category: C, TREC-ndcg: 0.776, TREC-ndcg@10: 0.422, TREC-CB-ECE: 1.093, TREC-ECE: 0.221, TREC-MSE: -~—«0.721 |, NTCIR-ndcg: 0.696, NTCIR-ndcg@10: 0.268, NTCIR-CB-ECE: 1.843, NTCIR-ECE: 0.709, NTCIR-MSE: ‘1.874',
 'Method: Finetune BERT, Category: C, TREC-ndcg: 0.738, TREC-ndcg@10: 0.327, TREC-CB-ECE: 1.253, TREC-ECE: 0.266, TREC-MSE: ~=—0.785_ |, NTCIR-ndcg: _ 0.727, NTCIR-ndcg@10: 0.285, NTCIR-CB-ECE: 1.756, NTCIR-ECE: 0.546, NTCIR-MSE: «1.416',
 "Method: LLM prompting w/ rubrics, Catego

In [None]:
# Now turn your items into Document objects like before, so they can go into the indexing pipeline downstream
df1_rows_w_headers_docs = turn_data_into_documents(df1_rows_w_headers)
df2_rows_w_headers_docs = turn_data_into_documents(df2_rows_w_headers)

In [None]:
# Preview one of your Document objs
df1_rows_w_headers_docs[0]

Document(id_='cb245e9f-1204-498e-bafc-f0ef4f493202', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Metric: # Queries (Train/Val/Test), TREC-DL: | 97/53/67, NTCIR-14: 48/16/16', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

### Run indexing pipeline

In [None]:
# Declare namespace
v2_namespace = 'v2'

# Initialize vector store w/v2 namespace
v2_vector_store = initialize_vector_store(pinecone_index, v2_namespace)

# Set up your docs
v2_docs = df1_rows_w_headers_docs + df2_rows_w_headers_docs + ctrl_docs

# Run your pipeline
output = run_indexing_pipeline(v2_vector_store, v2_docs)


Upserted vectors:   0%|          | 0/63 [00:00<?, ?it/s]

In [None]:
# Awesome, you have vectorized each row's values and column headers (per table), and
# upserted them all into Pinecone along with the 'control' vectors.

pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'control': {'vector_count': 48},
                'v1': {'vector_count': 63},
                'v2': {'vector_count': 0}},
 'total_vector_count': 111}

### Run RAG pipeline

In [None]:
# Run RAG pipeline for v2 use case
one_v2, two_v2, three_v2, four_v2, five_v2, six_v2, seven_v2 = run_rag_pipeline(v2_vector_store, QUERIES)


In [None]:
# Add variant 2's responses to `exp_results` dataframe:
v2_responses = [one_v2, two_v2, three_v2, four_v2, five_v2, six_v2, seven_v2]

exp_results['v2'] = v2_responses

exp_results

Unnamed: 0,ANSWER,control,v1,v2
How does the average query length compare to the average document length in table 1?,The average query length is shorter in TREC (8) than it is in NTCIR (22). The average doc length is also shorter in TREC (70.9 vs 493.2),The average query length is shorter compared to the average document length in Table 1.,The average query length is shorter compared to the average document length in table 1.,The average query length is significantly lower than the average document length in table 1.
What are the Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2?,TREC: 0.529; NTCIR: 0.340.,0.529 and 0.534,The Literal Explanation + BERT method's nDCG@10 scores on both datasets in Table 2 are 0.529 for the TREC dataset and 0.602 for the NTCIR dataset.,The Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2 are 0.529% for the TREC dataset and 0.340 for the NTCIR dataset.
What impact do natural language explanations (NLEs) have on improving the calibration and overall effectiveness of these models in document ranking tasks?,NLEs lead to better calibrated neural rankers while maintaining or even boosting ranking performance in most scenarios,"Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi..."
How do i interpret table 2's calibration and ranking scores?,"Lower is better for calibration, higher is better for ranking","Interpreting Table 2's calibration and ranking scores involves understanding that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. The table compares the performance of different methods on two scale calibration datasets, TREC and NTCIR, using metrics like nDCG, nDCG@10, CB-ECE, ECE, and MSE. Significant improvements over a baseline method are marked with a dagger ...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Statistically significant improvements over the ""Platt Scaling monoBERT"" baseline are marked with a dagger symbol. The table presents the performance of different methods on two scale calibration datasets, TREC and NTCIR, with metrics like nDCG, n...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Look for statistically significant improvements over the ""Platt Scaling monoBERT"" baseline, which are marked with a dagger symbol (†). Pay attention to the values in the columns for Ranking and Calibration metrics for different methods and dataset..."
"What are the weights of ""Uncalibrated monoBERT"" tuned on?",MSMarco,"The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO."
What category was used to build and train literal explanation + BERT? what does this category mean?,Category F: training nle-based neural rankers on calibration data.,"Category F was used to build and train the literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. Specifically, in the scenario of the literal explanation approach, where each input is represented with two meta NLEs (one for relevance an...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. The approach includes generating and aggregating natural language explanations for query-document pairs, and then using these explanations to r...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. For the literal explanation approach, this category involves generating and aggregating natural language explanations for query-document pairs ..."
Is the 'trec-dl' in table 1 the same as the 'trec' in table 2?,Yes,"Yes, the 'trec-dl' in Table 1 is the same as the 'TREC' in Table 2.","Yes, the 'trec-dl' in Table 1 is the same dataset as the 'TREC' dataset mentioned in Table 2.","Yes, the 'trec-dl' in table 1 is the same as the 'TREC' in table 2."


## Variant 3: Concatenate row values w/header data *and* table description data


In [None]:
# Make function to extract table descriptions from the PDF
# Note: this function includes transforming the table descriptions into LlamaIndex Document objs, so no need to do that later

def extract_table_description(docs: list[Document], start_phrase: str, end_phrase: str) -> list[str]:
    """
    Extract descriptions of embedded tables.

    :param docs: LlamaIndex documents you want to search through to find table descriptions.
    :param start_phrase: The starting boundary of your table descriptoin (inclusive).
    :param end_phrase: The ending boundary of your table descriptoin (inclusive).
    """
    pattern = fr"(.|^)({start_phrase}.*?{end_phrase})"
    table_desc = []
    for d in docs:
        match = re.search(pattern, d.text, re.DOTALL)
        if match:
            table_desc.append(match.group(2))
    return turn_data_into_documents(table_desc)  # Turn tables into LlamaIndex Document objs so they work in Pipeline


In [None]:
# Extract descriptions for Table 1 and Table 2, from `control_docs`
t1_desc = extract_table_description(ctrl_docs, "Table 1", "512 tokens")
t2_desc = extract_table_description(ctrl_docs, "Table 2", "marked with")  # Don't worry about special char in actual PDF at end of desc.


In [None]:
t1_desc[0].text

'Table 1: Statistics of the TREC-DL 2019-2022 and NTCIR-14\nWWW-2 Datasets. The lengths of queries and documents are\nquantified using BERT tokenization. For the NTCIR dataset,\ndocuments sourced from ClueWeb have undergone prepro-\ncessing to retain only the initial 512 tokens'

In [None]:
# Make deep copies of original `df1_rows_w_headers_docs`, so you don't mess past data structures, as we update them in-place below.
copy_df1_rows_w_headers_docs = deepcopy(df1_rows_w_headers_docs)
copy_df2_rows_w_headers_docs = deepcopy(df2_rows_w_headers_docs)

In [None]:
# Before in-place update, t1
copy_df1_rows_w_headers_docs[0].text

'Metric: # Queries (Train/Val/Test), TREC-DL: | 97/53/67, NTCIR-14: 48/16/16'

In [None]:
# Before in-place update, t2
copy_df2_rows_w_headers_docs[0].text

'Method: Uncalibrated monoBERT, Category: A, TREC-ndcg: 0.799, TREC-ndcg@10: 0.494, TREC-CB-ECE: 1.205, TREC-ECE: 0.320, TREC-MSE: 0.773, NTCIR-ndcg: | 0.735, NTCIR-ndcg@10: 0.337, NTCIR-CB-ECE: 1.757, NTCIR-ECE: 0.799, NTCIR-MSE: 1.824'

In [None]:
# Update .text attributes of each Document obj in-place to include the extracted table descriptions:

def add_in_table_description(docs: list[Document], desc: list[Document]) -> None:
    """
    Add description for embedded table to Document item (in-place).

    :param docs: Documents whose .text attribte you want to update with a Table description.
    :param desc: The Table description you want to add to a Document's .text attribute.
    """
    for i in range(len(docs)):
        docs[i].text += f". {desc[0].text}"

In [None]:
# Add Table 1 desc to Table 1 concatenated rows and headers
add_in_table_description(copy_df1_rows_w_headers_docs, t1_desc)

# Add Table 2 desc to Table 2 concatenated rows and headers
add_in_table_description(copy_df2_rows_w_headers_docs, t2_desc)

In [None]:
# After in-place update, t1
copy_df1_rows_w_headers_docs[0].text

'Metric: # Queries (Train/Val/Test), TREC-DL: | 97/53/67, NTCIR-14: 48/16/16. Table 1: Statistics of the TREC-DL 2019-2022 and NTCIR-14\nWWW-2 Datasets. The lengths of queries and documents are\nquantified using BERT tokenization. For the NTCIR dataset,\ndocuments sourced from ClueWeb have undergone prepro-\ncessing to retain only the initial 512 tokens'

In [None]:
# After in-place update, t1
copy_df2_rows_w_headers_docs[0].text

'Method: Uncalibrated monoBERT, Category: A, TREC-ndcg: 0.799, TREC-ndcg@10: 0.494, TREC-CB-ECE: 1.205, TREC-ECE: 0.320, TREC-MSE: 0.773, NTCIR-ndcg: | 0.735, NTCIR-ndcg@10: 0.337, NTCIR-CB-ECE: 1.757, NTCIR-ECE: 0.799, NTCIR-MSE: 1.824. Table 2: Ranking and scale calibration performance of baseline methods and our approaches on two scale calibration datasets\nTREC and NTCIR. Note that lower is better with calibration metrics (CB-ECE, ECE and MSE). Statistically significant improve-\nments over “Platt Scaling monoBERT” are marked with'

In [None]:
# Rename your vars so they reflect the addition of the table descriptions
df1_rows_w_headers_desc = copy_df1_rows_w_headers_docs
df2_rows_w_headers_desc = copy_df2_rows_w_headers_docs

In [None]:
# Preview of one doc from after you added Table 1 description
df2_rows_w_headers_desc[-2].text

'Method: Literal Explanation + BERT, Category: F, TREC-ndcg: 0.815*, TREC-ndcg@10: 0.529%, TREC-CB-ECE: 0.996°, TREC-ECE: 0.067*, TREC-MSE: 0.602" |, NTCIR-ndcg: 0.742, NTCIR-ndcg@10: 0.340, NTCIR-CB-ECE: 1.534", NTCIR-ECE: 0.355, NTCIR-MSE: 1.3307. Table 2: Ranking and scale calibration performance of baseline methods and our approaches on two scale calibration datasets\nTREC and NTCIR. Note that lower is better with calibration metrics (CB-ECE, ECE and MSE). Statistically significant improve-\nments over “Platt Scaling monoBERT” are marked with'

### Run indexing pipeline

In [None]:
# Declare namespace
v3_namespace = 'v3'

# Initialize vector store w/v3 namespace
v3_vector_store = initialize_vector_store(pinecone_index, v3_namespace)

# Join docs
v3_docs = df1_rows_w_headers_desc + df2_rows_w_headers_desc + ctrl_docs

# Run through embedding and indexing pipeline
output = run_indexing_pipeline(v3_vector_store, v3_docs)

Upserted vectors:   0%|          | 0/76 [00:00<?, ?it/s]

In [None]:
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'control': {'vector_count': 48},
                'v1': {'vector_count': 63},
                'v2': {'vector_count': 63},
                'v3': {'vector_count': 0}},
 'total_vector_count': 174}

### Run RAG pipeline

In [None]:
# Run RAG pipeline for v3 use case
one_v3, two_v3, three_v3, four_v3, five_v3, six_v3, seven_v3 = run_rag_pipeline(v3_vector_store, QUERIES)


In [None]:
# Add variant 3's responses to `exp_results` dataframe:
v3_responses = [one_v3, two_v3, three_v3, four_v3, five_v3, six_v3, seven_v3]

exp_results['v3'] = v3_responses

exp_results

Unnamed: 0,ANSWER,control,v1,v2,v3,v4
How does the average query length compare to the average document length in table 1?,The average query length is shorter in TREC (8) than it is in NTCIR (22). The average doc length is also shorter in TREC (70.9 vs 493.2),The average query length is shorter compared to the average document length in Table 1.,The average query length is shorter compared to the average document length in table 1.,The average query length is significantly lower than the average document length in table 1.,The average query length is shorter than the average document length in Table 1.,The average query length is significantly lower than the average document length in table 1.
What are the Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2?,TREC: 0.529; NTCIR: 0.340.,0.529 and 0.534,The Literal Explanation + BERT method's nDCG@10 scores on both datasets in Table 2 are 0.529 for the TREC dataset and 0.602 for the NTCIR dataset.,The Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2 are 0.529% for the TREC dataset and 0.340 for the NTCIR dataset.,"The Literal Explanation + BERT method's ndcg@10 scores on the TREC dataset and NTCIR dataset in table 2 are 0.529% and 0.340%, respectively.",The Literal Explanation + BERT method's nDCG@10 score on the TREC dataset is 0.529% and on the NTCIR dataset is 0.340.
What impact do natural language explanations (NLEs) have on improving the calibration and overall effectiveness of these models in document ranking tasks?,NLEs lead to better calibrated neural rankers while maintaining or even boosting ranking performance in most scenarios,"Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, exhibiting lower error values compared to other calibration methods. Additionally, NLEs c...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, exhibiting lower error values compared to other calibration methods. Additionally, NLEs h..."
How do i interpret table 2's calibration and ranking scores?,"Lower is better for calibration, higher is better for ranking","Interpreting Table 2's calibration and ranking scores involves understanding that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. The table compares the performance of different methods on two scale calibration datasets, TREC and NTCIR, using metrics like nDCG, nDCG@10, CB-ECE, ECE, and MSE. Significant improvements over a baseline method are marked with a dagger ...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Statistically significant improvements over the ""Platt Scaling monoBERT"" baseline are marked with a dagger symbol. The table presents the performance of different methods on two scale calibration datasets, TREC and NTCIR, with metrics like nDCG, n...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Look for statistically significant improvements over the ""Platt Scaling monoBERT"" baseline, which are marked with a dagger symbol (†). Pay attention to the values in the columns for Ranking and Calibration metrics for different methods and dataset...","The calibration scores in Table 2, represented by CB-ECE, ECE, and MSE, are metrics used to evaluate the alignment of the ranking scores produced by different methods with the target scale. Lower values for these metrics indicate better calibration performance. On the other hand, the ranking scores, such as nDCG and nDCG@10, measure the effectiveness of the ranking produced by the methods, whe...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Look for statistically significant improvements over the ""Platt Scaling monoBERT"" baseline, which are marked with a dagger symbol (†). Pay attention to the values in the columns for Ranking and Calibration metrics for different methods and dataset..."
"What are the weights of ""Uncalibrated monoBERT"" tuned on?",MSMarco,"The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO."
What category was used to build and train literal explanation + BERT? what does this category mean?,Category F: training nle-based neural rankers on calibration data.,"Category F was used to build and train the literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. Specifically, in the scenario of the literal explanation approach, where each input is represented with two meta NLEs (one for relevance an...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. The approach includes generating and aggregating natural language explanations for query-document pairs, and then using these explanations to r...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. For the literal explanation approach, this category involves generating and aggregating natural language explanations for query-document pairs ...","Category F was used to build and train the literal explanation + BERT model. This category involves training NLE-based neural rankers on calibration data. In this approach, a BERT model is finetuned to process meta natural language explanations (NLEs) and generate scale-calibrated ranking scores.","Category F was used to build and train the ""Literal Explanation + BERT"" method. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and generate scale-calibrated ranking scores. The literal explanation approach utilizes natural language explanations for query-document pairs to enhance the ranking performan..."
Is the 'trec-dl' in table 1 the same as the 'trec' in table 2?,Yes,"Yes, the 'trec-dl' in Table 1 is the same as the 'TREC' in Table 2.","Yes, the 'trec-dl' in Table 1 is the same dataset as the 'TREC' dataset mentioned in Table 2.","Yes, the 'trec-dl' in table 1 is the same as the 'TREC' in table 2.",The 'trec-dl' in Table 1 is not the same as the 'trec' in Table 2.,"No, the 'trec-dl' in table 1 is not the same as the 'trec' in table 2."


## Variant 4: Natural language injection

For this variant, you will inject the structured data values from your table into a phrase (or sentence) and vectorize that.

In [None]:
# Reminder of what df1 looks like
df1

Unnamed: 0,Metric,TREC-DL,NTCIR-14
0,# Queries (Train/Val/Test),| 97/53/67,48/16/16
1,Avg. # docs per query,282.7,345.3
2,Levels of relevance,4,5
3,Label dist. (low to high),58/22/14/6,48/23/17/8/3
4,Avg. query length,8.0,22.0
5,Avg. doc. length,70.9,493.2


In [None]:
# Write func that comes up with a natural language phrase/sentence/paragraph that makes Table 1's data make sense.
# You could do also this via LLM if scaling this workflow is important.

def inject_df1_vals_into_template(dataframe: pd.DataFrame) -> list[str]:
    """
    Inject values from Table 1 dataframe (`df1`) into a natural-language template.

    :param dataframe: Dataframe representing Table 1 from PDF.
    :return: Populated natural-language templates.
    """
    filled_templates = []
    for i,v in dataframe.iterrows():
        metric_name = v[0]
        val_1 = v[1]
        val_2 = v[2]
        dataset_1 = dataframe.columns[1]
        dataset_2 = dataframe.columns[2]
        template = f"The {metric_name} in the {dataset_1} dataset is {val_1}, while it\'s {val_2} in the {dataset_2} dataset"
        filled_templates.append(template)
    return filled_templates


In [None]:
df1_filled_templates = inject_df1_vals_into_template(df1)

  metric_name = v[0]
  val_1 = v[1]
  val_2 = v[2]


In [None]:
# Check out one of the phrases you made:
df1_filled_templates[0]

"The # Queries (Train/Val/Test) in the TREC-DL dataset is | 97/53/67, while it's 48/16/16 in the NTCIR-14 dataset"

In [None]:
# Do the same with df_2
# # Reminder of what df2 looks like
df2

Unnamed: 0,Method,Category,TREC-ndcg,TREC-ndcg@10,TREC-CB-ECE,TREC-ECE,TREC-MSE,NTCIR-ndcg,NTCIR-ndcg@10,NTCIR-CB-ECE,NTCIR-ECE,NTCIR-MSE
0,Uncalibrated monoBERT,A,0.799,0.494,1.205,0.320,0.773,| 0.735,0.337,1.757,0.799,1.824
1,Post hoc + monoBERT,B,0.799,0.494,1.141,0.125,0.684,| 0.735,0.337,1.624,0.457,1.462
2,Finetune monoBERT,C,0.776,0.422,1.093,0.221,-~—«0.721 |,0.696,0.268,1.843,0.709,‘1.874
3,Finetune BERT,C,0.738,0.327,1.253,0.266,~=—0.785_ |,_ 0.727,0.285,1.756,0.546,«1.416
4,LLM prompting w/ rubrics,D,0.786,0.457,1.000',1.246,«2.137,| 0.728,0.328,1.2947,1.194,2.773
5,Post hoc + MC Sampling LLM,E,0.790,0.473,1.165,0.145,0.673,| 0.736,"0.364""",1.677,0.472,‘1.540
6,Literal Explanation + BERT,F,0.815*,0.529%,0.996°,0.067*,"0.602"" |",0.742,0.340,"1.534""",0.355,1.3307
7,Conditional Explanation + BERT,F,0.822,—0.534*,0.862',0.428,~—0.832_ |,0.720,0.322,1.405',0.2577,1.2907


In [None]:
def inject_df2_vals_into_template(dataframe: pd.DataFrame) -> list[str]:
    """
    Inject values from Table 2 dataframe (`df1`) into a natural-language template.

    :param dataframe: Dataframe representing Table 2 from PDF.
    :return: Populated natural-language templates.
    """
    template = ("Against the {dataset} dataset, the \"{method}\" method (from the \"{category}\" category) got a nDCG score of {nDCG}, "
                "a nDCG@10 score of {nDCG10}, a CB-ECE score of {CB_ECE}, "
                "an ECE score of {ECE}, and an MSE score of {MSE}.")

    paragraphs = []  # To store the final paragraphs

    # Iterate over the dataframe
    for index, row in dataframe.iterrows():
        method = row['Method']
        category = row['Category']
        trec_values = {col.split('-', 1)[1]: val for col, val in row.items() if col.startswith('TREC')}
        ntcir_values = {col.split('-', 1)[1]: val for col, val in row.items() if col.startswith('NTCIR')}

        # Inject the values into the template for TREC
        trec_paragraph = template.format(
            dataset="TREC",
            method=method,
            category=category,
            nDCG=trec_values.get('ndcg', 'N/A'),
            nDCG10=trec_values.get('ndcg@10', 'N/A'),
            CB_ECE=trec_values.get('CB-ECE', 'N/A'),
            ECE=trec_values.get('ECE', 'N/A'),
            MSE=trec_values.get('MSE', 'N/A')
        )

        # Inject the values into the template for NTCIR
        ntcir_paragraph = template.format(
            dataset="NTCIR",
            method=method,
            category=category,
            nDCG=ntcir_values.get('ndcg', 'N/A'),
            nDCG10=ntcir_values.get('ndcg@10', 'N/A'),
            CB_ECE=ntcir_values.get('CB-ECE', 'N/A'),
            ECE=ntcir_values.get('ECE', 'N/A'),
            MSE=ntcir_values.get('MSE', 'N/A')
        )

        # Combine the TREC and NTCIR paragraphs
        combined_paragraph = trec_paragraph + " " + ntcir_paragraph
        paragraphs.append(combined_paragraph)

    return paragraphs


In [None]:
df2_filled_templates = inject_df2_vals_into_template(df2)

In [None]:
# Preview
df2_filled_templates[0]

'Against the TREC dataset, the "Uncalibrated monoBERT" method (from the "A" category) got a nDCG score of 0.799, a nDCG@10 score of 0.494, a CB-ECE score of 1.205, an ECE score of 0.320, and an MSE score of 0.773. Against the NTCIR dataset, the "Uncalibrated monoBERT" method (from the "A" category) got a nDCG score of | 0.735, a nDCG@10 score of 0.337, a CB-ECE score of 1.757, an ECE score of 0.799, and an MSE score of 1.824.'

In [None]:
df2_filled_templates[-2]

'Against the TREC dataset, the "Literal Explanation + BERT" method (from the "F" category) got a nDCG score of 0.815*, a nDCG@10 score of 0.529%, a CB-ECE score of 0.996°, an ECE score of 0.067*, and an MSE score of 0.602" |. Against the NTCIR dataset, the "Literal Explanation + BERT" method (from the "F" category) got a nDCG score of 0.742, a nDCG@10 score of 0.340, a CB-ECE score of 1.534", an ECE score of 0.355, and an MSE score of 1.3307.'

In [None]:
# Now turn both lists of filled-in templates into LlamaIndex Document objects:
df1_filled_templates_docs = turn_data_into_documents(df1_filled_templates)
df2_filled_templates_docs = turn_data_into_documents(df2_filled_templates)

### Run indexing pipeline

In [None]:
# Declare new namespace
v4_namespace = 'v4'

# Initialize vector store w/v4 namespace
v4_vector_store = initialize_vector_store(pinecone_index, v4_namespace)

# Join docs
v4_docs = df1_filled_templates_docs + df2_filled_templates_docs + ctrl_docs

# Run through embedding and indexing pipeline
output = run_indexing_pipeline(v4_vector_store, v4_docs)

Upserted vectors:   0%|          | 0/63 [00:00<?, ?it/s]

In [None]:
pinecone_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'control': {'vector_count': 48},
                'v1': {'vector_count': 63},
                'v2': {'vector_count': 63},
                'v3': {'vector_count': 76},
                'v4': {'vector_count': 0}},
 'total_vector_count': 250}

### Run RAG pipeline

In [None]:
# Run RAG pipeline for v4 use case
one_v4, two_v4, three_v4, four_v4, five_v4, six_v4, seven_v4 = run_rag_pipeline(v4_vector_store, QUERIES)


In [None]:
# Add variant 4's responses to `exp_results` dataframe:
v4_responses = [one_v4, two_v4, three_v4, four_v4, five_v4, six_v4, seven_v4]

exp_results['v4'] = v4_responses

exp_results

Unnamed: 0,ANSWER,control,v1,v2,v3,v4
How does the average query length compare to the average document length in table 1?,The average query length is shorter in TREC (8) than it is in NTCIR (22). The average doc length is also shorter in TREC (70.9 vs 493.2),The average query length is shorter compared to the average document length in Table 1.,The average query length is shorter compared to the average document length in table 1.,The average query length is significantly lower than the average document length in table 1.,The average query length is shorter than the average document length in Table 1.,The average query length is significantly lower than the average document length in table 1.
What are the Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2?,TREC: 0.529; NTCIR: 0.340.,0.529 and 0.534,The Literal Explanation + BERT method's nDCG@10 scores on both datasets in Table 2 are 0.529 for the TREC dataset and 0.602 for the NTCIR dataset.,The Literal Explanation + BERT method's ndcg@10 scores on both datasets in table 2 are 0.529% for the TREC dataset and 0.340 for the NTCIR dataset.,"The Literal Explanation + BERT method's ndcg@10 scores on the TREC dataset and the NTCIR dataset in table 2 are 0.529% and 0.340%, respectively.",The Literal Explanation + BERT method's nDCG@10 score on the TREC dataset is 0.529% and on the NTCIR dataset is 0.340.
What impact do natural language explanations (NLEs) have on improving the calibration and overall effectiveness of these models in document ranking tasks?,NLEs lead to better calibrated neural rankers while maintaining or even boosting ranking performance in most scenarios,"Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, resulting in lower calibration error values compared to other calibration methods. Additi...","Natural language explanations (NLEs) significantly enhance the scale calibration of neural rankers, often maintaining or even boosting ranking performance in most scenarios. They provide valuable insights for document differentiation and lead to statistically significant improvements in scale calibration, exhibiting lower error values compared to other calibration methods. Additionally, NLEs h..."
How do i interpret table 2's calibration and ranking scores?,"Lower is better for calibration, higher is better for ranking","Interpreting Table 2's calibration and ranking scores involves understanding that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. The table compares the performance of different methods on two scale calibration datasets, TREC and NTCIR, using metrics like nDCG, nDCG@10, CB-ECE, ECE, and MSE. Significant improvements over a baseline method are marked with a dagger ...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Statistically significant improvements over the ""Platt Scaling monoBERT"" baseline are marked with a dagger symbol. The table presents the performance of different methods on two scale calibration datasets, TREC and NTCIR, with metrics like nDCG, n...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Look for statistically significant improvements over the ""Platt Scaling monoBERT"" baseline, which are marked with a dagger symbol (†). Pay attention to the values in the columns for Ranking and Calibration metrics for different methods and dataset...","The calibration and ranking scores in Table 2 are evaluated based on different metrics such as nDCG, nDCG@10, CB-ECE, ECE, and MSE. Lower values are considered better for calibration metrics like CB-ECE, ECE, and MSE. The table compares the performance of baseline methods and new approaches on two scale calibration datasets, TREC and NTCIR. The results show how different methods perform in ter...","Interpret the calibration and ranking scores in Table 2 by noting that lower values are better for calibration metrics such as CB-ECE, ECE, and MSE. Look for statistically significant improvements over the ""Platt Scaling monoBERT"" baseline, which are marked with a dagger symbol (†). Pay attention to the values in the columns for Ranking and Calibration metrics for different methods and dataset..."
"What are the weights of ""Uncalibrated monoBERT"" tuned on?",MSMarco,"The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO.","The weights of ""Uncalibrated monoBERT"" are fine-tuned on MS MARCO."
What category was used to build and train literal explanation + BERT? what does this category mean?,Category F: training nle-based neural rankers on calibration data.,"Category F was used to build and train the literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. Specifically, in the scenario of the literal explanation approach, where each input is represented with two meta NLEs (one for relevance an...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. The approach includes generating and aggregating natural language explanations for query-document pairs, and then using these explanations to r...","Category F was used to build and train literal explanation + BERT. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and produce scale-calibrated ranking scores. For the literal explanation approach, this category involves generating and aggregating natural language explanations for query-document pairs ...","Category F was used to build and train the literal explanation + BERT model. This category involves training NLE-based neural rankers on calibration data. In this context, it means that the model was fine-tuned to process meta natural language explanations (NLEs) for query-document pairs and generate scale-calibrated ranking scores.","Category F was used to build and train the ""Literal Explanation + BERT"" method. This category involves training NLE-based neural rankers on calibration data. In this method, a BERT model is finetuned to process meta NLEs and generate scale-calibrated ranking scores. The literal explanation approach utilizes natural language explanations for query-document pairs to enhance the ranking performan..."
Is the 'trec-dl' in table 1 the same as the 'trec' in table 2?,Yes,"Yes, the 'trec-dl' in Table 1 is the same as the 'TREC' in Table 2.","Yes, the 'trec-dl' in Table 1 is the same dataset as the 'TREC' dataset mentioned in Table 2.","Yes, the 'trec-dl' in table 1 is the same as the 'TREC' in table 2.","Yes, the 'trec-dl' in Table 1 is the same as the 'TREC' in Table 2.","No, the 'trec-dl' in table 1 is not the same as the 'trec' in table 2."


In [None]:
# The end!