## **Dataset Exploration***
In the following we will analyse the dataset used for our research. This allows proper pre-processing and a sound research design enabling meaningful insights. 
The primary aim is to understand the structure of each dataset, which allows us to unify them to construct the final dataset, and to analyze the data distribution and characteristics, which enables efficient sampling. 

The dataset used for our research are: FinQA, ConFinQA, and FinDER.


In [51]:
# import all relevant libraries
import json
import random
import os

### **1. First Dataset Insepection**

**Load and Inspect the Samples**

In [52]:
# Import necessary libraries
import sys
sys.path.append('../src')
import json
from pathlib import Path
# For better display in notebooks
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

#### FinQA

In [53]:
import json
from pathlib import Path

finQA_train_file = Path("/Users/christel/Desktop/Thesis/thesis_repo/data/FinQA-main/dataset/train.json")
with open(finQA_train_file, 'r') as f:
    finQA_train_data = json.load(f)

print(f"Loaded {len(finQA_train_data)} training examples")
print(f"Data type: {type(finQA_train_data)}")

Loaded 6251 training examples
Data type: <class 'list'>


In [54]:
# Inspect the structure of the first sample
finQA_first_sample = finQA_train_data[0]
print(f"Sample type: {type(finQA_first_sample)}")
print(f"Sample keys: {list(finQA_first_sample.keys())}")
print(f"Number of keys: {len(finQA_first_sample.keys())}")

Sample type: <class 'dict'>
Sample keys: ['pre_text', 'post_text', 'filename', 'table_ori', 'table', 'qa', 'id', 'table_retrieved', 'text_retrieved', 'table_retrieved_all', 'text_retrieved_all']
Number of keys: 11


In [55]:
print(finQA_first_sample)

{'pre_text': ['interest rate to a variable interest rate based on the three-month libor plus 2.05% ( 2.05 % ) ( 2.34% ( 2.34 % ) as of october 31 , 2009 ) .', 'if libor changes by 100 basis points , our annual interest expense would change by $ 3.8 million .', 'foreign currency exposure as more fully described in note 2i .', 'in the notes to consolidated financial statements contained in item 8 of this annual report on form 10-k , we regularly hedge our non-u.s .', 'dollar-based exposures by entering into forward foreign currency exchange contracts .', 'the terms of these contracts are for periods matching the duration of the underlying exposure and generally range from one month to twelve months .', 'currently , our largest foreign currency exposure is the euro , primarily because our european operations have the highest proportion of our local currency denominated expenses .', 'relative to foreign currency exposures existing at october 31 , 2009 and november 1 , 2008 , a 10% ( 10 % )

In [56]:
# Detailed inspection of the first sample
for key, value in finQA_first_sample.items():
    print(f"\n📋 {key}:")
    if isinstance(value, str):
        print(f"   Type: string (length: {len(value)})")
        print(f"   Preview: {value[:100]}{'...' if len(value) > 100 else ''}")
    elif isinstance(value, list):
        print(f"   Type: list (length: {len(value)})")
        if len(value) > 0:
            print(f"   First item type: {type(value[0])}")
            if isinstance(value[0], dict):
                print(f"   First item keys: {list(value[0].keys())}")
    elif isinstance(value, dict):
        print(f"   Type: dict (keys: {list(value.keys())})")
    else:
        print(f"   Type: {type(value)}")
        print(f"   Value: {value}")


📋 pre_text:
   Type: list (length: 15)
   First item type: <class 'str'>

📋 post_text:
   Type: list (length: 35)
   First item type: <class 'str'>

📋 filename:
   Type: string (length: 20)
   Preview: ADI/2009/page_49.pdf

📋 table_ori:
   Type: list (length: 4)
   First item type: <class 'list'>

📋 table:
   Type: list (length: 4)
   First item type: <class 'list'>

📋 qa:
   Type: dict (keys: ['question', 'answer', 'explanation', 'ann_table_rows', 'ann_text_rows', 'steps', 'program', 'gold_inds', 'exe_ans', 'tfidftopn', 'program_re', 'model_input'])

📋 id:
   Type: string (length: 22)
   Preview: ADI/2009/page_49.pdf-1

📋 table_retrieved:
   Type: list (length: 2)
   First item type: <class 'dict'>
   First item keys: ['score', 'ind']

📋 text_retrieved:
   Type: list (length: 3)
   First item type: <class 'dict'>
   First item keys: ['score', 'ind']

📋 table_retrieved_all:
   Type: list (length: 4)
   First item type: <class 'dict'>
   First item keys: ['score', 'ind']

📋 text_retrie

**FinQA: Each training example is a dictionary with 11 keys:** <br>
"pre_text": the texts before the table; <br>
"post_text": the text after the table;<br>
"filename": name of the pdf file <br>
"table_ori": The original version of the table, as extracted from the document, before any preprocessing or normalization.<br>
"table": the table;<br>
"qa": {<br>
  "question": the question;<br>
  "answer": The final numeric/textual answer to the question.<br>
  "explenation": Optional human-written explanation for the answer (often empty in FinQA)<br>
  "ann_table_rows": Indices of table rows that are annotated as relevant (if the answer comes from a table).<br>
  "ann_text_rows": Indices of relevant text passages (e.g., [1] refers to text_1) from model_input.<br>
  "steps" ("op": operation, "arg1; arg2": operands; "res": result of the operation:  The symbolic execution steps used to compute the answer.<br>
  "program": the reasoning program;<br>
  "gold_inds": the gold supporting facts;<br>
  "exe_ans": the gold execution result;<br>
  "tfidftopn": Top-n retrieved text chunks using TF-IDF baseline.<br>
  "program_re": the reasoning program in nested format;<br>
  "model_input": A list of text chunks (tuples of text ID and content) used as input to the model.<br>
}<br>
"id": unique example id. <br>
"table_retrieved": A list of tables retrieved by a retriever model (e.g., BM25, DPR), each with a similarity score and ind (identifier).<br>
"text_retrieved": A list of retrieved text passages (usually from pre_text + post_text), sorted by similarity score.<br>
"table_retrieved_all": A complete list of table candidates along with their retrieval scores.<br>
"text_retrieved_all": All candidate text chunks (with scores), potentially from the whole document, ranked by relevance.<br>

#### ConvFinQA

In [57]:
print(os.listdir("/Users/christel/Desktop/Thesis/thesis_repo/data/ConvFinQA-main"))

['LICENSE', 'code', 'README.md', 'data']


In [58]:
ConvfinQA_turn_train_file = Path("/Users/christel/Desktop/Thesis/thesis_repo/data/ConvFinQA-main/data/train_turn.json")
with open(ConvfinQA_turn_train_file, 'r') as f:
    ConvfinQA_turn_train_data = json.load(f)

print(f"Loaded {len(ConvfinQA_turn_train_data)} training examples")
print(f"Data type: {type(ConvfinQA_turn_train_data)}")

Loaded 11104 training examples
Data type: <class 'list'>


In [59]:
ConvfinQA_train_file = Path("/Users/christel/Desktop/Thesis/thesis_repo/data/ConvFinQA-main/data/train.json")
with open(ConvfinQA_train_file, 'r') as f:
    ConvfinQA_train_data = json.load(f)

print(f"Loaded {len(ConvfinQA_train_data)} training examples")
print(f"Data type: {type(ConvfinQA_train_data)}")

Loaded 3037 training examples
Data type: <class 'list'>


In [60]:
for i, sample in enumerate(ConvfinQA_turn_train_data[:5]):
    dialogue = sample.get("annotation", {}).get("dialogue_break", [])
    print(f"\nSample {i}: Dialogue length = {len(dialogue)}")
    print(dialogue)



Sample 0: Dialogue length = 4
['what is the net cash from operating activities in 2009?', 'what about in 2008?', 'what is the difference?', 'what percentage change does this represent?']

Sample 1: Dialogue length = 4
['what is the net cash from operating activities in 2009?', 'what about in 2008?', 'what is the difference?', 'what percentage change does this represent?']

Sample 2: Dialogue length = 4
['what is the net cash from operating activities in 2009?', 'what about in 2008?', 'what is the difference?', 'what percentage change does this represent?']

Sample 3: Dialogue length = 4
['what is the net cash from operating activities in 2009?', 'what about in 2008?', 'what is the difference?', 'what percentage change does this represent?']

Sample 4: Dialogue length = 4
['what were revenues in 2008?', 'what were they in 2007?', 'what was the net change?', 'what is the percent change?']


In [61]:
for i, sample in enumerate(ConvfinQA_train_data[:5]):
    dialogue = sample.get("annotation", {}).get("dialogue_break", [])
    print(f"\nSample {i}: Dialogue length = {len(dialogue)}")
    print(dialogue)


Sample 0: Dialogue length = 4
['what is the net cash from operating activities in 2009?', 'what about in 2008?', 'what is the difference?', 'what percentage change does this represent?']

Sample 1: Dialogue length = 4
['what were revenues in 2008?', 'what were they in 2007?', 'what was the net change?', 'what is the percent change?']

Sample 2: Dialogue length = 4
['what was the total of net sales in 2001?', 'and what was that in 2000?', 'what was, then, the change in the total of net sales over the year?', 'and how much does this change represent in relation to that total in 2000, in percentage?']

Sample 3: Dialogue length = 6
['what was the change in the performance of the united parcel service inc . from 2004 to 2009?', 'and how much does this change represent in relation to that performance in 2004, in percentage?', 'what was the performance value of the s&p 500 index in 2009?', 'what was, then, the change in that performance from 2004 to 2009?', 'and how much does this change re

**train.json (Conversation-Level Format):** <br>
Each entry in this file represents a full multi-turn dialogue between a user and a system. It contains multiple interrelated QA pairs (dialogue_break) that often require the model to reason across dialogue history. This format is ideal for training and evaluating systems designed to handle conversational memory and context-aware reasoning.<br>
**train_turn.json (Turn-Level Format):**<br>
This version contains individual QA pairs, each treated as an independent training instance. While each turn includes metadata about the full dialogue (e.g., dialogue_break, turn_program), the structure is flattened to focus on single-turn question answering. It aligns closely with traditional QA datasets like FinQA and FinDER.<br>


The evaluation requires a unified dataset format that:<br>

- Ensures consistency across multiple QA datasets (FinQA, FinDER, ConvFinQA),
- Supports scalable benchmarking without additional engineering overhead,
- Enables clean input-output tracking across different RAG pipelines.<br>

The turn-level format (train_turn.json) satisfies these requirements by providing structurally uniform, self-contained QA pairs that are directly comparable to FinQA and FinDER. This consistency allows for streamlined preprocessing, batching, and evaluation across all models and datasets.<br>

Additionally, using the turn-level format avoids the added complexity of reconstructing dialogue context or implementing query-rewriting logic—an important consideration given the limited timeline of the project.<br>

To still account for conversational realism, a small subset of context-dependent examples from train.json may be used in a complementary analysis, providing qualitative insights into retriever performance under dialogue-aware conditions.<br>

In [62]:
# Inspect the structure of the first sample
ConvfinQA_first_sample = ConvfinQA_turn_train_data[0]
print(f"Sample type: {type(ConvfinQA_first_sample)}")
print(f"Sample keys: {list(ConvfinQA_first_sample.keys())}")
print(f"Number of keys: {len(ConvfinQA_first_sample.keys())}")

Sample type: <class 'dict'>
Sample keys: ['pre_text', 'post_text', 'filename', 'table_ori', 'table', 'qa', 'id', 'annotation']
Number of keys: 8


In [63]:
print(ConvfinQA_first_sample)

{'pre_text': ['26 | 2009 annual report in fiscal 2008 , revenues in the credit union systems and services business segment increased 14% ( 14 % ) from fiscal 2007 .', 'all revenue components within the segment experienced growth during fiscal 2008 .', 'license revenue generated the largest dollar growth in revenue as episys ae , our flagship core processing system aimed at larger credit unions , experienced strong sales throughout the year .', 'support and service revenue , which is the largest component of total revenues for the credit union segment , experienced 34 percent growth in eft support and 10 percent growth in in-house support .', 'gross profit in this business segment increased $ 9344 in fiscal 2008 compared to fiscal 2007 , due primarily to the increase in license revenue , which carries the highest margins .', 'liquidity and capital resources we have historically generated positive cash flow from operations and have generally used funds generated from operations and short

In [64]:
# Detailed inspection of the first sample
for key, value in ConvfinQA_first_sample.items():
    print(f"\n📋 {key}:")
    if isinstance(value, str):
        print(f"   Type: string (length: {len(value)})")
        print(f"   Preview: {value[:100]}{'...' if len(value) > 100 else ''}")
    elif isinstance(value, list):
        print(f"   Type: list (length: {len(value)})")
        if len(value) > 0:
            print(f"   First item type: {type(value[0])}")
            if isinstance(value[0], dict):
                print(f"   First item keys: {list(value[0].keys())}")
    elif isinstance(value, dict):
        print(f"   Type: dict (keys: {list(value.keys())})")
    else:
        print(f"   Type: {type(value)}")
        print(f"   Value: {value}")


📋 pre_text:
   Type: list (length: 9)
   First item type: <class 'str'>

📋 post_text:
   Type: list (length: 15)
   First item type: <class 'str'>

📋 filename:
   Type: string (length: 21)
   Preview: JKHY/2009/page_28.pdf

📋 table_ori:
   Type: list (length: 8)
   First item type: <class 'list'>

📋 table:
   Type: list (length: 7)
   First item type: <class 'list'>

📋 qa:
   Type: dict (keys: ['question', 'answer', 'explanation', 'ann_table_rows', 'ann_text_rows', 'steps', 'program', 'gold_inds', 'exe_ans', 'program_re'])

📋 id:
   Type: string (length: 32)
   Preview: Single_JKHY/2009/page_28.pdf-3_0

📋 annotation:
   Type: dict (keys: ['amt_table', 'amt_pre_text', 'amt_post_text', 'original_program', 'step_list', 'answer_list', 'dialogue_break', 'turn_program_ori', 'dialogue_break_ori', 'turn_program', 'qa_split', 'exe_ans_list', 'cur_program', 'cur_dial', 'exe_ans', 'cur_type', 'turn_ind', 'gold_ind'])


**ConvFinQA: Each training example is a dictionary with 8 keys:** <br>
"pre_text": the texts before the table; <br>
"post_text": the text after the table;<br>
"filename": name of the pdf file <br>
"table_ori": The original version of the table, as extracted from the document, before any preprocessing or normalization.<br>
"table": the table;<br>
"qa": {<br>
  "question": the question;<br>
  "answer": The final numeric/textual answer to the question.<br>
  "ann_table_rows": Indices of table rows that are annotated as relevant (if the answer comes from a table).<br>
  "ann_text_rows": Indices of relevant text passages (e.g., [1] refers to text_1) from model_input.<br>
  "steps" ("op": operation, "arg1; arg2": operands; "res": result of the operation:  The symbolic execution steps used to compute the answer.<br>
  "program": the reasoning program;<br>
  "gold_inds": the gold supporting facts;<br>
  "exe_ans": the execution results of each question turn. ;<br>
}<br>
"id": unique example id. <br>
"annotation": {<br>
  "original_program": original FinQA question;<br>
  "dialogue_break": the conversation, as a list of question turns. <br>
  "turn_program": the ground truth program for each question, corresponding to the list in "dialogue_break".<br>
  "cur_pogram":Current program for this turn.<br>
  "cur_dial":Current dialogue turn.<br>
  "gold_ind": Highlighted content for evidence.<br>
  "turn_ind": Index of this turn in the full dialogue.<br>
  "exe_ans_list": the execution results of each question turn. <br>
}<br>

#### FinDER

In [65]:
import json
from pathlib import Path

finder_train_file = Path("/Users/christel/Desktop/Thesis/thesis_repo/data/FinDER/train.jsonl")
finder_train_data = []
with open(finder_train_file, 'r') as f:
    for line in f:
        finder_train_data.append(json.loads(line))

print(f"Loaded {len(finder_train_data)} training examples")
print(f"Data type: {type(finder_train_data)}")

Loaded 5703 training examples
Data type: <class 'list'>


In [66]:
# Inspect the structure of the first sample
finder_first_sample = finder_train_data[0]
print(f"Sample type: {type(finder_first_sample)}")
print(f"Sample keys: {list(finder_first_sample.keys())}")
print(f"Number of keys: {len(finder_first_sample.keys())}")

Sample type: <class 'dict'>
Sample keys: ['_id', 'text', 'reasoning', 'category', 'references', 'answer', 'type']
Number of keys: 7


In [67]:
print(finder_train_data[0])

{'_id': 'b33fcee7', 'text': 'Delta in CBOE Data & Access Solutions rev from 2021-23.', 'reasoning': True, 'category': 'Financials', 'references': ['Cboe Global Markets, Inc. and Subsidiaries\n\nConsolidated Statements of Income\n\nYears ended December 31, 2023, 2022, and 2021\n\n(In millions, except per share data)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n\n2023\n\n    \n\n2022\n\n    \n\n2021\n\n \n\nRevenues:\n\n\n\n\n\n\n\n\n\n\n\nCash and spot markets\n\n\n$\n\n1,445.1\n\n\n$\n\n1,777.6\n\n\n$\n\n1,660.5\n\n\nData and access solutions\n\n\n\n539.2\n\n\n\n497.0\n\n\n\n427.7\n\n\nDerivatives markets\n\n\n \n\n1,789.2\n\n\n \n\n1,683.9\n\n\n \n\n1,406.6\n\n\nTotal revenues\n\n\n \n\n3,773.5\n\n\n \n\n3,958.5\n\n\n \n\n3,494.8\n\n\nCost of revenues:\n\n\n\n\n\n\n\n\n\n\n\n  Liquidity payments\n\n\n \n\n1,385.8\n\n\n \n\n1,670.2\n\n\n \n\n1,650.7\n\n\n  Routing and clearing\n\n\n\n79.1\n\n\n\n83.2\n\n\n\n87.8\n\n\n  Section 31 fees\n\n\n\n185.7\n\n\n\n329.8\n\n\n\n179.6\n\n\n  Royalty fees an

In [68]:
# Detailed inspection of the first sample
for key, value in finder_first_sample.items():
    print(f"\n📋 {key}:")
    if isinstance(value, str):
        print(f"   Type: string (length: {len(value)})")
        print(f"   Preview: {value[:100]}{'...' if len(value) > 100 else ''}")
    elif isinstance(value, list):
        print(f"   Type: list (length: {len(value)})")
        if len(value) > 0:
            print(f"   First item type: {type(value[0])}")
            if isinstance(value[0], dict):
                print(f"   First item keys: {list(value[0].keys())}")
    elif isinstance(value, dict):
        print(f"   Type: dict (keys: {list(value.keys())})")
    else:
        print(f"   Type: {type(value)}")
        print(f"   Value: {value}")


📋 _id:
   Type: string (length: 8)
   Preview: b33fcee7

📋 text:
   Type: string (length: 55)
   Preview: Delta in CBOE Data & Access Solutions rev from 2021-23.

📋 reasoning:
   Type: <class 'bool'>
   Value: True

📋 category:
   Type: string (length: 10)
   Preview: Financials

📋 references:
   Type: list (length: 1)
   First item type: <class 'str'>

📋 answer:
   Type: string (length: 133)
   Preview: The Data and Access Solutions revenue increased by $111.5 million from 2021 to 2023, calculated as 5...

📋 type:
   Type: string (length: 8)
   Preview: Subtract


**FinDER: Each training example is a dictionary with 7 keys:** <br>
"id": unique identifier.<br>
"text": query that the model is expected to answer.<br>
"reasoning": ndicates whether the question requires reasoning (e.g. logical inference, arithmetic operations) rather than simple lookup. true = reasoning required.<br>
"category": The semantic category of the question (e.g., Financials, Company overview, Footnotes, etc.).<br>
"references": The source text passages (e.g., extracted from tables or footnotes) that the model should consider when answering the question. <br>
"answer": The reference answer that the model should produce.<br>
"types": Indicates the type of reasoning required to arrive at the answer. <br>


### **2. Construct the final dataset**<br>
Since the original datasets are of varying structures we will in the following create a canonical schema to avoid confounds and log identical signals across runs. It'll contain only the fields that are relevant for retrieval, answer checking , and analysis. The goal is to construct a dataset where each row can be fed straight into the each of the RAG models with no dataset-specific branches. <br>



The final dataset has the following structure: <br>

{
  "qid"          : "string",     // dataset-prefix + original id <br>
  "dataset"      : "FinQA | ConvFinQA | FinDER",<br>
  "question"     : "string",<br>
  "answer"       : "string",     // canonicalised (see §4)<br>
  "context_text" : ["string"],   // list of text passages (sentences or 100-token chunks)<br>
  "context_table": [["string"]], // normalised table (may be [])<br>
  "reasoning"    : true|false,   // FinDER field → others: len(steps)>1<br>
  "reason_type"  : "string|null",// FinDER.type or Conv/FinQA program tag<br>
  "gold_text_id" : ["string"],   // evidence indices, empty if not provided<br>
  "gold_table_row":[int],        // ^ <br>
  "meta"         : { ... }       // any extra fields you still need<br>
}

**2.1 Normalize datasets**

2.1.1 FinQA

We sentence-split each pre_text/post_text, then concatenate adjacent sentences until the segment is ≤ 100 BPE tokens. This follows best practice in prior RAG work (Lewis 2020; Izacard 2021) and balances retrieval precision with embedding quality.

In [69]:
# pre-process finqa with the function in data_utils.py

import json
import random
from pathlib import Path
from data_utils import preprocess_finqa_dataset




In [71]:
def preprocess_finqa_sample(sample, tokenizer=None, max_bpe_tokens=100):
    # Set up tokenizer if not provided
    if tokenizer is None:
        tokenizer = tiktoken.get_encoding("cl100k_base")  # OpenAI's default

    # Helper: sentence split (simple, can be improved)
    def sentence_split(text):
        import re
        # Split on period, question mark, exclamation, or newline
        return [s.strip() for s in re.split(r'(?<=[.?!])\s+|\n', text) if s.strip()]

    # Helper: concatenate sentences ≤ max_bpe_tokens
    def segment_sentences(sentences):
        segments = []
        current = ""
        for sent in sentences:
            if not current:
                current = sent
            else:
                # Try adding the next sentence
                test = current + " " + sent
                if len(tokenizer.encode(test)) <= max_bpe_tokens:
                    current = test
                else:
                    segments.append(current)
                    current = sent
        if current:
            segments.append(current)
        return segments

    # 1. qid
    qid = "FinQA_" + str(sample["id"])
    # 2. dataset
    dataset = "FinQA"
    # 3. question
    question = sample["qa"]["question"]
    # 4. answer
    answer = sample["qa"]["answer"]
    # 5. context_text
    pre_text = sample.get("pre_text", [])
    post_text = sample.get("post_text", [])
    
    # Handle pre_text and post_text as lists of strings
    if isinstance(pre_text, list):
        pre_sentences = []
        for text_chunk in pre_text:
            if isinstance(text_chunk, str):
                pre_sentences.extend(sentence_split(text_chunk))
    else:
        pre_sentences = sentence_split(pre_text) if isinstance(pre_text, str) else []
    
    if isinstance(post_text, list):
        post_sentences = []
        for text_chunk in post_text:
            if isinstance(text_chunk, str):
                post_sentences.extend(sentence_split(text_chunk))
    else:
        post_sentences = sentence_split(post_text) if isinstance(post_text, str) else []
    
    sentences = pre_sentences + post_sentences
    context_text = segment_sentences(sentences)
    # 6. context_table
    context_table = sample.get("table")
    # 7. reasoning
    reasoning = len(sample["qa"].get("steps", [])) > 1
    # 8. reason_type
    steps = sample["qa"].get("steps", [])
    reason_type = steps[0]["op"] if steps else None
    # 9. gold_text_id
    gold_text_id = ["text_" + str(i) for i in sample["qa"].get("ann_text_rows",[])]
    # 10. gold_table_row
    gold_table_row = sample["qa"].get("ann_table_rows", [])
    # 11. meta
    meta = {
        "tfidftopn": sample.get("tfidftopn"),
        "table_retrieved": sample.get("table_retrieved"),
        "text_retrieved": sample.get("text_retrieved"),
    }

    return {
        "qid": qid,
        "dataset": dataset,
        "question": question,
        "answer": answer,
        "context_text": context_text,
        "context_table": context_table,
        "reasoning": reasoning,
        "reason_type": reason_type,
        "gold_text_id": gold_text_id,
        "gold_table_row": gold_table_row,
        "meta": meta,
    }

# Wrapper to process a whole dataset
def preprocess_finqa_dataset(finqa_data, tokenizer=None, max_bpe_tokens=100):
    return [preprocess_finqa_sample(sample, tokenizer, max_bpe_tokens) for sample in finqa_data]

In [74]:
# pre-process the finqa dataset
finqa_processed = preprocess_finqa_dataset(finQA_train_data)

NameError: name 'tiktoken' is not defined

In [None]:
# pre-process the finqa dataset
finqa_processed = preprocess_finqa_dataset(finQA_train_data)

# Analysis of the processed data
print(f"Successfully processed {len(finqa_processed)} FinQA samples")
print(f"Sample structure: {list(finqa_processed[0].keys())}")

# Show first processed sample
print("\n📊 First processed sample:")
first_processed = finqa_processed[0]
for key, value in first_processed.items():
    if key == 'context_text':
        print(f"  {key}: {len(value)} text segments")
        print(f"    First segment: {value[0][:100]}...")
    elif key == 'context_table':
        print(f"  {key}: {len(value)} table rows")
    elif key == 'meta':
        print(f"  {key}: {list(value.keys())}")
    else:
        print(f"  {key}: {value}")

# Dataset statistics
reasoning_count = sum(1 for sample in finqa_processed if sample['reasoning'])
print(f"\n📈 Dataset Statistics:")
print(f"  Total samples: {len(finqa_processed)}")
print(f"  Reasoning samples: {reasoning_count} ({reasoning_count/len(finqa_processed)*100:.1f}%)")
print(f"  Non-reasoning samples: {len(finqa_processed) - reasoning_count}")

# Reason type distribution
reason_types = {}
for sample in finqa_processed:
    reason_type = sample.get('reason_type')
    if reason_type:
        reason_types[reason_type] = reason_types.get(reason_type, 0) + 1

print(f"\n🔍 Reason Type Distribution:")
for reason_type, count in sorted(reason_types.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {reason_type}: {count} samples")