# Creating SEC 10-Q Dataset

This notebook takes the Docugami KG-RAG Dataset for SEC 10-Q maintained here: https://github.com/docugami/KG-RAG-datasets/tree/main/sec-10-q and converts it into the correct format for the llama_dataset repo.

Instructions adapted from https://github.com/run-llama/llama_index/blob/main/docs/examples/llama_dataset/ragdataset_submission_template.ipynb

In [15]:
!pip install --quiet --upgrade llama-index pypdf

[0m

## Get Latest Dataset artifacts

In [10]:
!rm -rf temp
!git clone https://github.com/docugami/KG-RAG-datasets.git temp

Cloning into 'temp'...
remote: Enumerating objects: 178, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 178 (delta 10), reused 60 (delta 8), pack-reused 107[K
Receiving objects: 100% (178/178), 56.52 MiB | 16.66 MiB/s, done.
Resolving deltas: 100% (11/11), done.


In [17]:
!rm -rf source_files
!cp -r temp/sec-10-q/data/v1/docs source_files

## Create a LabelledRagDataset

In [12]:
import os
from pathlib import Path

DOCS_DIR = Path(os.getcwd()) / "temp/sec-10-q/data/v1/docs" 
QNA_CSV = Path(os.getcwd()) / "temp/sec-10-q/data/v1/qna_data.csv"

In [13]:
import pandas as pd
from llama_index.llama_dataset import (
    LabelledRagDataExample,
    CreatedBy,
    CreatedByType,
)
from llama_index.llama_dataset import LabelledRagDataset


# Read
source_df = pd.read_csv(QNA_CSV)

examples = []
for ix, row in source_df.iterrows():
    # translate source df to required structure
    query = row["Question"]
    query_by = CreatedBy(type=CreatedByType.HUMAN)
    reference_answer = row["Answer"]
    reference_answer_by = CreatedBy(
        model_name="gpt-4-turbo (with human review)", type=CreatedByType.AI
    )
    reference_contexts = None

    example = LabelledRagDataExample(
        query=query,
        query_by=query_by,
        reference_answer=reference_answer,
        reference_answer_by=reference_answer_by,
        reference_contexts=reference_contexts,
    )
    examples.append(example)

rag_dataset = LabelledRagDataset(examples=examples)

# save this dataset as it is required for the submission
rag_dataset.save_json("rag_dataset.json")

## Generate a Baseline Evaluation Result

In [16]:
from llama_index.readers import SimpleDirectoryReader
from llama_index import VectorStoreIndex
from llama_index.llama_pack import download_llama_pack

documents = SimpleDirectoryReader(input_dir=str(DOCS_DIR)).load_data()

index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./temp/pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,
    rag_dataset=rag_dataset,
)
benchmark_df = await rag_evaluator.arun()
benchmark_df

Batch processing of predictions: 100%|██████████| 10/10 [00:10<00:00,  1.08s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:27<00:00,  2.77s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:11<00:00,  1.20s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:09<00:00,  1.09it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:06<00:00,  1.48it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:08<00:00,  1.16it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.77it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.29it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.29it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:08<00:00,  1.14it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.30it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:09<00:00,  1.02it/s]
Batch processing of predicti

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,2.702564
mean_relevancy_score,0.825641
mean_faithfulness_score,0.897436
mean_context_similarity_score,


## Prepare card.json and README.md

In [None]:
!rm -rf temp

In [18]:
from llama_index.llama_pack import download_llama_pack

LlamaDatasetMetadataPack = download_llama_pack("LlamaDatasetMetadataPack", "./temp/pack")

metadata_pack = LlamaDatasetMetadataPack()

dataset_description = (
    "A labelled RAG dataset with SEC 10-Q documents for major tech companies including queries "
    "across multiple docs and chunks, with reference answers. "
    "See https://github.com/docugami/KG-RAG-datasets for details."
)

# This creates and saves a card.json and README.md to the same
# directory where you're running this notebook.
metadata_pack.run(
    name="Docugami KG-RAG - SEC 10-Q",
    description=dataset_description,
    rag_dataset=rag_dataset,
    index=index,
    benchmark_df=benchmark_df,
    baseline_name="llamaindex",
)