# 📘 Warren Buffett Shareholder Letters: QnA Dataset

![Warren Buffett](image/main_wb.png)

This dataset contains curated **question-answer-reasoning (QAR)** triplets extracted from **Warren Buffett’s shareholder letters**, processed using Large Language Models (LLMs) for educational and research purposes.

## 🚀 Dataset Overview

Each data sample contains:
- A **Question** derived from a selected paragraph in the letter.
- A **Contextual Answer** based on the content of the letter.
- A detailed **Reasoning** showing how the answer can be derived from the paragraph.

This is especially useful for:
- Building retrieval-augmented generation (RAG) pipelines.
- Training/fine-tuning LLMs for educational Q&A.
- Studying how financial philosophy can be encoded in AI reasoning.

---

## 🛠️ How This Dataset Was Created

### 1. 📄 Scraping PDFs Using Mistral AI

I used a custom document processing pipeline powered by **Mistral AI** to:
- Load Warren Buffett's shareholder letters in PDF format.
- Segment them into clean paragraph-level markdown blocks.
- Structure each letter into a list of parsed pages for downstream tasks.

Each paragraph was stored in a list format:
```python
array_of_pages = [[{"index": 0, "markdown": "..."}], [...], ...]
```

### 2. 💬 Generating QnA Using Together API

I instantiated a `ChatBot` class using the Together API and sequentially queried the following prompts for each paragraph:

- **Q (Question Prompt):**
  > "What is a good question worth being asked from this paragraph? Please provide one sentence."

- **A (Answer Prompt):**
  > "What is a good answer that can be derived from above paragraph and question? Please provide answer."

- **R (Reasoning Prompt):**
  > "Give me reasoning to show how to arrive with answer above from paragraph and question. Please provide reasoning."

The response was streamed and collapsed into a structured dictionary.

### 3. 🧠 Curating Dataset Using Deepseek R1

To ensure high quality reasoning, I selected the **`deepseek-ai/DeepSeek-V3`** model (code-named `R1`) from Together AI, which excels in financial domain understanding and logical traceability. This model was used for all `invoke_api()` calls in the QAR curation loop.

---

## 🗃️ Format

Each sample is structured as:

```json
{
  "question": "Why does Warren Buffett emphasize the importance of admitting mistakes in his shareholder letters?",
  "answer": "Buffett believes acknowledging mistakes fosters transparency and accountability...",
  "reasoning": "Buffett values transparency with shareholders. He believes that by admitting errors..."
}
```

---

## 📤 Hosting on Hugging Face

The dataset was:
- Converted into Hugging Face `datasets.Dataset` format.
- Saved locally with `dataset.save_to_disk()`.
- Uploaded using the `huggingface_hub.upload_folder()` method (recommended by Hugging Face).

---

## ✨ Credits

- **Created by**: [Yiqiao Yin](https://huggingface.co/eagle0504)
- **Source Text**: Warren Buffett Shareholder Letters (public domain)
- **Models Used**: Mistral AI for OCR & parsing, DeepSeek-V3 via Together API for QnA

---

## 🧪 Future Work

- Expand coverage across more years of letters.
- Integrate RAG model finetuning tutorials using this dataset.
- Add multi-language support via translation pipelines.
```

---

### 📎 Upload Notes
If you're planning to upload this dataset to Hugging Face Hub:
- Ensure that the `image/main_wb.png` file is included in the dataset repo root.
- Place this `README.md` in the same directory as the dataset folder or specify it explicitly when uploading.



## Mistral OCR

In [None]:
%%capture

! pip install mistralai

Instructions from Mistral to access their API are presented [here](https://docs.mistral.ai/getting-started/quickstart/).

In [None]:
from google.colab import userdata
MISTRAL_API_KEY = userdata.get('MISTRAL_API_KEY')

We look for letters from Berkshire Hathaway [here](https://www.berkshirehathaway.com/letters/letters.html).

In [None]:
list_of_letters = [
    "https://www.berkshirehathaway.com/letters/2024ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2023ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2022ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2021ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2020ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2019ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2018ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2017ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2016ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2015ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2014ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2013ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2012ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2011ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2010ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2009ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2008ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2007ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2006ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2005ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2004ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2003ltr.pdf",
    "https://www.berkshirehathaway.com/letters/2002pdf.pdf",
    "https://www.berkshirehathaway.com/letters/2001pdf.pdf",
    "https://www.berkshirehathaway.com/letters/2000pdf.pdf",
    "https://www.berkshirehathaway.com/letters/1999pdf.pdf",
    "https://www.berkshirehathaway.com/letters/1998pdf.pdf"
]

print(f"Number of letters: {len(list_of_letters)}")

Number of letters: 27


In [None]:
import os
from mistralai import Mistral

client = Mistral(api_key=MISTRAL_API_KEY)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": list_of_letters[0]
    },
    include_image_base64=True
)

In [None]:
len(ocr_response.dict()['pages'])

<ipython-input-5-f5532fbb7ff2>:1: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  len(ocr_response.dict()['pages'])


15

In [None]:
ocr_response.dict()['pages']

<ipython-input-6-0bf3262b9678>:1: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  ocr_response.dict()['pages']


[{'index': 0,
  'markdown': '# BERKSHIRE HATHAWAY INC. \n\nTo the Shareholders of Berkshire Hathaway Inc.:\nThis letter comes to you as part of Berkshire\'s annual report. As a public company, we are required to periodically tell you many specific facts and figures.\n"Report," however, implies a greater responsibility. In addition to the mandated data, we believe we owe you additional commentary about what you own and how we think. Our goal is to communicate with you in a manner that we would wish you to use if our positions were reversed - that is, if you were Berkshire\'s CEO while I and my family were passive investors, trusting you with our savings.\n\nThis approach leads us to an annual recitation of both good and bad developments at the many businesses you indirectly own through your Berkshire shares. When discussing problems at specific subsidiaries, we do, however, try to follow the advice Tom Murphy gave to me 60 years ago: "praise by name, criticize by category."\n\n## Mistak

Now we run through a loop to scrape all letters.

In [None]:
# Start
array_of_pages = []

for i in range(len(list_of_letters)):
    try:
        # Define client
        client = Mistral(api_key=MISTRAL_API_KEY)

        # API call
        ocr_response = client.ocr.process(
            model="mistral-ocr-latest",
            document={
                "type": "document_url",
                "document_url": list_of_letters[i]
            },
            include_image_base64=True
        )

        # Append data
        array_of_pages.append(ocr_response.dict()['pages'])

        # Checkpoint
        print(f"--- finished with document: {i}, {list_of_letters[i]}")
    except:
        print(f"--- failed with document: {i}, {list_of_letters[i]}")
# End

<ipython-input-7-7add25c8f92a>:20: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  array_of_pages.append(ocr_response.dict()['pages'])


--- finished with document: 0, https://www.berkshirehathaway.com/letters/2024ltr.pdf
--- finished with document: 1, https://www.berkshirehathaway.com/letters/2023ltr.pdf
--- finished with document: 2, https://www.berkshirehathaway.com/letters/2022ltr.pdf
--- finished with document: 3, https://www.berkshirehathaway.com/letters/2021ltr.pdf
--- finished with document: 4, https://www.berkshirehathaway.com/letters/2020ltr.pdf
--- finished with document: 5, https://www.berkshirehathaway.com/letters/2019ltr.pdf
--- finished with document: 6, https://www.berkshirehathaway.com/letters/2018ltr.pdf
--- finished with document: 7, https://www.berkshirehathaway.com/letters/2017ltr.pdf
--- finished with document: 8, https://www.berkshirehathaway.com/letters/2016ltr.pdf
--- finished with document: 9, https://www.berkshirehathaway.com/letters/2015ltr.pdf
--- finished with document: 10, https://www.berkshirehathaway.com/letters/2014ltr.pdf
--- finished with document: 11, https://www.berkshirehathaway.co

## Together AI - Usage of Deepseek R1

Prepare for together ai API.

In [None]:
! pip install together

Collecting together
  Downloading together-1.5.5-py3-none-any.whl.metadata (14 kB)
Downloading together-1.5.5-py3-none-any.whl (87 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/87.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.9/87.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: together
Successfully installed together-1.5.5


In [None]:
from google.colab import userdata
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

Design chatbot.

In [None]:
from together import Together

class ChatBot:
    """
    A simple ChatBot class to interact with a Together LLM model.

    Attributes:
        api_key (str): The API key used to authenticate with the Together API.
        client (Together): A Together client for making requests.
        history (list[dict]): A list of dictionaries representing the conversation history.
    """

    def __init__(self, api_key: str) -> None:
        """
        Initializes the ChatBot with a given API key and an empty conversation history.
        Also creates a Together client instance for making requests.

        Args:
            api_key (str): The API key for Together.
        """
        self.api_key: str = api_key
        self.client: Together = Together(api_key=self.api_key)
        self.history: list[dict] = []

    def append_history(self, role: str, content: str) -> None:
        """
        Appends a new message entry to the conversation history.

        Args:
            role (str): The role of the message sender, e.g., "user" or "assistant".
            content (str): The message content to be appended.
        """
        self.history.append({"role": role, "content": content})

    def invoke_api(
        self,
        model: str = "deepseek-ai/DeepSeek-V3",
        max_tokens: int = 1024,
        temperature: float = 0.7,
        top_p: float = 0.7,
        top_k: int = 50,
        repetition_penalty: float = 1.0,
        stop: list[str] = ["<｜end▁of▁sentence｜>"]
    ) -> str:
        """
        Invokes the Together chat API using the stored conversation history.

        Args:
            model (str, optional): The name of the Together model to use. Defaults to "deepseek-ai/DeepSeek-V3".
            max_tokens (int, optional): The maximum number of tokens in the response. Defaults to 1024.
            temperature (float, optional): The sampling temperature. Defaults to 0.7.
            top_p (float, optional): The top_p sampling parameter. Defaults to 0.7.
            top_k (int, optional): The top_k sampling parameter. Defaults to 50.
            repetition_penalty (float, optional): The repetition penalty parameter. Defaults to 1.0.
            stop (list[str], optional): A list of stop tokens. Defaults to ["<｜end▁of▁sentence｜>"].

        Returns:
            str: The collapsed string response from the API.
        """
        response = self.client.chat.completions.create(
            model=model,
            messages=self.history,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repetition_penalty=repetition_penalty,
            stop=stop,
            stream=True
        )
        answer: str = self.collapse_response(response)
        return answer

    def collapse_response(self, response) -> str:
        """
        Collapses a streaming response from the Together API into a single string.

        Args:
            response: The streaming response object from the Together API.

        Returns:
            str: A single string containing the concatenated content from each token in the response.
        """
        answer: str = ""
        for token in response:
            if hasattr(token, "choices"):
                try:
                    answer += token.choices[0].delta.content
                except:
                    pass
        return answer

    def show_history(self) -> None:
        """
        Prints the entire conversation history.
        """
        print(self.history)


Test one endpoint call.

In [None]:
# Instantiate the ChatBot
bot = ChatBot(api_key=TOGETHER_API_KEY)

bot.history = [{"role": "assistant", "content": "You always provide reasoning. Your answer starts from <think>xxx</think> and <response>."}]

# Provided by data
current_question = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
current_answer = "72"
augmented_content = f"Provide reasoning how to answer question: {current_question} and to arrive with answer: {current_answer}. Use <think></think> for reasoning and <response></response> for final answer."
print(augmented_content)

# Append augmented content to history
bot.append_history(role="user", content=augmented_content)
bot.invoke_api()

Provide reasoning how to answer question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? and to arrive with answer: 72. Use <think></think> for reasoning and <response></response> for final answer.


'<think>  \n1. **April Sales**: Natalia sold clips to 48 friends in April. This means she sold 48 clips in April.  \n2. **May Sales**: She sold half as many clips in May as she did in April. Half of 48 is calculated as \\( \\frac{48}{2} = 24 \\). So, she sold 24 clips in May.  \n3. **Total Sales**: To find the total number of clips sold in April and May, add the two amounts together: \\( 48 \\text{ (April)} + 24 \\text{ (May)} = 72 \\).  \n</think>  \n\n<response>  \n72  \n</response>'

In [None]:
# for i in range(len(array_of_pages)):
#     tmp_doc = array_of_pages[i]
#     for j in range(len(tmp_doc)):
#         tmp_page = tmp_doc[j]
i=0
tmp_doc = array_of_pages[i]
j=0
tmp_page = tmp_doc[j]

In [None]:
tmp_page

{'index': 0,
 'markdown': '# BERKSHIRE HATHAWAY INC. \n\nTo the Shareholders of Berkshire Hathaway Inc.:\nThis letter comes to you as part of Berkshire\'s annual report. As a public company, we are required to periodically tell you many specific facts and figures.\n"Report," however, implies a greater responsibility. In addition to the mandated data, we believe we owe you additional commentary about what you own and how we think. Our goal is to communicate with you in a manner that we would wish you to use if our positions were reversed - that is, if you were Berkshire\'s CEO while I and my family were passive investors, trusting you with our savings.\n\nThis approach leads us to an annual recitation of both good and bad developments at the many businesses you indirectly own through your Berkshire shares. When discussing problems at specific subsidiaries, we do, however, try to follow the advice Tom Murphy gave to me 60 years ago: "praise by name, criticize by category."\n\n## Mistakes

In [None]:
# Instantiate the ChatBot
bot = ChatBot(api_key=TOGETHER_API_KEY)

bot.history = [{"role": "assistant", "content": f"Here is some paragraph written by investor Warren Buffett: {tmp_page}."}]

# Append augmented content to history
bot.append_history(role="user", content="What is a good question worth being asked from this paragraph? Please provide one sentence.")
question = bot.invoke_api(temperature=0.9)

# Append augmented content to history
bot.append_history(role="user", content="What is a good answer that can be derived from above paragraph and question? Please provide answer.")
answer = bot.invoke_api(temperature=0.1)

# Append augmented content to history
bot.append_history(role="user", content="Give me reasoning to show how to arrive with answer above from paragraph and question. Please provide reasoning.")
reasoning_content = bot.invoke_api(temperature=0.1)

print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Reasoning: {reasoning_content}")

Question: **"How does Warren Buffett's principle of 'praise by name, criticize by category' influence transparency and accountability in Berkshire Hathaway's shareholder communications?"**
Answer: **Question:** Why does Warren Buffett emphasize the importance of admitting mistakes in his shareholder letters?  

**Answer:** Buffett believes that openly acknowledging mistakes fosters transparency and accountability, allowing shareholders to better understand the company's challenges and decision-making, rather than masking problems with overly optimistic language.
Reasoning: **Question:** *Why does Warren Buffett emphasize admitting mistakes in his shareholder letters, while many other large companies avoid doing so?*  

**Answer:** Buffett believes transparency and accountability are crucial for shareholders, as openly discussing mistakes—both in investments and management—helps build trust and ensures corrective action is taken, unlike companies that avoid admitting errors to maintain 

## Data Curation Using Deepseek-R1

Now we gather all data together.

In [None]:
%%capture

! pip install datasets

In [None]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

In [None]:
from datasets import Dataset
from huggingface_hub import HfApi, HfFolder, Repository
from typing import List, Dict
import os

# List to collect results
num_of_sampling = 20
results: List[Dict[str, str]] = []

# Double for-loop to iterate over array_of_pages
for i in range(len(array_of_pages)):
    tmp_doc = array_of_pages[i]
    for j in range(len(tmp_doc)):
        tmp_page = tmp_doc[j]
        markdown_text = tmp_page.get("markdown", "")

        # Sampling
        for k in range(num_of_sampling):
            try:
                # Instantiate the ChatBot
                bot = ChatBot(api_key=TOGETHER_API_KEY)
                bot.history = [{"role": "assistant", "content": f"Here is some paragraph written by investor Warren Buffett: {markdown_text}"}]

                bot.append_history(role="user", content="What is a good question worth being asked from this paragraph? Please provide one sentence. Please only provide question.")
                question = bot.invoke_api(temperature=0.9)

                bot.append_history(role="user", content="What is a good answer that can be derived from above paragraph and question? Please only provide answer.")
                answer = bot.invoke_api(temperature=0.1)

                bot.append_history(role="user", content="Give me reasoning to show how to arrive with answer above from paragraph and question. Please only provide reasoning.")
                reasoning = bot.invoke_api(temperature=0.1)

                print(f"[{i}-{j}-{k}] ✅ Success\nQ: {question}\nA: {answer}\nR: {reasoning}\n")

                results.append({
                    "question": question.strip(),
                    "answer": answer.strip(),
                    "reasoning": reasoning.strip()
                })
            except:
                print(f"[{i}-{j}-{k}] ❌ Failed")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
R: The reasoning is derived from Buffett's emphasis on the value of retained earnings and unrealized gains in the "Big Four" investments, where he explains that even though Berkshire only reports dividends, the retained earnings (used for stock repurchases and business growth) significantly enhance Berkshire's future earnings and capital gains, making them equally valuable.

[10-4-6] ✅ Success
Q: How does Berkshire Hathaway's strategy of owning substantial but non-controlling stakes in excellent companies compare to acquiring full ownership of average businesses?
A: Berkshire Hathaway's strategy of owning substantial but non-controlling stakes in excellent companies (like American Express, Coca-Cola, IBM, and Wells Fargo) allows it to benefit from retained earnings, stock repurchases, and long-term growth, which is more valuable than owning 100% of mediocre businesses.
R: The reasoning involves analyzing Warren Buffett's 

In [None]:
len(results)

## Save on HuggingFace

In [None]:
from datasets import Dataset
from huggingface_hub import HfApi, HfFolder, upload_folder

In [None]:
%%time

# Convert to HF dataset
dataset = Dataset.from_list(results)

# Save the updated dataset (optional)
output_data_test_name = "wb_dataset"
dataset.save_to_disk(output_data_test_name)

In [None]:
from datasets import load_from_disk, DatasetDict

# Load dataset from the saved folder
aug_data = load_from_disk(output_data_test_name)

# Convert it into a DatasetDict format
dataset_dict = DatasetDict(
    {
        "train": aug_data # ,
        # "test": aug_data_test
    }
)

# Save again in correct structure
final_output_data_name = f"{output_data_test_name}-correct-format"
dataset_dict.save_to_disk(final_output_data_name)

# Define user and repo details
user_name = "eagle0504"
repo_name = f"warren-buffett-letters-qna-r1-enhanced-1998-2024"  # Change this to your desired repository name

# Push to Hugging Face
dataset_dict.push_to_hub(f"{user_name}/{repo_name}")

# Print the URL of the pushed dataset
print(f"Dataset pushed to: https://huggingface.co/datasets/{user_name}/{repo_name}")