# Extracting Stoic Quotes from "The Daily Stoic" Using LLMs and OpenAI Batch API

This notebook demonstrates a robust workflow for extracting structured data specifically, the date, quote, author, and reflection from each page of "The Daily Stoic" PDF. Instead of relying on traditional PDF parsing or OCR heuristics, we leverage Large Language Models (LLMs) for their superior understanding of document layout and semantics.

The process involves:
- Converting each PDF page to an image.
- Crafting a prompt with a one - shot example to guide the LLM in extracting the required fields.
- Using the OpenAI API (including the Batch API for cost efficiency) to process all pages in parallel.
- Handling large input files by splitting them to meet API constraints.
- Aggregating and cleaning the results into a single, unified JSON file.

This approach enables accurate, scalable extraction of daily stoic quotes and reflections, with minimal manual intervention and high reliability.

## Extract the stoic quote, author, reflection for each day from the PDF document

**Objective**: Make use of an LLM to extract the date, quote, author, reflection from each page of the PDF using an LLM

**Why**: We can use PDF reader libraries but it might be harder for them to understand the layout and distinction between the 4 distinct elements given that they typically access the text directly. It might mean adding some weird heuristics or regex patterns to enable this flow. Hence LLMs. 

**How**: Single PDF Page -> Image -> LLM -> structured output

**Notes**: Additionally, we will try to use the OpenAI Batch API just to test it and compare cost savings (if any). The methodology to do it is different so might as well give it a shot!

In [None]:
from pdf2image import convert_from_path

# Path to your PDF
pdf_path = "assets/The_Daily_Stoic_Book.pdf"

# Convert all pages to images
images = convert_from_path(pdf_path, dpi=300)  # dpi=300 for good quality

# Save each page as an image
for i, image in enumerate(images):
    image.save(f"assets/page_{i+1}.png", "PNG")

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

### Trying out the prompt first and with an example.

We have extracted the individual pages as images now, let's create the prompt. We will use a one-shot example to guide the way. 

In [28]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the API key
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set up the OpenAI client
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)

In [None]:
import openai
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Encode images as base64
example_image_b64 = encode_image("assets/page_152.png")
target_image_b64 = encode_image("assets/page_156.png")

# Your hand-crafted JSON for the example
example_json = """
{
  "05-12": {
    "title": "KINDNESS IS ALWAYS THE RIGHT RESPONSE",
    "quote": "“Kindness is invincible, but only when it's sincere, with no hypocricy or faking. For what can even the most malicious person do if you keep showing kindness and, if given the chance, you gently point out where they went wrong - right as they are trying to harm you?”",
    "author": "MARCUS AURELIUS, MEDITATIONS, 11.18.5.9a",
    "reflection": "What if the next time you were treated meanly you didn't just restrain yourself from fighting back - what if you responded with unmitigated kindness? What if you could \"love your enemies, do good to those who hate you\"? What kind of effect do you think that would have? \nThe Bible says that when you can do something nice and caring to a hateful enemy, it is like \"heap[ing] burning coal on his head.\" The expected reaction to hatred is more hatred. When someone says something pointed or mean today, they expect you to respond in kind - not with kindness. When that doesn't happen, they are embarrassed. It's a shock to their system - it makes them and you better. \nMost rudeness, meanness, and cruelty are a mask for deep-seated weakness. Kindness in these situations is only possible for people of great strength. You have that strength. Use it."
  },
}
"""

# Compose the messages
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that extracts structured data from images of book pages."
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Here is an example image of a Stoic quote page:"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{example_image_b64}"}},
            {"type": "text", "text": f"The extracted JSON for this page is:\n{example_json}"}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Now, extract the JSON for this new image, following the same format if it is a Stoic quote page. If the image is not a quote page (for example, a title or copyright page), please return the string SKIP."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{target_image_b64}"}}
        ]
    }
]

# Call the API
response = openai.chat.completions.create(
    model="gpt-4o",  # or "gpt-4o-mini" if available
    messages=messages,
    max_tokens=512,
    temperature=0,
)

print(response.choices[0].message.content)

```json
{
  "05-16": {
    "title": "THE CHAIN METHOD",
    "quote": "“If you don’t wish to be a hot-head, don’t feed your habit. Try as a first step to remain calm and count the days you haven’t been angry. I used to be angry every day, now every other day, then every third or fourth . . . if you make it as far as 30 days, thank God! For habit is first weakened and then obliterated. When you can say ‘I didn’t lose my temper today, or the next day, or for three or four months, but kept my cool under provocation,’ you will know you are in better health.”",
    "author": "EPICTETUS, DISCOURSES, 2.18.11b-14",
    "reflection": "The comedian Jerry Seinfeld once gave a young comic named Brad Isaac some advice about how to write and create material. Keep a calendar, he told him, and each day that you write jokes, put an X. Soon enough, you get a chain going—and then your job is to simply not break the chain. Success becomes a matter of momentum. Once you get a little, it’s easier to keep it 

In [6]:
import re
import json

response_text = response.choices[0].message.content

# This regex finds the content between ```json ... ```
match = re.search(r"```json\s*(\{.*?\})\s*```", response_text, re.DOTALL)
if match:
    json_str = match.group(1)
else:
    # Fallback: try to find any JSON object in the text
    match = re.search(r"(\{.*\})", response_text, re.DOTALL)
    if match:
        json_str = match.group(1)
    else:
        raise ValueError("No JSON object found in the response.")

stoic_quote = json.loads(json_str)

In [7]:
stoic_quote

{'05-16': {'title': 'THE CHAIN METHOD',
  'quote': '“If you don’t wish to be a hot-head, don’t feed your habit. Try as a first step to remain calm and count the days you haven’t been angry. I used to be angry every day, now every other day, then every third or fourth . . . if you make it as far as 30 days, thank God! For habit is first weakened and then obliterated. When you can say ‘I didn’t lose my temper today, or the next day, or for three or four months, but kept my cool under provocation,’ you will know you are in better health.”',
  'author': 'EPICTETUS, DISCOURSES, 2.18.11b-14',
  'reflection': 'The comedian Jerry Seinfeld once gave a young comic named Brad Isaac some advice about how to write and create material. Keep a calendar, he told him, and each day that you write jokes, put an X. Soon enough, you get a chain going—and then your job is to simply not break the chain. Success becomes a matter of momentum. Once you get a little, it’s easier to keep it going. Whereas Seinfel

### Now we create the batch JSON file for the OpenAI Batch job

We are following the instructions provided on this page - https://platform.openai.com/docs/guides/batch

In [8]:
import os
import base64
import json

# === CONFIGURATION ===

# Path to the folder with target images (images to process)
target_image_folder = "assets"

# Path to a fixed example image (used in every request)
example_image_path = "assets/page_152.png"

# Path to save the output .jsonl file
output_jsonl_path = "assets/stoic_requests.jsonl"

# The example JSON string (used in prompt)
example_json = """
{
  "05-12": {
    "title": "KINDNESS IS ALWAYS THE RIGHT RESPONSE",
    "quote": "“Kindness is invincible, but only when it's sincere, with no hypocricy or faking. For what can even the most malicious person do if you keep showing kindness and, if given the chance, you gently point out where they went wrong - right as they are trying to harm you?”",
    "author": "MARCUS AURELIUS, MEDITATIONS, 11.18.5.9a",
    "reflection": "What if the next time you were treated meanly you didn't just restrain yourself from fighting back - what if you responded with unmitigated kindness? What if you could \"love your enemies, do good to those who hate you\"? What kind of effect do you think that would have? \nThe Bible says that when you can do something nice and caring to a hateful enemy, it is like \"heap[ing] burning coal on his head.\" The expected reaction to hatred is more hatred. When someone says something pointed or mean today, they expect you to respond in kind - not with kindness. When that doesn't happen, they are embarrassed. It's a shock to their system - it makes them and you better. \nMost rudeness, meanness, and cruelty are a mask for deep-seated weakness. Kindness in these situations is only possible for people of great strength. You have that strength. Use it."
  },
}
"""

# === HELPER FUNCTIONS ===

def encode_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# === MAIN PROCESSING ===

# Encode the static example image
example_image_b64 = encode_image_to_base64(example_image_path)

# Open the output file
with open(output_jsonl_path, "w") as outfile:
    for idx, filename in enumerate(sorted(os.listdir(target_image_folder))):
        if filename.lower().endswith((".png", ".jpg", ".jpeg")):
            image_path = os.path.join(target_image_folder, filename)
            target_image_b64 = encode_image_to_base64(image_path)

            messages = [
                {
                    "role": "system",
                    "content": "You are a helpful assistant that extracts structured data from images of book pages."
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Here is an example image of a Stoic quote page:"},
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{example_image_b64}"}},
                        {"type": "text", "text": f"The extracted JSON for this page is:\n{example_json}"}
                    ]
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Now, extract the JSON for this new image, following the same format if it is a Stoic quote page. If the image is not a quote page (for example, a title or copyright page), please return the string SKIP."},
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{target_image_b64}"}}
                    ]
                }
            ]

            request_payload = {
                "custom_id": f"request-{idx+1}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o",
                    "messages": messages,
                    "max_tokens": 512,
                    "temperature": 0
                }
            }

            outfile.write(json.dumps(request_payload) + "\n")

print(f"✅ JSONL file created: {output_jsonl_path}")

✅ JSONL file created: assets/stoic_requests.jsonl


### File size issue: the Batch API only allows a maximum file size of 200MB. 

Because of the images that we are using, this was much larger than that (320MB) so we will split it into two files and try.

In [20]:
import os

input_file = "assets/stoic_requests.jsonl"
max_lines_per_file = 220  # adjust based on average size per line

with open(input_file, 'r') as infile:
    lines = infile.readlines()

for i in range(0, len(lines), max_lines_per_file):
    chunk = lines[i:i+max_lines_per_file]
    chunk_filename = f"stoic_requests_part_{i//max_lines_per_file + 1}.jsonl"
    with open(chunk_filename, 'w') as outfile:
        outfile.writelines(chunk)
    print(f"Wrote {chunk_filename}")

Wrote stoic_requests_part_1.jsonl
Wrote stoic_requests_part_2.jsonl


In [54]:
from openai import OpenAI
client = OpenAI()

batch_input_file = client.files.create(
    file=open("assets/stoic_requests_part_2.jsonl", "rb"),
    purpose="batch"
)

print(batch_input_file)

FileObject(id='file-NAfWRkN2HVwcj61RX2ivAE', bytes=150049667, created_at=1747670171, filename='stoic_requests_part_2.jsonl', object='file', purpose='batch', status='processed', status_details=None, expires_at=None)


- Our fileID is: file-N9h3oEKMbgVkbtqHNPuGuJ [part1] and fileID is: file-NAfWRkN2HVwcj61RX2ivAE [part2]
- We have to use that when creating the batch request job
- So far, the cost estimate for three requests seems to be roughly 2 cents. If we extract that to 400 pages (400 requests), I expect it to be between 2 and 3 USD. The batch API promises a 50% discount as we are waiting - so the costs should be around a dollar. We have to wait and watch!
- **Final result** - there was a bit of an additional hoop around splitting the file but the overall cost came to 1.44 USD - right in the range of what we had predicted! Makes good sense to use the Batch API.
- The response was also quite fast - I mean it's a batch job so not immediate but within 10 minutes both batches were done.

In [55]:
from openai import OpenAI
client = OpenAI()

batch_input_file_id = batch_input_file.id
client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "description": "stoic extraction job"
    }
)

Batch(id='batch_682b54bae3e88190827800445fae2b47', completion_window='24h', created_at=1747670202, endpoint='/v1/chat/completions', input_file_id='file-NAfWRkN2HVwcj61RX2ivAE', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1747756602, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'stoic extraction job'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

### Batch Status Check

Now, let's check the status. The same code can be executed at any time to keep track.

- 1st batch request: batch_682b5019b17c8190808a26d7fede9ba5
- 2nd batch request: batch_682b54bae3e88190827800445fae2b47

In [56]:
batch_id_number = 'batch_682b54bae3e88190827800445fae2b47'

In [60]:
batch = client.batches.retrieve(batch_id_number)
print(batch)

Batch(id='batch_682b54bae3e88190827800445fae2b47', completion_window='24h', created_at=1747670202, endpoint='/v1/chat/completions', input_file_id='file-NAfWRkN2HVwcj61RX2ivAE', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1747670501, error_file_id=None, errors=None, expired_at=None, expires_at=1747756602, failed_at=None, finalizing_at=1747670486, in_progress_at=1747670271, metadata={'description': 'stoic extraction job'}, output_file_id='file-E6Q1YS8jaSTkdD4aX9BXfn', request_counts=BatchRequestCounts(completed=187, failed=0, total=187))


### Retrieve batch job

Once the job has completed, you will get an output_file_id field. You need to use this field to retrieve the output.

- Part1 output_file_id=file-5bVxF77RWpnEzgRAAQpvx4
- Part2 output_file_id=file-E6Q1YS8jaSTkdD4aX9BXfn 

In [61]:
output_file_id='file-E6Q1YS8jaSTkdD4aX9BXfn'

file_response = client.files.content(output_file_id)

output_path = "assets/stoic_results_part_2.jsonl"

# Save the streamed content into a local file
with open(output_path, "wb") as f:
    f.write(file_response.read())

print(f"✅ Output saved to {output_path}")

✅ Output saved to assets/stoic_results_part_2.jsonl


### Check results

Read in the results for a particular day (row) and see how the response looks like

In [63]:
import random
import json

# Path to your jsonl file
jsonl_file_path = "assets/stoic_results_part_2.jsonl"

# Step 1: Load all lines
with open(jsonl_file_path, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Step 2: Pick a random line
random_line = random.choice(lines)

# Step 3: Parse the JSON line
response_obj = json.loads(random_line)

# Step 4: Inspect the object
quote_json = response_obj["response"]["body"]["choices"][0]["message"]["content"]
print(quote_json)

```json
{
  "11-26": {
    "title": "THE ALTAR OF NO DIFFERENCE",
    "quote": "“We are like many pellets of incense falling on the same altar. Some collapse sooner, others later, but it makes no difference.”",
    "author": "MARCUS AURELIUS, MEDITATIONS, 4.15",
    "reflection": "What's the difference between you and the richest person in the world? One has a little more money than the other. What's the difference between you and the oldest person in the world? One has been around a little longer than the other. Same goes for the tallest, smartest, fastest, and on down the line. Measuring ourselves against other people makes acceptance difficult, because we want what they have, or we want how things could have gone, not what we happen to have. But that makes no difference. Some might see this line from Marcus as pessimistic, whereas others see it as optimistic. It's really just truth. We're all here and we're all going to leave this earth eventually, so let's not concern ourselves wit

### Putting it all together into a single JSON file

In [65]:
import json
import os

# List of input .jsonl files
input_files = [
    "assets/stoic_results_part_1.jsonl",
    "assets/stoic_results_part_2.jsonl"
]

# Output file
output_path = "stoic_quotes.json"

valid_quotes = []
skipped_lines = []

for input_path in input_files:
    with open(input_path, "r", encoding="utf-8") as f:
        for idx, line in enumerate(f, start=1):
            try:
                response_obj = json.loads(line)
                content = response_obj["response"]["body"]["choices"][0]["message"]["content"].strip()

                # Strip triple backticks and possible `json` marker
                if content.startswith("```json"):
                    content = content.removeprefix("```json").removesuffix("```").strip()
                elif content.startswith("```"):
                    content = content.removeprefix("```").removesuffix("```").strip()

                if content == "SKIP":
                    skipped_lines.append(f"{input_path} — Line {idx}: SKIP")
                    continue

                # Parse the inner string as JSON
                quote_data = json.loads(content)
                valid_quotes.append(quote_data)

            except Exception as e:
                skipped_lines.append(f"{input_path} — Line {idx}: Error - {str(e)}")

# Save all valid JSONs into one unified JSON file
with open(output_path, "w", encoding="utf-8") as out_f:
    json.dump(valid_quotes, out_f, indent=2, ensure_ascii=False)

# Summary
print(f"✅ Extracted {len(valid_quotes)} quotes from {len(input_files)} files.")
print(f"⚠️ Skipped {len(skipped_lines)} lines.")
if skipped_lines:
    print("\nDetails of skipped lines:")
    for line in skipped_lines:
        print(line)

✅ Extracted 367 quotes from 2 files.
⚠️ Skipped 40 lines.

Details of skipped lines:
assets/stoic_results_part_1.jsonl — Line 1: SKIP
assets/stoic_results_part_1.jsonl — Line 2: SKIP
assets/stoic_results_part_1.jsonl — Line 11: SKIP
assets/stoic_results_part_1.jsonl — Line 13: SKIP
assets/stoic_results_part_1.jsonl — Line 24: SKIP
assets/stoic_results_part_1.jsonl — Line 35: SKIP
assets/stoic_results_part_1.jsonl — Line 45: SKIP
assets/stoic_results_part_1.jsonl — Line 46: SKIP
assets/stoic_results_part_1.jsonl — Line 47: SKIP
assets/stoic_results_part_1.jsonl — Line 82: SKIP
assets/stoic_results_part_1.jsonl — Line 112: SKIP
assets/stoic_results_part_1.jsonl — Line 117: SKIP
assets/stoic_results_part_1.jsonl — Line 152: SKIP
assets/stoic_results_part_1.jsonl — Line 187: SKIP
assets/stoic_results_part_1.jsonl — Line 188: SKIP
assets/stoic_results_part_2.jsonl — Line 2: SKIP
assets/stoic_results_part_2.jsonl — Line 3: SKIP
assets/stoic_results_part_2.jsonl — Line 39: SKIP
assets/stoic_r

Wow! Exactly 367 quotes have been extracted -> that's like icing on the cake. No error as such!