In [1]:
# Required for PydanticAI to work with Jupyter (nested event loops)
import nest_asyncio
nest_asyncio.apply()

In [2]:
from pastel.classifiers import classify_insurance_image
from pastel.parsers import parse_assertion, parse_evidence, check_grammar
from pastel.evaluation import consolidate_evaluations
from IPython.display import display, Markdown

import pandas as pd
import logfire
import os
from pastel.helpers import (
    load_images_from_directory,
    create_consolidated_image,
    export_evaluation_to_markdown,
)
from pastel.models import (
    InsightPlots,
    InputModel,
)

if os.getenv("PYDANTIC_LOGFIRE_TOKEN"):
    logfire.configure(token=os.getenv("PYDANTIC_LOGFIRE_TOKEN"))
    logfire.instrument_openai()
    logfire.instrument_anthropic()

[1mLogfire[0m project URL: ]8;id=918853;https://logfire.pydantic.dev/xmandeng/pastel\[4;36mhttps://logfire.pydantic.dev/xmandeng/pastel[0m]8;;\


# Input Selection
Select `index` value from 0 to 14

In [3]:
# Select from 0 to 14
index = 0

# Insight Validation
- Load datasets
- Collect images
- Parse Insight
  - conclusion
  - supporting premises
- Evaluate premises
- Check grammar

In [4]:
insights_df = pd.read_excel("../data/Insights.xlsx")

input_model = InputModel(
    name=insights_df["programname"][index],
    insight=insights_df["insight"][index],
    line_of_business=insights_df["line_of_business"][index],
)

assertion = await parse_assertion(input_model)
insight = await parse_evidence(assertion)

# Retrieve data sets
lrs = pd.read_excel("../data/lrs.xlsx")
lrs_data = lrs.loc[lrs["programname"] == insight.name].drop(columns=["programname"])

images = InsightPlots(plots=load_images_from_directory(f"../data/{insight.name.replace('/', '-')}"))
consolidated_image = create_consolidated_image(lrs_data, images, insight)
premises = await classify_insurance_image(consolidated_image, insight.evidence)
grammar = await check_grammar(insight)
final_evaluation = await consolidate_evaluations(insight, premises, grammar)

15:49:12.362 Chat Completion with 'gpt-4o-mini' [LLM]
15:49:13.487 Chat Completion with 'gpt-4o-mini' [LLM]
15:49:15.127 Message with 'claude-3-7-sonnet-latest' [LLM]
15:49:28.842 Message with 'claude-3-5-sonnet-latest' [LLM]
15:49:30.072 Message with 'claude-3-7-sonnet-latest' [LLM]


# Show Validation Resuls
- Display report
- Write markdown to `../reports/` folder
---

In [14]:
display(Markdown(f"---"))
display(Markdown("# Overall Assessment\n\n---"))
display(Markdown(f'#### "{insight.insight}"<br>'))
display(Markdown(f"## {final_evaluation.overall_valid}"))
display(Markdown(final_evaluation.reasoning))

display(Markdown("---"))
display(Markdown("## Conclusion\n\n"))
display(Markdown(f'#### "{insight.conclusion}"<br><br>'))

display(Markdown(f"---"))
display(Markdown("## Premises\n\n"))

if not premises:
    display(Markdown("####No premises found"))

else:
    for i, premise in enumerate(premises):
        display(Markdown(f"#### {i+1}. {premise.claim}"))
        display(Markdown(f"**Status** <br>{premise.status} <br>{premise.confidence} confidence"))
        display(Markdown(f"**Rationale** <br>{premise.reasoning}<br><br>"))

display(Markdown(f"---"))
if grammar.errors:
    error_list = "\n".join([f"- {error}" for error in grammar.errors])
    display(Markdown("## Grammar\n\n" + error_list))
else:
    display(Markdown("## Grammar\n\n"))
    display(Markdown("#### No grammatical errors found"))

---

# Overall Assessment

---

#### "This is a poorly performing auto liability program (EULR > 0.7) with stable pricing and similar performance for frequency and severity of claims across underwriting years."<br>

## True

The insight is generally valid, though with some qualifications. The core claim about this being a poorly performing auto liability program with EULR > 0.7 is strongly supported by data showing values of 0.99, 0.86, and 0.76 across years. The claim about stable pricing is partially true - while there is baseline consistency, there are notable periodic adjustments. The claim about similar performance for frequency of claims is fully supported, with almost identical trajectories across policy years. The claim about similar severity performance is partially true - there are similarities but with more significant variations, especially after day 700. There are no grammatical errors in the insight. Overall, the main assertions are either true or partially true, with no false claims, making the insight valid with minor caveats about the degree of pricing stability and severity performance similarity.

---

## Conclusion



#### "This is a poorly performing auto liability program"<br><br>

---

## Premises



#### 1. The Expected Ultimate Loss Ratio (EULR) is greater than 0.7.

**Status** <br>True <br>High confidence

**Rationale** <br>The LRS Data table clearly shows the 'uf' column which appears to represent the Ultimate Loss Ratio. For 2021, the value is 0.99, for 2022 it's 0.86, and for 2023 it's 0.76. All of these values are greater than 0.7. Even though 2024 shows 'nan' (likely because it's incomplete data), the historical data strongly supports that the EULR is consistently above 0.7.<br><br>

#### 2. The program has stable pricing.

**Status** <br>Partially True <br>Medium confidence

**Rationale** <br>The pricing chart (titled 'Written Premium per Risk and Days Covered') shows some fluctuations over time. While there is a baseline consistency with the 360 Rolling Median hovering around 200-300 range, there are notable spikes throughout the timeline, particularly around the Treaty markers. These spikes indicate periodic pricing adjustments rather than complete stability. The pricing appears to remain within a general range but with regular adjustments, suggesting partial stability with planned periodic changes.<br><br>

#### 3. There is similar performance for frequency of claims.

**Status** <br>True <br>High confidence

**Rationale** <br>The frequency chart ('CLAIM FREQUENCY PER POLICY') shows very similar patterns across different policy years (indicated by the colored lines 1-4). All lines follow almost identical trajectories, starting at 0 and increasing steadily over time toward a plateau of approximately 1.5 frequency per policy. The lines largely overlap, especially in the earlier periods, indicating consistent claim frequency performance across different policy years.<br><br>

#### 4. There is similar performance for severity of claims.

**Status** <br>Partially True <br>Medium confidence

**Rationale** <br>The severity chart ('LARGE CLAIMS/TOTAL CLAIMS') shows some similarity but with more variation than the frequency chart. There are two main trend lines (likely representing different claim size brackets: >50K and >100K). While the overall pattern is similar across years, there are more noticeable divergences, particularly after day 700 where the lines separate more distinctly. The general trend is comparable, but the variations are significant enough that it can only be considered partially similar in performance.<br><br>

---

## Grammar



#### No grammatical errors found

In [6]:

md_result = export_evaluation_to_markdown(insight, final_evaluation, premises, grammar)
display(Markdown(f"**{md_result}**"))

**Markdown report saved to: ../reports/evaluation_pdl6muw8mJl9DL7bVO40nFOroodOnSFBG5e7zw+nAW32k7BiKehq6oLHwyItBjfw.md**