# Input Selection
Select `index` value from 0 to 14

In [1]:
# Assign the index manually if needed
# index = int(input("Enter an index value (0-14): "))
index = 0

In [2]:
# Required for PydanticAI to work with Jupyter (nested event loops)
import nest_asyncio
nest_asyncio.apply()

In [3]:
from verifyai.classifiers import classify_insurance_image
from verifyai.parsers import parse_assertion, parse_evidence, check_grammar
from verifyai.evaluation import consolidate_evaluations
from IPython.display import display, Markdown

import pandas as pd
import logfire
import os
from verifyai.helpers import (
    load_images_from_directory,
    create_consolidated_image,
    export_evaluation_to_markdown,
)
from verifyai.models import (
    InsightPlots,
    InputModel,
)

if os.getenv("PYDANTIC_LOGFIRE_TOKEN"):
    logfire.configure(token=os.getenv("PYDANTIC_LOGFIRE_TOKEN"))
    logfire.instrument_openai()
    logfire.instrument_anthropic()

[1mLogfire[0m project URL: ]8;id=7193;https://logfire.pydantic.dev/xmandeng/verifyai\[4;36mhttps://logfire.pydantic.dev/xmandeng/verifyai[0m]8;;\


# Insight Validation
- Load datasets
- Collect images
- Parse Insight
  - conclusion
  - supporting premises
- Evaluate premises
- Check grammar

In [4]:
insights_df = pd.read_excel("../data/Insights.xlsx")

input_model = InputModel(
    name=insights_df["programname"][index],
    insight=insights_df["insight"][index],
    line_of_business=insights_df["line_of_business"][index],
)

assertion = await parse_assertion(input_model)
insight = await parse_evidence(assertion)

# Retrieve data sets
lrs = pd.read_excel("../data/lrs.xlsx")
lrs_data = lrs.loc[lrs["programname"] == insight.name].drop(columns=["programname"])

images = InsightPlots(plots=load_images_from_directory(f"../data/{insight.name.replace('/', '-')}"))
consolidated_image = create_consolidated_image(lrs_data, images, insight)
premises = await classify_insurance_image(consolidated_image, insight.evidence)
grammar = await check_grammar(insight)
final_evaluation = await consolidate_evaluations(insight, premises, grammar)

02:19:07.642 Chat Completion with 'gpt-4o-mini' [LLM]
02:19:09.851 Chat Completion with 'gpt-4o-mini' [LLM]
02:19:13.883 Message with 'claude-3-7-sonnet-latest' [LLM]
02:19:32.369 Message with 'claude-3-5-sonnet-latest' [LLM]
02:19:35.525 Message with 'claude-3-7-sonnet-latest' [LLM]


# Show Validation Resuls
- Display report
- Write markdown to `../reports/` folder
---

In [5]:
display(Markdown(f"---"))
display(Markdown("# Overall Assessment\n\n---"))
display(Markdown(f'#### "{insight.insight}"<br>'))
display(Markdown(f"## {final_evaluation.overall_valid}"))
display(Markdown(final_evaluation.reasoning))

display(Markdown("---"))
display(Markdown("## Conclusion\n\n"))
display(Markdown(f'#### "{insight.conclusion}"<br><br>'))

display(Markdown(f"---"))
display(Markdown("## Premises\n\n"))

if not premises:
    display(Markdown("####No premises found"))

else:
    for i, premise in enumerate(premises):
        display(Markdown(f"#### {i+1}. {premise.claim}"))
        display(Markdown(f"**Status** <br>{premise.status} <br>{premise.confidence} confidence"))
        display(Markdown(f"**Rationale** <br>{premise.reasoning}<br><br>"))

display(Markdown(f"---"))
if grammar.errors:
    error_list = "\n".join([f"- {error}" for error in grammar.errors])
    display(Markdown("## Grammar\n\n" + error_list))
else:
    display(Markdown("## Grammar\n\n"))
    display(Markdown("#### No grammatical errors found"))

---

# Overall Assessment

---

#### "This is a poorly performing auto liability program (EULR > 0.7) with stable pricing and similar performance for frequency and severity of claims across underwriting years."<br>

## False

The insight contains significant inaccuracies. While it correctly identifies high EULR values (>0.7) and similar claim frequency performance across years, it incorrectly characterizes the program as having "stable pricing" when the evidence shows considerable volatility in pricing with significant fluctuations and spikes. Additionally, the claim about similar severity performance is only partially true, as there are notable divergences in severity patterns between different years, especially after day 700. These factual errors undermine the overall validity of the insight despite some accurate observations.

---

## Conclusion



#### "This is a poorly performing auto liability program"<br><br>

---

## Premises



#### 1. The Expected Ultimate Loss Ratio (EULR) is greater than 0.7.

**Status** <br>Partially True <br>Medium confidence

**Rationale** <br>From the LRS Data table at the top of the image, we can see the 'ulf' (which appears to be Ultimate Loss Factor or Ultimate Loss Ratio) values for multiple years. For 2021, the ulf is 0.99, for 2022 it's 0.86, and for 2023 it's 0.76. All of these values are indeed greater than 0.7. However, the 2024 value is marked as 'nan' (not a number), indicating no data is available yet. Since the claim doesn't specify a time period and one year has missing data, I've marked this as partially true rather than completely true.<br><br>

#### 2. The program has stable pricing.

**Status** <br>False <br>High confidence

**Rationale** <br>The 'pricing.png' chart shows 'Written Premium per Risk and Days Covered' over time with a 360 Rolling Median. The chart displays significant fluctuations in pricing throughout the shown period, with multiple noticeable spikes (especially around Treaty 0, Treaty 2, and Treaty 3 markers). The premium values range from approximately 100 to over 500, showing considerable volatility rather than stability. There are periods of relative stability between the spikes, but the overall pattern demonstrates that pricing is not stable across the program's lifetime.<br><br>

#### 3. There is similar performance for frequency of claims.

**Status** <br>True <br>High confidence

**Rationale** <br>The 'frequency.png' chart shows claim frequency per policy over days. The lines for different years (color-coded as 1, 2, 3, and 4 in the legend, which appear to correspond to years 2021-2024) follow very similar trajectories. All lines start near zero and increase in a similar pattern, eventually plateauing around 1.4-1.5 frequency. While there are minor variations between the years, the overall pattern and trend of claim frequency performance is remarkably consistent across the observed periods, supporting the statement about similar performance.<br><br>

#### 4. There is similar performance for severity of claims.

**Status** <br>Partially True <br>Medium confidence

**Rationale** <br>The 'severity.png' chart shows large claims/total claims data. There are visible differences in severity patterns between different years (represented by colored lines 1-4). While all lines show an upward trend over time, there are notable divergences, especially after around day 700 where one line (likely representing 2021) continues climbing more steeply while another line (likely 2022) plateaus at a lower level. The early performance (first ~350 days) shows more similarity between years with some volatility. Given these mixed patterns - similar directional trends but different magnitudes - the statement is partially true.<br><br>

---

## Grammar



#### No grammatical errors found

In [6]:

# Save to folder
md_result = export_evaluation_to_markdown(insight, final_evaluation, premises, grammar)
display(Markdown(f"**{md_result}**"))

**Markdown report saved to: ../reports/evaluation_pdl6muw8mJl9DL7bVO40nFOroodOnSFBG5e7zw+nAW32k7BiKehq6oLHwyItBjfw.md**