
## SecureFinAI Contest 2025 Task 3: **FinGPT-Powered Agents for MultiModal Financial Data** 

### Objective
Participants will develop **FinGPT agents** capable of:
- Processing **financial document images** (in base64-encoded PNG format).  
- Converting images to **structured HTML** format (OCR + HTML generation task).  
- Note: This is an image-to-HTML conversion task, not a QA or reasoning task.  

### Ground Truth
The dataset `TheFinAI/SecureFinAI_Contest_2025-Task_3_EnglishOCR` and `TheFinAI/SecureFinAI_Contest_2025-Task_3_SpanishOCR` provides two key columns:
- `image` → base64-encoded PNG image (input).  
- `matched_html` → structured HTML generated from OCR + postprocessing (ground truth).  

### Evaluation
Your model’s predicted HTML is evaluated against `matched_html` using **ROUGE-1**.

---


## 0. Config

In [18]:
hf_repo = "TheFinAI/SecureFinAI_Contest_2025-Task_3_EnglishOCR" # hf repo of OCR dataset
pred_dir = "./preds_baseline"                 # Output directory for predictions
model_name = "baseline"                        # Used in prediction filenames
lang = "en"
eval_output = "./eval_rouge_baseline.csv"
max_samples = 5  # Set None for all rows

## 1. Environment Setup

In [9]:
from datasets import load_dataset
import os, base64, io, re, math
import pandas as pd
from tqdm import tqdm
from PIL import Image

# OCR engine
try:
    import pytesseract
except ImportError:
    !pip -q install pytesseract pillow
    import pytesseract

# Rouge evaluation requires 'evaluate'
try:
    import evaluate
except ImportError:
    !pip -q install evaluate
    import evaluate

print("Pandas:", pd.__version__)
print("Pytesseract:", pytesseract.get_tesseract_version())


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

2025-10-01 22:25:04.728372: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-10-01 22:25:04.728964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-10-01 22:25:04.806620: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-10-01 22:25:04.967563: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Pandas: 1.4.2
Pytesseract: 4.1.1


## 2. Load Dataset

In [19]:
try:
    dataset = load_dataset(hf_repo, split="test")
    # Convert to pandas DataFrame for convenience
    data = dataset.to_pandas()
except:
    # Create a toy dataset if not provided
    data = pd.DataFrame({
        "image": [],
        "matched_html": [
            "<html><body><p>Total revenue for Q1 was $1.2B.</p></body></html>",
            "<html><body><p>Operating income increased by 12% year-over-year.</p></body></html>"
        ]
    })
    print(f"'{hf_repo}'could not be found, Created toy data")


if max_samples is not None and len(data) > max_samples:
    data = data.head(max_samples)

print("Rows:", len(data))
data.head(2)


README.md:   0%|          | 0.00/475 [00:00<?, ?B/s]

train-00000-of-00004.parquet:   0%|          | 0.00/330M [00:00<?, ?B/s]

train-00001-of-00004.parquet:   0%|          | 0.00/331M [00:00<?, ?B/s]

train-00002-of-00004.parquet:   0%|          | 0.00/323M [00:00<?, ?B/s]

train-00003-of-00004.parquet:   0%|          | 0.00/333M [00:00<?, ?B/s]

test-00000-of-00002.parquet:   0%|          | 0.00/289M [00:00<?, ?B/s]

test-00001-of-00002.parquet:   0%|          | 0.00/282M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3415 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Rows: 5


Unnamed: 0,image,text,matched_html
0,iVBORw0KGgoAAAANSUhEUgAABnUAAAkjCAIAAABYnF2RAA...,Last Name\nFirst Name\nMiddle Name\nFaltysova\...,"<p class=""SectionTitle"">9. Type(s) of Securiti..."
1,iVBORw0KGgoAAAANSUhEUgAABnUAAAkjCAIAAABYnF2RAA...,SECURITIES AND EXCHANGE COMMISSION \nWashingto...,"<div class=""contentwrapper""><table id=""scheade..."


## 3. Baseline Agent (OCR → HTML)

In [4]:

def baseline_agent_from_image(b64_img: str) -> str:
    """Decode base64 image → OCR → wrap into structured HTML."""
    if not isinstance(b64_img, str) or not b64_img.strip():
        return "<html><body><p></p></body></html>"
    try:
        img_data = base64.b64decode(b64_img)
        img = Image.open(io.BytesIO(img_data)).convert("RGB")
        text = pytesseract.image_to_string(img, lang="eng")
    except Exception as e:
        text = ""
    return f"<html><body><p>{text.strip()}</p></body></html>"


## 4. Generate Predictions

In [5]:
os.makedirs(pred_dir, exist_ok=True)

written = 0
for i in tqdm(data.index, desc="Predicting"):
    b64_img = data.loc[i, "image"] if "image" in data.columns else ""
    pred_html = baseline_agent_from_image(b64_img)
    out_path = os.path.join(pred_dir, f"{model_name}_pred_{i}.html")
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(pred_html)
    written += 1

print(f"Wrote {written} predictions to {pred_dir}")


Predicting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:08<00:00,  1.68s/it]

Wrote 5 predictions to ./preds_baseline





## 5. Evaluation

In [None]:
def run_html_eval(df, pred_dir, model_name, lang, output_csv):
    """
    Evaluate predicted structured HTML against ground truth matched_html using ROUGE-1.
    """
    import pandas as pd, os
    import evaluate

    rouge = evaluate.load("rouge")

    results = []

    for i in df.index:
        # gt_html = df.loc[i, "matched_html"]
        gt_html = df.loc[i, "text"]
        pred_path = os.path.join(pred_dir, f"{model_name}_pred_{i}.html")
        
        if not os.path.exists(pred_path):
            results.append({"index": i, "ROUGE-1": None})
            continue

        with open(pred_path, "r", encoding="utf-8") as f:
            pred_html = f.read()

        score = rouge.compute(
            predictions=[pred_html],
            references=[gt_html],
            use_stemmer=True
        )["rouge1"]

        results.append({"index": i, "ROUGE-1": score})

    df_out = pd.DataFrame(results)
    df_out.to_csv(output_csv, index=False)
    return df_out


In [13]:
df_eval = run_html_eval(data, pred_dir, model_name, lang, eval_output)
df_eval.head()

Unnamed: 0,index,ROUGE-1
0,0,0.959906
1,1,0.559486
2,2,0.586667
3,3,0.815152
4,4,0.930796


## 6. Summary

In [15]:
if len(df_eval):
    macro_rouge1 = df_eval["ROUGE-1"].dropna().mean() if not df_eval["ROUGE-1"].dropna().empty else float('nan')
    print(f"Average ROUGE-1: {macro_rouge1:.4f}" if not math.isnan(macro_rouge1) else "No valid ROUGE scores.")
else:
    print("No evaluation results.")

print("End of Demo!")

Average ROUGE-1: 0.7704
End of Demo!
