Skip to content

Synthetic Japanese business email generator for ML training data

License

Notifications You must be signed in to change notification settings

terallite/yasumail

Repository files navigation

yasumail

Synthetic Japanese business email generator for ML training data. Produces labeled emails (JSONL) with line-by-line annotations for training email parsers.

Installation

git clone <repo-url>
cd yasumail
uv sync

Usage

Generate emails

# Generate 1000 emails to file
yasumail generate -n 1000 --seed 42 -o training.jsonl

# With noise injection (0.0-1.0)
yasumail generate -n 1000 --noise-level 0.5 -o noisy.jsonl

# Specific template only
yasumail generate -n 100 -t reply_chain -o replies.jsonl

Preview a sample

yasumail sample --seed 42

Output:

============================================================
Template: formal_full
Formality: formal
============================================================

[GREETING ]      お世話になっております。
[OTHER    ]
[BODY     ]      先日ご依頼いただいた件について、ご報告いたします。
[BODY     ]
[CLOSING  ]      何卒よろしくお願い申し上げます。
[SIGNATURE]
[SIGNATURE]      株式会社テスト
[SIGNATURE]      田中太郎

============================================================

View statistics

yasumail stats training.jsonl

Output:

=== Dataset Statistics ===
Total emails: 1000
Total lines: 17305

--- Templates ---
  formal_full: 577 (57.7%)
  formal_minimal: 143 (14.3%)
  informal: 89 (8.9%)
  ...

--- Labels ---
  BODY: 9909 (57.3%)
  SIGNATURE: 4841 (28.0%)
  ...

Output Format

Each line in the JSONL output contains:

{
  "email_text": "お世話になっております。\n\n先日...",
  "lines": [
    {
      "text": "お世話になっております。",
      "label": "GREETING",
      "quote_depth": 0
    },
    { "text": "", "label": "BODY", "quote_depth": 0 },
    {
      "text": "先日ご依頼いただいた件について...",
      "label": "BODY",
      "quote_depth": 0
    }
  ],
  "metadata": {
    "template_type": "formal_full",
    "formality": "formal",
    "has_signature": true,
    "body_block_count": 2
  }
}

Labels

Label Content
GREETING Opening line
BODY Main content, structured blocks, separators
CLOSING Closing phrase
SIGNATURE Sender identification
QUOTE Quoted content (with quote_depth)
OTHER Quote/forward headers

Templates

Template Weight Description
formal_full 55% Greeting, body, closing, signature
formal_minimal 15% No signature
informal 10% Casual, optional components
mobile_reply 10% Short mobile-style
reply_chain 7% With quoted previous email
forward_only 3% Forward with minimal content

Noise Injection

The --noise-level flag (0.0-1.0) applies realistic mutations:

  • Character: fullwidth conversion, halfwidth katakana, typos
  • Whitespace: trailing spaces, zero-width chars
  • Punctuation: ASCII/Japanese mixing (。↔.)
  • Line: hard wrapping, blank line insertion/removal

All noise preserves label correctness.

Component Library

The generator uses 1,133 LLM-generated components:

  • 100 greetings (formal, semi-formal, informal)
  • 105 closings
  • 928 body paragraphs (9 topics × 3 formality levels)

Procedural generators create names, companies, contacts, and signatures without placeholder values.

Development

# Type check
uv run ty check

# Lint
uv run ruff check .

# Test
uv run pytest

About

Synthetic Japanese business email generator for ML training data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages