Synthetic Japanese business email generator for ML training data. Produces labeled emails (JSONL) with line-by-line annotations for training email parsers.
git clone <repo-url>
cd yasumail
uv sync# Generate 1000 emails to file
yasumail generate -n 1000 --seed 42 -o training.jsonl
# With noise injection (0.0-1.0)
yasumail generate -n 1000 --noise-level 0.5 -o noisy.jsonl
# Specific template only
yasumail generate -n 100 -t reply_chain -o replies.jsonlyasumail sample --seed 42Output:
============================================================
Template: formal_full
Formality: formal
============================================================
[GREETING ] お世話になっております。
[OTHER ]
[BODY ] 先日ご依頼いただいた件について、ご報告いたします。
[BODY ]
[CLOSING ] 何卒よろしくお願い申し上げます。
[SIGNATURE]
[SIGNATURE] 株式会社テスト
[SIGNATURE] 田中太郎
============================================================
yasumail stats training.jsonlOutput:
=== Dataset Statistics ===
Total emails: 1000
Total lines: 17305
--- Templates ---
formal_full: 577 (57.7%)
formal_minimal: 143 (14.3%)
informal: 89 (8.9%)
...
--- Labels ---
BODY: 9909 (57.3%)
SIGNATURE: 4841 (28.0%)
...
Each line in the JSONL output contains:
{
"email_text": "お世話になっております。\n\n先日...",
"lines": [
{
"text": "お世話になっております。",
"label": "GREETING",
"quote_depth": 0
},
{ "text": "", "label": "BODY", "quote_depth": 0 },
{
"text": "先日ご依頼いただいた件について...",
"label": "BODY",
"quote_depth": 0
}
],
"metadata": {
"template_type": "formal_full",
"formality": "formal",
"has_signature": true,
"body_block_count": 2
}
}| Label | Content |
|---|---|
GREETING |
Opening line |
BODY |
Main content, structured blocks, separators |
CLOSING |
Closing phrase |
SIGNATURE |
Sender identification |
QUOTE |
Quoted content (with quote_depth) |
OTHER |
Quote/forward headers |
| Template | Weight | Description |
|---|---|---|
formal_full |
55% | Greeting, body, closing, signature |
formal_minimal |
15% | No signature |
informal |
10% | Casual, optional components |
mobile_reply |
10% | Short mobile-style |
reply_chain |
7% | With quoted previous email |
forward_only |
3% | Forward with minimal content |
The --noise-level flag (0.0-1.0) applies realistic mutations:
- Character: fullwidth conversion, halfwidth katakana, typos
- Whitespace: trailing spaces, zero-width chars
- Punctuation: ASCII/Japanese mixing (。↔.)
- Line: hard wrapping, blank line insertion/removal
All noise preserves label correctness.
The generator uses 1,133 LLM-generated components:
- 100 greetings (formal, semi-formal, informal)
- 105 closings
- 928 body paragraphs (9 topics × 3 formality levels)
Procedural generators create names, companies, contacts, and signatures without placeholder values.
# Type check
uv run ty check
# Lint
uv run ruff check .
# Test
uv run pytest