bunkatsu (“division; segmentation” in Japanese) turns raw Japanese text
into human‑sized word chunks.
It builds on top of kuromojin—which exposes the great Kuromoji tokenizer—and then glues many of the overly fine‑grained splits to make it easier for language learners.
It is currently used by [Lexirise][https://lexirise.app] to segment text parsed from comics.
- Kuromoji treats every grammatical morpheme as its own token. The passive
verb 「食べられる」 becomes four pieces
(食べ)(ら)(れ)(る), which makes it hard to look up in a dictionary. - Common colloquial contractions such as 「ちゃった」 or sentence endings like 「じゃん」 are also broken apart.
bunkatsu patches these cases (currently ~40 heuristic rules) so you get one
token per semantically useful unit, while still exposing the raw sub‑tokens
for features such as furigana or fine‑grained grammar pop‑ups.
npm i bunkatsu # or pnpm / yarn / bun addimport { segmentJapanese } from "bunkatsu";
const segments = await segmentJapanese("食べられちゃったんだよ!");
console.log(segments.map((s) => s.segment));
// → [ '食べられちゃった', 'ん', 'だよ', '!' ]async function segmentJapanese(
text: string,
options?: {
/** Skip the merge pass and return every raw morpheme */
fullBreakdown?: boolean;
/** Options forwarded verbatim to kuromojin.getTokenizer() */
kuromojiBuilderOptions?: import("kuromojin").TokenizerBuilderOptions;
}
): Promise<SegmentedToken[]>;
interface SegmentedToken {
segment: string; // final surface form (after merging)
isWordLike: boolean; // true ≅ POS ≠ 記号
index: number; // position in the returned array
start: number; // start offset in the original string
end: number; // end offset (exclusive)
subTokenCount: number; // how many Kuromoji morphemes were merged
}- The low‑level helpers
mergeTokens()andshouldMergeForward()are re‑exported so you can craft your own segmentation strategy. - A cached Kuromoji tokenizer is created lazily on the first call. If you
need a custom dictionary, just pass
kuromojiBuilderOptionsonce (e.g.{ dicPath: "/path/to/ipadic" }). Subsequent calls must omit the option or provide the exact same one.
Contributions are welcome
MIT
Made by Vince
