GitHub - vincelwt/bunkatsu: 🇯🇵🍜 learner‑friendly Japanese tokenizer / segmenter

bunkatsu (“division; segmentation” in Japanese) turns raw Japanese text into human‑sized word chunks.

It builds on top of kuromojin—which exposes the great Kuromoji tokenizer—and then glues many of the overly fine‑grained splits to make it easier for language learners.

It is currently used by [Lexirise][https://lexirise.app] to segment text parsed from comics.

Why not use Kuromoji directly?

Kuromoji treats every grammatical morpheme as its own token. The passive verb 「食べられる」 becomes four pieces (食べ)(ら)(れ)(る), which makes it hard to look up in a dictionary.
Common colloquial contractions such as 「ちゃった」 or sentence endings like 「じゃん」 are also broken apart.

bunkatsu patches these cases (currently ~40 heuristic rules) so you get one token per semantically useful unit, while still exposing the raw sub‑tokens for features such as furigana or fine‑grained grammar pop‑ups.

Installation

npm i bunkatsu  # or pnpm / yarn / bun add

Usage

import { segmentJapanese } from "bunkatsu";

const segments = await segmentJapanese("食べられちゃったんだよ！");
console.log(segments.map((s) => s.segment));
// → [ '食べられちゃった', 'ん', 'だよ', '！' ]

API

async function segmentJapanese(
  text: string,
  options?: {
    /** Skip the merge pass and return every raw morpheme */
    fullBreakdown?: boolean;

    /** Options forwarded verbatim to kuromojin.getTokenizer() */
    kuromojiBuilderOptions?: import("kuromojin").TokenizerBuilderOptions;
  }
): Promise<SegmentedToken[]>;

interface SegmentedToken {
  segment: string; // final surface form (after merging)
  isWordLike: boolean; // true ≅ POS ≠ 記号
  index: number; // position in the returned array
  start: number; // start offset in the original string
  end: number; // end offset (exclusive)
  subTokenCount: number; // how many Kuromoji morphemes were merged
}

Advanced usage

The low‑level helpers mergeTokens() and shouldMergeForward() are re‑exported so you can craft your own segmentation strategy.
A cached Kuromoji tokenizer is created lazily on the first call. If you need a custom dictionary, just pass kuromojiBuilderOptions once (e.g. { dicPath: "/path/to/ipadic" }). Subsequent calls must omit the option or provide the exact same one.

Contributing

Contributions are welcome

Licence

MIT

Credits

Made by Vince

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tests		tests
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
index.ts		index.ts
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why not use Kuromoji directly?

Installation

Usage

API

Advanced usage

Contributing

Licence

Credits

About

Uh oh!

Uh oh!

Languages

vincelwt/bunkatsu

Folders and files

Latest commit

History

Repository files navigation

Why not use Kuromoji directly?

Installation

Usage

API

Advanced usage

Contributing

Licence

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages