-
Notifications
You must be signed in to change notification settings - Fork 1k
auto-canto:0.2.3 #4255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
auto-canto:0.2.3 #4255
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
8ff61ec
[Typship] Add README.md
VincentTam 1158dac
[Typship] Add lib.typ
VincentTam 74b83d6
[Typship] Add jyutcitzi.png
VincentTam b0b934f
[Typship] Add typst.toml
VincentTam 0ea1db1
[Typship] Add example.png
VincentTam 0129c8b
[Typship] Add rust_canto.wasm
VincentTam 24f203e
[Typship] Add LICENSE
VincentTam ef7e297
perf: reduce WASM filesize
VincentTam File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| MIT License | ||
|
|
||
| Copyright (c) 2026 Vincent Tam | ||
|
|
||
| Permission is hereby granted, free of charge, to any person obtaining a copy | ||
| of this software and associated documentation files (the "Software"), to deal | ||
| in the Software without restriction, including without limitation the rights | ||
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
| copies of the Software, and to permit persons to whom the Software is | ||
| furnished to do so, subject to the following conditions: | ||
|
|
||
| The above copyright notice and this permission notice shall be included in all | ||
| copies or substantial portions of the Software. | ||
|
|
||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| SOFTWARE. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| ## auto-canto | ||
|
|
||
| This Typst package provides automatic Cantonese segmentation and romanization | ||
| (Jyutping (粵拼) and Yale (耶魯)) by wrapping the | ||
| [`rust-canto`](https://crates.io/crates/rust-canto) Rust crate as a WebAssembly | ||
| plugin. It integrates seamlessly with the | ||
| [`pycantonese-parser`](https://github.com/VincentTam/pycantonese-parser/) | ||
| package to render beautiful Cantonese text with ruby characters. | ||
|
|
||
| --- | ||
|
|
||
| ### Features | ||
|
|
||
| * **Automatic Segmentation**: Breaks Cantonese sentences into meaningful words | ||
| using a dictionary-based trie. | ||
| * **Multiple Romanizations**: Supports both **Jyutping** and **Yale** (numeric | ||
| or diacritics). | ||
| * **High Performance**: Powered by a Rust-compiled WASM plugin for fast | ||
| processing. | ||
| * **Typst Integration**: Provides a `quick-render` function that handles both | ||
| segmentation and styling in one go. | ||
|
|
||
| --- | ||
|
|
||
| ### Usage | ||
|
|
||
| To use this package, ensure the `rust_canto.wasm` file is in your project directory. | ||
|
|
||
| ```typ | ||
| #import "@preview/auto-canto:0.2.3": quick-render | ||
|
|
||
| // 36pt font | ||
| // use Libertinus Serif first (for ruby text) | ||
| // before falling back to Noto Serif CJK HK (for Chinese characters) | ||
| #set text(36pt, font: ("Libertinus Serif", "Noto Serif CJK HK")) | ||
|
|
||
| // 1. Basic rendering (defaults to Jyutping) | ||
| #quick-render[都會大學入面3%人識用AB膠] | ||
|
|
||
| // 2. Rendering with Yale romanization | ||
| #quick-render(romanization: "yale")[ | ||
| 平時會成日睇書 | ||
| ] | ||
|
|
||
| // 3. Customizing the underlying parser's style | ||
| #let my-text = "廣東話好難學" | ||
| #let my-style = (rb-size: 0.7em, rb-color: blue) | ||
| #let quicker-renderer = quick-render.with(style: my-style, visual-tones: false) | ||
| #quicker-renderer(my-text) | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| Live demo on YouTube: https://youtu.be/ivUu91eDfvY | ||
|
|
||
| #### Jyutcitzi (粵切字) support (optional) | ||
|
|
||
| This package can render Jyutcizi above Chinese characters, provided that the | ||
| user has imported the | ||
| [`se-jyutcitzi`](https://typst.app/universe/package/se-jyutcitzi) Typst package. | ||
|
|
||
| To ensure that a clean dependency, the user has to pass the `jyutcitzi()` | ||
| function from `se-jyutcitzi` package to the `jyutcit-ruby()` function in this | ||
| package. | ||
|
|
||
| ```typ | ||
| #import "@preview/se-jyutcitzi:0.3.2": * | ||
| #import "@preview/auto-canto:0.2.3": * | ||
| // #set page(height: auto, width: auto, margin: 1pt) | ||
| #set text(24pt, font: "Chiron GoRound TC") | ||
| #set par(justify: true) | ||
|
|
||
| // Customize Jyutcitzi display | ||
| #let default-style = ( | ||
| rb-color: rgb("#ff0000"), // Annotation text color | ||
| rb-size: 0.8em, // Annotation text size | ||
| word-sep: 0.2em, // Chinese words separation | ||
| char-jp-sep: 0.2em, // vertical space between words and Jyutping above | ||
| ) | ||
|
|
||
| #let mytxt = [ | ||
| 你識唔識講廣東話?就算你識講廣東話都好,都可以遇到啲好𠮩𠹌嘅字,就算係粵語母語者都好,都未必識得寫,最後要用abcd先得,就好似「bibu車」噉。 | ||
|
|
||
| 所以,我呢個package一定幫到你。仲唔快啲下載? | ||
| ] | ||
| #jyutcit-ruby(mytxt, jyutcitzi: jyutcitzi) | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| --- | ||
|
|
||
| ### API Reference | ||
|
|
||
| #### `quick-render(it, ..args)` | ||
|
|
||
| The primary high-level function. It fetches data from the WASM plugin and | ||
| forwards it to the parser. | ||
|
|
||
| * `it`: The item containing the Cantonese string to process. | ||
| * `..args`: Named arguments forwarded to [`render-word-groups`](https://github.com/VincentTam/pycantonese-parser/blob/7ed67e5d/src/renderer.typ#L10-L15) | ||
| (e.g. `romanization`, `style`). | ||
|
|
||
| #### `jyutcit-ruby(it, jyutcitzi: none, style: (:))` | ||
|
|
||
| Renders Cantonese text with **Jyutcitzi** annotations above each word. | ||
|
|
||
| * **Note**: Requiresthe | ||
| [`jyutcitzi`](https://github.com/VincentTam/jyutcitzi/blob/f0083491/lib.typ#L5) | ||
| function from the `se-jyutcitzi` package passed as an argument. | ||
| * `it`: The item containing the Cantonese string to process. | ||
| * `jyutcitzi`: Named argument for the Jyutcitzi function. | ||
| * `style`: a dictionary for the following four keys | ||
| - `rb-color`: ruby text color | ||
| - `rb-size`: ruby text size (in em) | ||
| - `word-sep`: horizontal separation between words (in em) | ||
| - `char-jp-sep`: vertical separation between ruby text and main text (in em) | ||
|
|
||
| #### `annotate(txt)` | ||
|
|
||
| Returns the raw segmented data as an array of dictionaries. | ||
|
|
||
| * **Return format**: `array` of `{word: str, jyutping: str, yale: array}`. | ||
|
|
||
| #### `to-yale-numeric(jp-str)` / `to-yale-diacritics(jp-str)` | ||
|
|
||
| Utility functions to convert space-delimited Jyutping strings into Yale format. | ||
|
|
||
| * `numeric`: "gwong2 dung1 waa2" → "gwong2 dung1 wa2". | ||
| * `diacritics`: "gwong2 dung1 waa2" → "gwóngdūngwá". | ||
|
|
||
| --- | ||
|
|
||
| ### Project Structure | ||
|
|
||
| * `lib.typ`: The main entry point containing the Typst wrappers. | ||
| * `rust_canto.wasm`: The WebAssembly binary compiled from the `rust-canto` | ||
| crate. | ||
| * `typst.toml`: Package metadata and dependencies. | ||
|
|
||
| ### License | ||
|
|
||
| MIT | ||
|
|
||
| ### Contributing | ||
|
|
||
| Contributions are welcome! Please open an issue or submit a pull request. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| #import "@preview/canto-parser:0.2.1": * | ||
|
|
||
| // Load the plugin | ||
| #let canto = plugin("rust_canto.wasm") | ||
|
|
||
| /// Internal helper to convert string to bytes for the WASM plugin | ||
| #let _to_bytes(txt) = { | ||
| if type(txt) == str { bytes(txt) } else { txt } | ||
| } | ||
|
|
||
| /// Internal helper to extract text | ||
| #let _extract-text(it) = { | ||
| if type(it) == str { | ||
| it | ||
| } else if type(it) == content { | ||
| if it == parbreak() { | ||
| "\n\n" // double newline → WASM plugin's segmenter emits two \n tokens | ||
| } else if it == linebreak() { | ||
| "\n" // single newline → WASM plugin's segmenter emits one \n token | ||
| } else if it.has("text") { | ||
| it.text | ||
| } else if it.has("children") { | ||
| it.children.map(_extract-text).join("") | ||
| } else if it.has("body") { | ||
| _extract-text(it.body) | ||
| } else { | ||
| "" | ||
| } | ||
| } else { | ||
| "" | ||
| } | ||
| } | ||
|
|
||
| /// Annotates text into a list of dictionaries containing word, jyutping, and Yale. | ||
| /// Returns: array of {word: str, jyutping: str, yale: array} | ||
| #let annotate(txt) = { | ||
| json(canto.annotate(_to_bytes(txt))) | ||
| } | ||
|
|
||
| /// Converts a space-delimited Jyutping string to Yale with tone numbers. | ||
| /// Example: "gwong2 dung1 waa2" → "gwong2 dung1 wa2" | ||
| #let to-yale-numeric(jp-str) = { | ||
| str(canto.to_yale_numeric(_to_bytes(jp-str))) | ||
| } | ||
|
|
||
| /// Converts a space-delimited Jyutping string to Yale with diacritics. | ||
| /// Example: "gwong2 dung1 waa2" → "gwóngdūngwá" | ||
| #let to-yale-diacritics(jp-str) = { | ||
| str(canto.to_yale_diacritics(_to_bytes(jp-str))) | ||
| } | ||
|
|
||
| /// A flexible wrapper that segments text and forwards all styling | ||
| /// parameters to the parser's rendering function. | ||
| /// - it: The item containing Cantonese string to process | ||
| /// - args: Captures named arguments like romanization: "yale" or "jyutping" | ||
| #let quick-render(it, ..args) = { | ||
| // 1. Extract text from item | ||
| let txt = _extract-text(it) | ||
|
|
||
| // 2. Get the data from the WASM plugin | ||
| let data = json(canto.annotate(bytes(txt))) | ||
|
|
||
| // 3. Forward the data and all extra arguments to the parser | ||
| render-word-groups(data, ..args) | ||
| } | ||
|
|
||
| /// Render Cantonese text with jyutcitzi annotations above each word. | ||
| /// The caller must pass the `jyutcitzi` function from @preview/se-jyutcitzi. | ||
| /// | ||
| /// Example: | ||
| /// #import "@preview/se-jyutcitzi:0.3.2": jyutcitzi | ||
| /// #import "@preview/auto-canto:0.2.1": jyutcit-ruby | ||
| /// #jyutcit-ruby(it, jyutcitzi: jyutcitzi) | ||
| #let jyutcit-ruby(it, jyutcitzi: none, style: (:)) = { | ||
| assert(jyutcitzi != none, message: "jyutcit-ruby requires the jyutcitzi function from @preview/se-jyutcitzi") | ||
| let default-style = ( | ||
| rb-color: rgb("#ff0000"), | ||
| rb-size: 0.8em, | ||
| word-sep: 0.2em, | ||
| char-jp-sep: 0.2em, | ||
| ) | ||
| let s = default-style + style | ||
| let data = annotate(_extract-text(it)) | ||
| [ | ||
| #for item in data { | ||
| if item.word == "\n" { text[\ ]; continue } | ||
| let ruby-txt = jyutcitzi(item.jyutping) | ||
| if ruby-txt != none { | ||
| box(stack( | ||
| dir: ttb, | ||
| spacing: s.char-jp-sep, | ||
| align(center, text(s.rb-size, s.rb-color, ruby-txt)), | ||
| align(bottom + center, box(height: 1em, text(1em, item.word))), | ||
| )) | ||
| } else { | ||
| text(1em, item.word) | ||
| } | ||
| h(s.word-sep) | ||
| } | ||
| #h(-s.word-sep) | ||
| ] | ||
| } |
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| [package] | ||
| name = "auto-canto" | ||
| version = "0.2.3" | ||
| entrypoint = "lib.typ" | ||
| authors = ["Vincent Tam <https://github.com/VincentTam>"] | ||
| license = "MIT" | ||
| description = "Automatic conversion to Cantonese romanizations from Chinese characters." | ||
| repository = "https://github.com/VincentTam/auto-canto" | ||
| keywords = [ | ||
| "Cantonese", | ||
| "jyutping", | ||
| "yale", | ||
| "Chinese", | ||
| "conversion", | ||
| ] | ||
| categories = [ | ||
| "text", | ||
| "languages", | ||
| "utility", | ||
| ] | ||
| disciplines = [ | ||
| "education", | ||
| "linguistics", | ||
| ] | ||
| compiler = "0.14.0" | ||
| exclude = ["example.png"] | ||
|
|
||
| [tool] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered possibilities to compact that file? In particular, since it mostly data which may contain repetitive patterns, do you think compressing it using some well-known format and decompressing it in memory could result in a net positive in terms of file size? Also, have you looked into
wasm-opt?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's 1.5MB. Hope it's acceptable.
The WASM file can be generated from the build script in my Rust crate
rust-canto