|
| 1 | +# Contributing to ztoken |
| 2 | + |
| 3 | +Thank you for your interest in contributing to ztoken, the BPE tokenizer library with HuggingFace compatibility for the Zerfoo ML ecosystem. This guide will help you get started. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +- [Development Setup](#development-setup) |
| 8 | +- [Building from Source](#building-from-source) |
| 9 | +- [Running Tests](#running-tests) |
| 10 | +- [Code Style](#code-style) |
| 11 | +- [Commit Conventions](#commit-conventions) |
| 12 | +- [Pull Request Process](#pull-request-process) |
| 13 | +- [Issue Reporting](#issue-reporting) |
| 14 | +- [Good First Issues](#good-first-issues) |
| 15 | +- [Key Conventions](#key-conventions) |
| 16 | + |
| 17 | +## Development Setup |
| 18 | + |
| 19 | +### Prerequisites |
| 20 | + |
| 21 | +- **Go 1.25+** |
| 22 | +- **Git** |
| 23 | + |
| 24 | +No GPU, C compiler, or external libraries are required. ztoken has zero external dependencies beyond the Go standard library and `golang.org/x/text`. |
| 25 | + |
| 26 | +### Clone and Verify |
| 27 | + |
| 28 | +```bash |
| 29 | +git clone https://github.com/zerfoo/ztoken.git |
| 30 | +cd ztoken |
| 31 | +go mod tidy |
| 32 | +go test ./... |
| 33 | +``` |
| 34 | + |
| 35 | +## Building from Source |
| 36 | + |
| 37 | +```bash |
| 38 | +go build ./... |
| 39 | +``` |
| 40 | + |
| 41 | +ztoken compiles on every platform Go supports with no additional setup. |
| 42 | + |
| 43 | +## Running Tests |
| 44 | + |
| 45 | +```bash |
| 46 | +# Run all tests |
| 47 | +go test ./... |
| 48 | + |
| 49 | +# Run tests with race detector |
| 50 | +go test -race ./... |
| 51 | + |
| 52 | +# Run tests with coverage |
| 53 | +go test -cover ./... |
| 54 | +go test -coverprofile=coverage.out ./... |
| 55 | +go tool cover -html=coverage.out -o coverage.html |
| 56 | +``` |
| 57 | + |
| 58 | +All new code must have tests. Aim for at least 80% coverage on new packages. |
| 59 | + |
| 60 | +## Code Style |
| 61 | + |
| 62 | +### Formatting and Linting |
| 63 | + |
| 64 | +- **`gofmt`** — all code must be formatted with `gofmt` |
| 65 | +- **`goimports`** — imports must be organized (stdlib, external) |
| 66 | +- **`golangci-lint`** — run `golangci-lint run` before submitting |
| 67 | + |
| 68 | +### Go Conventions |
| 69 | + |
| 70 | +- Follow standard Go naming: PascalCase for exported symbols, camelCase for unexported |
| 71 | +- Use table-driven tests with `t.Run` subtests |
| 72 | +- Write documentation comments for all exported functions, types, and methods |
| 73 | + |
| 74 | +## Commit Conventions |
| 75 | + |
| 76 | +We use [Conventional Commits](https://www.conventionalcommits.org/) for automated versioning with release-please. |
| 77 | + |
| 78 | +### Format |
| 79 | + |
| 80 | +``` |
| 81 | +<type>(<scope>): <description> |
| 82 | +
|
| 83 | +[optional body] |
| 84 | +
|
| 85 | +[optional footer(s)] |
| 86 | +``` |
| 87 | + |
| 88 | +### Types |
| 89 | + |
| 90 | +| Type | Description | |
| 91 | +|------|-------------| |
| 92 | +| `feat` | A new feature | |
| 93 | +| `fix` | A bug fix | |
| 94 | +| `perf` | A performance improvement | |
| 95 | +| `docs` | Documentation only changes | |
| 96 | +| `test` | Adding or correcting tests | |
| 97 | +| `chore` | Maintenance tasks, CI, dependencies | |
| 98 | +| `refactor` | Code change that neither fixes a bug nor adds a feature | |
| 99 | + |
| 100 | +### Examples |
| 101 | + |
| 102 | +``` |
| 103 | +feat(bpe): add support for Llama 3 tokenizer format |
| 104 | +fix(decode): handle surrogate pairs in Unicode decoding |
| 105 | +perf(encode): optimize merge priority queue for long sequences |
| 106 | +docs: update HuggingFace compatibility notes |
| 107 | +test: add round-trip encoding/decoding tests for Gemma vocabulary |
| 108 | +``` |
| 109 | + |
| 110 | +## Pull Request Process |
| 111 | + |
| 112 | +1. **One logical change per PR** — keep PRs focused and reviewable |
| 113 | +2. **Branch from `main`** and keep your branch up to date with rebase |
| 114 | +3. **All CI checks must pass** — tests, linting, formatting |
| 115 | +4. **Rebase and merge** — we do not use squash merges or merge commits |
| 116 | +5. **Reference related issues** — use `Fixes #123` or `Closes #123` in the PR description |
| 117 | +6. **Respond to review feedback** promptly |
| 118 | + |
| 119 | +### Before Submitting |
| 120 | + |
| 121 | +```bash |
| 122 | +go test ./... |
| 123 | +go test -race ./... |
| 124 | +go vet ./... |
| 125 | +golangci-lint run |
| 126 | +``` |
| 127 | + |
| 128 | +## Issue Reporting |
| 129 | + |
| 130 | +### Bug Reports |
| 131 | + |
| 132 | +Please include: |
| 133 | + |
| 134 | +- **Description**: Clear summary of the bug |
| 135 | +- **Steps to reproduce**: Minimal code with the tokenizer model file used |
| 136 | +- **Expected behavior**: Expected token IDs or decoded text |
| 137 | +- **Actual behavior**: Actual token IDs or decoded text |
| 138 | +- **Environment**: Go version, OS |
| 139 | +- **Tokenizer model**: Which HuggingFace model's tokenizer was used |
| 140 | + |
| 141 | +### Feature Requests |
| 142 | + |
| 143 | +Please include: |
| 144 | + |
| 145 | +- **Problem statement**: What problem does this solve? |
| 146 | +- **Proposed solution**: How should it work? |
| 147 | +- **Alternatives considered**: Other approaches you thought about |
| 148 | +- **Use case**: How would you use this feature in practice? |
| 149 | + |
| 150 | +## Good First Issues |
| 151 | + |
| 152 | +Look for issues labeled [`good first issue`](https://github.com/zerfoo/ztoken/labels/good%20first%20issue) on GitHub. These are scoped, well-defined tasks suitable for new contributors. |
| 153 | + |
| 154 | +Good areas for first contributions: |
| 155 | + |
| 156 | +- Adding test coverage for edge cases in encoding/decoding |
| 157 | +- Documentation improvements |
| 158 | +- Supporting additional HuggingFace tokenizer configurations |
| 159 | +- Performance optimizations in the BPE merge loop |
| 160 | + |
| 161 | +## Key Conventions |
| 162 | + |
| 163 | +These conventions are critical to maintaining consistency across the codebase: |
| 164 | + |
| 165 | +### HuggingFace compatibility |
| 166 | + |
| 167 | +ztoken must produce identical token IDs to the HuggingFace `tokenizers` library for all supported models. When adding support for a new tokenizer format, verify against the Python reference implementation: |
| 168 | + |
| 169 | +```python |
| 170 | +from transformers import AutoTokenizer |
| 171 | +tok = AutoTokenizer.from_pretrained("model-name") |
| 172 | +print(tok.encode("test string")) |
| 173 | +``` |
| 174 | + |
| 175 | +### Zero external dependencies |
| 176 | + |
| 177 | +ztoken depends only on the Go standard library and `golang.org/x/text`. Do not add third-party dependencies. This keeps the library lightweight and easy to embed. |
| 178 | + |
| 179 | +### Stdlib-only testing |
| 180 | + |
| 181 | +Tests use only the `testing` package from the standard library. Do not introduce test frameworks like testify. |
0 commit comments