Skip to content

Commit 60a1f99

Browse files
committed
docs: add CONTRIBUTING.md
Contributor guide covering development setup, testing, code style, commit conventions, and key conventions including HuggingFace compatibility, zero external dependencies, and stdlib-only testing.
1 parent 2a44b53 commit 60a1f99

1 file changed

Lines changed: 181 additions & 0 deletions

File tree

CONTRIBUTING.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Contributing to ztoken
2+
3+
Thank you for your interest in contributing to ztoken, the BPE tokenizer library with HuggingFace compatibility for the Zerfoo ML ecosystem. This guide will help you get started.
4+
5+
## Table of Contents
6+
7+
- [Development Setup](#development-setup)
8+
- [Building from Source](#building-from-source)
9+
- [Running Tests](#running-tests)
10+
- [Code Style](#code-style)
11+
- [Commit Conventions](#commit-conventions)
12+
- [Pull Request Process](#pull-request-process)
13+
- [Issue Reporting](#issue-reporting)
14+
- [Good First Issues](#good-first-issues)
15+
- [Key Conventions](#key-conventions)
16+
17+
## Development Setup
18+
19+
### Prerequisites
20+
21+
- **Go 1.25+**
22+
- **Git**
23+
24+
No GPU, C compiler, or external libraries are required. ztoken has zero external dependencies beyond the Go standard library and `golang.org/x/text`.
25+
26+
### Clone and Verify
27+
28+
```bash
29+
git clone https://github.com/zerfoo/ztoken.git
30+
cd ztoken
31+
go mod tidy
32+
go test ./...
33+
```
34+
35+
## Building from Source
36+
37+
```bash
38+
go build ./...
39+
```
40+
41+
ztoken compiles on every platform Go supports with no additional setup.
42+
43+
## Running Tests
44+
45+
```bash
46+
# Run all tests
47+
go test ./...
48+
49+
# Run tests with race detector
50+
go test -race ./...
51+
52+
# Run tests with coverage
53+
go test -cover ./...
54+
go test -coverprofile=coverage.out ./...
55+
go tool cover -html=coverage.out -o coverage.html
56+
```
57+
58+
All new code must have tests. Aim for at least 80% coverage on new packages.
59+
60+
## Code Style
61+
62+
### Formatting and Linting
63+
64+
- **`gofmt`** — all code must be formatted with `gofmt`
65+
- **`goimports`** — imports must be organized (stdlib, external)
66+
- **`golangci-lint`** — run `golangci-lint run` before submitting
67+
68+
### Go Conventions
69+
70+
- Follow standard Go naming: PascalCase for exported symbols, camelCase for unexported
71+
- Use table-driven tests with `t.Run` subtests
72+
- Write documentation comments for all exported functions, types, and methods
73+
74+
## Commit Conventions
75+
76+
We use [Conventional Commits](https://www.conventionalcommits.org/) for automated versioning with release-please.
77+
78+
### Format
79+
80+
```
81+
<type>(<scope>): <description>
82+
83+
[optional body]
84+
85+
[optional footer(s)]
86+
```
87+
88+
### Types
89+
90+
| Type | Description |
91+
|------|-------------|
92+
| `feat` | A new feature |
93+
| `fix` | A bug fix |
94+
| `perf` | A performance improvement |
95+
| `docs` | Documentation only changes |
96+
| `test` | Adding or correcting tests |
97+
| `chore` | Maintenance tasks, CI, dependencies |
98+
| `refactor` | Code change that neither fixes a bug nor adds a feature |
99+
100+
### Examples
101+
102+
```
103+
feat(bpe): add support for Llama 3 tokenizer format
104+
fix(decode): handle surrogate pairs in Unicode decoding
105+
perf(encode): optimize merge priority queue for long sequences
106+
docs: update HuggingFace compatibility notes
107+
test: add round-trip encoding/decoding tests for Gemma vocabulary
108+
```
109+
110+
## Pull Request Process
111+
112+
1. **One logical change per PR** — keep PRs focused and reviewable
113+
2. **Branch from `main`** and keep your branch up to date with rebase
114+
3. **All CI checks must pass** — tests, linting, formatting
115+
4. **Rebase and merge** — we do not use squash merges or merge commits
116+
5. **Reference related issues** — use `Fixes #123` or `Closes #123` in the PR description
117+
6. **Respond to review feedback** promptly
118+
119+
### Before Submitting
120+
121+
```bash
122+
go test ./...
123+
go test -race ./...
124+
go vet ./...
125+
golangci-lint run
126+
```
127+
128+
## Issue Reporting
129+
130+
### Bug Reports
131+
132+
Please include:
133+
134+
- **Description**: Clear summary of the bug
135+
- **Steps to reproduce**: Minimal code with the tokenizer model file used
136+
- **Expected behavior**: Expected token IDs or decoded text
137+
- **Actual behavior**: Actual token IDs or decoded text
138+
- **Environment**: Go version, OS
139+
- **Tokenizer model**: Which HuggingFace model's tokenizer was used
140+
141+
### Feature Requests
142+
143+
Please include:
144+
145+
- **Problem statement**: What problem does this solve?
146+
- **Proposed solution**: How should it work?
147+
- **Alternatives considered**: Other approaches you thought about
148+
- **Use case**: How would you use this feature in practice?
149+
150+
## Good First Issues
151+
152+
Look for issues labeled [`good first issue`](https://github.com/zerfoo/ztoken/labels/good%20first%20issue) on GitHub. These are scoped, well-defined tasks suitable for new contributors.
153+
154+
Good areas for first contributions:
155+
156+
- Adding test coverage for edge cases in encoding/decoding
157+
- Documentation improvements
158+
- Supporting additional HuggingFace tokenizer configurations
159+
- Performance optimizations in the BPE merge loop
160+
161+
## Key Conventions
162+
163+
These conventions are critical to maintaining consistency across the codebase:
164+
165+
### HuggingFace compatibility
166+
167+
ztoken must produce identical token IDs to the HuggingFace `tokenizers` library for all supported models. When adding support for a new tokenizer format, verify against the Python reference implementation:
168+
169+
```python
170+
from transformers import AutoTokenizer
171+
tok = AutoTokenizer.from_pretrained("model-name")
172+
print(tok.encode("test string"))
173+
```
174+
175+
### Zero external dependencies
176+
177+
ztoken depends only on the Go standard library and `golang.org/x/text`. Do not add third-party dependencies. This keeps the library lightweight and easy to embed.
178+
179+
### Stdlib-only testing
180+
181+
Tests use only the `testing` package from the standard library. Do not introduce test frameworks like testify.

0 commit comments

Comments
 (0)