Add DataGen synthetic dataset generation to the BitNet b1.58 CLI#16
Conversation
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eaa009cb12
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds an offline “DataGen” workflow to the BitNetSharp CLI and core library, enabling synthetic JSONL dataset generation from JSON seed examples while recording BitNet model provenance.
Changes:
- Introduces
datagenCLI command that parses options and writes JSONL output. - Adds
BitNetSharp.CoreDataGenGeneratorfor loading seed aliases and emitting structured synthetic records. - Updates docs + examples and adds targeted tests covering seed parsing, option parsing, and JSONL writing.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/BitNetSharp.Tests/BitNetModelTests.cs | Adds unit tests for seed alias loading, generator output shape, option parsing, and JSONL writing. |
| src/BitNetSharp.Core/DataGenGenerator.cs | Implements seed loading + synthetic record generation (domain/variation/model/tags/LoRA metadata). |
| src/BitNetSharp.App/Program.cs | Wires the new datagen command into the CLI entry point. |
| src/BitNetSharp.App/DataGenCommand.cs | Implements CLI option parsing and JSONL dataset writing for datagen. |
| examples/seed-examples.json | Provides a sample seed file demonstrating supported alias fields. |
| docs/usage.md | Documents datagen usage examples and links to the new guide. |
| docs/datagen-guide.md | Adds a dedicated DataGen guide (seed aliases, output schema, usage). |
| docs/architecture.md | Mentions DataGenGenerator and the CLI command in the architecture overview. |
| docs/SUMMARY.md | Adds the DataGen guide to GitBook navigation. |
| docs/README.md | Updates quick start + feature list to include DataGen and adds a docs map entry. |
|
@copilot apply changes based on the comments in this thread |
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Applied on the branch. The review-thread fixes landed in |
This PR closes the gap between the repository’s stated BitNet b1.58 reference surface and the implemented app surface by adding the missing DataGen workflow. It introduces an offline synthetic dataset generator that expands minimal JSON seed sets into JSONL instruction/response corpora while preserving BitNet model provenance.
CLI surface
datagencommand toBitNetSharp.App--option=valueand--option valueformsCore DataGen implementation
DataGenGeneratorinBitNetSharp.Coreinstruction,prompt,inputresponse,output,answerModel alignment
Docs and examples
docs/datagen-guide.mddocs/README.mdanddocs/SUMMARY.mddocs/usage.mdexamples/seed-examples.jsonTargeted coverage
Example:
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen \ --domain "medical-diagnosis" \ --count 50000 \ --seeds examples/seed-examples.json \ --output data/synthetic-medical.jsonl \ --lora medical-lora.binExample output shape:
{ "domain": "medical-diagnosis", "instruction": "...", "response": "...", "seedInstruction": "...", "seedResponse": "...", "variation": "pattern-1", "generatorModel": "bitnet-b1.58-sharp", "loraAdapter": "medical-lora.bin", "tags": ["synthetic", "offline", "pattern-1", "medical", "diagnosis"] }✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.