The bundled entities.json
is sourced from https://www.w3.org/TR/html5/entities.json.
Modelled on Philip Jackson's entities
crate for Rust.
The core datatypes are:
pub const Entity = struct {
entity: []u8,
codepoints: Codepoints,
characters: []u8,
};
pub const Codepoints = union(enum) {
Single: u32,
Double: [2]u32,
};
The list of entities is directly exposed, as well as a binary search function:
pub const ENTITIES: [_]Entity
pub fn lookup(entity: []const u8) ?Entity
Add it to your build.zig.zon
:
zig fetch --save git+https://nossa.ee/~talya/htmlentities.zig
In your build.zig
:
const htmlentities_dep = b.dependency("htmlentities.zig", .{ .target = target, .optimize = optimize });
exe.root_module.addImport("htmlentities", htmlentities_dep.module("htmlentities"));
In your main.zig
:
const std = @import("std");
const htmlentities = @import("htmlentities");
pub fn main() !void {
var eacute = htmlentities.lookup("é").?;
std.debug.print("eacute: {}\n", .{eacute});
}
Output:
eacute: Entity{ .entity = é, .codepoints = Codepoints{ .Single = 233 }, .characters = é }
Ideally we'd do the JSON parsing and struct creation at comptime. The std JSON
tokeniser uses ~80GB of RAM and millions of backtracks to handle the whole
entities.json
at comptime, so it's not gonna happen yet. Maybe once we get a
comptime allocator we can use the regular parser.
As it is, we do codegen. Ideally we'd piece together an AST and render
that instead of just writing Zig directly -- I did try it with a 'template'
input string (see some broken wip at 63b9393
), but it's hard to do since
std.zig.render
expects all tokens, including string literal, to be available
in the originally parsed source. At the moment we parse our generated source
and format it so we can at least validate it syntactically in the build step.