Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

declarative parsing and writing #9

Open
ScanMountGoat opened this issue Mar 9, 2024 · 7 comments
Open

declarative parsing and writing #9

ScanMountGoat opened this issue Mar 9, 2024 · 7 comments
Milestone

Comments

@ScanMountGoat
Copy link

I've been experimenting with making the reading and writing code for modern bdat files more self documenting using binrw. I've linked a gist for code with binrw 0.13.3 that can read and rebuild all bdat files in Xenoblade 3 1:1 with the originals. This code doesn't include converting to and from the ModernTable type since it looks like you're reworking that on a separate branch. It should also be doable to store enum values instead of bytes for each row and still rebuild the data correctly.
https://gist.github.com/ScanMountGoat/a406b36a8d103eed035663ec42c3b9f5

The implementation is a lot shorter, but the main advantage in my opinion is making the code easier to reason about. The basic process for reading would be to parse the bdat data into an owned Bdat. Writing this back to the file without any changes should produce identical bytes. Converting the Bdat to a friendlier representation like ModernTable can be done with from_ and to_ conversion functions. Splitting the read/write logic and conversion logic also means the conversions can be tested by comparing the Bdat structs instead of diffing binary files.

If we want the types to be an implementation detail, we could just have functions that take a reader or bytes and return Vec<ModernTable, Vec<LegacyTable, or an enum with both. This would avoid all the current complexity on the user side of dealing with opaque readers or slices that will probably get converted immediately to tables anyway. The bdat files are tiny, so the performance cost of parsing and converting all tables is low.

@ScanMountGoat
Copy link
Author

This doesn't account for endianness and the naming could be better, but this is the general idea of what I think user code could look like.

let bdat = bdat::from_bytes(bytes)?;
match bdat {
    bdat::Bdat::Legacy(tables) => (),
    bdat::Bdat::Modern(tables) => ()
}
bdat.write(writer)?;

let legacy_tables = bdat::legacy::from_bytes(bytes)?;
legacy_tables.write(writer)?;

let modern_tables = bdat::modern::from_bytes(bytes)?;
modern_tables.write(writer)?;

@roccodev
Copy link
Owner

I'll have to experiment with it, especially for legacy formats. Thanks!

@roccodev
Copy link
Owner

If I understand correctly, it seems that this implementation could replace the current io::legacy and io::modern modules. After 0.5 is released, I'll check whether this is feasible for legacy formats.

Right now, legacy formats are the bulk of the library's complexity (both parsing and type-wise), which is why I wanted to separate the two formats as much as possible in 0.5.

@roccodev roccodev added this to the 0.6 milestone Mar 30, 2024
@ScanMountGoat
Copy link
Author

I would personally define separate types for the legacy and modern bdats. It looks like the fields and types may be slightly different based on the docs. The modern binrw code code could at least be a starting point for both versions. Deriving the reading and writing logic and using owned instead of borrowed data should at least reduce a lot of the code complexity in io::legacy and io::modern.

@roccodev
Copy link
Owner

Yeah that was my intention. I'm a bit puzzled on why you would use owned data, though. Currently, lifetimes in the public API can be elided in function signatures, so they only complicate the API in struct fields. They give users the flexibility to either keep the borrow or clone the data to extend its lifetime, which is noticeable when transcoding large message BDATs to JSON or CSV. Admittedly, it's not as beneficial in legacy formats as you need to clone the data anyway if you don't own a mutable source (because text is originally scrambled)

@ScanMountGoat
Copy link
Author

How expensive is cloning the entire bdat for the largest file?

@roccodev
Copy link
Owner

Not much, from very rudimentary tests it seems to be a couple hundred ms at most on a single-threaded extract on the entire XC3 2.2.0 dump. I admit I might have given the performance factor too much focus in my initial analysis.

Still, accepting borrowed data is consistent with other libraries' (most notably serde) data model. Obviously I'm not aiming for total zero-copy as endianness may vary and values are not properly aligned, but being able to avoid copies for strings is quite nice. In the future, I might also import the hash table from the BDAT source directly for modern tables (up to endianness) if mutable access isn't required (with clone-on-write semantics), so that's another thing that could be borrowed from the data.

I might also experiment with a custom Serializer/Deserializer for BDAT, so you can model BDAT rows as structs using serde. Some types (i/u64, f64, etc.) may be unsupported and some might alias, so I need to see how to best integrate that with the BDAT type system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants