Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBT does not use UTF-8, it's MUTF-8. #144

Open
TkTech opened this issue Oct 12, 2020 · 11 comments
Open

NBT does not use UTF-8, it's MUTF-8. #144

TkTech opened this issue Oct 12, 2020 · 11 comments

Comments

@TkTech
Copy link

TkTech commented Oct 12, 2020

NBT uses MUTF-8, not UTF-8. Valid game-generated files will result in UnicodeDecodeErrors when using Twoolie's NBT. Minimal reproduction file with an embedded MUTF-8 NULL: encoded.dat.gz

I'd normally send you a PR to use my MUTF-8 encoder, but being dependency-free seems to be a project goal. There's a pure-python version in there you can just copy.

@1dt

@macfreek
Copy link
Collaborator

macfreek commented Oct 15, 2020

@TkTech Thanks for the report! And great suggestion, I would support an fix. (Unfortunately, I'm not actively maintaining this package anymore, so will do not myself, at least not anytime soon I'm afraid).

Interesting topic, I thought I'd seen it all after the different line-endings, different normalizations in UTF, and the BOM-or-no-BOM. Another variant I was not aware of. Seems like one of the original Java programmers had a field day torturing the original UTF-8 and UTF-16 specs. Alas, that's what we have to deal with.

[Edit, seems I was wrong at first]

Just for the record, am I correct to assume that:

And if I'm correct, there are two difference between this format and UTF-8

  1. It encodes U+0000 in 2 bytes instead of in 1 byte (also as described in the rejected Python issue 2857)
  2. It encodes codepoints outside the basic multilingual plane (thus code points >= U+10000) as 6 bytes instead of 4 bytes, like CESU-8 does, and as described by Unicode Technical Report #26
    .

@TkTech
Copy link
Author

TkTech commented Oct 15, 2020

That would be correct. I know it can be confusing, especially with some of the first posts you see (such as stackoverflow) when searching just suggest replacing NULLs, which is incorrect.

@TkTech
Copy link
Author

TkTech commented Jan 22, 2021

Since this library wants to be dependency free, and py2.7 compatible (which mutf8 is not), instead of a PR here's a patch anyone stumbling on this with an unreadable file can use (as long as you're py3):

diff --git a/nbt/nbt.py b/nbt/nbt.py
index 947a65e..8f633bd 100644
--- a/nbt/nbt.py
+++ b/nbt/nbt.py
@@ -4,12 +4,13 @@ Handle the NBT (Named Binary Tag) data format
 For more information about the NBT format:
 https://minecraft.gamepedia.com/NBT_format
 """
-
 from struct import Struct, error as StructError
 from gzip import GzipFile
 from collections import MutableMapping, MutableSequence, Sequence
 import sys

+import mutf8
+
 _PY3 = sys.version_info >= (3,)
 if _PY3:
     unicode = str
@@ -353,10 +354,10 @@ class TAG_String(TAG, Sequence):
         read = buffer.read(length.value)
         if len(read) != length.value:
             raise StructError()
-        self.value = read.decode("utf-8")
+        self.value = mutf8.decode_modified_utf8(read)

     def _render_buffer(self, buffer):
-        save_val = self.value.encode("utf-8")
+        save_val = mutf8.encode_modified_utf8(self.value)
         length = TAG_Short(len(save_val))
         length._render_buffer(buffer)
         buffer.write(save_val)
diff --git a/setup.py b/setup.py
index e6a7cd5..4338408 100755
--- a/setup.py
+++ b/setup.py
@@ -13,6 +13,7 @@ setup(
   license          = open("LICENSE.txt").read(),
   long_description = open("README.txt").read(),
   packages         = ['nbt'],
+  install_requires = ['mutf8'],
   classifiers      = [
         "Development Status :: 5 - Production/Stable",
         "Intended Audience :: Developers",

@ghost
Copy link

ghost commented Oct 14, 2021

Alright, so it looks like I'll need to fork the project and apply this patch. I currently have my block entity scanning script crash when trying to iterate over HermitCraft Season 7's world due to this bug.

@ghost
Copy link

ghost commented Oct 14, 2021

Turns out Hermitcraft 6 has a region file which breaks mutf8. I've isolated the broken region file to be r.14.-2.mca. I'll have to see if I can isolate which block entity is causing the breakage so I can figure out why (e.g. if it's corruption or just mutf8 not being able to read it properly). Edit: It's in the overworld chunks.

Screen Shot 2021-10-14 at 4 42 55 PM

@ghost
Copy link

ghost commented Oct 14, 2021

Interestingly enough, the message doesn't show up in the mutf8 source code. However, I believe the message is supposed to come from line 65 of mutf8.py as it's the message with the if statement that checks for byte 0xED. Edit: Yep, it's line 65. I deleted the (cython?) binary and it gave me the exact message from source plus the line number 65.

The problem with the way the exceptions are handled right now, is, I have no way of finding out which chunk or block is corrupted. So, I have to play guess which block's corrupted as I cannot get the coordinates from the stack trace. I'm about to wipe out every block that isn't bedrock or marked as axe mineable (for chests). If the error goes away, then it'll most likely be a hopper, furnace, blast furnace, etc... I'm thinking to make it easier to debug, it might do me well to create a datapack that includes every vanilla block that isn't a block entity and use that for wiping out blocks. It may not even be a block that causing the exception, but I can't load the region file in https://irath96.github.io/webNBT/ or in the NBT Plugin I have installed into Intellij Idea. So, it may do well to build an editor that can log every exception without breaking the chunk loop.

I should also mention, neither Minecraft, nor Amulet detects a problem with loading these chunks. Even using the Optimize World feature to upgrade the region file from 1.14.4 to 1.17.1 doesn't fix the issue. So, most likely, the data is valid, just that mutf8 can't handle it.

Screen Shot 2021-10-14 at 4 50 23 PM

@TkTech
Copy link
Author

TkTech commented Oct 14, 2021

Can you attach the region file? It may or may not be an issue with mutf8, might just be a genuinely corrupted tag.

@ghost
Copy link

ghost commented Oct 14, 2021

Yes. I've located the chest causing the problem too. Turns out Docm had created a series of books with weird characters and named them Alien Tech. Even minecraft froze for a second when I ran the /data get block ... command on the chest.

I copied the chest to every hotbar save, so you can run something like x+1 to get the chest which breaks mutf8.

Test Corruption.zip

hotbar.nbt.zip

2021-10-14_19 27 11

2021-10-14_19 28 07

@ghost
Copy link

ghost commented Oct 14, 2021

As I may have accidentally uploaded a copy post my breaking the chest (to confirm that the problem was with the chest), here's an unedited copy of the region file. Also, here's the command which has the chest's coordinates in it. /data get block 7580 68 -976.

r.14.-2.mca.zip

Edit: If it helps, this is the scanner I'm working on (https://github.com/alexis-evelyn/WorldScanner/blob/master/scanner.py). I'm currently using the patch you provided at #144 (comment). I have not uploaded the patched version of NBT yet, but can do so if you don't already have a fork that has a patch (If I add my own patches, then I can include yours too if you'd like).

@Netherwhal
Copy link

Still getting the same issue though:

UnicodeDecodeError: 'mutf-8' codec can't decode byte 0xed in position 630: 6-byte codepoint started, but input too short to finish.

OpenBagTwo pushed a commit to OpenBagTwo/NBT that referenced this issue Feb 9, 2024
Offroaders123 added a commit to Offroaders123/NBTify that referenced this issue May 14, 2024
This was an interesting one! Thankfully I found an issue page about MUTF-8 handling on the repo for Twoolie/NBT, the Python project. It gave me some insight and a file to test against. I wrote my own script to slim it down a bunch, and dedupe the tags that are used multiple times. It's crazy how big just book text can get!

I used this actual version of NBTify in this commit, to write the new content to the file. That's also why I diffed it out, I wanted to make sure when I slimmed it down that the content coming out of it was actually what it was supposed to be as well. When using older NBTify, it didn't work correctly, because MUTF-8 handles things different than standard UTF-8.

```js
// @ts-check

import { readFile, writeFile } from "node:fs/promises";
import * as NBT from "./NBTify/src/index.ts";

const data = await readFile("./hotbar.nbt");

const trimmed = data.subarray(0x000BAE96, 0x000CA7C2);
console.log(trimmed);

/** @type {NBT.NBTData<any>} */
const hotbar = await NBT.read(data);

const book = hotbar.data[0][1].tag.BlockEntityTag.Items[12];
console.log(book);

const mutf8Demo = await NBT.write(book);
console.log(mutf8Demo);

const demoDiff = mutf8Demo.subarray(1, -2);
console.log(Buffer.compare(trimmed, demoDiff));

await writeFile("./alien-book.nbt", mutf8Demo);
```

#42
#44
twoolie/NBT#144 (comment)
twoolie/NBT#144

I'm still not sure I'm going to use the dependency itself or if I should just emded that into NBTify on it's own. I think I may just use it as a dependency, as I've been trying to get more used to not reinventing the wheel for everything, unless that has benefits. The MUTF-8 library already does everything I need it to, and it's ESM TypeScript, so I'm not sure what other reason I have to not just use it, it's great! Eventually I want to move my compression handling into a separate module too, so I will have to use module resolution for that down the road either way. I say heck to it! Let's do it :) Gonna look into if there's anything I'm forgetting, before doing that though. I really like having the ability to use projects like these (NBTify) without needing a transpilation or build step. Modern CDNs seem to handle this nicely, so we'll see.
@Offroaders123
Copy link

Just wanted to stop by and say thanks for documenting this! I'm working on an NBT library as well, and MUTF-8 does have a notable difference in output for the character ranges that it handles compared to UTF-8. Having that hotbar.nbt file to test on really helped with ensuring that it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants