NBT does not use UTF-8, it's MUTF-8. #144

TkTech · 2020-10-12T18:43:53Z

NBT uses MUTF-8, not UTF-8. Valid game-generated files will result in UnicodeDecodeErrors when using Twoolie's NBT. Minimal reproduction file with an embedded MUTF-8 NULL: encoded.dat.gz

I'd normally send you a PR to use my MUTF-8 encoder, but being dependency-free seems to be a project goal. There's a pure-python version in there you can just copy.

@1dt

The text was updated successfully, but these errors were encountered:

macfreek · 2020-10-15T20:45:40Z

@TkTech Thanks for the report! And great suggestion, I would support an fix. (Unfortunately, I'm not actively maintaining this package anymore, so will do not myself, at least not anytime soon I'm afraid).

Interesting topic, I thought I'd seen it all after the different line-endings, different normalizations in UTF, and the BOM-or-no-BOM. Another variant I was not aware of. Seems like one of the original Java programmers had a field day torturing the original UTF-8 and UTF-16 specs. Alas, that's what we have to deal with.

[Edit, seems I was wrong at first]

Just for the record, am I correct to assume that:

This is the format implemented by yourself (code) and py2jdbc (code)
The format is the one described by https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html and https://py2jdbc.readthedocs.io/en/latest/mutf8.html (the docs for py2jdbc)

And if I'm correct, there are two difference between this format and UTF-8

It encodes U+0000 in 2 bytes instead of in 1 byte (also as described in the rejected Python issue 2857)
It encodes codepoints outside the basic multilingual plane (thus code points >= U+10000) as 6 bytes instead of 4 bytes, like CESU-8 does, and as described by Unicode Technical Report #26
.

TkTech · 2020-10-15T21:40:32Z

That would be correct. I know it can be confusing, especially with some of the first posts you see (such as stackoverflow) when searching just suggest replacing NULLs, which is incorrect.

TkTech · 2021-01-22T07:07:05Z

Since this library wants to be dependency free, and py2.7 compatible (which mutf8 is not), instead of a PR here's a patch anyone stumbling on this with an unreadable file can use (as long as you're py3):

diff --git a/nbt/nbt.py b/nbt/nbt.py
index 947a65e..8f633bd 100644
--- a/nbt/nbt.py
+++ b/nbt/nbt.py
@@ -4,12 +4,13 @@ Handle the NBT (Named Binary Tag) data format
 For more information about the NBT format:
 https://minecraft.gamepedia.com/NBT_format
 """
-
 from struct import Struct, error as StructError
 from gzip import GzipFile
 from collections import MutableMapping, MutableSequence, Sequence
 import sys

+import mutf8
+
 _PY3 = sys.version_info >= (3,)
 if _PY3:
     unicode = str
@@ -353,10 +354,10 @@ class TAG_String(TAG, Sequence):
         read = buffer.read(length.value)
         if len(read) != length.value:
             raise StructError()
-        self.value = read.decode("utf-8")
+        self.value = mutf8.decode_modified_utf8(read)

     def _render_buffer(self, buffer):
-        save_val = self.value.encode("utf-8")
+        save_val = mutf8.encode_modified_utf8(self.value)
         length = TAG_Short(len(save_val))
         length._render_buffer(buffer)
         buffer.write(save_val)
diff --git a/setup.py b/setup.py
index e6a7cd5..4338408 100755
--- a/setup.py
+++ b/setup.py
@@ -13,6 +13,7 @@ setup(
   license          = open("LICENSE.txt").read(),
   long_description = open("README.txt").read(),
   packages         = ['nbt'],
+  install_requires = ['mutf8'],
   classifiers      = [
         "Development Status :: 5 - Production/Stable",
         "Intended Audience :: Developers",

ghost · 2021-10-14T04:12:08Z

Alright, so it looks like I'll need to fork the project and apply this patch. I currently have my block entity scanning script crash when trying to iterate over HermitCraft Season 7's world due to this bug.

ghost · 2021-10-14T20:45:26Z

Turns out Hermitcraft 6 has a region file which breaks mutf8. I've isolated the broken region file to be r.14.-2.mca. I'll have to see if I can isolate which block entity is causing the breakage so I can figure out why (e.g. if it's corruption or just mutf8 not being able to read it properly). Edit: It's in the overworld chunks.

ghost · 2021-10-14T20:51:03Z

Interestingly enough, the message doesn't show up in the mutf8 source code. However, I believe the message is supposed to come from line 65 of mutf8.py as it's the message with the if statement that checks for byte 0xED. Edit: Yep, it's line 65. I deleted the (cython?) binary and it gave me the exact message from source plus the line number 65.

The problem with the way the exceptions are handled right now, is, I have no way of finding out which chunk or block is corrupted. So, I have to play guess which block's corrupted as I cannot get the coordinates from the stack trace. I'm about to wipe out every block that isn't bedrock or marked as axe mineable (for chests). If the error goes away, then it'll most likely be a hopper, furnace, blast furnace, etc... I'm thinking to make it easier to debug, it might do me well to create a datapack that includes every vanilla block that isn't a block entity and use that for wiping out blocks. It may not even be a block that causing the exception, but I can't load the region file in https://irath96.github.io/webNBT/ or in the NBT Plugin I have installed into Intellij Idea. So, it may do well to build an editor that can log every exception without breaking the chunk loop.

I should also mention, neither Minecraft, nor Amulet detects a problem with loading these chunks. Even using the Optimize World feature to upgrade the region file from 1.14.4 to 1.17.1 doesn't fix the issue. So, most likely, the data is valid, just that mutf8 can't handle it.

TkTech · 2021-10-14T21:27:55Z

Can you attach the region file? It may or may not be an issue with mutf8, might just be a genuinely corrupted tag.

ghost · 2021-10-14T23:34:27Z

Yes. I've located the chest causing the problem too. Turns out Docm had created a series of books with weird characters and named them Alien Tech. Even minecraft froze for a second when I ran the /data get block ... command on the chest.

I copied the chest to every hotbar save, so you can run something like x+1 to get the chest which breaks mutf8.

Test Corruption.zip

hotbar.nbt.zip

ghost · 2021-10-14T23:40:17Z

As I may have accidentally uploaded a copy post my breaking the chest (to confirm that the problem was with the chest), here's an unedited copy of the region file. Also, here's the command which has the chest's coordinates in it. /data get block 7580 68 -976.

r.14.-2.mca.zip

Edit: If it helps, this is the scanner I'm working on (https://github.com/alexis-evelyn/WorldScanner/blob/master/scanner.py). I'm currently using the patch you provided at #144 (comment). I have not uploaded the patched version of NBT yet, but can do so if you don't already have a fork that has a patch (If I add my own patches, then I can include yours too if you'd like).

Netherwhal · 2022-10-14T22:12:28Z

Still getting the same issue though:

UnicodeDecodeError: 'mutf-8' codec can't decode byte 0xed in position 630: 6-byte codepoint started, but input too short to finish.

@ts-check

This was an interesting one! Thankfully I found an issue page about MUTF-8 handling on the repo for Twoolie/NBT, the Python project. It gave me some insight and a file to test against. I wrote my own script to slim it down a bunch, and dedupe the tags that are used multiple times. It's crazy how big just book text can get! I used this actual version of NBTify in this commit, to write the new content to the file. That's also why I diffed it out, I wanted to make sure when I slimmed it down that the content coming out of it was actually what it was supposed to be as well. When using older NBTify, it didn't work correctly, because MUTF-8 handles things different than standard UTF-8. ```js // @ts-check import { readFile, writeFile } from "node:fs/promises"; import * as NBT from "./NBTify/src/index.ts"; const data = await readFile("./hotbar.nbt"); const trimmed = data.subarray(0x000BAE96, 0x000CA7C2); console.log(trimmed); /** @type {NBT.NBTData<any>} */ const hotbar = await NBT.read(data); const book = hotbar.data[0][1].tag.BlockEntityTag.Items[12]; console.log(book); const mutf8Demo = await NBT.write(book); console.log(mutf8Demo); const demoDiff = mutf8Demo.subarray(1, -2); console.log(Buffer.compare(trimmed, demoDiff)); await writeFile("./alien-book.nbt", mutf8Demo); ``` #42 #44 twoolie/NBT#144 (comment) twoolie/NBT#144 I'm still not sure I'm going to use the dependency itself or if I should just emded that into NBTify on it's own. I think I may just use it as a dependency, as I've been trying to get more used to not reinventing the wheel for everything, unless that has benefits. The MUTF-8 library already does everything I need it to, and it's ESM TypeScript, so I'm not sure what other reason I have to not just use it, it's great! Eventually I want to move my compression handling into a separate module too, so I will have to use module resolution for that down the road either way. I say heck to it! Let's do it :) Gonna look into if there's anything I'm forgetting, before doing that though. I really like having the ability to use projects like these (NBTify) without needing a transpilation or build step. Modern CDNs seem to handle this nicely, so we'll see.

Offroaders123 · 2024-05-14T10:45:05Z

Just wanted to stop by and say thanks for documenting this! I'm working on an NBT library as well, and MUTF-8 does have a notable difference in output for the character ranges that it handles compared to UTF-8. Having that hotbar.nbt file to test on really helped with ensuring that it works.

macfreek mentioned this issue Jan 17, 2021

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 10: invalid continuation byte Fenixin/Minecraft-Region-Fixer#148

Closed

Nyveon mentioned this issue Nov 24, 2021

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed Nyveon/MCStructureCleaner#7

Closed

Netherwhal added a commit to Netherwhal/NBT that referenced this issue Oct 14, 2022

Fixes twoolie#144: use mutf8

7dae256

OpenBagTwo mentioned this issue Feb 9, 2024

[Bug Report]: UnicodeDecodeError on valid data #187

Open

OpenBagTwo pushed a commit to OpenBagTwo/NBT that referenced this issue Feb 9, 2024

Fixes twoolie#144: use mutf8

aeac65b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NBT does not use UTF-8, it's MUTF-8. #144

NBT does not use UTF-8, it's MUTF-8. #144

TkTech commented Oct 12, 2020 •

edited

Loading

macfreek commented Oct 15, 2020 •

edited

Loading

TkTech commented Oct 15, 2020

TkTech commented Jan 22, 2021

ghost commented Oct 14, 2021

ghost commented Oct 14, 2021 •

edited by ghost

Loading

ghost commented Oct 14, 2021 •

edited by ghost

Loading

TkTech commented Oct 14, 2021

ghost commented Oct 14, 2021

ghost commented Oct 14, 2021 •

edited by ghost

Loading

Netherwhal commented Oct 14, 2022

Offroaders123 commented May 14, 2024

NBT does not use UTF-8, it's MUTF-8. #144

NBT does not use UTF-8, it's MUTF-8. #144

Comments

TkTech commented Oct 12, 2020 • edited Loading

macfreek commented Oct 15, 2020 • edited Loading

TkTech commented Oct 15, 2020

TkTech commented Jan 22, 2021

ghost commented Oct 14, 2021

ghost commented Oct 14, 2021 • edited by ghost Loading

ghost commented Oct 14, 2021 • edited by ghost Loading

TkTech commented Oct 14, 2021

ghost commented Oct 14, 2021

ghost commented Oct 14, 2021 • edited by ghost Loading

Netherwhal commented Oct 14, 2022

Offroaders123 commented May 14, 2024

TkTech commented Oct 12, 2020 •

edited

Loading

macfreek commented Oct 15, 2020 •

edited

Loading

ghost commented Oct 14, 2021 •

edited by ghost

Loading

ghost commented Oct 14, 2021 •

edited by ghost

Loading

ghost commented Oct 14, 2021 •

edited by ghost

Loading