# Sound archives

First up then `soundsL.zbd` and `soundsH.zbd`. Fun fact: Looking at the disassembled executable, there is indication of a third, medium sounds file `soundsM.zbd`. The demo only ships with medium sounds. I don't know why it wasn't included in the release version, but potentially it just wasn't worth it (the low variants sound fine to me).

Since the patch simply dumps loose `.wav` files into the directory, we should expect Wave files inside some kind of container or archive. Might as well start at the start of the files:

In [1]:
from pathlib import Path

lo_path = Path("install/v1.2-us-pre/zbd/soundsL.zbd")
hi_path = Path("install/v1.2-us-pre/zbd/soundsH.zbd")

with lo_path.open("rb") as f:
    print(f.read(16))
with hi_path.open("rb") as f:
    print(f.read(16))

b'RIFF \xe0\x02\x00WAVEfmt '
b'RIFF \xe0\x02\x00WAVEfmt '


This already looks very promising with the magic "RIFF" header, and a "WAVE" format. Let's try the most obvious thing first. Since they seem to be actual `*.wav` files, let's just rename a file.

In [2]:
wav_path = Path("soundsL.wav")
try:
    wav_path.symlink_to(lo_path)
except FileExistsError:
    pass

from IPython.display import Audio, display

Audio(url=str(wav_path))

It works! Well, sort of. It turns out to be Mechwarrior 3 music - but only 17 seconds of it. Which is a bit short for such a large file. Let's parse the file.

In [3]:
import wave
from io import BytesIO

out = BytesIO()
with wave.open(str(lo_path), "rb") as reader, wave.open(out, "wb") as writer:
    writer.setparams(reader.getparams())
    writer.writeframes(reader.readframes(reader.getnframes()))
Audio(data=out.getvalue())

In [4]:
lo_path.stat().st_size

46187446

In [5]:
out.tell()

188456

Again, 17 seconds. And the parsed data is significantly smaller. Wonder what's after that first wave file?

In [6]:
size = out.tell()
with lo_path.open("rb") as f:
    f.seek(size)
    print(f.read(16))

b'RIFF\xec\x01\x00\x00WAVEfmt '


It's another wave file. Better check how many there are...
```console
$ strings "soundsL.wav" | grep "RIFF" | wc -l
     999
```
Ah. So we could probably just read them out sequentially.

In [7]:
data = hi_path.read_bytes()

positions = []
with BytesIO(data) as f:
    while True:
        position = f.tell()
        positions.append(position)
        # advance f by "reading" the wave file
        try:
            with wave.open(f, "rb") as reader:
                reader.readframes(reader.getnframes())
        except wave.Error:
            break
len(positions) - 1  # off by one because we add the last position before we fail

999

So we have perfect overlap. But Mechwarrior 3 is a 90s game. It didn't have the luxury of parsing the sound file each time. So it's likely there's some kind of lookup table, somewhere. It's not at the front of the file...

In [8]:
data[-16:]

b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xe7\x03\x00\x00'

That looks like little-endian integers.

In [9]:
from struct import unpack_from

unpack_from("<I", data, len(data) - 4)[0]

999

Coincidence? I think not!

As an aside, having the table at the end makes sense. If you put it at the start, and you add a file, now you have to rewrite all files (or pad the table and hope you don't run out). At the end, you simply write the file, and then write the updated table again.

ID3v1 tags for MP3 files are probably the most famous example of metadata at the end of a file. It was done to maintain compatibility with existing players, but proved simple and reliable. In other words, it was a brilliant hack. In contrast, the much needed but flawed successor, ID3v2, is stored at the start of the file, and uses padding/rewriting files as it grows.

Let's look at the last position, where `wave.Error` would have terminated the loop, but we still added it to the array.

In [10]:
position = positions[-1]
data[position : position + 16]

b'\x00\x00\x00\x00(\xe0\x02\x00demointr'

In [11]:
data[position : position + 32]

b'\x00\x00\x00\x00(\xe0\x02\x00demointr.wav\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Very interesting. I think we've found that table. I've written a quick helper function that works a bit like `strings | grep`:

In [12]:
from helpers import find_all

table = data[position:]

indices = find_all(table, b".wav")
len(indices)

999

That confirms it. For efficiency, tables usually have a fixed record size.

In [13]:
len(table) / len(indices)

148.008008008008

So close, but it should be an integer... ah, the count at the end of the file!

In [14]:
(len(table) - 4) / len(indices)

148.004004004004

...or two...

In [15]:
(len(table) - 8) / len(indices)

148.0

In [16]:
unpack_from("<2I", data, len(data) - 8)

(1, 999)

Got there. It seems like each entry in the table is 148 bytes long. Although I have no idea what the value `1` at length - 8 signifies.

In [17]:
table[0:148]

b'\x00\x00\x00\x00(\xe0\x02\x00demointr.wav\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00{\x9d\xf7\xbf\x1fA\xf7\xbfp\x14\xfc\xbf\x96\x9d\xf7\xbf\x1fA\xf7\xbfp\x14\xfc\xbfL\x9e\xf7\xbf\x01\x00\x00\x00\xa8\x80t\x81\x94\xf8\xee\x01\xaaZ\xf7\xbf\x00\x00\x00\x00,\x00\x00\x00\xb8\x04;\x08(\xe0\x02\x00\x01\x01\x02\x00\xa8\x80t\x81\x00\x00\x00\x00\x00\x00\x00\x00'

In [18]:
table[148 : 148 * 2]

b'(\xe0\x02\x00\xbc\x03\x00\x00cheesit.wav\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00{\x9d\xf7\xbf\x1fA\xf7\xbfp\x14\xfc\xbf\x96\x9d\xf7\xbf\x1fA\xf7\xbfp\x14\xfc\xbfL\x9e\xf7\xbf\x01\x00\x00\x00\x10\x81t\x81\x94\xf8\xee\x01\xaaZ\xf7\xbf\x00\x00\x00\x00,\x00\x00\x00\x1c\xea\xe2\x06\xbc\x03\x00\x00\x01\x01\x02x\x10\x81t\x81\x00\x00\x00\x00\x00\x00\x00\x00'

Taking a simple guess at how to unpack this, glossing over the data after the filename for now

In [19]:
unpack_from("<2I64s", table, 0)

(0,
 188456,
 b'demointr.wav\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

In [20]:
unpack_from("<2I64s", table, 148)

(188456,
 956,
 b'cheesit.wav\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

In [21]:
unpack_from("<2I64s", table, 148 * 2)

(189412,
 244,
 b'sfx_button1.wav\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

In [22]:
189412 - 956

188456

So the entry gives us start, length, and filename (instead, it could have been start, end and filename, although that's less useful).

In [23]:
offset = len(data) - 8
_, count = unpack_from("<2I", data, offset)
table = {}
for _ in range(count):
    offset -= 148  # walk the table backwards
    start, length, name = unpack_from("<2I64s", data, offset)
    name = name.rstrip(b"\0").decode("ascii")
    table[name] = (start, start + length)
len(table)

931

Er, what? Looks like there's duplicates :/ Also, I should look at the data after the filename at some point. I've creatively called it `extra`.

In [24]:
from collections import defaultdict


def extract_table(data):
    offset = len(data) - 8
    _, count = unpack_from("<2I", data, offset)
    for _ in range(count):
        offset -= 148  # walk the table backwards
        start, length, name, extra = unpack_from("<2I64s76s", data, offset)
        name = name.rstrip(b"\0").decode("ascii")
        yield (name, start, start + length, extra)


table = defaultdict(list)
for name, start, end, extra in extract_table(data):
    table[name].append((start, end, extra))

special = {name: values for name, values in table.items() if len(values) > 1}

dupe_count = sum(len(values) - 1 for values in special.values())
print(count - len(table), dupe_count)

68 68


In [25]:
for name, values in special.items():
    start, end, _ = values[0]
    # all entries with the same name point to different locations...
    assert all(start != s and end != e for s, e, _ in values[1:])
    # but then the data is the same
    assert all(data[start:end] == data[s:e] for s, e, _ in values[1:])

Seem to be duplicates, and so safe to discard. Time to compare low and high:

In [26]:
lo_data = lo_path.read_bytes()
hi_data = hi_path.read_bytes()

framerates = set()
for lo, hi in zip(extract_table(lo_data), extract_table(hi_data)):
    lo_name, lo_start, lo_end, lo_extra = lo
    hi_name, hi_start, hi_end, hi_extra = hi

    assert lo_name == hi_name
    with BytesIO(lo_data[lo_start:lo_end]) as f, wave.open(f, "rb") as reader:
        lo_params = reader.getparams()
    with BytesIO(hi_data[hi_start:hi_end]) as f, wave.open(f, "rb") as reader:
        hi_params = reader.getparams()
    assert lo_params.comptype == "NONE"
    assert hi_params.comptype == "NONE"
    assert lo_params.nchannels == hi_params.nchannels

    if lo_params.sampwidth != hi_params.sampwidth:
        print(
            f"{lo_name:<32}",
            "sampwidth",
            lo_params.sampwidth,
            hi_params.sampwidth,
        )

    framerates.add(lo_params.framerate)
    framerates.add(hi_params.framerate)

    # check the length is the same
    assert (
        lo_params.nframes // lo_params.framerate
        == hi_params.nframes // hi_params.framerate
    )

print(framerates)

soil_grass_2s.wav                sampwidth 1 2
soil_in_water_2s.wav             sampwidth 1 2
soil_water_2s.wav                sampwidth 1 2
soil_mud_2s.wav                  sampwidth 1 2
soil_grass_2s.wav                sampwidth 1 2
soil_dirt_2s.wav                 sampwidth 1 2
soil_concrete_2s.wav             sampwidth 1 2
mech_footfall_2s.wav             sampwidth 1 2
{22000, 11025, 22050}


Nothing too shocking, low and high variants are the same. No point in messing with the low-quality sounds files then.

And, okay, I lied. I don't care about the extra fields in the table at all. Maybe later.

So now we have everything we need to extract the sound files. There are two options: Either we copy the data blindly, or we read the wave files and write them back out. As usual, I'm going to try the easy option first. However, I don't really want 900+ files littering my hard drive, so I'm going to put them in a `.zip` archive. Some players, such as [VLC](https://www.videolan.org/vlc/index.html) can play files inside an archive, so that suits me well.

In [27]:
from zipfile import ZipFile
import warnings

warnings.filterwarnings("ignore", category=UserWarning)

with ZipFile("sounds.zip", "w") as z:
    for name, start, end, _ in extract_table(hi_data):
        z.writestr(name, hi_data[start:end])
    # include any loose files
    for path in Path("install/v1.2-us-post/zbd/").glob("*.wav"):
        print(path.name)
        z.writestr(path.name, path.read_bytes())

mech_engine_np.wav
mech_footfall_2s.wav
soil_concrete_2s.wav
soil_dirt_2s.wav
soil_grass_2s.wav
soil_in_water_2s.wav
soil_metal_2s.wav
soil_mud_2s.wav
soil_snow_2s.wav
soil_water_2s.wav


I've also muted `UserWarning`s, which are displayed because of the duplicates. So we would have caught it here (at least in newer Python versions).

There doesn't seem to be any downside to just copying the sounds files. Although there is a binary difference in the output files, in VLC at least they sound identical.

For total rigour, let's at least validate it works for all versions:

In [28]:
for sound in Path("install").rglob("sounds*.zbd"):
    print(sound)
    data = sound.read_bytes()
    for name, start, end, _ in extract_table(data):
        with BytesIO(data[start:end]) as f, wave.open(f, "rb") as reader:
            reader.getparams()
            count = reader.getnframes()
            reader.readframes(count + 1)

install/v1.0-de-post/zbd/soundsH.zbd
install/v1.0-de-post/zbd/soundsL.zbd
install/v1.0-de-pre/zbd/soundsH.zbd
install/v1.0-de-pre/zbd/soundsL.zbd
install/v1.0-us-post/zbd/soundsH.zbd
install/v1.0-us-post/zbd/soundsL.zbd
install/v1.0-us-pre/zbd/soundsH.zbd
install/v1.0-us-pre/zbd/soundsL.zbd
install/v1.1-us-post/zbd/soundsH.zbd
install/v1.1-us-post/zbd/soundsL.zbd
install/v1.1-us-pre/zbd/soundsH.zbd
install/v1.1-us-pre/zbd/soundsL.zbd
install/v1.2-us-post/zbd/soundsH.zbd
install/v1.2-us-post/zbd/soundsL.zbd
install/v1.2-us-pre/zbd/soundsH.zbd
install/v1.2-us-pre/zbd/soundsL.zbd


Brilliant.

## Next up

[Texture extraction](06-textures.ipynb)