New format with different endianess for better performance #5

zommiommy · 2023-04-04T14:50:31Z

During today's review of the code we noticed that we read codes using a MSB to LSB order which implies the need to use big-endian words. Since most modern computers are little-endian we might want to add backends to support LSB to MSB order, thus little-endian order, which might result in better performance.

We will discuss this further in the future as it might lead only to marginal improvements and needs to be accurately benchmarked.

vigna · 2023-04-04T16:03:38Z

Yes, let's make this more explicit. When we had the first meeting I had a completely twisted memory of how the bitstream is implemented, and I now believe that it is fundamental to change the bit endianness of the current format, which was conceived around Java's inherently big-endian design.

The modifications are as follows:

Instead of bit k of the stream being the bit of index 7 - k mod 8 of byte k div 8, bit k of the stream should be the bit of index k mod 8 of byte k div 8. That is, we are reversing bit endianness at the byte level. With this change, we can indeed read files (on little-endian architectures) with words of arbitrary size. That is true of big-endian architectures in this moment, and it's not good.
However, all fixed-size read or write of bit blocks should happen with the bits reversed, because this is the "natural" order with small-endian bits. That is, reading bit blocks from the bit buffer or writing to the bit buffer should happen with a couple of logical operations; writing fixed-size bit blocks in a true small-endian order would be prohibitively slow.

This has the consequence that we will not be able to convert between the two formats just by reversing the bits in each byte, and in particular that no reasonably fast on-the-fly conversion will be possible (but I might be wrong on this).

More precisely: if we write the ɣ code for 4, namely 00101, in a bit stream presently we get the byte 00101000. We would like to switch to a format where writing the same code is written 00001100. Note that the unary part is reversed, but the "payload" of ɣ code is always written in the same direction, which is essential for quick insertion and extraction.

@zommiommy, please correct me if I'm wrong.

zacchiro · 2023-04-05T06:56:46Z

It's great that you have designed a new/better format. YAY for fresh looks at old problems!
About this (emphasis mine):

This has the consequence that we will not be able to convert between the two formats just by reversing the bits in each byte, and in particular that no reasonably fast on-the-fly conversion will be possible (but I might be wrong on this).

We can still have a batch and slow conversion tool from the current webgraph-java format to the new upcoming webgraph-rs format, correct?
Because if not that would force us to postpone experimenting with the current compressed Software Heritage graph (which uses the previous webgraph-java format) to the point we also have an implementation of compression in webgraph-rs, which would be unfortunate.

vigna · 2023-04-05T07:31:38Z

Yes. I was thinking about this issue this morning.

I think that the correct thing to do is to work on readers and writers in parallel and for both formats. It is a bit more work initially, but we have the advantage of having the possibility of unit-testing since the start reading and writing in both formats, and in particular converting graphs. (I'm not saying to develop compression now, just bit-writing code, even a not particularly efficient one).

An alternative is to write the conversion software in Java. But I think that it would also be useful to have a big-endian Rust version as people would be able to use (more slowly, if my intuition is correct) the current graphs.

So I would suggest that @zommiommy starts in parallel a reader and writer for big and little endian formats. For the WordReader, I guess a const template parameter is sufficient. In the end, the only difference is a single call to the byte-rearrangement function if the native endianness and the format of the file disagree. We will have to add an endianness field to the graph .properties file, and assume that it is big if not specified.

For the bit reader, code is slightly different—in the big-endian case we keep the bit buffer filled "at the top" (i.e., the most significant bits are correct), whereas in the little-endian case we keep the bit buffer filled "at the bottom" (i.e., the least significant bits are correct). In the big-endian case, to read a unary code we count the number of leading zeros, whereas in the little-endian case we count the number of trailing zeros. In the big-endian case, to extract k bits we take the k most significant bits of the bit buffer, whereas in the little-endian case we take the k least significant bits of the bit buffer.

But this is where the difference end. All other codes can be read using unary codes and bit blocks, so the rest of the code would be common (including minimal binary codes, albeit in the end the sequence of written bits would be different in the two cases—Tommaso, we can talk about this).

Note that the weird lack of symmetry between little and big endianness of my previous message is due to the same lack of symmetry at the hardware level—big-endian architectures and small-endian architectures orders bits in a byte in the same way, and that's essentially a small-endian way: bit 0 is the least significant bit.

…s discussed in #5

zommiommy · 2023-04-05T08:26:21Z

Ok implemented both readers, note that I explicitly use to_be, and to_le so these should work on both big endian and little endian machines. Of course, you will have better performance if the format and machine endianess matches. Now I think I'll proceed implementing the writers, if you agree.

Something I'm not sure yet is if we should have a single struct with a generic const to set the endianess, or have two different structs.

My concerns are the followings:
Two structs: Code duplication
Generic Const: Currently rust supports only integers, char, and bool as generic consts. The most explicit way would be to have an enumeration with two variants, but that requires the generic_const_exprs feature which is instable and incomplete.

vigna · 2023-04-05T08:34:42Z

Maybe having a macro for the time being, hoping that generic_const_exprs becomes stable? Or having 0/1 as const and we replace them later with a nice enumeration (0 is smaller, so little endian, 1 is bigger, so big endian)?

I'd say that a reasonable immediate target to is arrive at a point where we have unit tests writing and reading ɣ codes in both formats. Then the rest should be relatively easy. And at that point we can also do some benchmarking.

zommiommy · 2023-04-05T08:35:40Z

In code the alterantives I see are:

Generic const enum: (Sorry the feature I mentioned is the wrong one, the right one is adt_const_params)

#![feature(adt_const_params)]

#[derive(PartialEq, Eq)]
enum Endianess {
    Big,
    Little,
}

struct Reader<const E: Endianess>;

fn main() {
    let r = Reader::<{Endianess::Big}>;
}

Generic const bool:

struct Reader<const LittleEndian: bool>;

Two structs:

struct ReaderLittleEndian;
struct ReaderBigEndian;

Define two types as genercis. As bitvec does

trait Endianess{}

struct LittleEndian;
struct BigEndian;

impl Endianess for LittleEndian {}
impl Endianess for BigEndian {}

struct Reader<E: Endianess>(std::marker::PhantomData<E>);

fn main() {
    let r = Reader::<LittleEndian>;
}

zommiommy · 2023-04-05T08:36:54Z

te target to is arrive at a point where we have unit tests

Sure, let's start with anyone, and then, once the proof of concept works, migrate the code to the format we hope is the best.

vigna · 2023-04-05T08:41:31Z

I'd follow the bitvec path. Good point. If possible, maybe we can use the same structs of bitvec—certainly we will need it at some point.

zommiommy added a commit that referenced this issue Apr 5, 2023

added first implementation of little endian BufferedBitStreamReader a…

c1b2f6b

…s discussed in #5

Repository owner locked and limited conversation to collaborators Apr 7, 2023

vigna converted this issue into discussion #6 Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

New format with different endianess for better performance #5

New format with different endianess for better performance #5

zommiommy commented Apr 4, 2023

vigna commented Apr 4, 2023

zacchiro commented Apr 5, 2023

vigna commented Apr 5, 2023

zommiommy commented Apr 5, 2023

vigna commented Apr 5, 2023

zommiommy commented Apr 5, 2023 •

edited

zommiommy commented Apr 5, 2023

vigna commented Apr 5, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

New format with different endianess for better performance #5

New format with different endianess for better performance #5

Comments

zommiommy commented Apr 4, 2023

vigna commented Apr 4, 2023

zacchiro commented Apr 5, 2023

vigna commented Apr 5, 2023

zommiommy commented Apr 5, 2023

vigna commented Apr 5, 2023

zommiommy commented Apr 5, 2023 • edited

zommiommy commented Apr 5, 2023

vigna commented Apr 5, 2023

This issue was moved to a discussion.

zommiommy commented Apr 5, 2023 •

edited