Feature request: Serialization stream #39

Closed
Mortal opened this Issue Jan 10, 2013 · 20 comments

Comments

Projects
None yet
4 participants
Collaborator

Mortal commented Jan 10, 2013

We should have a new serialization framework that supports some kind of file format. I am currently implementing new serialization and a serialization_stream in the serialization branch.

It should support pipelining, stream reversing, and sorting.

Mortal was assigned Jan 10, 2013

@Mortal Mortal added a commit that referenced this issue Jan 10, 2013

@Mortal Mortal Add serialization_stream. (#39)
This is a work in progress, so not many features are implemented yet;
however, the core methods that are implemented should be reasonably
stable:

Supports read/write mode only.

open(std::string) only, with an optional `requireCleanClose` parameter
defaulting to true.

Block size fixed to 2 MB.

offset(), size() and can_read() supported.

No seek, skip, rewind, reverse read implemented.
ba31444
Collaborator

Mortal commented Jan 11, 2013

A bit more research has led me to a Stack Overflow answer on overloads/specializations across namespaces.

It seems the right way to enable serialization for user types is to provide a template specialization in the user's namespace, and for the parts of TPIE that use serialization to find the specialization with ADL.

Collaborator

Mortal commented Jan 11, 2013

What should we do when opening a stream for reverse reading and finding out it was not written in reverse (or vice versa)? Ignore the open flag and use the stream order; issue a warning; throw an exception?

Collaborator

antialize commented Jan 11, 2013

Perhaps only use the reverse flag when opening for writing, and document that it is only for writing.

Owner

thomasmoelhave commented Jan 12, 2013

@antialize are you proposing basically that the stream is "written" in reverse (as far as the user knows) and then whenever that stream is opened it will automatically be read backwards (which it'll know do do by the reverse flag)? That sounds like a decent idea. Alternatively, if "reverse" needs to be specified both for reading and writing I recommend throwing an exception if there's a mismatch.

Collaborator

antialize commented Jan 12, 2013

Actualy we have decided changing the interface so that we have 4 classes instead of one:
serialization_writer, serialization_reader, serialization_reverse_writer, serialization_reverse_reader.

This way more errors can be caught in the type system (since a stream cannot be opened for reading and writing at the same time) and the code can be made simpler and more efficient.

Collaborator

Mortal commented Jan 16, 2013

I have reconsidered the serialization stream class hierarchy. In short:

class serialization_header {
public:
    serialization_header(file_accessor::raw_file_accessor &);
    header_size(); // Reserve this many bytes in file for the header.
    // For reading a header:
    read();
    get_size();
    get_clean_close();
    verify();
    // For writing a header:
    set_size(sz); // if not set, writes size=0 in header.
    write(bool cleanClose);
};

class serialization_writer_base {
protected:
    open(path, reverse);
    write_block(buffer, bufSize); // always writes next block
    close();
};

class serialization_writer : public serialization_writer_base {
public:
    open(path);
    serialize(...);
    close();
};

(as well as a reverse writer and forwards/reverse reader.)

Collaborator

Mortal commented Jan 16, 2013

@antialize Please review serialization_stream.h@7836ffcb7 and let me know what you think.

Collaborator

Mortal commented Jan 16, 2013

I have implemented reverse reading and writing in 1e727a4.

Collaborator

Mortal commented Jan 17, 2013

I have implemented sorting. I have tested correctness by comparing the output with LC_ALL=C /usr/bin/sort. I still need to handle memory correctly, but at least there is an example usage in the test/sort program.

Owner

thomasmoelhave commented Jan 18, 2013

How good is the performance of the sorting? Is it very dependent on the size of the items being serialized? If for some reason all elementes were, say, ints - how does the sorting compare to a standard file_stream?

Collaborator

Mortal commented Jan 18, 2013

That's a good question - I'll look into that on Tuesday. So far I have tested scanning and reversing against file_stream, and the CPU overhead is significant when reversing, but not so in ordinary scanning, and when I use actual disks instead of a RAM disk I have not observed a difference.

I'm a bit wary of comparing the two sorters before I can make sure I only use the memory I am allotted in the new sorter, but I'll write a test program for that nonetheless.

Collaborator

Mortal commented Jan 28, 2013

I have run a single speed comparison test pitting tpie::sort against the new tpie::serialization_sorter on identical 50 GB inputs. The test program outputs 21 fractiles from a stream of numbers (which it has to sort), and I'm happy to see that the two programs output the same fractiles and so are probably equally correct.

Both runs have an input scan overhead, because the input numbers are written to a stream by a different program beforehand. The two stream types are different, but the numbers are generated from the same random seed.

Both runs have an output scan overhead. For tpie::sort this is due to the API; for serialization_stream this is due to me not handling the final merge step correctly yet.

Both runs took roughly the same time: 80 minutes (tpie::sort 80:03.3, serialization_sort 80:48.8).

I will fix the output scan overhead in the serialization sorter and then run a test against the new pipelining merge_sorter which does not have this output scan overhead.

Collaborator

antialize commented Feb 19, 2013

Would you create a pipeline wrapper for serialization sort?

Collaborator

Mortal commented Apr 3, 2013

I have implemented evacuation for the serialization sorter. It is tested by ut-serialization_sort which uses the same test case parameters (phase 1 mem/phase 2 mem/phase 3 mem/extra working space/input size) as the ut-merge_sort test (which I wrote for the ordinary pipelining merge sorter).

Collaborator

Mortal commented May 8, 2013

I think it is time to merge serialization into master.

Jakob and I have rebased the serialization onto the current tip of master, and the result is the serialization_rebase branch.

It includes a change to the pipelining merge sorter to handle empty input better.

This is the diffstat:

 doc/CMakeLists.txt                               |    3 +
 doc/Doxyfile.in                                  |    1 +
 doc/code/code.cpp                                |    7 +
 doc/code/serialization.inl                       |   76 ++
 doc/serialization.dox.in                         |  101 +++
 test/CMakeLists.txt                              |    8 +
 test/lines.cpp                                   |   46 ++
 test/sort.cpp                                    |   80 ++
 test/speed_regression/CMakeLists.txt             |   16 +
 test/speed_regression/fractile.h                 |  111 +++
 test/speed_regression/fractile_serialization.cpp |   65 ++
 test/speed_regression/fractile_tpie.cpp          |   53 ++
 test/speed_regression/numbergen.cpp              |  125 +++
 test/speed_regression/serialization.cpp          |  342 +++++++++
 test/unit/CMakeLists.txt                         |    6 +-
 test/unit/merge_sort.h                           |  195 +++++
 test/unit/test_merge_sort.cpp                    |  173 +----
 test/unit/test_pipelining_serialization.cpp      |  204 +++++
 test/unit/test_serialization.cpp                 |  281 ++++++-
 test/unit/test_serialization_sort.cpp            |   45 ++
 tpie/CMakeLists.txt                              |    4 +
 tpie/is_simple_iterator.h                        |   54 ++
 tpie/pipelining.h                                |    2 +
 tpie/pipelining/CMakeLists.txt                   |    1 +
 tpie/pipelining/merge_sorter.h                   |    8 +-
 tpie/pipelining/serialization.h                  |  221 ++++++
 tpie/pipelining/serialization_sort.h             |  536 +++++++++++++
 tpie/serialization2.h                            |  268 +++++++
 tpie/serialization_sort.h                        |  890 ++++++++++++++++++++++
 tpie/serialization_stream.cpp                    |  395 ++++++++++
 tpie/serialization_stream.h                      |  402 ++++++++++
 31 files changed, 4557 insertions(+), 162 deletions(-)
Owner

thomasmoelhave commented May 8, 2013

That works for me, this is probably a good time to do this merge.

Collaborator

freekvw commented May 8, 2013

Shouldn't we wait until we've finished the release?

Collaborator

antialize commented May 8, 2013

We point to a specific commit, and if an issue should arise, we can alwayes branch from there.

Collaborator

freekvw commented May 8, 2013

Was just thinking that's not so pretty, but yeah, whatever :-)

Mortal referenced this issue May 9, 2013

Closed

Releasing TPIE 1.1 #85

Collaborator

Mortal commented Jun 20, 2013

Merged in 51f4da6.

Mortal closed this Jun 20, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment