Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary I/O #83

Merged
merged 15 commits into from
Mar 23, 2022
Merged

Binary I/O #83

merged 15 commits into from
Mar 23, 2022

Conversation

ardasener
Copy link
Contributor

SBFF (SparseBase File Format) (naming is open for debate)

This pull request adds a custom binary format for input and output of Format objects.

Goals

This format is designed with the following goals in mind:

  • As simple as possible
  • Easy to read elsewhere without needing the library
  • As efficient and as small as possible
  • Able to deal with architectural differences
  • Type safe

Specifications

Overview

The general structure of the file format is shown in the figure below.
sparsebase_file_format

As can be seen from the figure there are 3 entities in each file:

  • File header (only one at the very top)
  • Array header (multiple)
  • Array (multiple)

File Header

The file header is a JSON object encoded in ASCII and padded with space characters to be exactly 1KB in length. The object contains the following fields:

  • name: Name of the written structure (for example: CSR, COO)
  • array_count: Number of arrays written to this file
  • dimensions: Dimensions of the structure
  • endian: Either "little" or "big" depending on the architecture's byte order

Array Header

Array headers are identical file headers in structure (JSON object encoded in ASCII and padded). However they contain different fields:

  • name: Name of the array (for example: for a CSR this could be row_ptr, col or vals)
  • type: Could be "signed", "unsigned" or "float" depending on the type of the array
  • type_size: Number of bytes used to represent on entity of the array (for example: for a double array this is 8)
  • array_size: Number of entities inside the array (ie, the length of the array)

Array

The actual data of the arrays are written directly to the disk. Depending on the OS this is done in two ways:

  • On Windows, the arrays will be cast to a char pointer and written using an output stream in standard C++.
  • On UNIX, the arrays will be cast in the same way but written using PIGO's faster write routines.

Since PIGO does not support Windows, reading and writing these files will be slower on Windows. There sadly isn't much we can do about this.

@ardasener ardasener added priority: soon High priority state: pending Taking action type: feature Brand new functionality, features, workflows, endpoints, etc labels Mar 23, 2022
Copy link
Contributor

@AmroAlJundi AmroAlJundi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names of the Read and Write Functions in sparse_file_format should be more clear.

Also, I recommend merging this PR with #82 (i.e., closing that one and moving its PR comments here) since the code for the latter is already merged here.

@ardasener
Copy link
Contributor Author

As suggested by @AmroAlJundi, all the changes from the pull request #82 are also here due to a necessary merge. So that pull request is closed and all the features discussed there are part of this request.

@ardasener ardasener mentioned this pull request Mar 23, 2022
@ardasener ardasener added state: review needed and removed state: pending Taking action labels Mar 23, 2022
@ardasener ardasener merged commit 8c55640 into develop Mar 23, 2022
@ardasener ardasener deleted the feature/binary_io branch March 23, 2022 16:21
SinanEkm pushed a commit that referenced this pull request Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: soon High priority state: review needed type: feature Brand new functionality, features, workflows, endpoints, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants