-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More disk support? #15
Comments
I'm starting the branch master --> hdf5. The complexity will drop a lot by doing the following:
|
I have mapped out all the requirements to continue the hdf5 branch in this notebook and think I've got it all covered. Datatypes = check. |
Some notes:
To overcome these issues, I'd like to implement a As this "global name space" is made for multiprocessing, the the TaskManager (main-process) can delegate tasks to a multiprocessing pool where
The work can be slices in various way. As we expect each operation to be atomic and incremental the work can be split by Data ImportWe can load data using Tablite's filereaders and load CSV data in chunks (already implemented). Each worker can load a different chunk. Concatenate <--- Implemented 2022/03/13Concatenation of tables of identical datatypes is trivial in as far as the TaskManager maintains the global namespace of columns
tbl_4.columns == ['A','B'] # strict order of headings. Filter <--- Implemented 2022/03/13 as
|
Multiprocessing - a batch of one.Assume tbl_7 has the columns
The interpretation of this function takes place in the TaskManager in the following steps:
Sample task list for batch like operations would look like this:
Each task contains (table, slice_start, slice_end, function, batch=True) and implies select the slice from the table and give it to the function as a batch. To process the chunks in batches of one (e.g. each row), the task list would be:
The only requirement this approach imposes on the user is to assure that the arguments are acceptable as columns of data, such as:
And that the result is a tuple of columns. We do not restrict the usage in the signature from including the chunk index, nor from returning multiple columns. Here is another example:
|
Which tasks happens in which process?
|
POC of memory manager now done. |
only show importable compatible formats. |
We have a test like this in the test suite:
My question: Is this test still valid? Conclusions A few hours later.... < Implemented. Tables are now mutable and permit updates. These will be quite slow, so the most efficient approach is to do slice updates rather than individual values. Fastest of course is to create a new column and drop the old. |
implemented in commit #4844bc87 |
Tablite - more (?) or better disk support?
The benefit of tablite is that it doesn't care about whether the data is in RAM or on DISK. The users usage is the same. Right now SQLite is used for the disk operation.
That's ok, but limited to usage with tempdir.
Sometimes users want to use another directory, store the tables in a particular state and then continue their work at a later point in time. Whilst tempdir can be used for that, tablite has all the data from memory sitting in the same SQLite blob. So to persist individual tables means moving data, which equals time.
It is probably better to allow the user to:
Table.set_working_dir(pathlib.Path)
which allows the user to have all data in here.Set
Table.metadata['localfile'] = "tbl_0.h5"
which allows tablite to conclude that "working_dir/tbl_0.h5" is where the user wants the file. Even if the user doesn't want to manage the file name but just want to be sure that the data doesn't cause MemoryOverFlow, the user can do:Table.new_tables_use_disk = True
Table.from_file(filename,...)
and all files will be named in the metadata as:If the filename already is h5 format, the data is merely "imported", by reading the h5 format. This requires only that:
.attrs
(which is dict like anyway)I keep writing h5, because I like hdf5's SWMR function for multiprocessing, in contrast to SQLite's blocking mode.
This brings me to some missing features:
Table.from_file(filename..., limit=100)
should returns head (100 rows) of the file only.Table.from_file(filename..., columns=['a','c','d'])
could return the columns a,c,d (not b,e,f,...) only.Table.from_file(filename..., columns=['a','c','d'], datatypes=[int,str,date], error=None)
could return the columnsa,c,d
with corrs. datatypes and usesNone
for exceptions.Table.to_hdf5(filename=None, ...)
is alias for to_hdf5 is tempfile mode.Table.from_file(....)
should print the file location isuse_disk==True
.Consequences?
This will require that the StoredColumn is inheriting the Table's localfile for data access, but otherwise there's no problem.
The
file_reader
functionfind_format
can be done in parallel, as each worker can read the h5 file with focus on a single column of values and return the appropriate dataformat in isolation.When exiting an
atexit
hook is required forflush()
andclose()
so that the h5 file is closed properlyWhen opening a file that already exists, the loading time should be limited to the time it takes to read the metadata of the h5 file as everything else already is in place.
The text was updated successfully, but these errors were encountered: