Do Tablite Support different datasets Concurrently ? #57

akash-goel · 2023-04-11T19:00:45Z

Hi Team,

Can we support different datasets and can store at different location and process it concurrently.

Regards,
Akahs

root-11 · 2023-04-11T19:09:43Z

Hi @akash-goel - It could probably be done. Could you describe your use case in a little more detail?

akash-goel · 2023-04-11T22:06:04Z

Hi ,

we have a user case , in which we have multiple users which are using a webapp and are trying to work on different files.
At the Server end we want to read the file , process it and share the results to the users concurrently.

when i have checked the Tablite i am not able to see where tablite is storing the data , is it storing in a different file or the same file and can we change the location of storage.

Please let me know if this usecase works with tablite.

Regards,
Akash

root-11 · 2023-04-12T07:35:37Z

Data is stored in tmp/tablite.hdf5. This file - just like sqlite3 - can contain an infinite number of datasets (as long as you have disk space).

akash-goel · 2023-04-12T20:32:35Z

Thanks for your response ,

I have tried on small dataset , Functionality working fine but when i am trying with big dataset getting below error.

Code Block

Table.reset_storage()
t3 = Table.import_file('Data_test.csv')
t3.show()

Error

_Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "\lib\multiprocessing\spawn.py", line 125, in _main 
    prepare(preparation_data)
  File "\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "\lib\runpy.py", line 97, in _run_module_code       
    _run_code(code, mod_globals, init_globals,
  File "\lib\runpy.py", line 87, in _run_code
importing: processing 'Data...est.csv': 66.99%|██████████████████████████████▏              | [00:04<00:01] 
   exec(code, run_globals)
  File "c\\API\tablite_test.py", line 11, in <module>
    t3 = Table.import_file('Data_test.csv')    
  File "\lib\site-packages\tablite\core.py", line 1756, in 
import_file
    t = reader(**config, **additional_configs)
  File "\lib\site-packages\tablite\core.py", line 481, in text_reader
    with TaskManager(cpu_count - 1) as tm:
  File "\lib\site-packages\mplite\__init__.py", line 79, in __enter__
    self.start()
  File "\lib\site-packages\mplite\__init__.py", line 89, in start
    worker.start()
  File "\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "\lib\multiprocessing\context.py", line 326, in _Popen
    return Popen(process_obj)
  File "\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
importing: saving 'Data...est.csv' to disk: 70.00%|████████████████████████████▋            | [00:04<00:01]_

Can you please guide what configuration we can do to load the file.

root-11 · 2023-04-12T20:55:12Z

As noted in the config in line 21 the single processing limit is 1_000_000 rows. When the data exceeds this number of rows, tablite switches to multiprocessing.

As you are using windows, this means you need to make your module importable for the windows subprocess 1.

The easiest way to do this, is to wrap your code block in a function, such as this:

def main():
    Table.reset_storage()
    t3 = Table.import_file('Data_test.csv')
    t3.show()

if __name__ == "__main__":
    main()

root-11 · 2023-04-21T16:26:04Z

Closing this issue as there has been no news since April 21st.

root-11 closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do Tablite Support different datasets Concurrently ? #57

Do Tablite Support different datasets Concurrently ? #57

akash-goel commented Apr 11, 2023

root-11 commented Apr 11, 2023

akash-goel commented Apr 11, 2023

root-11 commented Apr 12, 2023

akash-goel commented Apr 12, 2023 •

edited by root-11

Loading

root-11 commented Apr 12, 2023

root-11 commented Apr 21, 2023

Do Tablite Support different datasets Concurrently ? #57

Do Tablite Support different datasets Concurrently ? #57

Comments

akash-goel commented Apr 11, 2023

root-11 commented Apr 11, 2023

akash-goel commented Apr 11, 2023

root-11 commented Apr 12, 2023

akash-goel commented Apr 12, 2023 • edited by root-11 Loading

root-11 commented Apr 12, 2023

root-11 commented Apr 21, 2023

akash-goel commented Apr 12, 2023 •

edited by root-11

Loading