Feature request: Allow textX to work across processes (multiprocessing, ProcessPoolExecutor) #295

stanislaw · 2020-10-27T20:01:26Z

Hello,

First of all, thank you very much for creating this tool. I am building my own tool on top of textX and so far I have had a great experience using it!

To improve the performance in my project, I have tried to parallelize reading text files based on my custom textX grammar as well as writing text files from textX in-memory models. In both cases, an attempt to parallelize results in various pickling errors similar to the following error:

Traceback (most recent call last):
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/queues.py", line 241, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object

For writing my objects that I previously obtained with textX I have managed to work around the pickling errors by stripping out some of the textX's metadata as follows:

# HACK:
# ProcessPoolExecutor doesn't work because of non-picklable parts
# of textx. The offending fields are stripped down because they
# are not used anyway when writing using our generator.
document._tx_parser = None
document._tx_attrs = None
document._tx_metamodel = None
document._tx_peg_rule = None

This is the code that I am using to parallelize:

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    executor.map(my_processing_function, list_of_things)

I am wondering if there is a good recommendation or fix by the developers/maintainers of textX as to how I could achieve the parallelization without using the hack above.

P.S. Does the project accept donations? It is the core building block of my project and I would be absolutely happy to send a few coins to support textX.

The text was updated successfully, but these errors were encountered:

igordejanovic · 2020-10-28T19:27:06Z

Hello @stanislaw. Thanks for your kind words.

I haven't hit a use-case where I would need to pickle models so far, so I think your approach seems like a good solution at the moment. Thanks for providing the info what need to be stripped down for a serialization to work.

P.S. Does the project accept donations? It is the core building block of my project and I would be absolutely happy to send a few coins to support textX.

Thanks. We haven't set up any donation channel so far so the only way to help the project at the moment is through contributions, like this one :)

stanislaw · 2020-10-28T20:27:02Z

Another part of this report: if I switch to thread-based parallelization instead of process-based parallelization i.e. ThreadPoolExecutor instead of ProcessPoolExecutor I start getting some random crashes (bottom of this comment) and this makes me think that the metamodel_from_str and the meta_model.model_from_str methods are not thread-safe.

This is roughly the code that I am using inside each Thread.

reader = SDReader()
document = reader.read_from_file(doc_full_path)

and the reader class is roughly as follows:

class SDReader:
    def __init__(self):
        self.meta_model = metamodel_from_str(
            STRICTDOC_GRAMMAR, classes=DOCUMENT_MODELS, use_regexp_group=True
        )
        obj_processors = {
          # some processors I am sure thread-safe
        }

        self.meta_model.register_obj_processors(obj_processors)

    def read(self, input):
        document = self.meta_model.model_from_str(input)
        return document

    def read_from_file(self, file_path):
        with open(file_path, 'r') as file:
            sdoc_content = file.read()

        try:
            sdoc = self.read(sdoc_content)
            return sdoc

I am under impression that there is some shared state in metamodel.py and therefore my threads are corrupting each other's state inside textX's code.

I am quite sure that the thread-based parallelization would not speed up things too much as it is a known with Python but I thought about passing this information to you anyway in case there was a simple way to make this API thread-safe just for the sake of having this API to be side-effect free, with no shared state.

Thanks.

Some examples of the crashes:

  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/metamodel.py", line 43, in metamodel_from_str
    language_from_str(lang_desc, metamodel)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/lang.py", line 932, in language_from_str
    raise TextXSyntaxError(text(e), line, col)
textx.exceptions.TextXSyntaxError: None:136:1: error: Expected EOF at position (136, 1) => '$/ ' ' ;  *ReferenceT'.

  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/metamodel.py", line 43, in metamodel_from_str
    language_from_str(lang_desc, metamodel)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/lang.py", line 932, in language_from_str
    raise TextXSyntaxError(text(e), line, col)
textx.exceptions.TextXSyntaxError: None:114:1: error: Expected EOF at position (114, 1) => 'eleted';  *ReqComment'.

  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/metamodel.py", line 43, in metamodel_from_str
    language_from_str(lang_desc, metamodel)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/lang.py", line 932, in language_from_str
    raise TextXSyntaxError(text(e), line, col)
textx.exceptions.TextXSyntaxError: None:146:1: error: Expected rule_name or '*' or '?' or '+' or '#' or '-' or attribute or '!' or '&' or ''((\\')|[^'])*'' or '"((\\")|[^"])*"' or re_match or rule_ref or '(' or '|' or ')' at position (146, 1) => '\]/ ' ' ; *'.

This should make metamodel_from_* calls thread-safe

igordejanovic · 2020-10-29T09:02:08Z

Thank for reporting. Indeed, grammar parsing is not thread safe as we are caching/reusing parsers for efficiency so, different threads might use the same parser concurrently.

I've just pushed a possible fix. It is always hard to test issues like this so please install from the branch and see if the issue is fixed.

igordejanovic · 2020-10-31T14:52:43Z

Discussion continued on #297

igordejanovic added a commit that referenced this issue Oct 29, 2020

refs #295 Cache textX grammar parsers in a thread local storage

bbb090f

This should make metamodel_from_* calls thread-safe

igordejanovic mentioned this issue Oct 29, 2020

WIP: Cache textX grammar parsers in a thread local storage #296

Closed

7 tasks

igordejanovic mentioned this issue Dec 16, 2020

Is it possible to save a metamodel? #308

Closed

igordejanovic added the feature label Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Allow textX to work across processes (multiprocessing, ProcessPoolExecutor) #295

Feature request: Allow textX to work across processes (multiprocessing, ProcessPoolExecutor) #295

stanislaw commented Oct 27, 2020 •

edited

igordejanovic commented Oct 28, 2020

stanislaw commented Oct 28, 2020

igordejanovic commented Oct 29, 2020

igordejanovic commented Oct 31, 2020

Feature request: Allow textX to work across processes (multiprocessing, ProcessPoolExecutor) #295

Feature request: Allow textX to work across processes (multiprocessing, ProcessPoolExecutor) #295

Comments

stanislaw commented Oct 27, 2020 • edited

igordejanovic commented Oct 28, 2020

stanislaw commented Oct 28, 2020

igordejanovic commented Oct 29, 2020

igordejanovic commented Oct 31, 2020

stanislaw commented Oct 27, 2020 •

edited