Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Allow textX to work across processes (multiprocessing, ProcessPoolExecutor) #295

Open
stanislaw opened this issue Oct 27, 2020 · 4 comments
Labels

Comments

@stanislaw
Copy link
Contributor

stanislaw commented Oct 27, 2020

Hello,

First of all, thank you very much for creating this tool. I am building my own tool on top of textX and so far I have had a great experience using it!

To improve the performance in my project, I have tried to parallelize reading text files based on my custom textX grammar as well as writing text files from textX in-memory models. In both cases, an attempt to parallelize results in various pickling errors similar to the following error:

Traceback (most recent call last):
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/queues.py", line 241, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object

For writing my objects that I previously obtained with textX I have managed to work around the pickling errors by stripping out some of the textX's metadata as follows:

# HACK:
# ProcessPoolExecutor doesn't work because of non-picklable parts
# of textx. The offending fields are stripped down because they
# are not used anyway when writing using our generator.
document._tx_parser = None
document._tx_attrs = None
document._tx_metamodel = None
document._tx_peg_rule = None

This is the code that I am using to parallelize:

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    executor.map(my_processing_function, list_of_things)

I am wondering if there is a good recommendation or fix by the developers/maintainers of textX as to how I could achieve the parallelization without using the hack above.

P.S. Does the project accept donations? It is the core building block of my project and I would be absolutely happy to send a few coins to support textX.

@igordejanovic
Copy link
Member

Hello @stanislaw. Thanks for your kind words.

I haven't hit a use-case where I would need to pickle models so far, so I think your approach seems like a good solution at the moment. Thanks for providing the info what need to be stripped down for a serialization to work.

P.S. Does the project accept donations? It is the core building block of my project and I would be absolutely happy to send a few coins to support textX.

Thanks. We haven't set up any donation channel so far so the only way to help the project at the moment is through contributions, like this one :)

@stanislaw
Copy link
Contributor Author

Another part of this report: if I switch to thread-based parallelization instead of process-based parallelization i.e. ThreadPoolExecutor instead of ProcessPoolExecutor I start getting some random crashes (bottom of this comment) and this makes me think that the metamodel_from_str and the meta_model.model_from_str methods are not thread-safe.

This is roughly the code that I am using inside each Thread.

reader = SDReader()
document = reader.read_from_file(doc_full_path)

and the reader class is roughly as follows:

class SDReader:
    def __init__(self):
        self.meta_model = metamodel_from_str(
            STRICTDOC_GRAMMAR, classes=DOCUMENT_MODELS, use_regexp_group=True
        )
        obj_processors = {
          # some processors I am sure thread-safe
        }

        self.meta_model.register_obj_processors(obj_processors)

    def read(self, input):
        document = self.meta_model.model_from_str(input)
        return document

    def read_from_file(self, file_path):
        with open(file_path, 'r') as file:
            sdoc_content = file.read()

        try:
            sdoc = self.read(sdoc_content)
            return sdoc

I am under impression that there is some shared state in metamodel.py and therefore my threads are corrupting each other's state inside textX's code.

I am quite sure that the thread-based parallelization would not speed up things too much as it is a known with Python but I thought about passing this information to you anyway in case there was a simple way to make this API thread-safe just for the sake of having this API to be side-effect free, with no shared state.

Thanks.


Some examples of the crashes:

  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/metamodel.py", line 43, in metamodel_from_str
    language_from_str(lang_desc, metamodel)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/lang.py", line 932, in language_from_str
    raise TextXSyntaxError(text(e), line, col)
textx.exceptions.TextXSyntaxError: None:136:1: error: Expected EOF at position (136, 1) => '$/ ' ' ;  *ReferenceT'.
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/metamodel.py", line 43, in metamodel_from_str
    language_from_str(lang_desc, metamodel)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/lang.py", line 932, in language_from_str
    raise TextXSyntaxError(text(e), line, col)
textx.exceptions.TextXSyntaxError: None:114:1: error: Expected EOF at position (114, 1) => 'eleted';  *ReqComment'.
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/metamodel.py", line 43, in metamodel_from_str
    language_from_str(lang_desc, metamodel)
  File "/Users/Stanislaw/.pyenv/versions/3.6.0/lib/python3.6/site-packages/textx/lang.py", line 932, in language_from_str
    raise TextXSyntaxError(text(e), line, col)
textx.exceptions.TextXSyntaxError: None:146:1: error: Expected rule_name or '*' or '?' or '+' or '#' or '-' or attribute or '!' or '&' or ''((\\')|[^'])*'' or '"((\\")|[^"])*"' or re_match or rule_ref or '(' or '|' or ')' at position (146, 1) => '\]/ ' ' ; *'.

igordejanovic added a commit that referenced this issue Oct 29, 2020
This should make metamodel_from_* calls thread-safe
@igordejanovic
Copy link
Member

Thank for reporting. Indeed, grammar parsing is not thread safe as we are caching/reusing parsers for efficiency so, different threads might use the same parser concurrently.

I've just pushed a possible fix. It is always hard to test issues like this so please install from the branch and see if the issue is fixed.

@igordejanovic
Copy link
Member

Discussion continued on #297

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants