Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformDataset doesn't process the data in paralell (uses only single worker) #146

Open
wsuchy opened this issue Nov 1, 2019 · 1 comment
Assignees

Comments

@wsuchy
Copy link

wsuchy commented Nov 1, 2019

When using multiple input files and FnApiRunner / SUBPROCESS_SDK runner:

pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)])
return beam.Pipeline(options=pipeline_options,
                         runner=fn_api_runner.FnApiRunner(
                             default_environment=beam_runner_api_pb2.Environment(
                                 urn=python_urns.SUBPROCESS_SDK,
                                 payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
                                         % sys.executable.encode('ascii'))))

the tft_beam.AnalyzeAndTransformDataset really uses all workers, generates multiple output files, which makes processing quite fast (gist: analyze_and_transform()) .

The tft_beam.TransformDataset however uses only one worker and produces only one output file (gist: transform_only()). This makes almost impossible to process test and validation dastasets within a reasonable amount of time.

Is there a problem with my code or is it a bug?

GIST: https://gist.github.com/wsuchy/0c89b27a72b457ae6c904d8786658d2e
Dataset comes from https://www.kaggle.com/generall/oneshotwikilinks and has been processed using prepare_dataset function

@wsuchy wsuchy changed the title TransformDataset doesn't process the data in paralell (uses only single core) TransformDataset doesn't process the data in paralell (uses only single worker) Nov 1, 2019
@rmothukuru rmothukuru self-assigned this Nov 4, 2019
@rmothukuru rmothukuru assigned zoyahav and unassigned rmothukuru Nov 4, 2019
@rmothukuru rmothukuru added type:performance Performance Issue and removed type:bug labels Nov 4, 2019
@schmidt-jake
Copy link

I'm also running into this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants