Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot process dataset with 800+ features: Job graph is too large #57

Closed
fabito opened this issue May 14, 2018 · 17 comments
Closed

Cannot process dataset with 800+ features: Job graph is too large #57

fabito opened this issue May 14, 2018 · 17 comments

Comments

@fabito
Copy link

fabito commented May 14, 2018

Hi,

I'm trying to submit a job to process a dataset (~850 features) in Cloud Dataflow.
The preprocessing_fn looks like this:

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    outputs = {}

    for key in _discrete_features(): # 395
        x = inputs[key]
        tft.uniques(tf.as_string(x), vocab_filename=key, store_frequency=True)
        outputs[key] = tft.scale_to_z_score(x)

    for key in _continuous_features(): # 216
        x = inputs[key]
        nanmean = t.nanmean(x)
        x = tf.where(tf.is_nan(x), tf.fill(tf.shape(x), nanmean), x)
        outputs[key] = tft.scale_to_z_score(x)

    for key in _float_features(): # 59
        outputs[key] = tft.scale_to_z_score(inputs[key])

    for key in _string_features(): # 191
        outputs[key] = tft.string_to_int(inputs[key], vocab_filename=key)

    outputs[LABEL_KEY] = inputs[LABEL_KEY]

    return outputs

After a few minutes the job submission fails claiming that "The job graph is too large."
Has anyone seen this before ? How can I workaround it?

Detailed logs below:

INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Finished the size estimation of the input at 1 files. Estimation took 0.172410011292 seconds
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/11f97ee8c0fd4197bca9d2c0361ebf49/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/11f97ee8c0fd4197bca9d2c0361ebf49/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/66e87beb6715476288308d47028d92e6/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/66e87beb6715476288308d47028d92e6/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/5493bd9dd95845aba7924d331a416bf3/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/5493bd9dd95845aba7924d331a416bf3/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/48aa71f6a4664a6da46ded569656bb7c/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/48aa71f6a4664a6da46ded569656bb7c/saved_model.pb
INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Finished the size estimation of the input at 10 files. Estimation took 0.0985808372498 seconds
INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pipeline.pb...
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pipeline.pb
INFO:root:Executing command: ['/home/user/npd/venv/bin/python', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmpt5flm5']

(...)

INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/workflow.tar.gz...
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/workflow.tar.gz
INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pickled_main_session...
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pickled_main_session
INFO:root:Staging the SDK tarball from PyPI to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/dataflow_python_sdk.tar
INFO:root:Executing command: ['/home/user/npd/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpt5flm5', 'google-cloud-dataflow==2.4.0', '--no-binary', ':all:', '--no-deps']
Collecting google-cloud-dataflow==2.4.0
  Using cached https://files.pythonhosted.org/packages/3b/6b/165eb940a26b16ee27cee2643938e23955c54f6042e7e241b2d6afea8cea/google-cloud-dataflow-2.4.0.tar.gz
  Saved /tmp/tmpt5flm5/google-cloud-dataflow-2.4.0.tar.gz
Successfully downloaded google-cloud-dataflow
INFO:root:file copy from /tmp/tmpt5flm5/google-cloud-dataflow-2.4.0.tar.gz to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/dataflow_python_sdk.tar.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
Traceback (most recent call last):
  File "preprocess/transform_v2.py", line 62, in <module>
    main()
  File "preprocess/transform_v2.py", line 57, in main
    transform_data(pipeline_options, known_args.input_dir, known_args.output_dir, top_features)
  File "preprocess/transform_v2.py", line 31, in transform_data
    | CreateSegmentDataset(segment_name, converter, output_dir, make_preprocessing_fn(top=top_features), RAW_DATA_METADATA))
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 389, in __exit__
    self.run().wait_until_finish()
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 369, in run
    self.to_runner_api(), self.runner, self._options).run(False)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 382, in run
    return self.runner.run_pipeline(self)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 324, in run_pipeline
    self.dataflow_client.create_job(self.job), self)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 475, in create_job
    return self.submit_job_description(job)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 523, in submit_job_description
    response = self._client.projects_locations_jobs.Create(request)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 643, in Create
    config, request, global_params=global_params)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
    return self.ProcessHttpResponse(method_config, http_response, request)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
    self.__ProcessHttpResponse(method_config, http_response, request))
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
    http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/my-project/locations/us-central1/jobs?alt=json>: response: <{'status': '400', 'content-length': '229', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Mon, 14 May 2018 20:30:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
  "error": {
    "code": 400,
    "message": "(73447f6f3d9a02de): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "status": "INVALID_ARGUMENT"
  }
}
@fabito fabito changed the title Cannot process dataset with 800+ features: generated Dataflow job is too large Cannot process dataset with 800+ features: Job graph is too large May 14, 2018
@KesterTong
Copy link
Contributor

Is it possible to use a few multi-dimensional features instead of many single-dimensional features? This may greatly improve performance. If your features are somehow grouped by catagories, this may also be a logical way to arrange your features.

@Debasish-Das-CK
Copy link
Contributor

I pushed 2000 features by translating 2000 one dimensional features to 1 2000 dimensional feature

@frarito
Copy link

frarito commented Sep 17, 2018

Any update about this issue? I'm in a similar situation..

@KesterTong
Copy link
Contributor

Are you able to use a similar workaround?

@debasish83
Copy link

this workaround worked fine for us and actually helps make training faster also...overall idea is to run through tf-transform in high dimensional tensor for numeric/categorical/? features and at the end of the flow either unstack or let the tf model fit on high dimensional tensor like image training...

@debasish83
Copy link

I think tf-transform can fuse it automatically but we are not getting time to push the fix in...also giving a unstacked view is good for tf.boosted_trees but I don't think we can make it generic...

@Harshini-Gadige
Copy link

Team is working on this and we encourage to use the workaround provided(above) for now.
Will keep you posted.

@cyc
Copy link

cyc commented May 9, 2019

Any updates or workarounds for this feature request? I am now hitting the "graph too large" limit, but unfortunately the transformations I use cannot be concatenated together to be applied element-wise.

@KesterTong
Copy link
Contributor

Which transformations can't be concatenated? We are currently working on #110 and #62 will this be enough?

@cyc
Copy link

cyc commented May 9, 2019

Yes, #110 would suffice for our needs. Just wondering if there are any workarounds for folks running into this issue with the current release.

@KesterTong
Copy link
Contributor

No there is no workaround for #110

@RuhuaJiang
Copy link

RuhuaJiang commented Feb 3, 2021

Any update on this? The workaround works for the case of "same transform goes to N features so we can concat N features first then do this transform". Basically we have to do tf.sparse.concat first in the preprocessing_fn. For the use case of doing per feature processing first, the workaround is not going to work.

Another question is, any guidance on rule of thumb of avoiding "graph is too large". Like 'number of features should less < X' or 'certain transformation should less than M' etc. Right now as a user i am in constant fear of worrying the graph is too large, the only way to see it is large or not is by try launch that in dataflow (take several minutes to wait) and hope for the best. This experience is not great.

@zoyahav
Copy link
Member

zoyahav commented Feb 4, 2021

There have been several improvements since the original issue.
If you're still running into this issue could you please open a separate issue and provide as many details as possible? (description of the input features, how many there are, what analysis is performed on those, full stacktrace of errors etc.)

Re. guidance on rule of thumb for avoiding graph is too large error, we can't do that because a) it depends on so many factors that are specific to each user's inputs and preprocessing_fn, and b) this is a Dataflow error that we can't really control.

@RuhuaJiang
Copy link

Thanks @zoyahav

@UsharaniPagadala
Copy link

@fabito

Could you please confirm if this issue can be closed.Thanks

@gaikwadrahul8
Copy link

@fabito

Apologies for the delay and It seems like duplicate issue #223 and Developers are reporting that using runner_v1 with upload_graph solves the problem as per this comment so Could you please try that workaround and let us know is it resolving your issue ?

Could you please confirm if this issue is resolved for you ? Please feel free to close the issue if it is resolved? If issue still persists please let us know if possible please help us with error log to do investigation to find out root cause for your issue ?

Thank you!

@gaikwadrahul8
Copy link

Hi, @fabito

Closing this issue due to lack of recent activity for couple of weeks. Please feel free to reopen the issue or post your comments if you need any further assistance or update

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

13 participants