Cannot process dataset with 800+ features: Job graph is too large #57

fabito · 2018-05-14T18:58:12Z

Hi,

I'm trying to submit a job to process a dataset (~850 features) in Cloud Dataflow.
The preprocessing_fn looks like this:

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    outputs = {}

    for key in _discrete_features(): # 395
        x = inputs[key]
        tft.uniques(tf.as_string(x), vocab_filename=key, store_frequency=True)
        outputs[key] = tft.scale_to_z_score(x)

    for key in _continuous_features(): # 216
        x = inputs[key]
        nanmean = t.nanmean(x)
        x = tf.where(tf.is_nan(x), tf.fill(tf.shape(x), nanmean), x)
        outputs[key] = tft.scale_to_z_score(x)

    for key in _float_features(): # 59
        outputs[key] = tft.scale_to_z_score(inputs[key])

    for key in _string_features(): # 191
        outputs[key] = tft.string_to_int(inputs[key], vocab_filename=key)

    outputs[LABEL_KEY] = inputs[LABEL_KEY]

    return outputs

After a few minutes the job submission fails claiming that "The job graph is too large."
Has anyone seen this before ? How can I workaround it?

Detailed logs below:

INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Finished the size estimation of the input at 1 files. Estimation took 0.172410011292 seconds
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/11f97ee8c0fd4197bca9d2c0361ebf49/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/11f97ee8c0fd4197bca9d2c0361ebf49/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/66e87beb6715476288308d47028d92e6/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/66e87beb6715476288308d47028d92e6/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/5493bd9dd95845aba7924d331a416bf3/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/5493bd9dd95845aba7924d331a416bf3/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/48aa71f6a4664a6da46ded569656bb7c/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/48aa71f6a4664a6da46ded569656bb7c/saved_model.pb
INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Finished the size estimation of the input at 10 files. Estimation took 0.0985808372498 seconds
INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pipeline.pb...
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pipeline.pb
INFO:root:Executing command: ['/home/user/npd/venv/bin/python', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmpt5flm5']

(...)

INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/workflow.tar.gz...
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/workflow.tar.gz
INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pickled_main_session...
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pickled_main_session
INFO:root:Staging the SDK tarball from PyPI to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/dataflow_python_sdk.tar
INFO:root:Executing command: ['/home/user/npd/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpt5flm5', 'google-cloud-dataflow==2.4.0', '--no-binary', ':all:', '--no-deps']
Collecting google-cloud-dataflow==2.4.0
  Using cached https://files.pythonhosted.org/packages/3b/6b/165eb940a26b16ee27cee2643938e23955c54f6042e7e241b2d6afea8cea/google-cloud-dataflow-2.4.0.tar.gz
  Saved /tmp/tmpt5flm5/google-cloud-dataflow-2.4.0.tar.gz
Successfully downloaded google-cloud-dataflow
INFO:root:file copy from /tmp/tmpt5flm5/google-cloud-dataflow-2.4.0.tar.gz to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/dataflow_python_sdk.tar.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
Traceback (most recent call last):
  File "preprocess/transform_v2.py", line 62, in <module>
    main()
  File "preprocess/transform_v2.py", line 57, in main
    transform_data(pipeline_options, known_args.input_dir, known_args.output_dir, top_features)
  File "preprocess/transform_v2.py", line 31, in transform_data
    | CreateSegmentDataset(segment_name, converter, output_dir, make_preprocessing_fn(top=top_features), RAW_DATA_METADATA))
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 389, in __exit__
    self.run().wait_until_finish()
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 369, in run
    self.to_runner_api(), self.runner, self._options).run(False)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 382, in run
    return self.runner.run_pipeline(self)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 324, in run_pipeline
    self.dataflow_client.create_job(self.job), self)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 475, in create_job
    return self.submit_job_description(job)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 523, in submit_job_description
    response = self._client.projects_locations_jobs.Create(request)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 643, in Create
    config, request, global_params=global_params)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
    return self.ProcessHttpResponse(method_config, http_response, request)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
    self.__ProcessHttpResponse(method_config, http_response, request))
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
    http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/my-project/locations/us-central1/jobs?alt=json>: response: <{'status': '400', 'content-length': '229', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Mon, 14 May 2018 20:30:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
  "error": {
    "code": 400,
    "message": "(73447f6f3d9a02de): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "status": "INVALID_ARGUMENT"
  }
}

The text was updated successfully, but these errors were encountered:

KesterTong · 2018-05-15T20:10:20Z

Is it possible to use a few multi-dimensional features instead of many single-dimensional features? This may greatly improve performance. If your features are somehow grouped by catagories, this may also be a logical way to arrange your features.

Debasish-Das-CK · 2018-06-09T00:39:27Z

I pushed 2000 features by translating 2000 one dimensional features to 1 2000 dimensional feature

frarito · 2018-09-17T17:59:59Z

Any update about this issue? I'm in a similar situation..

KesterTong · 2018-09-17T18:02:04Z

Are you able to use a similar workaround?

debasish83 · 2018-09-17T18:18:12Z

this workaround worked fine for us and actually helps make training faster also...overall idea is to run through tf-transform in high dimensional tensor for numeric/categorical/? features and at the end of the flow either unstack or let the tf model fit on high dimensional tensor like image training...

debasish83 · 2018-09-17T18:19:34Z

I think tf-transform can fuse it automatically but we are not getting time to push the fix in...also giving a unstacked view is good for tf.boosted_trees but I don't think we can make it generic...

Harshini-Gadige · 2019-03-26T17:05:41Z

Team is working on this and we encourage to use the workaround provided(above) for now.
Will keep you posted.

cyc · 2019-05-09T17:22:34Z

Any updates or workarounds for this feature request? I am now hitting the "graph too large" limit, but unfortunately the transformations I use cannot be concatenated together to be applied element-wise.

KesterTong · 2019-05-09T17:29:58Z

Which transformations can't be concatenated? We are currently working on #110 and #62 will this be enough?

cyc · 2019-05-09T17:39:06Z

Yes, #110 would suffice for our needs. Just wondering if there are any workarounds for folks running into this issue with the current release.

KesterTong · 2019-05-09T17:47:14Z

No there is no workaround for #110

RuhuaJiang · 2021-02-03T15:05:57Z

Any update on this? The workaround works for the case of "same transform goes to N features so we can concat N features first then do this transform". Basically we have to do tf.sparse.concat first in the preprocessing_fn. For the use case of doing per feature processing first, the workaround is not going to work.

Another question is, any guidance on rule of thumb of avoiding "graph is too large". Like 'number of features should less < X' or 'certain transformation should less than M' etc. Right now as a user i am in constant fear of worrying the graph is too large, the only way to see it is large or not is by try launch that in dataflow (take several minutes to wait) and hope for the best. This experience is not great.

zoyahav · 2021-02-04T14:09:53Z

There have been several improvements since the original issue.
If you're still running into this issue could you please open a separate issue and provide as many details as possible? (description of the input features, how many there are, what analysis is performed on those, full stacktrace of errors etc.)

Re. guidance on rule of thumb for avoiding graph is too large error, we can't do that because a) it depends on so many factors that are specific to each user's inputs and preprocessing_fn, and b) this is a Dataflow error that we can't really control.

RuhuaJiang · 2021-02-07T16:02:42Z

Thanks @zoyahav

UsharaniPagadala · 2021-11-02T11:45:20Z

@fabito

Could you please confirm if this issue can be closed.Thanks

gaikwadrahul8 · 2022-12-15T17:07:53Z

@fabito

Apologies for the delay and It seems like duplicate issue #223 and Developers are reporting that using runner_v1 with upload_graph solves the problem as per this comment so Could you please try that workaround and let us know is it resolving your issue ?

Could you please confirm if this issue is resolved for you ? Please feel free to close the issue if it is resolved? If issue still persists please let us know if possible please help us with error log to do investigation to find out root cause for your issue ?

Thank you!

gaikwadrahul8 · 2023-01-15T18:08:33Z

Hi, @fabito

Closing this issue due to lack of recent activity for couple of weeks. Please feel free to reopen the issue or post your comments if you need any further assistance or update

Thank you!

fabito changed the title ~~Cannot process dataset with 800+ features: generated Dataflow job is too large~~ Cannot process dataset with 800+ features: Job graph is too large May 14, 2018

Harshini-Gadige added stat:awaiting response type:bug labels Oct 23, 2018

Harshini-Gadige added stat:awaiting tensorflower and removed stat:awaiting response labels Mar 15, 2019

Harshini-Gadige added the type:feature label Mar 26, 2019

seoyoon94 mentioned this issue Feb 24, 2021

'Job graph is too large` error when using DataflowRunner #223

Closed

UsharaniPagadala self-assigned this Nov 2, 2021

UsharaniPagadala added stat:awaiting response and removed stat:awaiting tensorflower labels Nov 2, 2021

UsharaniPagadala assigned pindinagesh and unassigned UsharaniPagadala Dec 13, 2021

pindinagesh assigned zoyahav and unassigned pindinagesh Mar 31, 2022

pindinagesh removed the stat:awaiting response label Mar 31, 2022

pindinagesh added the stat:awaiting tensorflower label Mar 31, 2022

gaikwadrahul8 self-assigned this Dec 15, 2022

gaikwadrahul8 removed stat:awaiting tensorflower type:feature labels Dec 15, 2022

gaikwadrahul8 added the stat:awaiting response label Dec 15, 2022

gaikwadrahul8 closed this as completed Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot process dataset with 800+ features: Job graph is too large #57

Cannot process dataset with 800+ features: Job graph is too large #57

fabito commented May 14, 2018 •

edited

KesterTong commented May 15, 2018

Debasish-Das-CK commented Jun 9, 2018

frarito commented Sep 17, 2018

KesterTong commented Sep 17, 2018

debasish83 commented Sep 17, 2018

debasish83 commented Sep 17, 2018

Harshini-Gadige commented Mar 26, 2019

cyc commented May 9, 2019

KesterTong commented May 9, 2019

cyc commented May 9, 2019

KesterTong commented May 9, 2019

RuhuaJiang commented Feb 3, 2021 •

edited

zoyahav commented Feb 4, 2021

RuhuaJiang commented Feb 7, 2021

UsharaniPagadala commented Nov 2, 2021

gaikwadrahul8 commented Dec 15, 2022

gaikwadrahul8 commented Jan 15, 2023

Cannot process dataset with 800+ features: Job graph is too large #57

Cannot process dataset with 800+ features: Job graph is too large #57

Comments

fabito commented May 14, 2018 • edited

KesterTong commented May 15, 2018

Debasish-Das-CK commented Jun 9, 2018

frarito commented Sep 17, 2018

KesterTong commented Sep 17, 2018

debasish83 commented Sep 17, 2018

debasish83 commented Sep 17, 2018

Harshini-Gadige commented Mar 26, 2019

cyc commented May 9, 2019

KesterTong commented May 9, 2019

cyc commented May 9, 2019

KesterTong commented May 9, 2019

RuhuaJiang commented Feb 3, 2021 • edited

zoyahav commented Feb 4, 2021

RuhuaJiang commented Feb 7, 2021

UsharaniPagadala commented Nov 2, 2021

gaikwadrahul8 commented Dec 15, 2022

gaikwadrahul8 commented Jan 15, 2023

fabito commented May 14, 2018 •

edited

RuhuaJiang commented Feb 3, 2021 •

edited