-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use TF Hub on a distributed setting? #48
Comments
The issue seems to be that the module is cached in a directory that is not accessible to all the machines that initialize variables. Can you try to specify a location for module caching that all the jobs can read from? You can look at https://www.tensorflow.org/hub/basics section "Caching Modules" for instructions. |
@svsgoogle thanks! Would a GS bucket be a valid location? |
It should be :) |
It should not :) #50 |
I have the same issue with Google ML Engine. In my opinion, the issue is caused by I have made a repository to reproduce the issue. As well as, we are discussing it on the google issue tracker. |
For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps: On some local machine
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module Now on your code you can use pass the location of the bucket to TF Hub hub.Module("gs://bucket/some/path/to/module") |
This was fixed in 6860621, but users will have to wait for a new pypi release + picked up by cloud. |
I suppose.. is it working also on s3? |
It should though I haven't explicitly tested on S3, so please let us know
if you see any issues.
…On Mon, Jun 18, 2018 at 6:11 PM bhack ***@***.***> wrote:
I suppose.. is it working also on s3?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#48 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AbunTCGcNxUDWesY5-fpg8QQgcZCn5nlks5t99E5gaJpZM4T8DNY>
.
|
Probably is it a little bit incorrect 6860621#diff-781a53e648f3df8d16a08ec083b04bf4R18? Cause we are using tensorflow |
What error are you observing with the fix in question? The test uses
file:// because we cannot use GCS or S3 in a unit test, but in offline
testing, we did test gs://. The breakage that was address by the fix
applies to all custom filesystem in TF (file://, gs://. etc).
…On Mon, Jun 18, 2018 at 6:23 PM bhack ***@***.***> wrote:
Probably is a little bit incorrect 6860621
#diff-781a53e648f3df8d16a08ec083b04bf4R18
<6860621#diff-781a53e648f3df8d16a08ec083b04bf4R18>?
Cause we are using tensorflow file:// as I see in 6860621
#diff-cd783e402e7064fee42578c5b35d1c3c
<6860621#diff-cd783e402e7064fee42578c5b35d1c3c>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AbunTOILlYuHMVG2UJJ2-IH4upg0OBoZks5t99PrgaJpZM4T8DNY>
.
|
Is that it claims gcs in the chabgelog but file:// is not gcs only. |
Yep, we will be more clear with the cl description next time ))
…On Mon, Jun 18, 2018 at 7:38 PM bhack ***@***.***> wrote:
Is that it claims gcs in the chabgelog but file:// is not gcs only.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AbunTBvn7XigGSKlZOwTiQjwSFZCnm9vks5t9-WkgaJpZM4T8DNY>
.
|
This approach does not work for me? Did you manage to make it run? I have version 0.1.1 |
What error are you seeing?
…On Wed, Nov 14, 2018, 19:21 Paweł Budzianowski ***@***.***> wrote:
For those interested, the issue can be solved by saving the module into
GCP, you can do it with the following steps:
On some local machine
1. [optional] clear all modules: `rm -fr /tmp/tfhub_modules`
2. Run some code that downloads the module: `hub.Module(...)`
3. Upload the module to some gcp bucket:
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module
Now on your code you can use pass the location of the bucket to TF Hub
hub.Module("gs://bucket/some/path/to/module")
This approach does not work for me? Did you manage to make it run?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AbunTNmsrFc25hbbwMl7Hy2eUC0yzHkYks5uvF8pgaJpZM4T8DNY>
.
|
I'm trying do the caching through gs but model gets stuck at loading the module as in #50. |
One option for further debugging is to try to run a small test program that
reads/write to the GCS bucket in question using tf.gfile
<https://www.tensorflow.org/api_docs/python/tf/gfile/GFile>. This will make
sure that the bucket and the local machine are set up correctly.
…On Thu, Nov 15, 2018 at 1:01 AM Paweł Budzianowski ***@***.***> wrote:
I'm trying do the caching through gs but model gets stuck at loading the
module as in #50 <#50>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AbunTDx6GgV2kNtMdBChQqIxs3PpGpRKks5uvK7igaJpZM4T8DNY>
.
|
-------------------- EDITED It actually works, I just added the inner module hash to the gs path. I mean this:
hub.Module("gs://bucket/some/path/to/module/{module_hash}") |
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
I want to use the ResNet-101-v2 feature vectors to do some transfer learning. I am training with the Estimator API on GCP, I call the hub Module at the beggining of the
model_fn
.When I run in a single node ("basic-gpu") all is well, however, when I run the same code in distributed mode ("standard-1") I get this error:
How should I structure my code for TF Hub to work with the Estimator API for distributed training?
The text was updated successfully, but these errors were encountered: