Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add path exist, is directory, stat #41282

Merged
merged 1 commit into from
Jul 13, 2020

Conversation

vnghia
Copy link
Contributor

@vnghia vnghia commented Jul 10, 2020

@mihaimaruseac
As I said, this PR can not mimic the current implementation because google-cloud-cpp is missing one feature. I am working on it googleapis/google-cloud-cpp#4494. We will have to wait but it won't take too much time. After that PR is merged, I will do a refactor with the gcs code

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Jul 10, 2020
@gbaned gbaned self-assigned this Jul 10, 2020
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Jul 10, 2020
Comment on lines +570 to +579
for (auto&& metadata :
gcs_file->gcs_client.ListObjects(bucket, gcs::Prefix(object))) {
if (!metadata) {
TF_SetStatusFromGCSStatus(metadata.status(), status);
return;
}
// We consider a path exists if there is at least one object whose key
// contains the path.
return TF_SetStatus(status, TF_OK, "");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop runs at most only one time. Is this intended?

Copy link
Contributor Author

@vnghia vnghia Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As I comment, if there is at least one object satisfies the requirement, the path exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if path = "/path/to/dir/" and there is an object "path/to/dir/file", we consider that this path exists. gcs::Prefix will return the prefix equal to that path ( i.e object name is path+name )

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jul 10, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 10, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 10, 2020
@vnghia
Copy link
Contributor Author

vnghia commented Jul 12, 2020

@mihaimaruseac
I just found a bug in the current gcs filesystem. It is not so common but it is one reason to reconsider the concept of directory. Here is my bucket on GCS server.
image
windows is a file and windows/ is a directory ( it is allowed on cloud filesystem ).

Here is my test with tensorflow 2.2.0
image

What do you think ?

@tensorflow-copybara tensorflow-copybara merged commit 92658e6 into tensorflow:master Jul 13, 2020
PR Queue automation moved this from Approved by Reviewer to Merged Jul 13, 2020
@vnghia vnghia deleted the gcs-stat branch July 13, 2020 05:07
@mihaimaruseac
Copy link
Collaborator

I think that is a bug in tf.io.gfile.isdir due to the change from file_exists to path_exists?

@vnghia
Copy link
Contributor Author

vnghia commented Jul 13, 2020

I think no. I am testing with prebuilt tensorflow 2.2.0 so there is nothing releated to path_exists ?

@mihaimaruseac
Copy link
Collaborator

Oh, it's a preexisting bug. I would expect that tf.io.gfile.isdir(x) should always return the same boolean as tf.io.gfile.stat(x).is_directory, assuming no errors.

@vnghia
Copy link
Contributor Author

vnghia commented Jul 13, 2020

So I would like to ask you which value do you expect in the returned result in this case ? I think it should be false even there is a directory because if it is true, we will lose the access to windows object.

@mihaimaruseac
Copy link
Collaborator

In general we should preserve the invariant from above as otherwise there would be subtle to debug issues, even security related ones.

In this particular case when we have both a file and a directory with the same name, we can given precedence to directory (as it usually contains more data) and not allowing creation of files/directories if the other one exists (from TF, outside TF user that creates both is responsible for the mess)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

6 participants