Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong error code from Xrootd bridge for file exists #1167

Closed
bbockelm opened this issue Mar 28, 2020 · 13 comments
Closed

Wrong error code from Xrootd bridge for file exists #1167

bbockelm opened this issue Mar 28, 2020 · 13 comments
Assignees

Comments

@bbockelm
Copy link
Contributor

When a file exists, the cmsd returns the wrong error code (or maybe just doesn't provide a distinct error code?) resulting in the wrong xrootd error code.

See the following example (note PID 2072999 is the cmsd and 2075105 is the xrootd):

200328 09:18:45 2072999 unknown.1:22@example do_Select: failed; Unable to create new file; file already exists. /prefix/test_folder4
200328 09:18:45 2075105 Receive ???????? 52 bytes on 3071
200328 09:18:45 2075105 Decode example gave unknown.1:22@example err -2 'Unable to create new file; file already exists.' /prefix/test_folder4
200328 09:18:45 2075096 unknown.1:22@example XrootdResponse: sending err 3011: Unable to create new file; file already exists.

I think what happens is the -2 error code in the cmsd response is interpreted as a Unix errno by the xrootd daemon, which corresponds to ENOENT and then mapped to error code 3011.

The wrong response (3011 is kXR_NotFound or file not found) is causing the HTTP protocol to return the wrong response and clients to take the wrong action.

The right fix isn't readily apparent -- it looks like giving the "right" error code from the cmsd would be a large-scale protocol break, which I'm not confident in doing immediately. May have to do a short-term workaround based on the error message in the meantime.

bbockelm added a commit to bbockelm/xrootd that referenced this issue Mar 28, 2020
This avoids the issue in xrootd#1167, which causes the wrong error code
to be given for a known error text.  This simply looks for the known
error text and corrects the HTTP status code.
@abh3
Copy link
Member

abh3 commented Mar 28, 2020

I left a comment in the pull request but I will post here as well.

I looked, indeed, the cmsd does not differentiate the between ENOENT and EEXISTS.
That can be fixed. I don't particularly like relying on message text as that tends
to change over time as people complain that the message isn't sufficiently
descriptive.

The correct error code should have been returned by the cmsd and, in fact, it's supposed to do that. I agree that it isn't obvious because error code handling becomes complicated when the cmsd attempts a retry and gets yet another error. However, in this case it's pretty straightforward as a retry is not possible. So, it's not too complicated to fix this.

That does that change your opinion on the fix?

@abh3 abh3 self-assigned this Mar 28, 2020
@bbockelm
Copy link
Contributor Author

I spent about an hour tinkering with different approaches to fixing the return code from the cmsd. However, it was quickly becoming a wholesale rewrite of all return codes between the server and client - it kept snowballing into pieces of the codebase I wasn't familiar with.

So, I agree with the idea that we really should fix the problem at the root.

That said - this is somewhat critical / urgent because it causes problems with the gfal2 recursive directory creation and causes failures in the corresponding FTS transfers. gfal2 will try a mkdir on each directory path, starting with the file's destination directory and walking up the tree until the return code switches from "file not found" to "file exists". Because Xrootd always returns ENOENT, gfal2 keeps walking up the tree until it hits a top-level directory (or ultimately /) and gets a permission error.

Given doing a reasonable fix looked like a large-scale project (which are best not rushed) -- and needing a short-term fix for 4.11.x -- led me to do the small patch based on error message. I sure hope we are long done with tinkering with error message texts for 4.11.x!

@abh3
Copy link
Member

abh3 commented Mar 28, 2020

OK, sounds reasonable. We will have to remember to remove this, I suppose (though we will likely forget). My take is that you want this back ported to 4.11.x but we have already released it so it would look like 4.11.4 which we don't mind doing. Of course, that won't be as quick as I think you expect, given the circumstances.

@abh3
Copy link
Member

abh3 commented Mar 28, 2020

One more thing, EEXISTS is now mapped to kXR_InvalidRequest which is used to indicate what is meant (e.g. duplicate logins, bypassing authentication, bad request code). So, dumping EEXISTS in that mess is awkward but not devastating. I suppose you'd want a separate error code for that when we fix this in R5, right?

@bbockelm
Copy link
Contributor Author

That'd be a reasonable thing to do.

That said, in XrdHTTP, we currently decipher the meaning based on the context -- if we've successfully logged in to the bridge and get a kXR_InvalidRequest in response to a specific command, we assume it is really EEXIST.

@abh3
Copy link
Member

abh3 commented Mar 29, 2020

So, maybe I'll leave it that way. Anyway, how come gfal2 doesn't use the mkpath option of mkdir. That way it would all just get done in a single try. Is that possible to do with WebDav?

@bbockelm
Copy link
Contributor Author

No - to the best of my knowledge, there's nothing equivalent to the mkpath flag for WebDAV. I suspect they don't use it for the xrootd protocol because they don't expose the flag in the plugin interface.

@abh3
Copy link
Member

abh3 commented Mar 30, 2020

Well, that can be corrected since the gfal person is also the xroot person, funny that. I'll bring it to their attention. Anyway, it won't be difficult to return the correct error code after all. I'm working on it now. It also gives me a chance to cleanup up some ugly code.

@abh3
Copy link
Member

abh3 commented Mar 30, 2020

OK, the pull request you have will work in all releases prior to R5. In R5 the text changes (which is why text scraping is bad) to one of:
Unable to create directory; directory already exists.
Unable to create new file; file already exists.

So, the line:
if (httpStatusText == "Unable to create new file; file already exists.\n")
won't work. Since you have std:string here why not use:
if (httpStatusText.find("already exists."))

That way you will be immune in case the code doesn't change for R5 to take advantage of the fact the the right error code (kXR_InvalidRequest) will be returned.

I changed the text once I realized it was completely wrong for directories.

bbockelm added a commit to bbockelm/xrootd that referenced this issue Mar 30, 2020
This avoids the issue in xrootd#1167, which causes the wrong error code
to be given for a known error text.  This simply looks for the known
error text and corrects the HTTP status code.
@bbockelm
Copy link
Contributor Author

@abh3 - since I'm a little worried about what other error responses include the phrase already exists, I simply enumerated the two known cases. Rebased and retested it against the latest master.

That said, after your latest correction of the correct error code, this patch is superfluous in R5. I'm going to close the PR and redo it against the stable branch.

bbockelm added a commit to bbockelm/xrootd that referenced this issue Mar 30, 2020
This avoids the issue in xrootd#1167, which causes the wrong error code
to be given for a known error text.  This simply looks for the known
error text and corrects the HTTP status code.
@abh3
Copy link
Member

abh3 commented Mar 30, 2020 via email

@bbockelm
Copy link
Contributor Author

It does! That was the work in #1164. This ticket is a follow-up because #1164 only fixed the case for a data server and not the redirector.

@abh3
Copy link
Member

abh3 commented May 18, 2020

This has been corrected, though it doesn't seem the http code tests for this particular error code. However, the way it's implemented it sort of works anyway.

@abh3 abh3 closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants