-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nntplib cleanup #53606
Comments
The patch performs an extensive cleanup of nntplib:
|
I haven't looked at the patch in detail, but if my quick skim is correct, it looks like you are decoding using latin1. If you don't provide any way to get at the original bytes (which I presume you don't if you have "converted the API to return strings"), then it will be difficult and non-obvious to correctly decode messages using a content transfer encoding of 8bit and some other character set. (Of course, it is currently impossible to do this using the Py3k email package either, but I'm working on fixing that). |
This is an issue only for the actual article content, right? I'll be happy to extend the API to allow getting the original bytes of the article content, while keeping the rest of API (like group names) string-based. |
Correct, if by 'article content' you mean what is returned by the article command. So I think it is necessary to be able to get bytes for two commands: article and body. Then for symmetry it should also be possible to get bytes from the head command. (It is actually possible for headers to include 8bit data, but it is a violation of the RFC and both rare outside of spam and impossible to handle sensibly in most cases...and it may be the case that most if not all nttpservers sanitize such headers, I don't have any experience in that area.) On the other hand, this tells me that a possible use case I had been wondering about in email is in fact a valid concern: being able to parse a data stream that contains both strings and bytes :( That is, one might use nntplib in a mode where you pull the headers as text and the body as bytes and want to hand it off to email to build a Message object from those parts. |
Note that according to RFC 3977, “The character set for all NNTP commands is UTF-8”. But it also says this about multi-line data blocks: Note that texts using an encoding (such as UTF-16 or UTF-32) that may IMO, it should decode/encode by default using utf-8 (with the "surrogateescape" error handler for easy round-tripping with non-compliant servers), except for raw articles (bodies / envelopes) where bytes should be returned. |
Here is an updated patch, but it's a work in progress. Since we are breaking compatibility anyway, I think a larger cleanup is deserved. For example:
Additional useful features could be added:
Also, we should add an internal mock NNTP server to easily test features. Relying on gmane is nice for simple basic tests, but not much more. Does it sound reasonable? |
Assuming we can break backward compatibility, it sounds fine to me. |
Here is a further, still work-in-progress, patch. I'm posting it here so that interested people can give advice. |
Here is the current state of the cleanup work. Coding is roughly finished; most methods are unit-tested. The documentation remains to be done. Review welcome. |
To make the distinction easier to remember, would it help if the methods that are currently set to return bytes instead accepted the typical encoding+errors parameters, with parallel *b APIs to get at the raw bytes? My concern with the current API is that there isn't a clear indicator during normal programming as to which APIs return strings and which return the raw bytes and hence require further decoding. |
Not really, no. For raw messages, which encoding+errors must be used Apart from raw message bodies, NNTP data has well-defined encodings and
That's a documentation issue. I haven't touched the docs yet :) |
I tested this against my existing py3k nttp client code. Why is it a good thing to make file a keyword only parameter? But given that it is, line 720 needs to use it as such, which omission means your missing at least one test :) On line 193 you index fmt, and *then* you check the length. When the number of tokens is too long, an IndexError is raised. (Note that the offending overview line comes from the gmane group comp.lang.python.mime.devel, and the offense is an extra '' field on one of the records. No idea why it got added to that one record. Looks like it is message 701, artid <4111BBA9.3040302@harvee.org>) Could the 'date' field in the xover headers also be a DateTime rather than a string? And :bytes and :lines be ints? Or is that being to DWIMish? If the date field isn't returned as a datetime, though, should there be a helper method for converting it, or should we just assume that email.utils mktime_tz and parsedate_tz? Am I correct that the purpose of _NNTPBase is to make testing easier? There seem to be only three lines of code in common between the file and the non-file case in _getlongresp. I think it would be clearer to make them two different routines, or at least move the three common lines to the top and then do an if test on file is None (that is, put the simpler, non-file case first). My little nttp client script doesn't really test very much of the nttplib interface, nor is it very complex. The change to xover considerably simplified that part of my script, so I very much like that change. I was also able to drop my implementation of decode_header. So overall this patch is significant improvement, IMO. I haven't given the code as thorough a review as might be optimal, but I certainly think you are headed in the right direction. |
Gah. I reviewed patch4, didn't noticed today's update :( |
Oops, I'm sorry for that. It turns out most of your comments are useful anyway.
I was thinking that if we want to add other function parameters, it will You are right that testing that parameter is still on the TODO list...
Yes, this is fixed in patch5 :)
We could indeed parse the xover/over fields, but the issue is that there Also, people could want to get at the non-parsed representation
parsedate or parsedate_tz is the right thing to use indeed (dates follow
Yes.
I'll take a look.
Thank you :) |
OK. It doesn't seem all that likely that we'll be adding parameters, but you never know. So I'm OK with file becoming keyword only. As for the overview headers...if they aren't decoded through decode_header already, why are they unicode? If they haven't been decoded, I'd expect them to be bytes. Or does the standard really say that these headers, extracted from the bytes message data, will be valid utf-8? (I realize that as things stand now with email5 you'll in fact *have* to decode them in order to process them with decode_header, but that doesn't change the fact that if they are unicode I would expect that they'd be *fully* decoded.) But I take your point about the datatypes otherwise. Also, when we get to email6 all headers should be turned into Header objects, and there will be a way to get a datetime out of a Date Header object. |
Yes, unescaped utf-8 is explicitly allowed in headers by RFC 3977: The content of a header SHOULD be in UTF-8. However, if an (I guess this means NNTP is now more modern than SMTP :-)). Actually, there's a test with an actual utf-8 header in the unit tests. |
Here is a patch adding tests for article(), head() and body(), as well as for the "file" parameter. |
And here is a patch integrating doc fixes and updates. |
I've committed the latest patch in r85111. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: