Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nntplib cleanup #53606

Closed
DmitryJemerov mannequin opened this issue Jul 23, 2010 · 19 comments
Closed

nntplib cleanup #53606

DmitryJemerov mannequin opened this issue Jul 23, 2010 · 19 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@DmitryJemerov
Copy link
Mannequin

DmitryJemerov mannequin commented Jul 23, 2010

BPO 9360
Nosy @ncoghlan, @pitrou, @giampaolo, @bitdancer
Files
  • nntplib_cleanup.patch
  • nntplib_cleanup2.patch
  • nntplib_cleanup3.patch
  • nntplib_cleanup4.patch
  • nntplib_cleanup5.patch
  • nntplib_cleanup6.patch
  • nntplib_cleanup7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-29.15:05:01.245>
    created_at = <Date 2010-07-23.16:54:15.762>
    labels = ['type-feature', 'library']
    title = 'nntplib cleanup'
    updated_at = <Date 2010-09-29.15:05:01.244>
    user = 'https://bugs.python.org/DmitryJemerov'

    bugs.python.org fields:

    activity = <Date 2010-09-29.15:05:01.244>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-29.15:05:01.245>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2010-07-23.16:54:15.762>
    creator = 'Dmitry.Jemerov'
    dependencies = []
    files = ['18159', '18872', '18882', '18894', '18931', '19002', '19011']
    hgrepos = []
    issue_num = 9360
    keywords = ['patch']
    message_count = 19.0
    messages = ['111364', '111369', '111383', '111404', '115794', '116342', '116349', '116400', '116878', '116884', '116886', '116903', '116904', '116959', '117016', '117208', '117331', '117377', '117618']
    nosy_count = 5.0
    nosy_names = ['ncoghlan', 'pitrou', 'giampaolo.rodola', 'r.david.murray', 'Dmitry.Jemerov']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue9360'
    versions = ['Python 3.2']

    @DmitryJemerov
    Copy link
    Mannequin Author

    DmitryJemerov mannequin commented Jul 23, 2010

    The patch performs an extensive cleanup of nntplib:

    • Change API methods to return strings instead of bytes. This breaks API compatibility, but given that the parameters need to be passed as strings and many of the returned values would need to be passed to other API methods, I consider the current API to be broken. I've discussed this with Brett at the EuroPython sprint, and he agrees.
    • Add tests.
    • Add pending deprecation warnings for xgtitle() and xpath() methods, which are not useful in modern environments.
    • Use named tuples for returned values where appopriate.
    • Modernize the implementation a little bit.
    • Clean up the docstrings.

    @DmitryJemerov DmitryJemerov mannequin added the stdlib Python modules in the Lib dir label Jul 23, 2010
    @bitdancer
    Copy link
    Member

    bitdancer commented Jul 23, 2010

    I haven't looked at the patch in detail, but if my quick skim is correct, it looks like you are decoding using latin1. If you don't provide any way to get at the original bytes (which I presume you don't if you have "converted the API to return strings"), then it will be difficult and non-obvious to correctly decode messages using a content transfer encoding of 8bit and some other character set. (Of course, it is currently impossible to do this using the Py3k email package either, but I'm working on fixing that).

    @DmitryJemerov
    Copy link
    Mannequin Author

    DmitryJemerov mannequin commented Jul 23, 2010

    This is an issue only for the actual article content, right? I'll be happy to extend the API to allow getting the original bytes of the article content, while keeping the rest of API (like group names) string-based.

    @bitdancer
    Copy link
    Member

    bitdancer commented Jul 24, 2010

    Correct, if by 'article content' you mean what is returned by the article command. So I think it is necessary to be able to get bytes for two commands: article and body. Then for symmetry it should also be possible to get bytes from the head command. (It is actually possible for headers to include 8bit data, but it is a violation of the RFC and both rare outside of spam and impossible to handle sensibly in most cases...and it may be the case that most if not all nttpservers sanitize such headers, I don't have any experience in that area.)

    On the other hand, this tells me that a possible use case I had been wondering about in email is in fact a valid concern: being able to parse a data stream that contains both strings and bytes :( That is, one might use nntplib in a mode where you pull the headers as text and the body as bytes and want to hand it off to email to build a Message object from those parts.

    @bitdancer bitdancer added the type-feature A feature request or enhancement label Jul 24, 2010
    @pitrou
    Copy link
    Member

    pitrou commented Sep 7, 2010

    Note that according to RFC 3977, “The character set for all NNTP commands is UTF-8”.

    But it also says this about multi-line data blocks:

    Note that texts using an encoding (such as UTF-16 or UTF-32) that may
    contain the octets NUL, LF, or CR other than a CRLF pair cannot be
    reliably conveyed in the above format (that is, they violate the MUST
    requirement above). However, except when stated otherwise, this
    specification does not require the content to be UTF-8, and therefore
    (subject to that same requirement) it MAY include octets above and
    below 128 mixed arbitrarily.

    IMO, it should decode/encode by default using utf-8 (with the "surrogateescape" error handler for easy round-tripping with non-compliant servers), except for raw articles (bodies / envelopes) where bytes should be returned.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 13, 2010

    Here is an updated patch, but it's a work in progress.

    Since we are breaking compatibility anyway, I think a larger cleanup is deserved. For example:

    • remove old exception aliases
    • make return types consistent (for example, newgroups() should returned structured data as list() does)
    • use datetime.datetime objects instead of consuming and producing dates and times as 6-character strings

    Additional useful features could be added:

    • automatic querying of CAPABILITIES on connection
    • a higher-level over() method able to parse the response as a list of header dicts (using LIST OVERVIEW.FMT), with appropriately decoded values (thanks email.header.decode_header())
    • ...

    Also, we should add an internal mock NNTP server to easily test features. Relying on gmane is nice for simple basic tests, but not much more.

    Does it sound reasonable?

    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 13, 2010

    Assuming we can break backward compatibility, it sounds fine to me.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 14, 2010

    Here is a further, still work-in-progress, patch. I'm posting it here so that interested people can give advice.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 19, 2010

    Here is the current state of the cleanup work. Coding is roughly finished; most methods are unit-tested. The documentation remains to be done. Review welcome.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Sep 19, 2010

    To make the distinction easier to remember, would it help if the methods that are currently set to return bytes instead accepted the typical encoding+errors parameters, with parallel *b APIs to get at the raw bytes?

    My concern with the current API is that there isn't a clear indicator during normal programming as to which APIs return strings and which return the raw bytes and hence require further decoding.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 19, 2010

    To make the distinction easier to remember, would it help if the
    methods that are currently set to return bytes instead accepted the
    typical encoding+errors parameters, with parallel *b APIs to get at
    the raw bytes?

    Not really, no. For raw messages, which encoding+errors must be used
    depends on the returned contents, it's not something the client can know
    up front; moreover, different parts of the returned bytes may need
    decoding using different encodings (for example if there are several
    MIME parts to the message). People should use the email package to parse
    the raw messages, as I assume they already do in 2.x.

    Apart from raw message bodies, NNTP data has well-defined encodings and
    that's why I can take and return unicode (although as stated, I also use
    surrogateescape to be fault-tolerant in the face of broken servers).

    My concern with the current API is that there isn't a clear indicator
    during normal programming as to which APIs return strings and which
    return the raw bytes and hence require further decoding.

    That's a documentation issue. I haven't touched the docs yet :)

    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 20, 2010

    I tested this against my existing py3k nttp client code.

    Why is it a good thing to make file a keyword only parameter? But given that it is, line 720 needs to use it as such, which omission means your missing at least one test :)

    On line 193 you index fmt, and *then* you check the length. When the number of tokens is too long, an IndexError is raised. (Note that the offending overview line comes from the gmane group comp.lang.python.mime.devel, and the offense is an extra '' field on one of the records. No idea why it got added to that one record. Looks like it is message 701, artid <4111BBA9.3040302@harvee.org>)

    Could the 'date' field in the xover headers also be a DateTime rather than a string? And :bytes and :lines be ints? Or is that being to DWIMish? If the date field isn't returned as a datetime, though, should there be a helper method for converting it, or should we just assume that email.utils mktime_tz and parsedate_tz?

    Am I correct that the purpose of _NNTPBase is to make testing easier?

    There seem to be only three lines of code in common between the file and the non-file case in _getlongresp. I think it would be clearer to make them two different routines, or at least move the three common lines to the top and then do an if test on file is None (that is, put the simpler, non-file case first).

    My little nttp client script doesn't really test very much of the nttplib interface, nor is it very complex. The change to xover considerably simplified that part of my script, so I very much like that change. I was also able to drop my implementation of decode_header. So overall this patch is significant improvement, IMO.

    I haven't given the code as thorough a review as might be optimal, but I certainly think you are headed in the right direction.

    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 20, 2010

    Gah. I reviewed patch4, didn't noticed today's update :(

    @pitrou
    Copy link
    Member

    pitrou commented Sep 20, 2010

    Gah. I reviewed patch4, didn't noticed today's update :(

    Oops, I'm sorry for that. It turns out most of your comments are useful anyway.

    Why is it a good thing to make file a keyword only parameter?

    I was thinking that if we want to add other function parameters, it will
    be possible to do so while keeping the (mnemonically useful) convention
    that the file parameter is last.

    You are right that testing that parameter is still on the TODO list...

    On line 193 you index fmt, and *then* you check the length. When the
    number of tokens is too long, an IndexError is raised.

    Yes, this is fixed in patch5 :)

    Could the 'date' field in the xover headers also be a DateTime rather
    than a string? And :bytes and :lines be ints? Or is that being to
    DWIMish?

    We could indeed parse the xover/over fields, but the issue is that there
    can be extra fields which therefore won't get parsed (or not all of them
    if we decide to recognize some e.g. "Xref").

    Also, people could want to get at the non-parsed representation
    (especially for the date field), which means we would need two APIs with
    two different levels of abstraction.

    If the date field isn't returned as a datetime, though, should there
    be a helper method for converting it, or should we just assume that
    email.utils mktime_tz and parsedate_tz?

    parsedate or parsedate_tz is the right thing to use indeed (dates follow
    the same format as in e-mails). As for the from and subject fields, the
    provided decode_header() is very recommended (as the __main__ block
    shows).

    Am I correct that the purpose of _NNTPBase is to make testing easier?

    Yes.

    There seem to be only three lines of code in common between the file
    and the non-file case in _getlongresp. I think it would be clearer to
    make them two different routines, or at least move the three common
    lines to the top and then do an if test on file is None (that is, put
    the simpler, non-file case first).

    I'll take a look.

    My little nttp client script doesn't really test very much of the
    nttplib interface, nor is it very complex. The change to xover
    considerably simplified that part of my script, so I very much like
    that change. I was also able to drop my implementation of
    decode_header. So overall this patch is significant improvement, IMO.

    Thank you :)

    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 21, 2010

    OK. It doesn't seem all that likely that we'll be adding parameters, but you never know. So I'm OK with file becoming keyword only.

    As for the overview headers...if they aren't decoded through decode_header already, why are they unicode? If they haven't been decoded, I'd expect them to be bytes. Or does the standard really say that these headers, extracted from the bytes message data, will be valid utf-8? (I realize that as things stand now with email5 you'll in fact *have* to decode them in order to process them with decode_header, but that doesn't change the fact that if they are unicode I would expect that they'd be *fully* decoded.)

    But I take your point about the datatypes otherwise. Also, when we get to email6 all headers should be turned into Header objects, and there will be a way to get a datetime out of a Date Header object.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 23, 2010

    Yes, unescaped utf-8 is explicitly allowed in headers by RFC 3977:

    The content of a header SHOULD be in UTF-8. However, if an
    implementation receives an article from elsewhere that uses octets in
    the range 128 to 255 in some other manner, it MAY pass it to a client
    or server without modification. Therefore, implementations MUST be
    prepared to receive such headers, and data derived from them (e.g.,
    in the responses from the OVER command, Section 8.3), and MUST NOT
    assume that they are always UTF-8.

    (I guess this means NNTP is now more modern than SMTP :-)).

    Actually, there's a test with an actual utf-8 header in the unit tests.
    (as for “implementations MUST be prepared to receive such headers, and data derived from them”, this is accounted for by using surrogateescape).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 24, 2010

    Here is a patch adding tests for article(), head() and body(), as well as for the "file" parameter.
    As far as code is concerned, this should now be complete. Documentation remains to be updated :)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 25, 2010

    And here is a patch integrating doc fixes and updates.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 29, 2010

    I've committed the latest patch in r85111.

    @pitrou pitrou closed this as completed Sep 29, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants