Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.sax and xml.dom fetch DTDs by default #61520

Closed
rsandwick3 mannequin opened this issue Feb 28, 2013 · 3 comments
Closed

xml.sax and xml.dom fetch DTDs by default #61520

rsandwick3 mannequin opened this issue Feb 28, 2013 · 3 comments
Labels
expert-XML performance Performance or resource usage

Comments

@rsandwick3
Copy link
Mannequin

rsandwick3 mannequin commented Feb 28, 2013

BPO 17318
Nosy @tiran, @mcepl, @bitdancer
Superseder
  • bpo-17239: XML vulnerabilities in Python
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-10-20.12:52:47.908>
    created_at = <Date 2013-02-28.00:50:37.950>
    labels = ['expert-XML', 'performance']
    title = 'xml.sax and xml.dom fetch DTDs by default'
    updated_at = <Date 2021-10-20.12:52:47.907>
    user = 'https://bugs.python.org/rsandwick3'

    bugs.python.org fields:

    activity = <Date 2021-10-20.12:52:47.907>
    actor = 'christian.heimes'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-10-20.12:52:47.908>
    closer = 'christian.heimes'
    components = ['XML']
    creation = <Date 2013-02-28.00:50:37.950>
    creator = 'rsandwick3'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 17318
    keywords = []
    message_count = 3.0
    messages = ['183193', '183200', '404444']
    nosy_count = 4.0
    nosy_names = ['christian.heimes', 'mcepl', 'r.david.murray', 'rsandwick3']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '17239'
    type = 'resource usage'
    url = 'https://bugs.python.org/issue17318'
    versions = ['Python 2.6', 'Python 2.7']

    @rsandwick3
    Copy link
    Mannequin Author

    rsandwick3 mannequin commented Feb 28, 2013

    Note that URIs in the following are only meant as links when in parentheses; otherwise, they are identifiers and mostly will not yield useful results. I have only worked with xml.sax in Python 2.6 and 2.7, so I cannot speak to its current state in later versions.

    The condition described in Python issue bpo-2124 (http://bugs.python.org/issue2124) may yet be a defect, and is at the least a reasonably important enhancement, but apparently was not sufficiently specified, so I will attempt to clarify. As an aside, it is similar to a libxml2 issue on which I have also commented today (https://bugzilla.gnome.org/show_bug.cgi?id=162776), whose statement of issue actually contains what I would expect to be correct behavior if the toggling action were setting an option/feature rather than importing an additional module.

    The most common case, and the reason w3c has been inundated with the described requests, is that every time any user anywhere uses xml.sax in its default form to parse an XHTML document containing a doctype declaration, a request is sent to www.w3.org for the contents of that DTD from the URI in its system identifier. This is not documented anywhere (which would be the primary reason to call this a defect), and is confusing because it has the effect of using the terms parser and validator (or "validating parser," whichever is the preferred name) interchangeably.

    The w3c is largely to blame, since their own definition document for XML (http://www.w3.org/TR/REC-xml/#sec-external-ent) defines the DTD as a "special kind of external entity," and then goes on to say that XML processors *MAY* use any combination of pubid+sysid to find an alternative method of resolving the reference, but otherwise *MUST* use the URI.

    However, this is only necessary when *validating* XML. The DTD is a "mostly useless, but required" (http://en.wikipedia.org/wiki/Document_Type_Declaration) entity in HTML5, e.g., but is not required in XML generally. Even when present, the only time a processor should consult the DTD is during validation, not parsing. If the default parser revealed by xml.sax is a validator rather than just a parser, that should be communicated clearly to the user. When we discuss a CSV parser, we expect it to accept lines separated by some character, each with columns separated by commas. We do not expect it to verify that certain values are found in certain columns of the first line unless we specify that it should. In specifying that it should, we have asked for a validator rather than a parser. This issue is related to the XML analogue of that distinction.

    The most valid and important complaint in the referenced blog post is: "don't fetch stuff unless you actually need it," which is what xml.sax users may be unwittingly doing if validation is the default behavior. Further, if xml.sax were actually *not* conducting validation by default, there is no reason whatever to retrieve the DTD, since any external entity references can remain unresolved in well-formed XML prior to validation.

    Note that the features, http://xml.org/sax/features/external-general-entities, .../external-parameter-entities, and .../validation have no specified defaults (http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description). Making these enabled by default causes network-required side effects, which I would contend is improper: unless a user asks for network activity, none should occur. An implicit request for network activity, such as validation, should be fully and widely-visibly documented as a legitimate side effect.

    The set of primary use cases for the xml.sax parsers certainly include validation, but users will often be unaware that it is the default, and more importantly be unaware that the parser will therefore request the DTD from its URI. While the feature, .../external-general-entities, partially solves the problem, it is not a full solution, because a well-formed XML document can contain external entities regardless of the location of DTD subsets. The w3c's description ("special kind of external entity") is important here - the DTD is special for a reason, and has its own tag/specifier as a result: resolving general external entities after intentionally omitting an external DTD subset is an acceptable use case, especially in a non-validating parser.

    My proposal would be to enhance/fix xml.sax by doing the following:

    1. allow toggling of external DTD subset loading via a feature such as http://apache.org/xml/features/nonvalidating/load-external-dtd (http://xerces.apache.org/xerces-j/features.html),
    2. cause the feature, http://xml.org/sax/features/validation, to automatically enable the DTD loading feature as well, just as it does for the two currently implemented external entity features,
    3. document the default behavior, specially noting that users can expect URIs to be resolved, across the network/internet if necessary, after either the DTD feature or the validation feature is toggled to the enabled state,

    and in my opinion:

    1. disable the DTD feature by default, so the xml.sax-uninitiated developer who arrives upon the module as a solution doesn't start testing/using it without realizing these requests will be sent.

    Sufficient documentation could override #4, since there is a backward-compatibility issue, but I think the detriment to the w3c is enough reason to rethink it nevertheless. Catalogs are a nice solution as well when validation is needed, but when it is not needed, there is no reason to require the extra work of building a catalog (that can't be guaranteed to be writable in situ without sysadmin access) when it is essentially purposeless.

    I am continuing to search for the entry point for "<!" (and would appreciate any pointers) but have resorted to subclassing ExpatParser (which, again, kills the nice abstraction xml.sax touts) and omitting external entites with public identifiers starting with "-//W3C//DTD " - this is not a general solution, as DTDs are not guaranteed to come from w3c, and a correct solution would apply the appropriate omissions to the full "<!DOCTYPE ...>" entity, either by leaving external subsets as unparsed and orphaned entities in the document, or (only as a secondary potential solution since internal subsets could still be present and would thus become broken) by ignoring it completely. It might not even be reasonable to consider the latter, though when parsing only and not validating it would be a correctly-working result, so if the former is unachievable, the latter would be a decent improvement for that situation.

    As a final note, in case it's helpful: my approach to a fix has been to examine ways to treat the DOCTYPE declaration itself, but another approach would be to have EntityResolver.resolveEntity receive a declaration type alongside the public and system identifiers, and thus the DOCTYPE declaration type could receive appropriate treatment within the current framework quite easily.

    @rsandwick3 rsandwick3 mannequin added expert-XML performance Performance or resource usage labels Feb 28, 2013
    @bitdancer
    Copy link
    Member

    bitdancer commented Feb 28, 2013

    I believe this is a subset of bpo-17239, and it may be appropriate to close it as a duplicate. I'll let that up to Chris, though, since he knows what still needs to be specified/worked out.

    @tiran tiran removed their assignment Jun 12, 2016
    @tiran
    Copy link
    Member

    tiran commented Oct 20, 2021

    I'm closing this as duplicate.

    @tiran tiran closed this as completed Oct 20, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    expert-XML performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants