Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/simple serves HTML that can't be parsed by Python's xml.etree if package has yanked releases #7886

Closed
boegel opened this issue May 4, 2020 · 19 comments · Fixed by #7916
Closed
Labels
needs discussion a product management/policy issue maintainers and users should discuss

Comments

@boegel
Copy link

boegel commented May 4, 2020

Describe the bug

Parsing HTML served by /simple endpoint results in xml.etree.ElementTree.ParseError.

Expected behavior

No parse error, as it was before when there were no yanked releases yet or with packages that don't have any yanked releases (yet).

To Reproduce

  • Python script test.py that contains:

    import requests
    from xml.etree import ElementTree
    simple_pip = requests.get('https://pypi.python.org/simple/pip')
    ElementTree.fromstring(simple_pip.text)
  • run it with python test.py, for example (on macOS):

    $ python3 test.py
    Traceback (most recent call last):
      File "test.py", line 4, in <module>
        ElementTree.fromstring(simple_pip.text)
      File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
      parser.feed(text)
    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 143, column 306
    

The problem is the data-yanked part in lines like:

<a href="https://files.pythonhosted.org/packages/8c/5c/c18d58ab5c1a702bf670e0bd6a77cd4645e4aeca021c6118ef850895cc96/pip-20.0.tar.gz#sha256=5128e9a9401f1d16c1d15b2ed766a79d7813db1538428d0b0ce74838249e3a41" data-requires-python="&gt;=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*" data-yanked>pip-20.0.tar.gz</a><br/>

My Platform

  • macOS 10.15.4 with Python 2.7.16 or 3.7.7 (but same issue occurs on other platforms too)

Additional context

@di
Copy link
Member

di commented May 4, 2020

Hi @boegel thanks for the report. From https://docs.python.org/3/library/xml.etree.elementtree.html:

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.

PEP 503, which defines the simple API, says that this page is HTML, not XML:

Within a repository, the root URL (/ for this PEP which represents the base URL) MUST be a valid HTML5 page with a single anchor element per project in the repository.

Therefore I wouldn't expect to be able to parse a /simple page (HTML) with ElementTree (for XML).

As is, this page is valid HTML5, so I don't think there's anything for us to do here.

I'd recommend using html.parser instead for parsing HTML.

@boegel
Copy link
Author

boegel commented May 5, 2020

Thanks for the feedback!

Unfortunately html.parser is not really an option for us currently, since we still support Python 2...

It's a shame that there's no interest in fixing this on the PyPI side, since it seems like it could be an easy fix, for example by indicating yanked releases with data-yanked='yes' or something, similar to how data-requires-python is specified.

I didn't see anything about the data-yanked attribute in PEP 503, is how yanked releases are specified in /simple elsewhere?

@di
Copy link
Member

di commented May 5, 2020

Yes, in PEP 592. Unfortunately adding a value to this attribute has meaning according to the PEP, so we can't just make it "yes" or something similar.

@boegel
Copy link
Author

boegel commented May 5, 2020

@di How about using data-yanked='' if no reason for yanking was provided? It seems like that would still match what is specified in PEP 592 (depending on how "no value" is interpreted, I guess).

@di
Copy link
Member

di commented May 5, 2020

Yes, I think that would depend on how pip is currently interpreting this value.

@boegel
Copy link
Author

boegel commented May 5, 2020

data-yanked and data-yanked='' are equivalent to pip, based on the tests in https://github.com/pypa/pip/blob/327a31536ff71b99e5aa454bd325b3daf75b666d/tests/unit/test_collector.py#L336-L341

@dstufft
Copy link
Member

dstufft commented May 5, 2020

I'm not super enthused by attempting to maintain compatibility with an XML parser and I think it's likely going to be error prone since this response is html5 and not XML. In this case we can possibly do it, but I'm not sure that holds true moving forward.

Maybe simple changes infrequently enough and is weird enough that it isn't a big deal and it's worth doing, I dunno. I just worry that long term it's a a futile effort.

@boegel
Copy link
Author

boegel commented May 5, 2020

We've been relying on /simple for quite some time now in EasyBuild (I think after the simple API was strongly recommended to us), and we've never run into problems with it up until recently when packages started yanking releases.

The issue I reported is quite annoying for us, since it effectively break the auto-download-from-PyPI feature we have in all existing EasyBuild releases.
We have a workaround for it for the upcoming release, but it's still annoying to many.

That shouldn't be the big motivation here though, of course, but I suspect other people be running into this too (and it's not trivial to pinpoint the exact issue either if you see the error popping up, it look like a fluke in PyPI at first to be honest).

Maybe it's sufficient to have a test somewhere that checks whether a page like https://pypi.python.org/simple/pip can be parsed by xml.etree.ElementTree, with a pointer to this issue?
If some breakage then pops up again in the future, and it's too much effort to avoid it, then so be it.

If I can help with that in any way, I'd love to hear it.

I'll try and make sure we don't rely on ElementTree anymore in EasyBuild for parsing what is served by /simple, since that clearly wasn't the best solution, to be more robust against changes in PyPI in the future, but that leaves the issue in existing EasyBuild releases, which can also be resolved by moving data-yanked='' rather than data-yanked.

@di di added the needs discussion a product management/policy issue maintainers and users should discuss label May 5, 2020
@Flamefire
Copy link

To add on

data-yanked and data-yanked='' are equivalent to pip

Those are not only equivalent to pip they are in fact defined to be equivalent in HTML5: https://stackoverflow.com/a/23752239/1930508

So I see no downside in adding that.

Question however: PIP for Python2 is still (kinda) supported (at least up to some version). Does that access the /simple-endpoint too? How does it parse the page?

@boegel
Copy link
Author

boegel commented May 7, 2020

To add on

data-yanked and data-yanked='' are equivalent to pip

Those are not only equivalent to pip they are in fact defined to be equivalent in HTML5: https://stackoverflow.com/a/23752239/1930508

So I see no downside in adding that.

Oh, even better. Hopefully that makes it less of an issue to change to data-yanked=''.

Question however: PIP for Python2 is still (kinda) supported (at least up to some version). Does that access the /simple-endpoint too? How does it parse the page?

Latest pip still supports Python 2, so no issue there?

@di
Copy link
Member

di commented May 7, 2020

Question however: PIP for Python2 is still (kinda) supported (at least up to some version). Does that access the /simple-endpoint too? How does it parse the page?

Pip uses https://pypi.org/project/html5lib/ to parse /simple.

@daneah
Copy link
Contributor

daneah commented May 9, 2020

@di and I discussed the details of #7856 today so that I could pick it up, and it feels like the natural outcome of that work will be data-yanked="<optional reason for yank>".

@boegel
Copy link
Author

boegel commented May 11, 2020

@daneah Thanks a lot for looking into this!

If no reason was provided, would the changes in #7916 result in PyPI specifying data-yanked="" on the page served by /simple, or something else?

@Flamefire
Copy link

Looking at https://github.com/pypa/warehouse/pull/7916/files#diff-5e24a0b38c92bf7d3bb2982c25b10156R22 then yes it will be data-yanked="<optional>" or not existing at all

@daneah
Copy link
Contributor

daneah commented May 11, 2020

@boegel @Flamefire that's correct—the attribute after the change would now be one of: absent, data-yanked="", or data-yanked="some reason".

@di di closed this as completed in #7916 May 18, 2020
@boegel
Copy link
Author

boegel commented May 18, 2020

@di I understand that this issue got closed via the merge of #7916, but it seems that the fix isn't live on PyPI yet.

Out of curiosity, when would the fix also be active on PyPI itself?
No rush, please don't read this as me asking to deploy this ASAP, that's not why I'm asking this.

@di
Copy link
Member

di commented May 18, 2020

@boegel Correct. The deployment takes about ~5 minutes from merge, I can let you know once it's live if you'd like.

In addition, it takes about ~24 hours for these pages to fall out of our cache, so it might be a while until 100% of /simple pages exhibit this behavior.

@boegel
Copy link
Author

boegel commented May 18, 2020

OK, I can keep an eye on this, give it another shot tomorrow, thanks!

@di
Copy link
Member

di commented May 18, 2020

This is now deployed, you can see it on https://pypi.org/simple/pip/ for example, for which I have manually purged the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants