Implement pep 503 for simple repository. #506

Carreau · 2016-08-14T21:41:50Z

Implement pep 503 data-requires for simple repository.

This exposes the data-requires-python field when the packages registers it.

This also expand the readme with some basic instruction on how to connect
pypi-legacy to a running warehouse setup.

SQL seem to be ~2x to 5x slower, but likely still reasonable time

SQL Before :

    explain analyse select filename, python_version, md5_digest from release_files where name='ipython_sql'

    Index Scan using release_files_name_idx on release_files  (cost=0.29..15.60 rows=3 width=66) (actual time=0.028..0.028 rows=0 loops=1)
      Index Cond: (name = 'ipython_sql'::text)
    Planning time: 0.224 ms
    Execution time: 0.058 ms

SQL After :

    explain analyse select filename, releases.requires_python, md5_digest
    from release_files
    inner join releases
        on release_files.version=releases.version
        and release_files.name=releases.name
    where release_files.name='ipython_sql'

    Nested Loop  (cost=4.60..31.74 rows=1 width=61) (actual time=0.013..0.013 rows=0 loops=1)
      Join Filter: (release_files.version = releases.version)
      ->  Index Scan using release_files_name_idx on release_files  (cost=0.29..15.60 rows=3 width=77) (actual time=0.012..0.012 rows=0 loops=1)
            Index Cond: (name = 'ipython_sql'::text)
      ->  Materialize  (cost=4.31..16.01 rows=3 width=18) (never executed)
            ->  Bitmap Heap Scan on releases  (cost=4.31..16.00 rows=3 width=18) (never executed)
                  Recheck Cond: (name = 'ipython_sql'::text)
                  ->  Bitmap Index Scan on release_name_idx  (cost=0.00..4.31 rows=3 width=0) (never executed)
                        Index Cond: (name = 'ipython_sql'::text)
    Planning time: 0.668 ms
    Execution time: 0.092 ms

Likely longer for project with more releases.

ewdurbin · 2016-08-15T09:37:08Z

@Carreau

thanks for the major updates to the README.

I'm going to add some notes to specific changes here, but the overarching theme of my notes will be "small steps please". Given the nature of PyPI and the simple index, I'm hesitant to have too much change in any given update to core components. I'm positive that adding data-requires-python will
brick some people's parsers/scrapers and minimizing change will help us support them better, since we can point them to PEP 503 and this change to understand what's gone on. Multiple changes at once are likely confuse and bewilder people, instigating more pitchforks and burdening support.

ewdurbin · 2016-08-15T09:38:51Z

webui.py

@@ -1036,22 +1036,21 @@ def simple_body(self, path):
            raise NotFound, path + " does not have any releases"

        html = []
-        html.append("""<html><head><title>Links for %s</title><meta name="api-version" value="2" /></head>"""
+        html.append("""<!DOCTYPE html><html><head><title>Links for %s</title></head>"""


I'm not sure if the meta tag here is used, can we leave it until it's value is understood?

The !DOCTYPE addition is sound, however I'd like to stage it in a separate PR if possible.

@dstufft the api-version meta node was added in be17fea#diff-19fadc30e1b17100568adbd8c6c3cc13, do you have any information on it's use?

Removed DOCTYPE and reeadded meta attributes.

ewdurbin · 2016-08-15T09:53:31Z

just saw #505 and it helped color some of the decisions made here.

regarding select performance, I did want to point out that things get significantly slower when working with packages that have a longer release history. though i have no suggestions on improvement:

running on a local copy of the production release_files/releases tables:

pypi=# explain analyze select filename, releases.requires_python, md5_digest from release_files inner join releases on release_files.name=releases.name and release_files.version=releases.version where release_files.name='setuptools';
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.85..178.97 rows=50 width=61) (actual time=0.068..4.953 rows=632 loops=1)
   ->  Index Scan using releases_pkey on releases  (cost=0.42..60.60 rows=14 width=19) (actual time=0.034..0.219 rows=202 loops=1)
         Index Cond: (name = 'setuptools'::text)
   ->  Index Scan using release_files_name_version_idx on release_files  (cost=0.42..8.45 rows=1 width=78) (actual time=0.019..0.022 rows=3 loops=202)
         Index Cond: ((name = 'setuptools'::text) AND (version = releases.version))
 Planning time: 5.652 ms
 Execution time: 5.038 ms
(7 rows)

pypi=# explain analyze select filename, python_version, md5_digest from release_files where name='setuptools';
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on release_files  (cost=17.02..2132.36 rows=593 width=66) (actual time=0.274..0.982 rows=632 loops=1)
   Recheck Cond: (name = 'setuptools'::text)
   Heap Blocks: exact=628
   ->  Bitmap Index Scan on release_files_name_version_idx  (cost=0.00..16.87 rows=593 width=0) (actual time=0.176..0.176 rows=632 loops=1)
         Index Cond: (name = 'setuptools'::text)
 Planning time: 0.119 ms
 Execution time: 1.039 ms
(7 rows)

ewdurbin · 2016-08-15T09:56:07Z

I'd really like to have @dstufft take a pass here, as this is the most significant change to /simple we've implemented in PyPI-Legacy in the past... well a long time.

ewdurbin · 2016-08-15T10:08:15Z

webui.py

            if href.startswith('http://cheeseshop.python.org/pypi') or \
                    href.startswith('http://pypi.python.org/pypi') or \
                    href.startswith('http://www.python.org/pypi'):
                # Suppress URLs that point to us
                continue
-            if rel:


If I remember correctly, removing rel="internal" tag will break older versions of pip. I'm not opposed, but it is asking for pitchforks.

reintroduced.

ewdurbin · 2016-08-15T10:10:45Z

I'd like to propose that we break out the following ancillary changes into a separate PR that can be deployed after a lengthy "warning" period. We can put it in the status page as a scheduled maintenance to alert people that it's changing ahead of time and have somewhere to point when they get angry and flashmob us.

Addition of the !DOCTYPE element
Removal of the api-version meta tag
Removal of the rel="internal" attribute on package links

Carreau · 2016-08-15T16:55:30Z

Multiple changes at once are likely confuse and bewilder people, instigating more pitchforks and burdening support.

No problem, I completely understand. I'll take care of sending incremental updates.

regarding select performance, I did want to point out that things get significantly slower when working with packages that have a longer release history. though i have no suggestions on improvement

One of the changes possible would be to actually update the releases_files table and duplicate the requires_python information into another column.

Anyway, now that I have that working, it should be easy to pull that out in smaller chunks.

I have to work a bit on other stuff this week but I can try to spend some cycles.

dstufft · 2016-08-15T17:38:48Z

One of the changes possible would be to actually update the releases_files table and duplicate the requires_python information into another column.

This would not be the end of the world. Our data model is a little weird anyways, techincally all metadata is per file (right now) and it's entirely possible to have wildly different metadata for different files. We just sort of pretend this doesn't exist, and use the metadata for one of the files as all of the metadata for a particular release. This was done by basically splitting the metadata with a semi arbitrary line and saying "this metadata is likely to be different per-release, and this metadata is likely to be different per-file".

So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto release_files (and if we do that, it should be done via a database trigger, I can show you an example in Warehouse) or decide that requires_python is not a release centric piece of data but is actually a file centric piece of data and just move the column.

Changes to the DB have to go through Wareh

Carreau · 2016-08-19T04:35:28Z

So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto release_files (and if we do that, it should be done via a database trigger, I can show you an example in Warehouse) or decide that requires_python is not a release centric piece of data but is actually a file centric piece of data and just move the column.

I will trust you on this one. i honestly doubt people should/would bother to have various files with different requires-python; and I think that per release make sens or at least it is easier to migrate to per-file later if we decide; or at least simpler than deciding that per-file was a mistake and try to gather all at one back at a release level.

Changes to the DB have to go through Warehouse

Sure, just tell me if you want to make the changes or if the performance hit on Postgres because of the Join is acceptable.

Can you also benchmark the following ?

explain analyze select filename, requires_python, md5_digest
from (select requires_python, version from releases where name='setuptools') as s1,
     (select filename, version, md5_digest from release_files where name='setuptools') as s2
where s2.version=s1.version

I don't get the ~200 nested loops I see on your benchmark as I don't have the full db, but this query might be quicker as the cross product should be smaller if we filter by name first; though I don't know how much the query-planner can optimize...

Or maybe even the following that don't make the join explicit :

explain analyse select filename, releases.requires_python, md5_digest
from release_files,releases
where release_files.version=releases.version 
        and release_files.name='setuptools'
        and releases.name='setuptools'

Thanks !

Carreau · 2016-08-23T02:29:30Z

So there are two sort of ways to handle this, one is that it's just denormalizing the release centric data onto release_files (and if we do that, it should be done via a database trigger, I can show you an example in Warehouse) or decide that requires_python is not a release centric piece of data but is actually a file centric piece of data and just move the column.

How often are each releases and release_files productiondb updated ? It seem like I could make a materialized view that contain the above join and refresh it on insert/updates of each release / release_files ?

mpacer · 2016-08-23T22:15:22Z

README.rst

+Quick development setup
+-----------------------
+
+Make sure you read http://wiki.python.org/moin/CheeseShopDev#DevelopmentEnvironmentHints


Can you remove/update this line it links to a page that states that it is immediately out of date and that people should come back here to learn what to do. Infinite loops are sad.

…files Assuming there are much less updates of the `release` and `release_file` table than reads then we take the overhead of building this materialized view only rarely. Building it concurrently is slow but does not block reads so the building cost might be invisible. The more efficient alternative would be : - To make this an actual table and update only the affected row but seem much more complicated and prone to errors and confusions for maintenance as this table should not be manipulated directly. But would not be that hard since Posgres 9.5 supports UPSERTs. It would be easy to convert back-and from a virtual-table to a real table without actually changing Python code. - To duplicate the `requires_python` column on both `releases` and `release_files` which seem redundant and might lead to inconsistencies. - Transfer `requires_python` to `release_files`, though it is also queried from `releases` for we might still have the overhead of join but in the other direction and will require careful checking of the legacy-pypi codebase. Related: pypi/legacy#506

Carreau · 2016-09-16T23:46:49Z

Rebase and updated to work with pypi/warehouse#1448 (once accepted)

Carreau · 2016-10-05T22:50:20Z

Just checked "Allow edits from maintainers" and gave pushed right to my branch to @michaelpacer .

end This exposes the data-requires-python field when the packages registers it.

cgi.escape() does not escape single quotes, so we need to make sure any arguments parsed with cgi.escape() are wrapped with double quotes. Using cgi.escape(quote=True) escapes double quotes. Though this is unlikely to occur in a python_requires value, we want to protect against double quote based sql injections. The html parsing was separated into its own private function so that it could be unit tested.

mpacer · 2016-10-06T00:23:21Z

I took out the html formatting parts of the simple_body() function and put them in a private function outisde of the main class so that they can be more easily tested without needing to instantiate the full object.

Would you prefer me to include a test to go with these updates or to leave this as is?

Carreau · 2016-10-06T17:44:52Z

I opened #535 which resurect some of the testing and pass on travis.

mpacer · 2016-10-06T19:21:02Z

If the time-to-merge of this solution would be extended because of needing to wait for tests, I would support not including tests and simply merging as is.

At the least, it supports escaping all of the information that is passed now. Previously the rel argument was not escaped (but just passed in as a raw string), and so an injection could have occurred fairly easily if bad data had been included.

I especially would prefer not to write tests if there is no long term intention of maintaining a testing suite for this repo. I understand not wanting to start supporting a testing suite given that the intention is to deprecate this when it becomes feasible. But until the warehouse lives, I also understand wanting to keep the cheese shop resilient.

So I support whatever the maintainers decide in either direction.

Carreau · 2016-11-01T21:41:24Z

Hello there !

I hope that @dstufft has found a new position that give him some time to still work on Open Source and that he can still work remotely from home ! I'm pretty sure he can't discuss it while contract are being signed. Just giving my encouragement.

I made a couple of PRs to PyPI legacy this one that adds limited functionality for Python 3 only packages, so I was wondering if there is any hope to get that soonish, and if there is anything I can do to help.

I've tried as well to simplify and document how to run PyPI legacy locally and can do more cleanup work if this is what is needed for things to move forward. There are a couple of library out there that wish to release Python 3 only version and hope not to use gross hack or break Python 2 user workflow.

Of course we need pip 8.2 to be out, but that would be neat to be able to at least get that into the wild on test it a bit.

Thanks !

Carreau · 2016-11-02T20:31:48Z

Of course we need pip 8.2 to be out, but that would be neat to be able to at least get that into the wild on test it a bit.

Thanks @dstufft for releasing pip 9 ! 🍰

Carreau · 2016-11-23T23:53:56Z

Hello there ! Anything we can do here ?

dstufft · 2016-11-30T12:21:09Z

I think that will automatically deploy in ~15 minutes or so. If not then I'll go looking to see why not.

Carreau · 2016-11-30T18:55:21Z

Thanks a lot, really appreciate it.

I just tested on testpypi with py3onlytest, it does not seem to show the data-attributes.

I've tried with the most recent stable setuptools (and twine).

dstufft · 2016-11-30T23:47:53Z

Hmm, none of those has a requires_python in the DB:

warehouse-test::DATABASE=> SELECT name, version, filename, requires_python FROM release_files WHERE name = 'py3onlytest';
    name     | version |             filename             | requires_python
-------------+---------+----------------------------------+-----------------
 py3onlytest | 0.1     | py3onlytest-0.1.tar.gz           |
 py3onlytest | 0.1     | py3onlytest-0.1-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2.tar.gz           |
(4 rows)

warehouse-test::DATABASE=> SELECT name, requires_python FROM releases WHERE name = 'py3onlytest';
    name     | requires_python
-------------+-----------------
 py3onlytest |
 py3onlytest |
(2 rows)

So either we're not properly capturing that value when it is sent, or twine/setuptools is not actually sending it, or both.

Carreau · 2016-12-01T03:20:05Z

I'm confused... can run the same for flit? as it seem to have that set :

But it does not have the data in the /simple/flit page.

I'll try to dig out where the upload path is and what could go wrong there.

dstufft · 2016-12-01T03:21:21Z

Is that on regular PyPI or TestPyPI?

Carreau · 2016-12-01T03:30:12Z

Is that on regular PyPI or TestPyPI?

Sorry this one is regular pypi. It's also on testpypi, but with Requires Python: 3 not >=3. Though the one on tespypi does not show data attributes either

dstufft · 2016-12-01T03:51:54Z

@Carreau It's not deployed to regular PyPI yet. The flit page was cached though, I purged it and it shows up now. This suggests the problem is in twine or setuptools or both and not PyPI though.

Carreau · 2016-12-01T04:14:02Z

@Carreau It's not deployed to regular PyPI yet. The flit page was cached though, I purged it and it shows up now. This suggests the problem is in twine or setuptools or both and not PyPI though.

Good, thanks a lot. I'll investigate both and report back. Thanks !

Carreau · 2016-12-01T05:09:05Z

So

in setuptools I was using requires_python instead of python_requires. I'm astonished nothing complained.
The listing in testPyPI shows Python Requires : >=3

Though the /simple/ page still does not have the data attribute. I might be missing something. Will have a look tomorrow.

Thanks.

Carreau · 2016-12-01T18:41:09Z

Hum, so the only thing I can see is that the db-trigger are not triggered correctly.

As the main page pull the Requires-Python from the releases table and is correct, and the listing page is pulling things from the releases_file the only thing I can see leading to this a DB inconsistency.

I'm trying to re-run the Warehouse/PyPI-legacy locally but seem to have issues with make initdb for now.

dstufft · 2016-12-01T21:05:48Z

@Carreau That seems to be the issue yes:

warehouse-test::DATABASE=> SELECT name, version, filename, requires_python FROM release_files WHERE name = 'py3onlytest';
    name     | version |             filename             | requires_python
-------------+---------+----------------------------------+-----------------
 py3onlytest | 0.1     | py3onlytest-0.1.tar.gz           |
 py3onlytest | 0.1     | py3onlytest-0.1-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2-py3-none-any.whl |
 py3onlytest | 0.2     | py3onlytest-0.2.tar.gz           |
 py3onlytest | 0.3     | py3onlytest-0.3-py3-none-any.whl |
 py3onlytest | 0.3     | py3onlytest-0.3.tar.gz           |
 py3onlytest | 0.4     | py3onlytest-0.4-py3-none-any.whl |
 py3onlytest | 0.5     | py3onlytest-0.5-py3-none-any.whl |
 py3onlytest | 0.5     | py3onlytest-0.5.tar.gz           |
(9 rows)

warehouse-test::DATABASE=> SELECT name, version, requires_python FROM releases WHERE name = 'py3onlytest';
    name     | version | requires_python
-------------+---------+-----------------
 py3onlytest | 0.4     | >=3,<5
 py3onlytest | 0.5     | >=3
 py3onlytest | 0.3     | >=3,<5
 py3onlytest | 0.2     |
 py3onlytest | 0.1     |
(5 rows)

Carreau · 2016-12-01T21:51:56Z

Ok, thanks for confirming. I'm unsure where this could be coming from. I'm trying to replicate locally. Fighting with legacy codebase to have it running without S3 and reddis.

Carreau · 2016-12-01T23:54:17Z

Sorry previous comment was in a wrong window. But I think I know what is happening at least for me locally.

When uploading through the legacy pypi interface. It seem when calling python setup.py register is does create an entry in release but does not set requires_python. Thus when twine uploads files, these twine does send the python_requires that have no effect.

I'm still unsure of why on testpipy it does seem to work at least in the webUI.

I'm giving up for today

takluyver · 2016-12-18T12:32:22Z

Thanks both for getting this in place; I can see the data-requires-python on flit's simple index page now, and with support in pip, it should be working :-)

ewdurbin reviewed Aug 15, 2016
View reviewed changes

Carreau force-pushed the pep-503 branch 2 times, most recently from 806bfed to 09b4978 Compare August 19, 2016 03:05

mpacer reviewed Aug 23, 2016
View reviewed changes

Carreau mentioned this pull request Aug 31, 2016

Create requires_python column on release_files to efficiently query requires_python with files pypi/warehouse#1448

Merged

Carreau force-pushed the pep-503 branch from 09b4978 to fcbaa09 Compare September 16, 2016 23:45

Carreau force-pushed the pep-503 branch from fcbaa09 to 4ee1a70 Compare September 19, 2016 15:36

Carreau and others added 2 commits October 5, 2016 16:03

Implement pep 503 data-requires for simple repository.

3e1ce8f

end This exposes the data-requires-python field when the packages registers it.

mpacer force-pushed the pep-503 branch from 4ee1a70 to 8e7c3ac Compare October 6, 2016 00:20

Carreau mentioned this pull request Oct 6, 2016

Resurrect some tests #535

Closed

Escape rel attribute if present

9055bf8

Carreau mentioned this pull request Oct 6, 2016

Make it easier to run local version of PyPI. #536

Closed

Escape double quotes in rel attribute

1c4dbfe

Carreau mentioned this pull request Oct 30, 2016

Practicalities section points to nonexistant information? python3statement/python3statement.github.io#53

Open

dstufft merged commit 12b6e3a into pypi:master Nov 30, 2016

Carreau deleted the pep-503 branch November 30, 2016 18:25

Carreau mentioned this pull request Dec 2, 2016

Prevent inconsistency between release and release_files tables. pypi/warehouse#1513

Merged

jamesrhester mentioned this pull request Jan 27, 2017

Fix python version check pypa/pip#4254

Closed

Implement pep 503 for simple repository. #506

Implement pep 503 for simple repository. #506

Conversation

Carreau commented Aug 14, 2016 • edited Loading

ewdurbin commented Aug 15, 2016

ewdurbin Aug 15, 2016

Choose a reason for hiding this comment

ewdurbin Aug 15, 2016

Choose a reason for hiding this comment

ewdurbin Aug 15, 2016

Choose a reason for hiding this comment

Carreau Aug 19, 2016

Choose a reason for hiding this comment

ewdurbin commented Aug 15, 2016 • edited Loading

ewdurbin commented Aug 15, 2016

ewdurbin Aug 15, 2016 • edited Loading

Choose a reason for hiding this comment

Carreau Aug 19, 2016

Choose a reason for hiding this comment

ewdurbin commented Aug 15, 2016 • edited Loading

Carreau commented Aug 15, 2016

dstufft commented Aug 15, 2016

Carreau commented Aug 19, 2016 • edited Loading

Carreau commented Aug 23, 2016

mpacer Aug 23, 2016

Choose a reason for hiding this comment

Carreau commented Sep 16, 2016

Carreau commented Oct 5, 2016

mpacer commented Oct 6, 2016

Carreau commented Oct 6, 2016

mpacer commented Oct 6, 2016

Carreau commented Nov 1, 2016

Carreau commented Nov 2, 2016

Carreau commented Nov 23, 2016

dstufft commented Nov 30, 2016

Carreau commented Nov 30, 2016

dstufft commented Nov 30, 2016

Carreau commented Dec 1, 2016

dstufft commented Dec 1, 2016

Carreau commented Dec 1, 2016

dstufft commented Dec 1, 2016

Carreau commented Dec 1, 2016

Carreau commented Dec 1, 2016

Carreau commented Dec 1, 2016

dstufft commented Dec 1, 2016

Carreau commented Dec 1, 2016

Carreau commented Dec 1, 2016

takluyver commented Dec 18, 2016

Carreau commented Aug 14, 2016 •

edited

Loading

ewdurbin commented Aug 15, 2016 •

edited

Loading

ewdurbin Aug 15, 2016 •

edited

Loading

ewdurbin commented Aug 15, 2016 •

edited

Loading

Carreau commented Aug 19, 2016 •

edited

Loading