Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create requires_python column on release_files to efficiently query requires_python with files #1448

Merged
merged 12 commits into from
Oct 5, 2016

Conversation

Carreau
Copy link
Contributor

@Carreau Carreau commented Aug 31, 2016

Assuming there are much less updates of the release and release_file
table than reads then we take the overhead of building this materialized
view only rarely.

Building it concurrently is slow but does not block reads so the
building cost might be invisible.

The more efficient alternative would be :

   - To make this an actual table and update only the affected row but
     seem much more complicated and prone to errors and confusions for
     maintenance as this table should not be manipulated directly. But
     would not be that hard since Posgres 9.5 supports UPSERTs.

     It would be easy to convert back-and from a virtual-table to a real
     table without actually changing Python code.

   - To duplicate the `requires_python` column on both `releases` and
     `release_files` which seem redundant and might lead to
     inconsistencies.

   - Transfer `requires_python` to `release_files`, though it is also
     queried from `releases` for we might still have the overhead of
     join but in the other direction and will require careful checking
     of the legacy-pypi codebase.

Working with @michaelpacer on pypi/legacy#506 we came up with this solution for now, and woudl appreciate any feedback and timing statistics on replicate of production DB.

@Carreau
Copy link
Contributor Author

Carreau commented Aug 31, 2016

Pushed a change to try to make the linter happy.

@Carreau
Copy link
Contributor Author

Carreau commented Aug 31, 2016

The test on 3.5.2 seem to have failed before starting can one of you restart this specific build of travis ?

Thanks !

@di
Copy link
Member

di commented Aug 31, 2016

@Carreau I restarted it for you.

@dstufft
Copy link
Member

dstufft commented Aug 31, 2016

Why not just add a column to the release_files table and setup a trigger to cascade the values over? Using a Materialized view here seems to be harder than that?

@Carreau
Copy link
Contributor Author

Carreau commented Aug 31, 2016

Why not just add a column to the release_files table and setup a trigger to cascade the values over? Using a Materialized view here seems to be harder than that?

I'm not sure how to cascade on the same table, I can give it a try; @michaelpacer and I basically learn SQL for that, and I can see many way to get that wrong:
- recursion ? I don't know if it's possible for triggers ?
How to react to row update, if row update in release_files there is and it update a release_files.requires_python, does the trigger revert back to the corresponding value in release.requires_python ? Globally I'm concern that would would get internal source of truth and that any further development later on warehouse might get confused to know which of the two column should be updated.

I might just misunderstand the features available in SQL, though and there might be way to make sure there is consistency.

@Carreau
Copy link
Contributor Author

Carreau commented Sep 14, 2016

Updated with @michaelpacer work, this now adds a column to release_files and update only concern rows on UPDATE or INSERT into releases or release_files.

@Carreau
Copy link
Contributor Author

Carreau commented Sep 15, 2016

Travis seem to be stuck on Running setup.py bdist_wheel for psycopg2 ... can you restart this build please ?

I improved the codestyle to pass the linter, and I assume after rebasing I had to update the database revision number (I assume that what's the following was trying to convey:

self = <alembic.script.base.ScriptDirectory object at 0x7fa690fc8dd8>
ancestor = 'Destination %(end)s is not a valid upgrade target from current head(s)'
multiple_heads = "Multiple head revisions are present for given argument 'head'; please specify a specific target revision, '<branchname>@head' to narrow to a specific head, or 'heads' for all heads"

Is there a simpler way to update the db revision number than to re-run docker-compose run web python -m warehouse db revision and move the content from one file to another ?

@dstufft
Copy link
Member

dstufft commented Sep 15, 2016

I think multiple heads means that other changes to the DB has landed since this change and you'll need to either rebase your DB revision so it follows linearly to that one, or you'll need to create a merge db revision. See http://alembic.zzzcomputing.com/en/latest/branches.html

@Carreau
Copy link
Contributor Author

Carreau commented Sep 15, 2016

Thanks for restarting the build.

I think multiple heads means that other changes to the DB has landed since this change and you'll need to either rebase your DB revision so it follows linearly to that one, or you'll need to create a merge db revision. See http://alembic.zzzcomputing.com/en/latest/branches.html

Thanks, it appears I still had the revision wrong but because of a missing key... re-attempting.

@Carreau Carreau changed the title Create a materialized view to efficiently query requires_python with files Create requires_python column on release_files to efficiently query requires_python with files Sep 15, 2016
@Carreau
Copy link
Contributor Author

Carreau commented Sep 15, 2016

Yeah ! 🍰 🎊 Test are passing.

Not sure if you prefer commits to be squashed a bit, and/or if extra test are needed for the DB as this will mostly be used by Legacy PyPI, nor exactly what need to be tested.

@Carreau
Copy link
Contributor Author

Carreau commented Sep 16, 2016

Why not just add a column to the release_files table and setup a trigger to cascade the values over? Using a Materialized view here seems to be harder than that?

That's now implemented and seem to work.

[edit] fix typo spotted in next comment.

@mpacer
Copy link

mpacer commented Sep 17, 2016

Why not just add a column to the release_files table and setup a trigger to cascade the values over? Using a Materialized view here seems to be harder than that?

I think what @Carreau meant was that it's now implemented. Since it is not not implemented. 😄

Is there anything more that needs to be done on this for it to be included?

@Carreau
Copy link
Contributor Author

Carreau commented Sep 17, 2016

I think what @Carreau meant was that it's now implemented. Since it is not not implemented. 😄

Oops. typed to fast. I edited my comment.

@Carreau
Copy link
Contributor Author

Carreau commented Sep 19, 2016

I'll appreciate comment or hints on the next step to take on this. In order to tackle the things on PyPI sides as well.

Thanks !

@Carreau
Copy link
Contributor Author

Carreau commented Sep 26, 2016

Hello there !

It's Monday 🎊 , usually the day of the week most people hate because it's time to start working again !

I hope you are all fine, and have spent a nice week-end. I personally went to the beach (in Moss landing) and saw otters. I did a wrong move and my broken for is hurting again, so I used that as an excuse to watch Mr Robot on Sunday. I've also read that Postgres 9.6 is going to have BDR, which is cool I guess.

I'm now routinely looking at this piece of code and looking to get any feedback on things to improve (or to fix). Hopefully get things merged soon enough to be able to finish my patch on legacy PyPI.

I'm now going to attempt inserting a Cat Gif here to cheer you up and give you the strength to review.
Sorry if you are a dog person, the dog gif is for next week.

cat-drinks-water-2

Thanks for all your hard work !

@Carreau
Copy link
Contributor Author

Carreau commented Oct 3, 2016

Morning everyone !

Is is monday again ! I know, I know it's often monday. But according to rough calculation only about 14.28% of the time ! I definitively need more samples to be sure.

Also good news it 💧 rained 💧 this morning in California ! Also my ankle did not hurt before it started raining. I'm sad cause I wanted to be like the old grandpa in movies that can predict rain just depending on whether* their rheumatism are hurting. [*or should I say weather, Haha]

I'm also kind of anti-participating to hacktoberfest: instead of submitting PRs I'm trying to get mine merged or closed as I feel too many opened PRs can be scarry to new contributors, and depressing for maintainers. I know the feeling.

Let me tell you a joke to cheer you up and give you the courage to merge this Pull request. You make it hard on me because I have to find one relevant for this Pull request...

A SQL request walk into a bar, where some of it's friends are hanging in the back.
it approaches the tables where they are having multiple row of drinks and ask:
"Can I join you ?"

Well ok, that was lame. But I hope you enjoyed it and got some courage to press the nice green button you can see below. Also if you are afraid to ask from some change or don't have the courage to write a long reply, do not worry I can take a blunt and short answer even negative one. All I want is this thing to move forward. And I know that writing a long nice message that is politically correct can be tough.

Anyway thanks for your hard work, and hopping we can move this forward.

Also I'm still unsure if you are a dog or a cat person, so I'm going to try a dog gif this time. If it does not work I'll likely try squirel, llama, rabbits, raptors or something else next week.

59627412

See you before next monday hopefully.

@dstufft
Copy link
Member

dstufft commented Oct 4, 2016

Sorry, this is on my list of things to review, I've just had a really crappy couple of weeks and haven't been able to focus much.

BEGIN
IF (TG_OP = 'INSERT'
OR
OLD.requires_python IS DISTINCT FROM NEW.requires_python)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can ditch the IF ... THEN statement here and just rely on the trigger to only evaluate when this needs to be evaluated (see below).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught that the if statement ended after I tried to rebuild the db. Now it seems to successfully migrate!

# release_files with the appropriate requires_python values.
op.execute(
""" CREATE TRIGGER releases_requires_python
AFTER INSERT OR UPDATE ON releases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this to AFTER INSERT OR UPDATE OF requires_python ON releases then this trigger will only fire on either an insert, or when requires_python is updated. This would allow you to remove the logic from the procedure up above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice ! Thanks for the tip !

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much more elegant; thank you!

Changes should be addressed by 9061cbe.

# if someone changes the requires_python value, it is regenerated from
# releases.
op.execute(
""" CREATE TRIGGER release_files_requires_python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably wouldn't bother with this. This should be read only from the Python code in Warehouse and so it shouldn't ever got modified anyways. If it does we can always repair the data by executing:

UPDATE release_files
SET requires_python = releases.requires_python
FROM releases
WHERE
    release_files.name=releases.name
    AND release_files.version=releases.version

again to fix it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that those commands should be added to the downgrade() function? Or just that there's a clean way to address the issue if it does happen to come up.

Thanks for the feedback either way!

@Carreau
Copy link
Contributor Author

Carreau commented Oct 4, 2016

Sorry, this is on my list of things to review, I've just had a really crappy couple of weeks and haven't been able to focus much.

No worries, we all go through these phase. Hope things will go better for you. I was secretly hoping you were not reviewing because you liked the weekly email and were hoping to get more.

Thanks a lot for your time we really appreciate !

Copy link
Member

@dstufft dstufft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks good. One thing I missed is that you need to add this to the File module, something like:

class File(db.Model):
    ...

    requires_python = Column(Text)

    @validates("requires_python")
    def validate_requires_python(self, *args, **kwargs):
        raise RuntimeError("Cannot set File.requires_python")

@mpacer
Copy link

mpacer commented Oct 5, 2016

Is the validates decorator that you're suggesting the same as the sqlalchemy.orm.validates described here or is there a place in the code where you define your own validates decorator that I'm not finding?

@dstufft
Copy link
Member

dstufft commented Oct 5, 2016

Yea, it's the SQLAlchemy one, and it will just make it so any attempt in the Warehouse ORM to change the value will bomb out (but the trigger will work fine).

@mpacer
Copy link

mpacer commented Oct 5, 2016

And it should also be included in the File module?

@dstufft
Copy link
Member

dstufft commented Oct 5, 2016

It should only be on the File model. Modifying Release.requires_python is perfectly fine, the trigger will keep File.requires_python up to date, but modifying File.requires_python should trigger a validation error.

@mpacer
Copy link

mpacer commented Oct 5, 2016

Got it, thank you for the pointers! I wouldn't have known to construct it in this way without your help.

@mpacer
Copy link

mpacer commented Oct 5, 2016

I see now! The validation is accomplishing what we were trying to directly overwrite with triggers.

I agree that raising an error is a better than silently overwriting what someone did.

Also, I just realised I never imported it.

@mpacer
Copy link

mpacer commented Oct 5, 2016

@dstufft Does the codecov issue mean that a test needs to be added that hits the RuntimeError and expects a failure?

@dstufft
Copy link
Member

dstufft commented Oct 5, 2016

Yes.

@mpacer
Copy link

mpacer commented Oct 5, 2016

Would that go in the tests/unit/packaging/test_models.py under the TestFile class?

@dstufft
Copy link
Member

dstufft commented Oct 5, 2016

Yup.

@willingc
Copy link
Contributor

willingc commented Oct 5, 2016

@dstufft Thanks for helping out the Jupyter folks.

Sorry about the last few weeks. Your Python friends ❤️ all that you do. Take care. Feel free to ping any of us if we can help. You know how much I love docs ;-)

@mpacer
Copy link

mpacer commented Oct 5, 2016

Thank you for merging this!

When can we rely on this being available from test_pypi and pypi so we can make progress on integrating requires_python into pypi-legacy while using more efficient sql requests (in re: pypi/legacy#506)?

@Carreau Carreau deleted the mat_view branch October 5, 2016 18:55
@dstufft
Copy link
Member

dstufft commented Oct 5, 2016

Now.

Carreau added a commit to Carreau/warehouse that referenced this pull request Dec 2, 2016
The work in PR pypi#1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on pypi#1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of pypi#1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
Carreau added a commit to Carreau/warehouse that referenced this pull request Dec 2, 2016
The work in PR pypi#1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on pypi#1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of pypi#1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
Carreau added a commit to Carreau/warehouse that referenced this pull request Dec 3, 2016
The work in PR pypi#1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on pypi#1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of pypi#1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
Carreau added a commit to Carreau/warehouse that referenced this pull request Dec 3, 2016
The work in PR pypi#1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on pypi#1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of pypi#1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
Carreau added a commit to Carreau/warehouse that referenced this pull request Dec 5, 2016
The work in PR pypi#1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on pypi#1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of pypi#1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
Carreau added a commit to Carreau/warehouse that referenced this pull request Dec 5, 2016
The work in PR pypi#1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on pypi#1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of pypi#1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
dstufft pushed a commit that referenced this pull request Dec 5, 2016
The work in PR #1448 was meant to replicate the information of the
`requires_python` of the `release` table to the `release_files` table
for efficiency when generating the list of available packages for pip.

While the work on #1448 seem sufficient for warehouse itself, it need to
consider that legacy-pypi also access the same database, and legacy-pypi
violate some constraint.

In particular when using `setup.py register` followed by `twine upload`,
the file upload insert files into `release_files` after inserting into
`releases`. Thus the value in releases is not propagated, leading to an
inconsistency and a listing in pip missing information about
python-version compatibility.

While I doubt there are any packages released between the merge of #1448
and a fix, this update the tables, and bind an already existing trigger
to update the information during insertion in `release_files`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants