Update projects on per-backend base #239

Closed
wants to merge 0 commits into
from

Conversation

Projects
None yet
3 participants
@MichaelMraka

This request adds update_projects() method to backend classes. The method is used to update projects in much faster / scalable way by calling upstream server's (pypi.python.org, cpan.org, etc.) API or getting its RSS feed with updated projects. So instead of polling all projects (many thousands) on every update it polls only projects which reported some modifications.

It also automatically adds new projects found in feed.

There is a replacement for current anitya_cron.py - anitya_cron_backends.py script - which uses backend update_projects() method.

This is very useful in e.g. automatic rebuild of upstream packages to rpm (in COPR) which I'm testing.

anitya/lib/backends/__init__.py
+ session = anitya.app.SESSION
+ projects = self.list_recent_projects(session)
+ anitya.LOG.info(projects)
+ p = multiprocessing.Pool(anitya.app.APP.config.get('CRON_POOL', 10))

This comment has been minimized.

@ralphbean

ralphbean Dec 9, 2015

Contributor

Any thoughts on using multiprocessing.pool.ThreadPool here instead? Using threads in python is no good when the work done is CPU-bound -- the GIL kills performance. However, this is mostly an IO bound thing, so we should be okay (fingers crossed).

@ralphbean

ralphbean Dec 9, 2015

Contributor

Any thoughts on using multiprocessing.pool.ThreadPool here instead? Using threads in python is no good when the work done is CPU-bound -- the GIL kills performance. However, this is mostly an IO bound thing, so we should be okay (fingers crossed).

This comment has been minimized.

@ralphbean

ralphbean Dec 9, 2015

Contributor

With a threadpool (instead of a multiproc pool) you might be able to pass the session object into update_project and thereby avoid re-initializing it for every work item.

@ralphbean

ralphbean Dec 9, 2015

Contributor

With a threadpool (instead of a multiproc pool) you might be able to pass the session object into update_project and thereby avoid re-initializing it for every work item.

This comment has been minimized.

@MichaelMraka

MichaelMraka Dec 11, 2015

Well, I simply reused code from current anitya_cron.py here. But yes, I can see cron job failures from time to time so different parallel execution module might suit better. I'll give it a try.

@MichaelMraka

MichaelMraka Dec 11, 2015

Well, I simply reused code from current anitya_cron.py here. But yes, I can see cron job failures from time to time so different parallel execution module might suit better. I'll give it a try.

This comment has been minimized.

@ralphbean

ralphbean Dec 15, 2015

Contributor

Any luck with ThreadPool?

@ralphbean

ralphbean Dec 15, 2015

Contributor

Any luck with ThreadPool?

This comment has been minimized.

@xsuchy

xsuchy Feb 25, 2016

ThreadPool is obsolete module and should not be used for new projects. @MichaelMraka if multiprocessing cause you problem, you may try asyncio https://docs.python.org/3/library/asyncio.html

@xsuchy

xsuchy Feb 25, 2016

ThreadPool is obsolete module and should not be used for new projects. @MichaelMraka if multiprocessing cause you problem, you may try asyncio https://docs.python.org/3/library/asyncio.html

This comment has been minimized.

@ralphbean

ralphbean Feb 26, 2016

Contributor

@xsuchy, can you point to some docs on the deprecation? I don't see any mention of it in the multiprocessing.pool.ThreadPool docstring.

@ralphbean

ralphbean Feb 26, 2016

Contributor

@xsuchy, can you point to some docs on the deprecation? I don't see any mention of it in the multiprocessing.pool.ThreadPool docstring.

This comment has been minimized.

@xsuchy

xsuchy Feb 28, 2016

@ralphbean https://pypi.python.org/pypi/threadpool but that is something different than multiprocessing.pool.ThreadPool, which is on the other hand barely documented.

@xsuchy

xsuchy Feb 28, 2016

@ralphbean https://pypi.python.org/pypi/threadpool but that is something different than multiprocessing.pool.ThreadPool, which is on the other hand barely documented.

@ralphbean

This comment has been minimized.

Show comment
Hide comment
@ralphbean

ralphbean Dec 9, 2015

Contributor

Implementation aside for a moment, we currently run the full-scan cronjob about twice a day.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

Contributor

ralphbean commented Dec 9, 2015

Implementation aside for a moment, we currently run the full-scan cronjob about twice a day.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

@MichaelMraka

This comment has been minimized.

Show comment
Hide comment
@MichaelMraka

MichaelMraka Dec 11, 2015

Thanks Ralph for the comments.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

Exactly. Moreover I have a testing instance with more than 15k packages (automatically added pypi modules updated in the last 3 month) and current full-scan cron job runs 25 mins on it while new API-scan finishes in 2-3 mins.
So simple calculation says somewhere around 400k projects full-scan will take 12 hours :).
Well, 400k might seem like an insane number now but with automatic new project registration it can be reached couple of months (there are 70k modules on PyPI, 150k on CPAN, 200k on npmjs, ...).

Thanks Ralph for the comments.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

Exactly. Moreover I have a testing instance with more than 15k packages (automatically added pypi modules updated in the last 3 month) and current full-scan cron job runs 25 mins on it while new API-scan finishes in 2-3 mins.
So simple calculation says somewhere around 400k projects full-scan will take 12 hours :).
Well, 400k might seem like an insane number now but with automatic new project registration it can be reached couple of months (there are 70k modules on PyPI, 150k on CPAN, 200k on npmjs, ...).

anitya/lib/backends/__init__.py
@@ -265,6 +266,30 @@ def call_url(self, url, insecure=False):
return requests.get(url, headers=headers, verify=not insecure)
+ @classmethod
+ def list_recent_projects(self, session):
+ return anitya.lib.model.Project.by_backend(session, self.name)

This comment has been minimized.

@ralphbean

ralphbean Dec 15, 2015

Contributor

Can you write an inline comment here explaining how child classes of the BaseBackend can override this?

Can you add to that comment some description of the default functionality -- i.e., what happens for the child classes that do not override this?

@ralphbean

ralphbean Dec 15, 2015

Contributor

Can you write an inline comment here explaining how child classes of the BaseBackend can override this?

Can you add to that comment some description of the default functionality -- i.e., what happens for the child classes that do not override this?

@ralphbean

This comment has been minimized.

Show comment
Hide comment
@ralphbean

ralphbean Dec 15, 2015

Contributor

Thinking about this some more -- this change is pretty major. It doesn't actually change a lot of the existing code, but it does add a whole new mode that the Backend classes need to update themselves for over time.

Can you add a narrative blurb (one or two paragraphs) to the docs (maybe the README?) describing the two different modes and what methods the backends need to implement in order to take advantage of them?

Contributor

ralphbean commented Dec 15, 2015

Thinking about this some more -- this change is pretty major. It doesn't actually change a lot of the existing code, but it does add a whole new mode that the Backend classes need to update themselves for over time.

Can you add a narrative blurb (one or two paragraphs) to the docs (maybe the README?) describing the two different modes and what methods the backends need to implement in order to take advantage of them?

@ralphbean

This comment has been minimized.

Show comment
Hide comment
@ralphbean

ralphbean Jan 30, 2016

Contributor

Any updates here @MichaelMraka?

Contributor

ralphbean commented Jan 30, 2016

Any updates here @MichaelMraka?

@MichaelMraka

This comment has been minimized.

Show comment
Hide comment
@MichaelMraka

MichaelMraka Feb 23, 2016

Unfortunately I had no luck finding better solution for multiprocessing (which would eliminated all cron job failures).

Unfortunately I had no luck finding better solution for multiprocessing (which would eliminated all cron job failures).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment