Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multi-workers, and make data loading faster #12

Merged
merged 1 commit into from
May 3, 2013
Merged

Enable multi-workers, and make data loading faster #12

merged 1 commit into from
May 3, 2013

Conversation

sykp241095
Copy link
Contributor

  • Make data loading faster

I cache these data("targets_all", "graphs_all") to disk files when run update_metrics.py,
then delete the urlopen in update_metrics.py, because app.py will reload these data by comparing mtime of these files when response /index.

  • Enable multi-workers

When I tested multi-workers with gunicorn, I found it would refresh_data in a SINGLE worker. then I make app.py compare the mtime of these data files in each request to /index, if these files were modified, reload them.

It also works in a single worker as before, but some minutes faster, :)

@Dieterbe
Copy link
Contributor

The performance of the computations of the metrics datastructures are currently the single most PITA when using graph-explorer with a large amount of metrics. this is currently a single threaded operation that could be easily parrallelized (by splitting up the metrics list in chunks).
Another fix that I've been looking into, is creating a tag database (in something like sqlite) instead of python memory structures, this should be more efficient in a few areas (possibly in addition to multi-threading)
Ideally, long term, however, metrics should be submitted into graphite as a list of key/value pairs, instead of a "dumb" string. I've started working towards this goal but it's gonna take me a while. (see for example https://answers.launchpad.net/graphite/+question/223956). Once that's in place we can get rid of structured_metrics and many things will be easier. (tag based metrics in graphite are the future, and everyone will benefit)

Anyway, looking at your PR:

  1. I like the removal of lambda functions (and all configure functions) when storing metrics: basically we should only store what's fundamental to the metric (graphite metric, target, tags). that suffices to use the metric. the configure functions were only kept so that you could see them when inspecting a metric (to debug what happened), but that's a separate, secondary concern.
    since metrics have a plugin tag, it's fairly to look up this info after the fact anyway.

  2. I don't understand how your commit does anything wrt parallelisation (multi-workers? how does that work? i've never used gunicorn. would it be better to always run in gunicorn instead of paste as i do now?)

  3. Can you explain more in detail how the speeds are achieved?

thanks

@sykp241095
Copy link
Contributor Author

Hi:

to 2): Gunicorn and paste have the same function in this case: managing workers, single or multiple.

Without my commit, I can also start multiple worker with paste or gunicorn, but these workers can not hold the same data(targets_all, graphs_all) in the same time.

For example I start 2 workers A and B with paste server, when I "GET /refresh_data" to rebuild data, Worker A will serve my request, and "targets_all" and "graphs_all" will be rebuild in worker A, but they are still old in worker B, becase B did't serve my "GET /refresh_data" this time.

to 3): The speed is archived ONLY when rebuild data(targets_all, graphs_all) by these steps:

(1) Instead of rebuilding data in web app worker, I build these data in cron(update_metrics.py), when this cron runs, I not only download metrics, but also rebuild these two big data structure "targets_all" and "graphs_all", then "pickle.dumps" them to disk files with module "pickle".

(2) Instead of urllib.urlopen('/refresh_data'), in each request, workers will TRY TO reload data(targets_all and graphs_all dumped just now) by comparing timestamp of files dumped just now with that value(timestamp) stored last time reloaded.
And there is not too much overhead when comparing timestamp.

(3) Finally, workers can easily RELOAD data by "pickle.loads" disk files dumped just now. In my case, it will cost 3 seconds when there are 20, 000 metrics.

PS: As pickle can not dumps lambda, I copied targets_all, and deleted lambdas from the copy.

@sykp241095
Copy link
Contributor Author

We have already use this patch in our production environment for a week, it seems good.

if not (os.path.isfile(config.targets_all_cache_file) and \
os.path.isfile(config.graphs_all_cache_file) and \
os.path.isfile(config.filename_metrics)):
backend.update_data(s_metrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a race condition with multiple workers/threads ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as workers are independent, I think they will work fine if the data(targets_all.cache & graphs_all.cache) is prepared.

@Dieterbe
Copy link
Contributor

Dieterbe commented May 2, 2013

hey so like i mentioned earlier, i wanted to get rid of the lambda functions earlier on. as it turns out this was pretty easy to implement and i pushed it to master: Dieterbe/graph-explorer@c01a8e9 so now you can pickle the list directly.

thanks for the explanation, that clears up a lot for me.

I'm personally very inexperienced with webservers such as paste and gunicorn. it probably makes sense to have a config option that specifies the number of workers, but that would probably be a different implementation for paste vs gunicorn.. maybe a note in the deployment section in the readme about multiple workers?

I'm suffering a lot from GE being unresponsive, and I wonder if maybe that's because of the old refresh_data implementation, I hope your patch fixes that. So I'll try it out.

Btw I think there's a race condition, I added a comment to the code line.

@sykp241095
Copy link
Contributor Author

Because the old 'refresh_data' need a lot computing( regex matching ), it will capture the whole computer as a result of PIL in python, so the worker will be unresponsive.

The pickle.dumps & pickle.loads or saving these data to database as you said will help you to make this faster.

Good luck !

@Dieterbe Dieterbe merged commit 77b4643 into vimeo:master May 3, 2013
@Dieterbe
Copy link
Contributor

Dieterbe commented May 3, 2013

I merged it, applied some fixes (please check the commit log and have a look), but there's still some bugs that have to be fixed:

  1. when you run update_metrics, the log messages emitted from backend.update_data(s_metrics) aren't shown.

  2. if you have no cache files present, then start GE, then run update_metrics.py (successfully), but go to a page like /debug, /debug/metrics, /inspect, it will keep waiting for the data files to be loaded, but they will never load because only the index page does that.

  3. (would be nice to have some documentation on how to use multiple workers, pref. with paste)

@Dieterbe
Copy link
Contributor

Dieterbe commented May 3, 2013

wow.. github closed this automatically and doesn't let me reopen it. @huoxy can you reopen?

@sykp241095
Copy link
Contributor Author

I can not reopen it either...

to 3) I have used gunicorn

In my case, I edited app.py and append 'app = bottle.default_app()' to it, and then run app.py with gunicorn like this without graph-explorer.py :

    # /usr/bin/gunicorn -w 4 app:app -b 0.0.0.0:8080

or edit graph-explorer.py :

    run('app', host=config.listen_host, port=config.listen_port, server='gunicorn', workers = 4)

then it will start 4 processes.

see this: http://blog.yprez.com/running-a-bottle-app-with-gunicorn.html

@Dieterbe
Copy link
Contributor

Dieterbe commented May 9, 2013

ok thanks, i created separate issues (see above) for these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants