Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation request for diagnosing/restarting/monitoring pipelines #133

Open
cfhammill opened this issue Jan 2, 2019 · 2 comments
Open

Comments

@cfhammill
Copy link

Hi all,

I think it would be helpful if some documentation could be provided for how to monitor/diagnose/restart an ongoing pipeline, particularly in the Redis case.

Currently I connect to the redis coordinator with redis-cli and track job_running and jobs_queue keys to get a sense of what's happening.

If things break I try to clean restart with:

  1. chmod +w store #give write access to the store
  2. rm store/lock/* #remove all locks
  3. ls -d store/pending* | parallel -j<ncores> 'chmod -R +w {}; rm -r {}' #remove pending jobs which can prevent running
  4. redis-cli LTRIM "job_running" 0 0; redis-cli LPOP "job_running". As an aside I think instead I should probably use DEL or FLUSHALL, this is just my redis inexperience. If I don't do this, the "job_running" key grows with each run, potentially triggering the same job to run multiple times, and has caused failures.

This strategy was acquired through somewhat painful trial and error.

Documentation that would help me (and I suspect others):

  1. How to inspect the contents of "job_running". GET returns some encoded data that I don't know how to decode.
  2. Understanding what "jobs_queue" is for, everything seems to go to "job_running" after start up.
  3. Explanation of how logging works, my STDOUT and STDERRs end up in my cluster log files, not in the store/metadata/hash-<hash>/{stdout,stderr}. Although this may be just a torque cluster idiosyncracy.
  4. What's in metadata.db, I'm happy to poke at the SQLITE tables if that's what required. Alternatively a pointer to a human readable set of stages preferably divided into queued, running, completed, and failed.
  5. Garbage collection and caching examples. I don't really have disk space for multiple copies of my pipeline. Alternatively I could avoid caching certain steps, but how to do so isn't immediately clear.
  6. Tips for avoiding unnecessary re-runs. I've had to run the same pipeline on multiple hardware sets, and it seems that this has triggered re-runs (or my store manipulation tom-foolery, not sure).
@cfhammill
Copy link
Author

Also a convenient way to issue interrupts to executors would be handy

@dorranh
Copy link
Contributor

dorranh commented Jul 22, 2021

Thanks for raising this issue! We are no longer using external-executor, but I think the points here are important to keep in mind if we end up implementing distributed execution in future versions of funflow. As such I'll tag this issue and leave it up for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants