Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_probable_cause_of_failure() is bad at fetching logs #2

Closed
coyotemarin opened this issue Oct 13, 2010 · 8 comments
Closed

find_probable_cause_of_failure() is bad at fetching logs #2

coyotemarin opened this issue Oct 13, 2010 · 8 comments
Labels
Milestone

Comments

@coyotemarin
Copy link
Collaborator

We currently grab EMR logs from S3. This only works for job flows that shut down after running your job. Technically, it's not supposed to work at all; according to (http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265), logs aren't copied to S3 until they've been untouched for 5 minutes.

Rather than grabbing the logs from S3 directly, we need to download the relevant logs via ssh if the job flow is still running, and S3 if it's not, and parse the log files locally.

@irskep
Copy link
Contributor

irskep commented Jun 13, 2011

Would it be more appropriate to call out to the ssh utility or to add a dependency on something like paramiko?

@irskep
Copy link
Contributor

irskep commented Jun 14, 2011

I'm refactoring the S3 and SSH log fetcher functionality to subclass LogFetcher in a new submodule.

from mrjob.logfetch.ssh import SSHLogFetcher
# etc.

This will probably also involve breaking a lot of S3-related code out of EMRJobRunner, which probably isn't a bad thing since that class is currently a couple thousand lines long.

@coyotemarin
Copy link
Collaborator Author

Yup, that sounds good. Another good way to approach this is to start out by building a standalone utility (in mrjob.tools.emr) that fetches and analyzes logs, and then patch it into EMRJobRunner.

And please, use scp; don't add another library dependency. :)

@irskep
Copy link
Contributor

irskep commented Jun 14, 2011

Can do. My current strategy is to copy any relevant functions (ls/get from S3, local, and SSH + dependencies) into instance methods and helpers for fetchers so that logfetch can be used independently. Then I will write a tool around it, verify by hand on various cases, add mocking for SSH + automated tests, and finally insert it into EMRJobRunner, removing redundant functions.

@coyotemarin
Copy link
Collaborator Author

Sounds like a good plan.

@ghost ghost assigned irskep Jun 14, 2011
@irskep
Copy link
Contributor

irskep commented Jun 14, 2011

New info: logs have slightly different paths on S3 vs local. Here's a quickref I'll put in the comments:

S3 location             Local location
/daemons                / (root)
/jobs                   /history
/node                   <not present>
/steps                  /steps
/task-attempts          /userlogs

irskep pushed a commit to irskep/mrjob that referenced this issue Jun 18, 2011
coyotemarin pushed a commit that referenced this issue Jul 27, 2011
@irskep
Copy link
Contributor

irskep commented Aug 1, 2011

I believe this can be closed unless it also encompasses a log fetching/parsing refactor.

@coyotemarin
Copy link
Collaborator Author

Yup, thanks!

anusha-r added a commit that referenced this issue Mar 8, 2015
Pulling master from Yelp/mrjob
coyotemarin pushed a commit to coyotemarin/mrjob that referenced this issue Apr 24, 2016
merge master into Google dataproc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants