find_probable_cause_of_failure() is bad at fetching logs #2

coyotemarin · 2010-10-13T20:44:35Z

We currently grab EMR logs from S3. This only works for job flows that shut down after running your job. Technically, it's not supposed to work at all; according to (http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265), logs aren't copied to S3 until they've been untouched for 5 minutes.

Rather than grabbing the logs from S3 directly, we need to download the relevant logs via ssh if the job flow is still running, and S3 if it's not, and parse the log files locally.

irskep · 2011-06-13T23:46:47Z

Would it be more appropriate to call out to the ssh utility or to add a dependency on something like paramiko?

irskep · 2011-06-14T16:57:28Z

I'm refactoring the S3 and SSH log fetcher functionality to subclass LogFetcher in a new submodule.

from mrjob.logfetch.ssh import SSHLogFetcher
# etc.

This will probably also involve breaking a lot of S3-related code out of EMRJobRunner, which probably isn't a bad thing since that class is currently a couple thousand lines long.

coyotemarin · 2011-06-14T17:20:39Z

Yup, that sounds good. Another good way to approach this is to start out by building a standalone utility (in mrjob.tools.emr) that fetches and analyzes logs, and then patch it into EMRJobRunner.

And please, use scp; don't add another library dependency. :)

irskep · 2011-06-14T17:30:02Z

Can do. My current strategy is to copy any relevant functions (ls/get from S3, local, and SSH + dependencies) into instance methods and helpers for fetchers so that logfetch can be used independently. Then I will write a tool around it, verify by hand on various cases, add mocking for SSH + automated tests, and finally insert it into EMRJobRunner, removing redundant functions.

coyotemarin · 2011-06-14T18:05:33Z

Sounds like a good plan.

irskep · 2011-06-14T21:39:05Z

New info: logs have slightly different paths on S3 vs local. Here's a quickref I'll put in the comments:

S3 location             Local location
/daemons                / (root)
/jobs                   /history
/node                   <not present>
/steps                  /steps
/task-attempts          /userlogs

…Now this should fix Yelp#2

SSH log fetching (issue #2)

irskep · 2011-08-01T23:13:49Z

I believe this can be closed unless it also encompasses a log fetching/parsing refactor.

coyotemarin · 2011-08-01T23:42:43Z

Yup, thanks!

Pulling master from Yelp/mrjob

merge master into Google dataproc

ghost assigned irskep Jun 14, 2011

irskep pushed a commit to irskep/mrjob that referenced this issue Jun 18, 2011

Tests for SSH-S3 failover written and uncovered a bug which I fixed. …

2c4e294

…Now this should fix Yelp#2

coyotemarin pushed a commit that referenced this issue Jul 27, 2011

Merge pull request #156 from irskep/ssh_log_fetch

8f9daaf

SSH log fetching (issue #2)

coyotemarin closed this as completed Aug 1, 2011

irskep mentioned this issue May 15, 2012

Integrate typedbytes into mrjob #447

Closed

anusha-r added a commit that referenced this issue Mar 8, 2015

Merge pull request #2 from Yelp/master

99e203e

Pulling master from Yelp/mrjob

wsshopping mentioned this issue Jul 23, 2015

mrjob on Hadoop 2.7.1 returned non-zero exit status 256 #1092

Closed

tomelm unassigned irskep Sep 28, 2015

coyotemarin pushed a commit to coyotemarin/mrjob that referenced this issue Apr 24, 2016

Merge pull request Yelp#2 from davidmarin/google-dataproc

ba34692

merge master into Google dataproc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_probable_cause_of_failure() is bad at fetching logs #2

find_probable_cause_of_failure() is bad at fetching logs #2

coyotemarin commented Oct 13, 2010

irskep commented Jun 13, 2011

irskep commented Jun 14, 2011

coyotemarin commented Jun 14, 2011

irskep commented Jun 14, 2011

coyotemarin commented Jun 14, 2011

irskep commented Jun 14, 2011

irskep commented Aug 1, 2011

coyotemarin commented Aug 1, 2011

find_probable_cause_of_failure() is bad at fetching logs #2

find_probable_cause_of_failure() is bad at fetching logs #2

Comments

coyotemarin commented Oct 13, 2010

irskep commented Jun 13, 2011

irskep commented Jun 14, 2011

coyotemarin commented Jun 14, 2011

irskep commented Jun 14, 2011

coyotemarin commented Jun 14, 2011

irskep commented Jun 14, 2011

irskep commented Aug 1, 2011

coyotemarin commented Aug 1, 2011