Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Sometimes one process node owns 2 tasks #8

Open
lyogavin opened this Issue · 4 comments

2 participants

Gavin Li Matthieu Morel
Gavin Li

When using S4, we found sometimes it ends up with one process node owns 2 tasks. I did some investigation, it seems that the handling of ConnectionLossException when creating the ephemeral node is problematic. Sometimes when the response from zookeeper server times out, zookeeper.create() will fail with ConnectionLossException while the creation request might already be sent to server(see http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java line 830). From our logs this is the case we ran into.

Maybe we should handle it in the way that HBase is handling it (http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java?view=markup), just simply exit the process when got that exception to let the whole process restart.

Gavin Li

To be more clear, what happened was: a process node called zookeeper.create() to acquire a task, the request was successfully sent to zookeeper server, but the zookeeper IO loop timed out before the response came. So the zookeeper.create() failed with ConnectionLossException. Then the process node ignored this exception and tried to acquire another task. Then it got 2 tasks.

Gavin Li

I created a patch:

diff --git a/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java b/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java
index 8c0c120..052701c 100644
--- a/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java
+++ b/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java
@@ -128,6 +128,7 @@ public class ZkTaskManager extends DefaultWatcher implements TaskManager {
map.put("taskSize", "" + tasks.size());
map.put("tasksRootNode", tasksListRoot);
map.put("processRootNode", processListRoot);
+
String create = zk.create(pNode,
JSONUtil.toJsonString(map)
.getBytes(),
@@ -142,10 +143,15 @@ public class ZkTaskManager extends DefaultWatcher implements TaskManager {
logger.info("No task available to take up. Going to wait");
mutex.wait();
}

  • } catch (KeeperException.NodeExistsException e) {
  • logger.info("znode already exist, retry..." + e.getMessage(), e); } catch (KeeperException e) {
  • logger.info("Warn:mostly ignorable " + e.getMessage(), e);
  • logger.error("Unexpected zookeeper exception. Exit!!" + e.getMessage(), e);
  • System.exit(5); } catch (InterruptedException e) { logger.info("Warn:mostly ignorable " + e.getMessage(), e);
  • Thread.currentThread().interrupt(); + } } }
Matthieu Morel
Collaborator

Thanks for taking the time to clearly identify that issue.

We have just moved to apache... could you report that on the apache jira? Just in order to ensure it is tracked!

see : https://issues.apache.org/jira/browse/S4

Gavin Li

Sure, S4-3 created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.