Sometimes one process node owns 2 tasks #8

Open
lyogavin opened this Issue Oct 9, 2011 · 4 comments

Comments

Projects
None yet
2 participants

lyogavin commented Oct 9, 2011

When using S4, we found sometimes it ends up with one process node owns 2 tasks. I did some investigation, it seems that the handling of ConnectionLossException when creating the ephemeral node is problematic. Sometimes when the response from zookeeper server times out, zookeeper.create() will fail with ConnectionLossException while the creation request might already be sent to server(see http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java line 830). From our logs this is the case we ran into.

Maybe we should handle it in the way that HBase is handling it (http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java?view=markup), just simply exit the process when got that exception to let the whole process restart.

lyogavin commented Oct 9, 2011

To be more clear, what happened was: a process node called zookeeper.create() to acquire a task, the request was successfully sent to zookeeper server, but the zookeeper IO loop timed out before the response came. So the zookeeper.create() failed with ConnectionLossException. Then the process node ignored this exception and tried to acquire another task. Then it got 2 tasks.

lyogavin commented Oct 9, 2011

I created a patch:

diff --git a/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java b/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java
index 8c0c120..052701c 100644
--- a/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java
+++ b/s4-comm/src/main/java/io/s4/comm/zk/ZkTaskManager.java
@@ -128,6 +128,7 @@ public class ZkTaskManager extends DefaultWatcher implements TaskManager {
map.put("taskSize", "" + tasks.size());
map.put("tasksRootNode", tasksListRoot);
map.put("processRootNode", processListRoot);
+
String create = zk.create(pNode,
JSONUtil.toJsonString(map)
.getBytes(),
@@ -142,10 +143,15 @@ public class ZkTaskManager extends DefaultWatcher implements TaskManager {
logger.info("No task available to take up. Going to wait");
mutex.wait();
}

  •            } catch (KeeperException.NodeExistsException e) {
    
  •                logger.info("znode already exist, retry..." + e.getMessage(), e);
             } catch (KeeperException e) {
    
  •                logger.info("Warn:mostly ignorable " + e.getMessage(), e);
    
  •                logger.error("Unexpected zookeeper exception. Exit!!" + e.getMessage(), e);
    
  •                System.exit(5);
             } catch (InterruptedException e) {
                 logger.info("Warn:mostly ignorable " + e.getMessage(), e);
    
  •                Thread.currentThread().interrupt();
    
    •        }
         }
      
      }

Thanks for taking the time to clearly identify that issue.

We have just moved to apache... could you report that on the apache jira? Just in order to ensure it is tracked!

see : https://issues.apache.org/jira/browse/S4

Sure, S4-3 created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment