Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for `yarn-cluster` with high availability #905

Merged
merged 7 commits into from Sep 7, 2017

Conversation

Projects
None yet
1 participant
@javierluraschi
Copy link
Member

commented Aug 7, 2017

See #903, add support for yarn-cluster in high availability configurations.

When a cluster is configured for HA, it may list multiple resource managers instead, the main one is tracked by yarn.resourcemanager.ha.id which can then be used to retrieve the actual address with yarn.resourcemanager.address.<haid>.

See also: https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_hag_rm_ha_config.html#concept_skx_n1z_vl__table_lbx_g21_wl

Tested a couple scenarios:

  1. Basic non-HA in yarn-cluster
  2. HA with no fallback, main resource manager response and is used.
  3. HA with fallback, main resource manager fails to respone.

For reference, config files for (2) and (3) follow:

HA with no fallback:

<?xml version="1.0" encoding="UTF-8"?>

<!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>yarn.acl.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.admin.acl</name>
    <value>*</value>
  </property>
  <!-- HA Start-->
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm1</value>
  </property>
  <!-- HA End -->
  <!-- RM1 Start -->
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>ip-172-30-1-13.us-west-2.compute.internal:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>ip-172-30-1-13.us-west-2.compute.internal:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>ip-172-30-1-13.us-west-2.compute.internal:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>ip-172-30-1-13.us-west-2.compute.internal:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>ip-172-30-1-13.us-west-2.compute.internal:8088</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>ip-172-30-1-13.us-west-2.compute.internal:8090</value>
  </property>
  <!-- RM1 End -->
  <!-- RM2 Start -->
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>bad-ip-172-30-1-13.us-west-2.compute.internal:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>bad-ip-172-30-1-13.us-west-2.compute.internal:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>bad-ip-172-30-1-13.us-west-2.compute.internal:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>bad-ip-172-30-1-13.us-west-2.compute.internal:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>bad-ip-172-30-1-13.us-west-2.compute.internal:8088</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>bad-ip-172-30-1-13.us-west-2.compute.internal:8090</value>
  </property>
  <!-- RM2 End -->
  <property>
    <name>yarn.resourcemanager.client.thread-count</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.client.thread-count</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.client.thread-count</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>yarn.scheduler.increment-allocation-mb</name>
    <value>512</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>65536</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.increment-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>32</value>
  </property>
  <property>
    <name>yarn.resourcemanager.amliveliness-monitor.interval-ms</name>
    <value>1000</value>
  </property>
  <property>
    <name>yarn.am.liveness-monitor.expiry-interval-ms</name>
    <value>600000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.am.max-attempts</name>
    <value>2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.container.liveness-monitor.interval-ms</name>
    <value>600000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
    <value>1000</value>
  </property>
  <property>
    <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
    <value>600000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.client.thread-count</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.application.classpath</name>
    <value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.max-completed-applications</name>
    <value>10000</value>
  </property>
  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
  </property>
  <property>
    <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
    <value>logs</value>
  </property>
</configuration>

(3) is same as (2) but changing:

  <property>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm2</value>
  </property>
@javierluraschi

This comment has been minimized.

Copy link
Member Author

commented Aug 7, 2017

This needs more work, not sufficient to pick based on .id, need to use REST API to find which RM is up and where the application is running.

@javierluraschi javierluraschi changed the title Add support for `yarn-cluster` with high availability WIP: Add support for `yarn-cluster` with high availability Aug 8, 2017

@javierluraschi javierluraschi force-pushed the feature/yarn-high-availability branch from d2bacef to 7c35eb0 Sep 7, 2017

@javierluraschi javierluraschi force-pushed the feature/yarn-high-availability branch from 7c35eb0 to 288c9ef Sep 7, 2017

@javierluraschi javierluraschi changed the title WIP: Add support for `yarn-cluster` with high availability Add support for `yarn-cluster` with high availability Sep 7, 2017

@javierluraschi javierluraschi merged commit 2e54d26 into master Sep 7, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.