Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh Project Nodes does not allways work #3967

Closed
eblikstad opened this issue Aug 31, 2018 · 8 comments
Closed

Refresh Project Nodes does not allways work #3967

eblikstad opened this issue Aug 31, 2018 · 8 comments
Assignees

Comments

@eblikstad
Copy link

Describe the bug
The job step "Refresh Project Nodes" does not allways work. I a workflow where a node resource is added, then a refresh step is run, the third step which refers to the newly added node sometimes fails with no nodes matched (best guess 1 in 10).

My Rundeck detail

  • Rundeck version: 2.10.6
  • install type: launcher
  • OS Name/version: Windows 2012 R2
  • DB Type/version: Micrososft SQL Server

To Reproduce
Steps to reproduce the behavior:

  1. Creae a new job
  2. Add a job step which adds a node
  3. Add a refresh project nodes step
  4. Add a job reference step which has a node override for the newly added node
  5. Re-run this job until it fails

Expected behavior
The refresh project nodes should allways synchrounously refresh job node resource xml files.

Screenshots
image

@n-cc
Copy link

n-cc commented Mar 23, 2020

I believe I'm encountering a similar issue, and after discussion with #rundeck on freenode, I was directed here.

Describe the bug
I have a job that runs a command on a node in project A that creates a node in project B, then cross-references jobs in project B in order to run ansible against the new node in project B. In order to have the new node available, I'm cross-referencing a job in project B to run "Refresh Project Nodes" in order to refresh the nodes available in project B so that the new one is available. However, no sleep period for "Refresh Project Nodes" (I've tested up to 1 hour) allows the new node to be available by cross-referenced jobs in project B. The following error is encountered:

No nodes matched for the filters: NodeSet{includes={name=nodename, dominant=false, }}

Immediately after the job fails with the above error, I can go to project B's node list in the web UI and confirm that nodename doesn't show up. After waiting ~3 minutes, it then shows up. This ~3 minute wait period is consistent regardless of whether or not the "Refresh Project Nodes" jobs has a sleep period of 5 minutes or 1 hour. It seems the new node only shows up after the job fails.

My Rundeck detail

  • Rundeck version: 3.2.3-20200221
  • install type: deb
  • OS Name/version: Ubuntu 18.04
  • DB Type/version: local mysql

To Reproduce
Steps to reproduce the behavior:

  1. Create a new job in project A that:
  • Creates a node in project B
  • Runs a cross-project job in project B to run "Refresh Project Nodes," with any sleep period
  • Attempts to contact the new node via a cross-project job in project B
  1. Run this job and wait for the error I pasted above
  2. Enter the Nodes tab of project B, confirm the new node doesn't show up, wait ~3 minutes, and it shows up

Expected behavior
The new node would being available after running "Refresh Project Nodes" (instead, it only shows up ~3 minutes after the job fails, regardless of how long the sleep value is).

Screenshots
See above post, it's essentially the same thing.

I was able to recreate this issue with all jobs in the same project.

@MegaDrive68k
Copy link

MegaDrive68k commented Mar 23, 2020

Hi @n-cc

I found something with an example to reproduce (three jobs):

JobA (that adds a new server on Ansible node sources).

<joblist>
  <job>
    <defaultTab>nodes</defaultTab>
    <description></description>
    <executionEnabled>true</executionEnabled>
    <id>2f71b2c0-991b-4191-9373-feec9c8c7514</id>
    <loglevel>INFO</loglevel>
    <name>JobA</name>
    <nodeFilterEditable>false</nodeFilterEditable>
    <plugins />
    <scheduleEnabled>true</scheduleEnabled>
    <sequence keepgoing='false' strategy='node-first'>
      <command>
        <exec>echo "updating..."</exec>
      </command>
      <command>
        <fileExtension>.sh</fileExtension>
        <script><![CDATA[echo -e "\n192.168.33.22" >> /home/user/Downloads/hosts]]></script>
        <scriptargs />
        <scriptinterpreter>/bin/bash</scriptinterpreter>
      </command>
      <command>
        <step-plugin type='source-refresh-plugin'>
          <configuration>
            <entry key='sleep' value='15' />
          </configuration>
        </step-plugin>
      </command>
      <command>
        <exec>echo "done"</exec>
      </command>
    </sequence>
    <uuid>2f71b2c0-991b-4191-9373-feec9c8c7514</uuid>
  </job>
</joblist>

JobB (that executes a command in the future new node).

<joblist>
  <job>
    <defaultTab>nodes</defaultTab>
    <description></description>
    <dispatch>
      <excludePrecedence>true</excludePrecedence>
      <keepgoing>false</keepgoing>
      <rankOrder>ascending</rankOrder>
      <successOnEmptyNodeFilter>false</successOnEmptyNodeFilter>
      <threadcount>1</threadcount>
    </dispatch>
    <executionEnabled>true</executionEnabled>
    <id>7a616d10-bf61-4953-a09e-d294287269c4</id>
    <loglevel>INFO</loglevel>
    <name>JobB</name>
    <nodeFilterEditable>false</nodeFilterEditable>
    <nodefilters>
      <filter>192.168.33.22</filter>
    </nodefilters>
    <nodesSelectedByDefault>true</nodesSelectedByDefault>
    <plugins />
    <scheduleEnabled>true</scheduleEnabled>
    <sequence keepgoing='false' strategy='node-first'>
      <command>
        <exec>echo "hi"</exec>
      </command>
    </sequence>
    <uuid>7a616d10-bf61-4953-a09e-d294287269c4</uuid>
  </job>
</joblist>

ParentJob (that call JobA and JobB).

<joblist>
  <job>
    <defaultTab>nodes</defaultTab>
    <description></description>
    <executionEnabled>true</executionEnabled>
    <id>5ff04e74-156c-4525-a666-76d3e5d30e72</id>
    <loglevel>INFO</loglevel>
    <name>Parent</name>
    <nodeFilterEditable>false</nodeFilterEditable>
    <plugins />
    <scheduleEnabled>true</scheduleEnabled>
    <sequence keepgoing='false' strategy='node-first'>
      <command>
        <jobref name='JobA' nodeStep='true'>
          <uuid>2f71b2c0-991b-4191-9373-feec9c8c7514</uuid>
        </jobref>
      </command>
      <command>
        <exec>echo "hi"</exec>
      </command>
      <command>
        <jobref name='JobB' nodeStep='true'>
          <uuid>7a616d10-bf61-4953-a09e-d294287269c4</uuid>
        </jobref>
      </command>
    </sequence>
    <uuid>5ff04e74-156c-4525-a666-76d3e5d30e72</uuid>
  </job>
</joblist>

Executing that parent job I obtained the same error reported:

fail

But putting a single "sleep 10" in the parent job before the JobB execution, it works like a charm. (of course, keep in mind that the 192.168.33.22 node doesn't exist before the execution of JobB)

<joblist>
  <job>
    <defaultTab>nodes</defaultTab>
    <description></description>
    <executionEnabled>true</executionEnabled>
    <id>5ff04e74-156c-4525-a666-76d3e5d30e72</id>
    <loglevel>INFO</loglevel>
    <name>Parent</name>
    <nodeFilterEditable>false</nodeFilterEditable>
    <plugins />
    <scheduleEnabled>true</scheduleEnabled>
    <sequence keepgoing='false' strategy='node-first'>
      <command>
        <jobref name='JobA' nodeStep='true'>
          <uuid>2f71b2c0-991b-4191-9373-feec9c8c7514</uuid>
        </jobref>
      </command>
      <command>
        <exec>sleep 10</exec>
      </command>
      <command>
        <jobref name='JobB' nodeStep='true'>
          <uuid>7a616d10-bf61-4953-a09e-d294287269c4</uuid>
        </jobref>
      </command>
    </sequence>
    <uuid>5ff04e74-156c-4525-a666-76d3e5d30e72</uuid>
  </job>
</joblist>

Result:

ok

Perhaps the time of the "Refresh Project Nodes" step needs some adjustments to work correctly.

Workaround: put a "sleep 10" before the step that points to the new remote node.

Hope it helps!

@jtobard jtobard self-assigned this Mar 24, 2020
@n-cc
Copy link

n-cc commented Apr 2, 2020

Thanks for the tip - looks like the sleep allows the node to be found when all jobs are in the same project, but I'm still having issues getting it to work cross-project. I'll do a bit more testing.

@n-cc
Copy link

n-cc commented Apr 29, 2020

I'm not sure what changed on my side, but the local sleep is no longer working for new nodes where all jobs are within the same project. No length of sleep, before or after the refresh project nodes task (or even without it), is allowing new nodes to be recognized from within the same job that creates them. Not true, see rate of failure for sleep values below.

@n-cc
Copy link

n-cc commented May 1, 2020

A bit more insight to the problem - we are using the ansible plugin, and in the project's Nodes -> Sources tab, we have an Ansible Resource Model source with the inventory file path set to a local directory on the rundeck server. Inside of this directory is a script that ansible runs to grab a list of remote hosts/nodes (scripts in inventory directories get ran automatically, and the returned hosts are added to the inventory). The plugin parses this output to create the project's list of nodes.

The job I'm testing is simple - it runs a command to create a new node on a remote host. I can run the script inside the inventory directory to confirm the new node is immediately showing up. It then refreshes the project nodes, which in the log output generates a bunch of these lines:

13:17:38 |   | node1 -> localhost]
13:17:39 |   | node2 -> localhost]
13:17:40 |   | node3 -> localhost]

The newly created node DOES show up in this output, but after sleeping for X minutes (see data below on failure rates), rundeck fails with the "No nodes matched for the filters..." error when it attempts to contact the node.

In short:

Create node -> refresh project nodes -> sleep X -> attempt to contact node

Here are my results of testing different sleep values after the refresh project nodes step (a success means it could contact the node and run steps, failure means it hit the "No nodes matched for the filters" error):

10 second sleep: 3 out of 8 jobs succeeded

3 minute sleep: 3 out of 8 jobs succeeded

10 minute sleep: 12 out of 12 succeeded

Given the consistent success of jobs that sleep for 10 minutes, I'm wondering if there's something internal to Rundeck that's causing a 10-or-so minute delay in allowing nodes to be reliably referenced.

@n-cc
Copy link

n-cc commented May 5, 2020

Today I created a fresh Rundeck server, configured a project similar to the project on the server above, gave it the same list of nodes, and gave it the same jobs. I was able to run the same job (create node, access node) that produced a ~50% failure rate as noted above with a 100% success rate. I tried using 1 and 3 minute sleeps after the Refresh Project nodes, and was unable to recreate the "No nodes matched for the filters" error.

Since the node list, jobs, and project configuration are all identical between the two Rundeck servers, the issue with the first server outlined in the above comment must be somewhere else. Is it possible there's some sort of garbage buildup that can take place in servers that have been running for a while that can cause the sort of delay in recognition of nodes that we've been seeing? That's my first thought, since the only thing dissimilar between the two servers is their history and database.

@n-cc
Copy link

n-cc commented Jul 24, 2020

Ping on ^... the fact that a new Rundeck server doesn't experience the same delay in nodes being recognized as a longstanding Rundeck server, even when both servers have the same node list and the same project/job configuration leads me to believe there's an issue on the application's side. Are there some tunables we can tweak to fix problems with longstanding servers, or anything else we can look into to help debug this issue?

@stale
Copy link

stale bot commented Jul 24, 2021

In an effort to focus on bugs and issues that impact currently supported versions of Rundeck, we have elected to notify GitHub issue creators if their issue is classified as stale and close the issue. An issue is identified as stale when there have been no new comments, responses or other activity within the last 12 months. If a closed issue is still present please feel free to open a new Issue against the current version and we will review it. If you are an enterprise customer, please contact your Rundeck Support to assist in your request.
Thank you, The Rundeck Team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants