Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The CI infra is busted, late June edition #21108

Closed
SimonSapin opened this issue Jun 30, 2018 · 6 comments
Closed

The CI infra is busted, late June edition #21108

SimonSapin opened this issue Jun 30, 2018 · 6 comments

Comments

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Jun 30, 2018

The homu queue was stuck this morning, with 6 PRs ready but none pending, and buildbot being idle. After restarting them both, homu went through the entire queue and everything failed with:

error in RunProcess._startCommand
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/buildslave/runprocess.py", line 426, in start
    self._startCommand()
  File "/usr/local/lib/python2.7/dist-packages/buildslave/runprocess.py", line 550, in _startCommand
    usePTY=self.usePTY)
  File "/usr/local/lib/python2.7/dist-packages/buildslave/runprocess.py", line 572, in _spawnProcess
    processProtocol, uid, gid, childFDs)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/process.py", line 728, in __init__
    self._fork(path, uid, gid, executable, args, environment, fdmap=fdmap)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/process.py", line 405, in _fork
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
@SimonSapin
Copy link
Member Author

@SimonSapin SimonSapin commented Jun 30, 2018

@SimonSapin
Copy link
Member Author

@SimonSapin SimonSapin commented Jun 30, 2018

Free RAM on linux builders:

% for host in $(curl -s http://build.servo.org/json/slaves|jq 'keys | .[]' -r -|grep linux); do echo ====== $host; ssh -o 'ConnectTimeout 2' $host.servo.org free -h|grep buffers; done 
====== servo-linux-cross1
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       538M       6.8G
====== servo-linux-cross2
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       506M       6.8G
====== servo-linux-cross3
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       5.0G       2.3G
====== servo-linux1
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       1.6G        27G
====== servo-linux2
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       2.3G        27G
====== servo-linux3
ssh: connect to host servo-linux3.servo.org port 22: Connection timed out
====== servo-linux4
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       1.4G        27G
====== servo-linux5
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       1.6G        27G
====== servo-linux6
             total       used       free     shared    buffers     cached
-/+ buffers/cache:       3.1G        26G

servo-linux-cross3 is the outlier, and seems to match failing jobs. Its /usr/bin/python /usr/local/bin/buildslave start --nodaemon /home/servo/buildbot/slave process is using 4.7 GB RSS while idle. For comparison, on servo-linux-cross2 the same process takes 22 MB.

Clicking "Graceful shutdown" on http://build.servo.org/buildslaves/servo-linux-cross3 solves the RAM issue, but now that builder is disconnected.

Running salt "servo-linux-cross3" state.highstate 2>&1 --force-color | tee salt_highstate_issue_servo_21108.log on servo-master did not help. It ended with Failed: 2, some of the red includes:

          ID: /home/servo/buildbot/slave
    Function: file.recurse
      Result: False
     Comment: Recurse failed: none of the specified sources were found
     Started: 10:12:17.842137
    Duration: 2026.132 ms
          ID: buildbot-slave
    Function: service.running
      Result: False
     Comment: One or more requisite failed: buildbot.slave./home/servo/buildbot/slave
@SimonSapin SimonSapin mentioned this issue Jun 30, 2018
4 of 4 tasks complete
@SimonSapin
Copy link
Member Author

@SimonSapin SimonSapin commented Jun 30, 2018

Now seeing on servo-mac9.

exceptions.Exception: Actual commit (ef4246537da6888de0d7a4274ebd26faf52c9bf2) differs from requested commit (299cd13a4f4d5138f4032b20c41ddd2dfcb3ffd0)

Steps suggested by the wiki show… no good sign:

salt 'servo-mac*' cmd.run 'find /Users/servo/buildbot/slave -path "*/.git/index.lock"'
servo-mac8:
servo-mac3:
servo-mac4:
servo-mac5:
    Minion did not return. [Not connected]
servo-mac7:
    Minion did not return. [Not connected]
servo-mac6:
    Minion did not return. [Not connected]
servo-mac1:
    Minion did not return. [Not connected]
servo-mac2:
    Minion did not return. [Not connected]
servo-mac9:
    Minion did not return. [Not connected]

SSHing directly is somewhat better:

% for i in $(seq 1 9); do echo $i; ssh servo-mac$i.servo.org -o 'ConnectTimeout 2' 'find /Users/servo/buildbot/slave -path "*/.git/index.lock"'; done         
1
ssh: connect to host servo-mac1.servo.org port 22: Connection timed out
2
ssh: connect to host servo-mac2.servo.org port 22: Connection timed out
3
4
5
find: /Users/servo/buildbot/slave: No such file or directory
6
7
8
9
/Users/servo/buildbot/slave/mac-rel-wpt4/build/.git/index.lock

This seems to help:

ssh servo-mac9.servo.org rm /Users/servo/buildbot/slave/mac-rel-wpt4/build/.git/index.lock
@SimonSapin
Copy link
Member Author

@SimonSapin SimonSapin commented Jun 30, 2018

Current status: infra seems to be unblocked (#21102 has landed), but it is running at reduced capacity. servo-linux-cross3 is not connected to buildbot because I shut it down and didn’t figure out how to bring it back up. servo-mac1 and servo-mac5 are as well, but I don’t know why.

@edunham
Copy link
Contributor

@edunham edunham commented Jul 2, 2018

Cross3 is back -- service buildbot-slave start on the host fixed it, though highstate errored.

@SimonSapin
Copy link
Member Author

@SimonSapin SimonSapin commented Jul 2, 2018

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.