Time to blog about the awesomeness going on with the worker and new vm setup #196

Merged
merged 5 commits into from Jan 24, 2013

Conversation

Projects
None yet
2 participants
Owner

joshk commented Jan 24, 2013

Comment away!

We have a total of 26 worker boxes.

"maintenance"

@roidrage roidrage commented on an outdated diff Jan 24, 2013

blog/_posts/2013-01-25-the-worker-gets-a-revamp.md
+VirtualBox has got us very far. It was great for development, had some fantastic features like snapshots and immutable disk images, and had some great tools built around it like [vagrant](http://www.vagrantup.com/).
+
+As we grew from one worker box to our current 24, maintenance of the VMs became a pain. Updating a worker took up to an hour as each VM on the host had to be provisioned and primed for use. Also, because of how VirtualBox works, we had to plan for how many Ruby boxes we would run, or how many Perl/Python/PHP (PPP) or JVM boxes we needed. To make a long story short, we could not easily dynamically decide what builds a host box could or would run.
+
+And of course there are the API isuses and VirtualBox specific errors, like trying to shut down VMs which looked liked they were shutting down but were actually stuck. Initially we implemented a crud 'kill -9' trick which would detect this error and then, well, shell out and kill the VM process using it's process id, which seemed to work for a while, but was not fool proof by any means. In fact, it was mearly a band aid around a more complicated issue of 'what does the future architecture of Travis look like?'
+
+The great thing is we finally have the answer to this question, and are very happy to say we now know what our next-gen architecture will look like. In fact, we have been testing it with the Rails and Spree queues, and since a week ago, the JVM queue. And over the next week or two we will be moving all queues to this new setup.
+
+And don't worry [Travis Pro customers](http://about.travis-ci.org/blog/2012-10-25-the-travis-plans/), we have been running a beta setup for a small set of customers too and it is working beautifully!
+
+So what is this new setup you ask?
+
+We will save most of these details for a later blog post after we've ironed out some of the bugs, but we will be partnering with a server hosting provider in the States who will be running a private cloud for us. This private cloud, backed with SSDs, will allow us to offer the awesome users of Travis greater resource allocations (3gigs of ram, double what we currently offer), and we are also looking at offering users the ability to pick your VM type, like the ability to test on 32bit Ubuntu AS WELL AS 64bit.
+
+This is a huge maintainence relief for us as we can now focus on Travis features instead of having to maintain servers. We also pledge to update VMs more often so you have the latest and greatest services available for you to test against!
+
@roidrage

roidrage Jan 24, 2013

Owner

maintenance

@roidrage roidrage commented on an outdated diff Jan 24, 2013

blog/_posts/2013-01-25-the-worker-gets-a-revamp.md
+
+About a month ago our amazing Sven had an idea, he thought it was a bit crazy at first so coded it mostly as an experiment, but it was such a super smart idea we just had to use it as soon as possible. Mind blowingly smart! (mind blown pic)
+
+<figure class="small right">
+ [ ![Spend 5 minutes with Sven and this is what happens to you!](http://www.reactiongifs.com/wp-content/uploads/2011/09/mind_blown.gif) ](http://www.reactiongifs.com/wp-content/uploads/2011/09/mind_blown.gif)
+ <figcaption>Spend 5 minutes with Sven and this is what happens to you!</figcaption>
+</figure>
+
+Instead of us running command after command using net-ssh-shell, we now create a shell script which includes all the commands we need to run, upload that to the VM, and then excute it! Boom! This means we now only need to run one command, capture the output and exit code, and all covered by the standard SSH spec. Even better, we now have a script you can run locally on a Linux or Mac machine to replicate exactly what we do!
+
+Welcome to the new Travis Build, which can be found on the [sf-compile-sh](https://github.com/travis-ci/travis-build/tree/sf-compile-sh) branch (for the meantime). You can read about it more [here](https://github.com/travis-ci/travis-build/pull/60), which also includes links to example build scripts we generate.
+
+
+**What we run your tests on**
+
+VirtualBox has got us very far. It was great for development, had some fantastic features like snapshots and immutable disk images, and had some great tools built around it like [vagrant](http://www.vagrantup.com/).
@roidrage

roidrage Jan 24, 2013

Owner

it's called Vagrant

@roidrage roidrage commented on an outdated diff Jan 24, 2013

blog/_posts/2013-01-25-the-worker-gets-a-revamp.md
+SSH was also a logical explanation as maybe the connection was flickering and (maybe) net-ssh was trying to be forgiving when it should have just exploded and raised connection errors.
+
+And latestly there was net-ssh-shell. Maybe we should explain how this works a little.
+
+Since the dawn of Travis, the Worker has been using a Ruby gem called net-ssh-shell to help run your tests. This gem works around the issue of SSH not allowing you to run multiple commands after each other while also preserving the environment. net-ssh-shell effectivly starts an echoless shell and then pipes in commands via STDIN while capturing the output (STDOUT) and listening for a little code it adds to figure out when the command has finished and what the exit code is. You can see the main code at work [here](https://github.com/mitchellh/net-ssh-shell/blob/master/lib/net/ssh/shell/process.rb#L44-46).
+
+This has worked great, but it's also a bit of a hack which isn't 100% realibale. All you need is for net-ssh-shell to miss a little bit of output, or for another process on the VM to print to STDOUT at the same time, mixing up the code it is waiting for, and you end up with a stalled job.
+
+At the end of the day there is no single reason we can pinpoint and be sure of when it comes to false timeouts, it is highly likely a mixture of technologies and libraries used contributes to the problem, and it is highly likely componded by Travis code.
+
+So what have we been doing about it?
+------------------------------------
+
+**How we run your tests**
+
+About a month ago our amazing Sven had an idea, he thought it was a bit crazy at first so coded it mostly as an experiment, but it was such a super smart idea we just had to use it as soon as possible. Mind blowingly smart! (mind blown pic)
@roidrage

roidrage Jan 24, 2013

Owner

(mind blown pic)

Owner

roidrage commented Jan 24, 2013

:shipit:

joshk merged commit 599e64a into master Jan 24, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment