Skip to content
This repository has been archived by the owner on Oct 26, 2019. It is now read-only.

Migration to Dedicated Redis Boxes

Justin Littman edited this page May 8, 2019 · 10 revisions

Deprecated

Introduction

In September 2018, all robot suites and robot-master (in all environments) were re-configured and re-deployed in order to use dedicated Redis boxes (sul-robots-redis-{ENV} instead of sul-lyberservices-{ENV}). This was done to reduce the load and the number of dependencies on the lyberservices boxes and to hand-off knowledge about working with robots from the Access team (cbeer in particular) to the Infrastructure team. Kudos to Chris Beer and Christina Harlow for making this happen!

The migration is documented here.

Sequence

Determine scope

The first phase (the only phase completed as of Sept. 2018) would include pointing all robot suites and robot-master at dedicated Redis boxes instead of sul-lyberservices-{ENV}. That is the work that is described here.

Future work was also identified, including moving robot-master off sul-lyberservices-{ENV} to new boxes. (Note that the robot suites were already on their own boxes when this document was created.) This will require new Puppet work and will also require ensuring robot-master can still access the workflow table (in Oracle) via the workflow-services HTTP endpoint (the Java codebase).

Define strategy for migration

We decided that the best strategy for the migration would be migrate the robot suites (Redis/resque consumers) first, and robot-master (Redis/resque producer) last. This allows robot-master to continue working off jobs while the robot suites begin populating the new Redis instances with new jobs. robot-master is cut over once it has 0 jobs in progress or queued up.

This would be done first in dev (though most of the dev infrastructure has already been retired), and then in stage as a test run. prod would be done last, after having been communicated to affected teams and scheduled in advance so as not to disrupt work.

Define strategy for communication

We presented the migration strategy and plan to Ben Albritton who gave us the go-ahead and communicated out to other teams and folks who would be affected. We kept Ben in the loop as we worked through the strategy, and ultimately scheduled a half-day of downtime a couple of weeks out -- Ben took the lead on reflecting this out, and would do the same for the all-clear message.

Create new VMs for Redis in Puppet

  1. https://github.com/sul-dlss/puppet/blob/production/hieradata/node/sul-robots-redis-dev.stanford.edu.eyaml
  2. https://github.com/sul-dlss/puppet/blob/production/hieradata/node/sul-robots-redis-stage.stanford.edu.eyaml
  3. https://github.com/sul-dlss/puppet/blob/production/hieradata/node/sul-robots-redis-prod.stanford.edu.eyaml

Each of these uses the Redis profile in puppet, making sure they all share common configuration and monitoring.

These Puppet configs were then spun up, creating the new VMs.

Compare Redis configurations

The Redis configs on the old boxes and the new boxes were compared to look for areas of divergence. Findings:

  • Diffs
    • auto-aof-rewrite-min-size
      • 64mb on lyberservices, 64min on robots-redis. This is a typo from upstream and a puppet PR to correct this has been submitted.
    • bind
      • localhost and eth0 IP for lyberservices, 0.0.0.0 for robots. I suspect the value for robots is superior -- it should bind to both localhost and the eth0 IP in a way that future-proofs it should its IP change
    • daemonize
      • yes on lyberservices, no on robots. I suspect this is a mistake, and I have created a puppet PR to correct this on robots
    • pidfile
      • I don’t think we care if this value varies between lyberservices and the new robots boxes, as long as the pidfile can be written to from the redis service account
    • save 60
      • 1 on lyberservices, 10000 on robots. The values for save on robots are the new Redis defaults, so continue using those.
    • slowlog-max-len
      • 128 on lyberservices, 1024 on robots. 128 is the Redis default. 1024 is the puppet-redis default. On lyberservices-prod, there were but 10 entries in the slowlog, and none were particularly large... so the value of 1024 shouldn’t be an issue and it seems valuable to use upstream defaults where we can (and those are working for us now).
  • Only on lyberservices
    • aof-rewrite-incremental-fsync yes
    • hz 10
    • maxmemory-policy volatile-lru
      • volatile-lru is the default value, so no change is needed here
    • notify-keyspace-events ""
      • "" is the default value, so no change is needed here
    • repl-disable-tcp-nodelay no
      • no is the default value, so no change is needed here
    • tcp-keepalive 0
      • The default value in Redis until 3.2.1 was 0. For 3.2.1 and greater, the default value is 300. The current value of 0 is working for us -- as far as we know, we haven't experienced any issues around TCP keepalives, so don't change this now.
  • Only on robots
    • repl-timeout 60
      • 60 is the default value. Leave it be.
    • save 300 10
      • This is in the default Redis config. Keep this.
    • save 900 1
      • This is in the default Redis config. Keep this.
    • syslog-enabled no
      • no is the default value. Leave it be.

Update Puppet profile class for Redis

Due to the findings above, we updated the puppet-redis library to pick up the latest configurations: https://github.com/sul-dlss/puppet/pull/3126

Confirm monitoring is set up

The Redis Puppet profile in place for the new boxes already included a Nagios check for Redis that alerts when either:

  1. Redis is down, or
  2. Redis is consuming an inordinate amount of memory: https://github.com/sul-dlss/puppet/blob/production/modules/profile/files/nagios/client/nrpe/check_redis#L54-L77

This latter value defaults to 200MB for warnings and 300MB for critical alerts. FWIW, the Redis instance on sul-lyberservices-prod reports it is using 5.05MB with a peak of 11.19MB.

We concluded that monitoring of the new boxes was sufficient and in line with other services.

Ensure network access

Before beginning to make configuration changes to the robot suites and robot-master, we confirmed that the VMs running this software (in all environments) could connect to its corresponding sul-robots-redis-{ENV} box. For instance, to test this, connect to sul-robots1-prod and run redis-cli -h sul-robots-redis-prod client list.

Migrate robot suites

Determine and document where robot suite configs live

We asked the shared_configs repo which branches point at the sul-lyberservices-{ENV} Redis instances via (for prod):

$ for commit in $(git grep sul-lyberservices-prod.stanford.edu:6379 $(git rev-list --all) | cut -d: -f1 | sort | uniq); do > git branch -r --contains $commit
> done | sort | uniq

It reported the following robot suites used shared_configs to specify Redis connection information (for prod):

origin/kurma-robots1-prod
origin/preservation_robots_prod
origin/sdr-preservation-core_prod
origin/sul-robots-prod
origin/was-robots1-prod

We documented our findings on the robot-master wiki, and made the necessary changes in shared_configs. These PRs were done on the day the migration for a given env (stage/prod) was done. For robot suites not in shared_configs, we changed the Redis connection information on each robot suite box (in e.g., ~/{ROBOT_SUITE}/shared/config/environments/production.rb). (Again, see the robots FAQ for more detail on this.)

Re-deploy robot suites to point them at new Redis location

This was documented on the robot-master wiki.

Migrate robot-master

Determine and document where robot-master config lives

We documented our findings on the robot-master wiki. Note: robot-master does not use shared_configs. Configs for robot-master live in ~/robot-master/shared/config/ on sul-lyberservices-{ENV}.

Re-deploy robot-master to point at new Redis location

This was documented on the robot-master wiki.

🍾 Success 🍾

We scheduled the production migration for September 25th. It began shortly after 8am and was completed around 9:45am. Ben Albritton then ran some end-to-end tests and determined that the migration was a success. An all-clear message was sent out to affected teams shortly thereafter.