-
Notifications
You must be signed in to change notification settings - Fork 1
Migration to Dedicated Redis Boxes
Deprecated
In September 2018, all robot suites and robot-master
(in all environments) were re-configured and re-deployed in order to use dedicated Redis boxes (sul-robots-redis-{ENV}
instead of sul-lyberservices-{ENV}
). This was done to reduce the load and the number of dependencies on the lyberservices boxes and to hand-off knowledge about working with robots from the Access team (cbeer in particular) to the Infrastructure team. Kudos to Chris Beer and Christina Harlow for making this happen!
The migration is documented here.
The first phase (the only phase completed as of Sept. 2018) would include pointing all robot suites and robot-master
at dedicated Redis boxes instead of sul-lyberservices-{ENV}
. That is the work that is described here.
Future work was also identified, including moving robot-master
off sul-lyberservices-{ENV}
to new boxes. (Note that the robot suites were already on their own boxes when this document was created.) This will require new Puppet work and will also require ensuring robot-master
can still access the workflow table (in Oracle) via the workflow-services HTTP endpoint (the Java codebase).
We decided that the best strategy for the migration would be migrate the robot suites (Redis/resque consumers) first, and robot-master
(Redis/resque producer) last. This allows robot-master
to continue working off jobs while the robot suites begin populating the new Redis instances with new jobs. robot-master
is cut over once it has 0 jobs in progress or queued up.
This would be done first in dev
(though most of the dev
infrastructure has already been retired), and then in stage
as a test run. prod
would be done last, after having been communicated to affected teams and scheduled in advance so as not to disrupt work.
We presented the migration strategy and plan to Ben Albritton who gave us the go-ahead and communicated out to other teams and folks who would be affected. We kept Ben in the loop as we worked through the strategy, and ultimately scheduled a half-day of downtime a couple of weeks out -- Ben took the lead on reflecting this out, and would do the same for the all-clear message.
- https://github.com/sul-dlss/puppet/blob/production/hieradata/node/sul-robots-redis-dev.stanford.edu.eyaml
- https://github.com/sul-dlss/puppet/blob/production/hieradata/node/sul-robots-redis-stage.stanford.edu.eyaml
- https://github.com/sul-dlss/puppet/blob/production/hieradata/node/sul-robots-redis-prod.stanford.edu.eyaml
Each of these uses the Redis profile in puppet, making sure they all share common configuration and monitoring.
These Puppet configs were then spun up, creating the new VMs.
The Redis configs on the old boxes and the new boxes were compared to look for areas of divergence. Findings:
- Diffs
-
auto-aof-rewrite-min-size
-
64mb
on lyberservices,64min
on robots-redis. This is a typo from upstream and a puppet PR to correct this has been submitted.
-
-
bind
-
localhost
and eth0 IP for lyberservices,0.0.0.0
for robots. I suspect the value for robots is superior -- it should bind to both localhost and the eth0 IP in a way that future-proofs it should its IP change
-
-
daemonize
-
yes
on lyberservices,no
on robots. I suspect this is a mistake, and I have created a puppet PR to correct this on robots
-
-
pidfile
- I don’t think we care if this value varies between lyberservices and the new robots boxes, as long as the pidfile can be written to from the redis service account
-
save 60
-
1
on lyberservices,10000
on robots. The values forsave
on robots are the new Redis defaults, so continue using those.
-
-
slowlog-max-len
-
128
on lyberservices,1024
on robots.128
is the Redis default.1024
is the puppet-redis default. On lyberservices-prod, there were but 10 entries in the slowlog, and none were particularly large... so the value of1024
shouldn’t be an issue and it seems valuable to use upstream defaults where we can (and those are working for us now).
-
-
- Only on lyberservices
-
aof-rewrite-incremental-fsync yes
- This was added in redis 3.0,, and was documented upstream here: https://github.com/arioch/puppet-redis/pull/163. This is the default value, so no change is needed here
-
hz 10
- This was added upstream as well: https://github.com/arioch/puppet-redis/pull/50. This is the default value, so no change is needed here
-
maxmemory-policy volatile-lru
-
volatile-lru
is the default value, so no change is needed here
-
-
notify-keyspace-events ""
-
""
is the default value, so no change is needed here
-
-
repl-disable-tcp-nodelay no
-
no
is the default value, so no change is needed here
-
-
tcp-keepalive 0
- The default value in Redis until 3.2.1 was
0
. For 3.2.1 and greater, the default value is300
. The current value of0
is working for us -- as far as we know, we haven't experienced any issues around TCP keepalives, so don't change this now.
- The default value in Redis until 3.2.1 was
-
- Only on robots
-
repl-timeout 60
-
60
is the default value. Leave it be.
-
-
save 300 10
- This is in the default Redis config. Keep this.
-
save 900 1
- This is in the default Redis config. Keep this.
-
syslog-enabled no
-
no
is the default value. Leave it be.
-
-
Due to the findings above, we updated the puppet-redis library to pick up the latest configurations: https://github.com/sul-dlss/puppet/pull/3126
The Redis Puppet profile in place for the new boxes already included a Nagios check for Redis that alerts when either:
- Redis is down, or
- Redis is consuming an inordinate amount of memory: https://github.com/sul-dlss/puppet/blob/production/modules/profile/files/nagios/client/nrpe/check_redis#L54-L77
This latter value defaults to 200MB for warnings and 300MB for critical alerts. FWIW, the Redis instance on sul-lyberservices-prod reports it is using 5.05MB with a peak of 11.19MB.
We concluded that monitoring of the new boxes was sufficient and in line with other services.
Before beginning to make configuration changes to the robot suites and robot-master
, we confirmed that the VMs running this software (in all environments) could connect to its corresponding sul-robots-redis-{ENV}
box. For instance, to test this, connect to sul-robots1-prod
and run redis-cli -h sul-robots-redis-prod client list
.
We asked the shared_configs
repo which branches point at the sul-lyberservices-{ENV}
Redis instances via (for prod
):
$ for commit in $(git grep sul-lyberservices-prod.stanford.edu:6379 $(git rev-list --all) | cut -d: -f1 | sort | uniq); do > git branch -r --contains $commit
> done | sort | uniq
It reported the following robot suites used shared_configs
to specify Redis connection information (for prod
):
origin/kurma-robots1-prod
origin/preservation_robots_prod
origin/sdr-preservation-core_prod
origin/sul-robots-prod
origin/was-robots1-prod
We documented our findings on the robot-master
wiki, and made the necessary changes in shared_configs
. These PRs were done on the day the migration for a given env (stage
/prod
) was done. For robot suites not in shared_configs
, we changed the Redis connection information on each robot suite box (in e.g., ~/{ROBOT_SUITE}/shared/config/environments/production.rb
). (Again, see the robots FAQ for more detail on this.)
This was documented on the robot-master
wiki.
We documented our findings on the robot-master
wiki. Note: robot-master
does not use shared_configs
. Configs for robot-master
live in ~/robot-master/shared/config/
on sul-lyberservices-{ENV}
.
This was documented on the robot-master
wiki.
We scheduled the production migration for September 25th. It began shortly after 8am and was completed around 9:45am. Ben Albritton then ran some end-to-end tests and determined that the migration was a success. An all-clear message was sent out to affected teams shortly thereafter.