New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maxwell loses connection during RDS backups #326

Closed
ckampfe opened this Issue Apr 28, 2016 · 13 comments

Comments

Projects
None yet
4 participants
@ckampfe

ckampfe commented Apr 28, 2016

We're experiencing outages on Maxwell that correlate with the timing of our RDS backups.

From the docs, it shows:

A brief I/O freeze, typically lasting a few seconds, occurs during both automated backups and DB snapshot operations on Single-AZ DB instances.

The stacktraces look like:

...
14:20:08,287 ERROR MaxwellReplicator - Missing binlog 'mysql-bin-changelog.003907' on $REPLICA_HOST
14:20:08,287 ERROR MaxwellReplicator - Transport exception #1236
com.google.code.or.net.TransportException: Could not find first log file name in binary log index file
        at com.google.code.or.OpenReplicator.dumpBinlog(OpenReplicator.java:288)
        at com.google.code.or.OpenReplicator.start(OpenReplicator.java:104)
        at com.zendesk.maxwell.MaxwellReplicator.beforeStart(MaxwellReplicator.java:87)
        at com.zendesk.maxwell.RunLoopProcess.runLoop(RunLoopProcess.java:25)
        at com.zendesk.maxwell.Maxwell.run(Maxwell.java:103)
        at com.zendesk.maxwell.Maxwell.main(Maxwell.java:109)
14:20:08,307 INFO  SchemaPosition - Storing final position: null

Our setup is currently RDS Master DB -> RDS Replica DB -> Maxwell. The Replica is a single AZ instance, so we believe that this I/O suspension is causing the failure due to the number of occurrences and the consistency with which they correlate to our scheduled backups.

  • Is it possible that Maxwell's connection to the RDS Replica is timing out?
  • If so, is there a way to raise this timeout?
  • Is anyone using Maxwell with Amazon RDS in a similar setup?

We're currently looking at pointing Maxwell at our production DB instance, or spinning up a manual replica that is multi-az (not Amazon's "read replica"-style instance). Any guidance is much appreciated! Thank you.

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff Apr 28, 2016

Collaborator

there's some info here that might help about amazon's binlog retention policy:

#298

lms if that works. I need to write that up into the official docs I think.

Collaborator

osheroff commented Apr 28, 2016

there's some info here that might help about amazon's binlog retention policy:

#298

lms if that works. I need to write that up into the official docs I think.

@ckampfe

This comment has been minimized.

Show comment
Hide comment
@ckampfe

ckampfe Apr 29, 2016

We have enabled binlog retention, and the process seems to be working fine in circumstances other than these I/O suspension backup periods.

I'm not sure that there's much Maxwell (or any other replicating process) could do to fix this but I figured it was worth a shot to get a discussion going. I've filed an AWS support ticket as well to see if there are workarounds for this behavior.

Perusing the Maxwell source a bit, and I haven't been able to find anything regarding a configurable connection timeout. Is my assessment accurate?

Many thanks!

ckampfe commented Apr 29, 2016

We have enabled binlog retention, and the process seems to be working fine in circumstances other than these I/O suspension backup periods.

I'm not sure that there's much Maxwell (or any other replicating process) could do to fix this but I figured it was worth a shot to get a discussion going. I've filed an AWS support ticket as well to see if there are workarounds for this behavior.

Perusing the Maxwell source a bit, and I haven't been able to find anything regarding a configurable connection timeout. Is my assessment accurate?

Many thanks!

@MartinAmps

This comment has been minimized.

Show comment
Hide comment
@MartinAmps

MartinAmps Apr 29, 2016

@ckampfe I'm assuming raising this will help you:

https://github.com/zendesk/maxwell/blob/master/src/main/java/com/zendesk/maxwell/MaxwellReplicator.java#L88

I'm curious why you're running it against the replica? Are you writing maxwell's binlog position to the replica too? As you said, pointing maxwell at your master would also solve this.

Finally-- we use runit to manage maxwell, you could look at doing the same, or anything similar eg. supervisord. That way the process will restart if it bails for any reason.

MartinAmps commented Apr 29, 2016

@ckampfe I'm assuming raising this will help you:

https://github.com/zendesk/maxwell/blob/master/src/main/java/com/zendesk/maxwell/MaxwellReplicator.java#L88

I'm curious why you're running it against the replica? Are you writing maxwell's binlog position to the replica too? As you said, pointing maxwell at your master would also solve this.

Finally-- we use runit to manage maxwell, you could look at doing the same, or anything similar eg. supervisord. That way the process will restart if it bails for any reason.

@ethangunderson

This comment has been minimized.

Show comment
Hide comment
@ethangunderson

ethangunderson Apr 29, 2016

(@ckampfe and I are on the same team)

@MartinAmps We run cascading replication to ensure isolation of the primary database host. We have a different host that Maxwell uses to manage state.

We also use runit, but in this case, the binlog is gone when maxwell tries to reconnect, causing a crash loop.

ethangunderson commented Apr 29, 2016

(@ckampfe and I are on the same team)

@MartinAmps We run cascading replication to ensure isolation of the primary database host. We have a different host that Maxwell uses to manage state.

We also use runit, but in this case, the binlog is gone when maxwell tries to reconnect, causing a crash loop.

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff Apr 29, 2016

Collaborator

does maxwell crash before it tries to reconnect? when does maxwell try to reconnect? immediately afterwards? After this happens, what does show binary logs on the replica say?

Collaborator

osheroff commented Apr 29, 2016

does maxwell crash before it tries to reconnect? when does maxwell try to reconnect? immediately afterwards? After this happens, what does show binary logs on the replica say?

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff Apr 30, 2016

Collaborator

I just went and setup RDS, I had to do the following:

call mysql.rds_set_configuration('binlog retention hours', 24);

as otherwise RDS was getting rid of binlogs very aggressively.

Collaborator

osheroff commented Apr 30, 2016

I just went and setup RDS, I had to do the following:

call mysql.rds_set_configuration('binlog retention hours', 24);

as otherwise RDS was getting rid of binlogs very aggressively.

@ethangunderson

This comment has been minimized.

Show comment
Hide comment
@ethangunderson

ethangunderson May 3, 2016

AWS Support told us that bin logs are rotated during an I/O freeze (regardless of binlog retention being set). So that explains why we're getting that error. They opened a feature request for us to address this problem, but don't have an ETA.

To mitigate it, we'll either need to run Maxwell against the primary DB, or spin up a manual replica where we can enable bin logs without needing backups enabled. We'll probably be doing the former.

ethangunderson commented May 3, 2016

AWS Support told us that bin logs are rotated during an I/O freeze (regardless of binlog retention being set). So that explains why we're getting that error. They opened a feature request for us to address this problem, but don't have an ETA.

To mitigate it, we'll either need to run Maxwell against the primary DB, or spin up a manual replica where we can enable bin logs without needing backups enabled. We'll probably be doing the former.

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff May 4, 2016

Collaborator

interesting, ok. Well, I might reproduce your setup and see if there's a way for Maxwell to work around it. To clarify, it's just setting up maxwell chained off and RDS replica, and then taking snapshots of that?

Collaborator

osheroff commented May 4, 2016

interesting, ok. Well, I might reproduce your setup and see if there's a way for Maxwell to work around it. To clarify, it's just setting up maxwell chained off and RDS replica, and then taking snapshots of that?

@ethangunderson

This comment has been minimized.

Show comment
Hide comment
@ethangunderson

ethangunderson May 4, 2016

Yep, the setup is Primary -> Replica -> Maxwell.

Bin log retention is set to 3 hours, and snapshots are enabled to enable bin logs.

ethangunderson commented May 4, 2016

Yep, the setup is Primary -> Replica -> Maxwell.

Bin log retention is set to 3 hours, and snapshots are enabled to enable bin logs.

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff May 4, 2016

Collaborator

does the connection loss issue repro when you manually trigger a snapshot?

Collaborator

osheroff commented May 4, 2016

does the connection loss issue repro when you manually trigger a snapshot?

@ethangunderson

This comment has been minimized.

Show comment
Hide comment
@ethangunderson

ethangunderson May 4, 2016

I'm not sure, I haven't tested that scenario.

ethangunderson commented May 4, 2016

I'm not sure, I haven't tested that scenario.

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff May 4, 2016

Collaborator

okey doke. thx for the info.

Collaborator

osheroff commented May 4, 2016

okey doke. thx for the info.

@osheroff

This comment has been minimized.

Show comment
Hide comment
@osheroff

osheroff Jul 25, 2016

Collaborator

I tried to repro this for awhile, and I couldn't. did you ever hear back from AWS?

Collaborator

osheroff commented Jul 25, 2016

I tried to repro this for awhile, and I couldn't. did you ever hear back from AWS?

@osheroff osheroff closed this Sep 23, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment