RFD 115: Discussion #80

davepacheco · 2018-02-22T18:00:49Z

This issue represents an opportunity for discussion of RFD 115 Improving Manta Data Path Availability while it remains in a pre-published state.

chudley · 2018-02-23T10:38:07Z

This document reads well. I think it provides enough background around what availability means to us and provides the right amount of information on the potentially soft perceived uptime, while still showing our interest in ensuring the system is available.

It does seem like MORAY-437 is underrepresented in the document because I believe the change here lead to the observed error rate going up for a period of time. The paragraph it's in mentions "the above changes to fix pathological PostgreSQL performance", but the above section is around PostgreSQL replication performance. I understood that MORAY-437 was put in place to prevent us from queuing too much work in Moray that was caused by PostgreSQL performance that wasn't directly related to replication lag.

If you're still looking for help on tickets around these incidents then I can do some JIRA sleuthing to find more SCI, INC, and CM tickets. I'm not sure how well that'll go (I imagine we'll be missing some as the SCI process was getting to its feet) or whether it's still worth doing because the document has many references already, but I can take a look if you think it's worthwhile!

kellymclaughlin · 2018-02-23T17:45:34Z

One relevant situation I thought of while reading this (and that apparently I've failed to make a ticket for) is if all of the muskie instances on a webapi zone fail or go into maintenance, but registrar is still running on that zone then muppet will continue to route requests to that zone even if there are other webapi zones with functioning muskie instances available. This will result in unnecessary expenditure from our error budget, but is something I believe we can fix. I think this situation becomes perhaps more likely to occur and important to resolve as we look at canary deployments. I will file a Jira ticket to cover this issue today.

One concern that I had when reading through was that we don't have a real plan for major version updates to postgres. It seems risky and limiting to resign ourselves to being on 9.6 indefinitely. There are potential performance improvements and new features with each major release that might give us tools to better address issues with the system. One example of this is the UPSERT addition starting in 9.5 and possibilities that enables like MANTA-3464. Moving up from 9.6 is very tough still and I don't claim to have a great answer right away, but the logical replication introduced in pg 10 could facilitate major version upgrades with much lower impact (assuming it performs well enough compared to the streaming replication). Maybe it is worth considering how we could move forward from 9.6 and beyond now even if we don't choose to actually move to a later version for some time.

kellymclaughlin · 2018-02-23T18:22:13Z

One relevant situation I thought of while reading this (and that apparently I've failed to make a ticket for) is if all of the muskie instances on a webapi zone fail or go into maintenance, but registrar is still running on that zone then muppet will continue to route requests to that zone even if there are other webapi zones with functioning muskie instances available. This will result in unnecessary expenditure from our error budget, but is something I believe we can fix. I think this situation becomes perhaps more likely to occur and important to resolve as we look at canary deployments. I will file a Jira ticket to cover this issue today.

Filed as MANTA-3589

davepacheco · 2018-02-26T20:00:35Z

@chudley

It does seem like MORAY-437 is underrepresented in the document because I believe the change here lead to the observed error rate going up for a period of time. The paragraph it's in mentions "the above changes to fix pathological PostgreSQL performance", but the above section is around PostgreSQL replication performance. I understood that MORAY-437 was put in place to prevent us from queuing too much work in Moray that was caused by PostgreSQL performance that wasn't directly related to replication lag.

You raise some good questions. It's true that we had a number of requests fail after MORAY-437, but I don't think those failures are an issue in themselves, but rather a sign of the system attempting to do the best it can when it can't meet the current demand. That is, the problem isn't the dropped requests after MORAY-437 -- it's whatever's causing those requests to queue up. I don't know that we ever proved a connection, but I believe these "overload" failures completely disappeared after the other changes we made for pathological PostgreSQL performance (e.g., the record size rewrites).

There was a degree to which MORAY-437 was a problem in itself. When we initially rolled out the change, the max queue length was too short and caused a number of requests to fail that otherwise would have succeeded in a reasonable time. I think that's relatively minor, though.

If you're still looking for help on tickets around these incidents then I can do some JIRA sleuthing to find more SCI, INC, and CM tickets. I'm not sure how well that'll go (I imagine we'll be missing some as the SCI process was getting to its feet) or whether it's still worth doing because the document has many references already, but I can take a look if you think it's worthwhile!

In general, I would love to quantify the downtime associated with each major issue. I think that's pretty time-consuming to do retroactively because the data we have is not that well organized for this. It's not that easy to even find the SCI tickets corresponding to Manta unavailability, let alone connect each of them to the various other tickets (in NOC, INC, MANTA, and OPS) that explain the problem. I seeded the list of issues here by searching through SCI tickets mentioning Manta, skimming them, and looking at related tickets. That's a little error-prone. If you wanted to take another pass through the SCI tickets to make sure I didn't miss anything big, that would certainly be welcome, but don't feel obligated!

I think the most useful thing going forward is the suggested process change to "establish better ways of associating specific periods of downtime with specific issues". I haven't fleshed this out yet. I think we want to record enough information after each incident in a queryable form that we can efficiently summarize downtime over longer periods, but we also don't want to create a bunch of make-work for each incident. Ideas include:

Make sure we can keep historical metrics for Muskie error rates, latency, and throughput for at least a year in a form that's efficient for querying.
Make sure that SCI tickets are reliably annotated with a start time, end time, and some characterization of the impact. (This may already be true today.)
Make sure that CM tickets are similarly tagged for duration and expected impact.
Make sure that SCI tickets are also annotated with two sets of issues: one is a set of issues that contributed to the duration or impact of the incident; and the other is the set of issues without which the incident wouldn't have happened at all.

With these in place, we could write a system that queries JIRA for relevant SCI tickets, cross-references them with periods of request failures or significant performance degradations, and generates a report summarizing unavailability caused by various known issues (or changes).

There's also danger in trying to be overly quantitative about it. I wouldn't try to associate individual issues with specific request failures, or even try to break down what percent of failures within an incident were caused by which issues. But I think the above would give us a lot of insight while still being pretty lightweight.

I haven't thought all that much about this, though, and I think it merits further discussion.

davepacheco · 2018-02-26T21:25:21Z

@kellymclaughlin

One relevant situation I thought of while reading this (and that apparently I've failed to make a ticket for) is if all of the muskie instances on a webapi zone fail or go into maintenance, but registrar is still running on that zone then muppet will continue to route requests to that zone even if there are other webapi zones with functioning muskie instances available. This will result in unnecessary expenditure from our error budget, but is something I believe we can fix. I think this situation becomes perhaps more likely to occur and important to resolve as we look at canary deployments. I will file a Jira ticket to cover this issue today.
Filed as MANTA-3589

Thanks for filing that ticket!

One concern that I had when reading through was that we don't have a real plan for major version updates to postgres. It seems risky and limiting to resign ourselves to being on 9.6 indefinitely. There are potential performance improvements and new features with each major release that might give us tools to better address issues with the system. One example of this is the UPSERT addition starting in 9.5 and possibilities that enables like MANTA-3464. Moving up from 9.6 is very tough still and I don't claim to have a great answer right away, but the logical replication introduced in pg 10 could facilitate major version upgrades with much lower impact (assuming it performs well enough compared to the streaming replication). Maybe it is worth considering how we could move forward from 9.6 and beyond now even if we don't choose to actually move to a later version for some time.

Yeah, it's a problem. We've generally assumed that if we stay on a version, we likely won't suddenly start running into new critical issues. We've been on 9.2.4 in JPC since launch in 2013, though we ultimately needed 9.6 in order to manage the lag issues we started seeing at much higher loads last year. We do have a procedure for doing that upgrade, and we tested it last year in a production context, but it's costly for availability. As I understand it, there's no way to do a major upgrade without significant downtime (at least for the pg_upgrade, and likely for the replica rebuilds as well). On the plus side, 9.6 is supported through 2021.

Part of the challenge with planning a major upgrade before we know what version we're going to is that we don't necessarily know the constraints. Future major versions could invalidate our plan (e.g., if "pg_upgrade" suddenly doesn't work across a major version) or even make it much easier (if they provide an online migration tool).

kellymclaughlin · 2018-02-27T22:25:47Z

When it is time to make the leap from postgresql 9.6 to something else, it might be worth some time and effort to explore the possibility of incorporating pglogical to try and do an online migration and then perhaps consider moving to the official pg logical replication afterwards. From reading this post it seems that the new logical replication shares some common authors and ideas with pglogical. I don't have any idea of the performance charateristics, but for the possibility of an online upgrade it might be worth checking out.

FWIW, I did a quick test and was able to build pglogical for postgres 9.6 on a SmartOS zone after a bit of tinkering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 115: Discussion #80

RFD 115: Discussion #80

davepacheco commented Feb 22, 2018

chudley commented Feb 23, 2018

kellymclaughlin commented Feb 23, 2018

kellymclaughlin commented Feb 23, 2018

davepacheco commented Feb 26, 2018

davepacheco commented Feb 26, 2018

kellymclaughlin commented Feb 27, 2018

RFD 115: Discussion #80

RFD 115: Discussion #80

Comments

davepacheco commented Feb 22, 2018

chudley commented Feb 23, 2018

kellymclaughlin commented Feb 23, 2018

kellymclaughlin commented Feb 23, 2018

davepacheco commented Feb 26, 2018

davepacheco commented Feb 26, 2018

kellymclaughlin commented Feb 27, 2018