Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Scaling down should remove instances from the container with the most imbalance #1560

Closed
billonahill opened this issue Nov 14, 2016 · 28 comments
Assignees
Milestone

Comments

@billonahill
Copy link
Contributor

billonahill commented Nov 14, 2016

The repacking algos currently remove instances in a round-robin fashion from container 1, which can result in the imbalanced packing described below. Instead instances should be removed from the containers that have the most imbalance.

For example:

  1. Deploy the ExclamationTopology:
    ~/bin/heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --deploy-deactivated --verbose

heron_admin-0

  1. Scale up from 4 to 6:
    ~/bin/heron update local ExclamationTopology --component-parallelism=exclaim1:6

heron_admin-1

  1. Scale back down to 4:
    ~/bin/heron update local ExclamationTopology --component-parallelism=exclaim1:4

heron_admin-2

Instead the packing in 3 should look like 1. The proposed algorithm is as follows:

First, compute the idealized allocation (representation?) factor for component A on a given container as the ratio of component A's parallelism to the sum of all component parallelisms. For example, after Step 2 above the idealized allocation factor for exclaim1 would be 6/8 or 0.75, since the topology has 8 total instances of which 6 are As.

Second, compute the actual allocation factor on a given container for component A as the ratio A's instance count to the count of all instances on the container. Following the example above, after Step 2 the allocation factors for A on each of the nodes is [2/3, 2/3, 2/2] or [0.67, 0.67, 1.0].

Finally, the offset from the ideal could be computed as [-0.08, -0.08, +0.25], which shows that container 3 is the most over allocated and should be targeted for instance A removal.

@billonahill billonahill self-assigned this Nov 14, 2016
@avflor
Copy link
Contributor

avflor commented Nov 14, 2016

@billonahill This is an interesting idea. I think the algorithm can be simplified to work as follows:
Before you do the scaling down (step 2 in this case) compute for each container the following: number of instances of component you will scale/total number of instances in a container. So in your example the ratios will be [2/3, 2/3,2/2] as you also said in step 2 of your algorithm. Then sort the containers in decreasing order of these ratios and remove the component from the first container, update the first container's ratio and adjust its position in the ordered list if needed. Then remove again from the first container and so on until you remove all the components you want. This version avoids step 1 of your algo and also lets you decide at any point in time which container should be targeted next. Thoughts?

@billonahill
Copy link
Contributor Author

+1

I was thinking the same thing, which is that step 1 isn't really necessary since what matters is the relative allocation factor, not whether it's above or below the ideal. It's conceptually helpful to describe the ideal we're striving for only, but not required when implementing.

@avflor
Copy link
Contributor

avflor commented Nov 14, 2016

+1

@ashvina
Copy link
Contributor

ashvina commented Nov 14, 2016

I see one issue with the proposed approach. As scale up results in increase in number of containers (mostly), i think scale down should prioritize reducing number of containers.

The above approach will result in the following if 1 each of A and B is added and removed

T0: deploy 4 A to B
Packing: (A A B) (A A B)

T1: scale up 1 A and 1 B
new packing: (A A B) (A A B) (A B)

T2: scale down A by 1
new packing: (A B) (A A B) (A B)

T3: scale down B by 1
new packing: (A A B) (A) (A B)

@billonahill
Copy link
Contributor Author

@ashvina if T3 came after T2 you'd end up with (A B) (A B) (A B) right?

Either way, yes this algorithm will be optimizing for balance over resource utilization. I think that's acceptable for now until we see how this feature is used. It's more tailored for the case where a single component is being scaled up and down by large amounts, and the balanced approach would free the previously added containers.

When scaling down from T1 to T2 what you show seems better to me that going to (A A B) (A A B) (B) IMO. I'm inclined to try to achieve balance and if we need to at some point we could implement a repack that isn't biased towards least disruption that just repacks everything as efficiently as possible, to prune wasted space. Kind of like a Full GC that could happen if too many scale up/downs have happened and we're fragmented.

@ashvina
Copy link
Contributor

ashvina commented Nov 14, 2016

@billonahill - In the example, event T3 is scaling-down component B. So it will be (A A B) (A) (A B) if allocation factors mentioned in the description is used.

Increase in number of containers can increase the total resource cost of a topology, specially for Aurora. In Aurora all containers will be of same size (the largest one). So the allocation factors approach can result in higher cost. For e.g. if the cost at deployment is 100 GB, say cost after temporary scaling-up is 200 GB, cost after restoring original size could still be 200 GB.

Also should we design the default scale down approach for a specific use case? Is scaling up a single component and scaling it down expected to be the most common operation? IMO, no. A user might scale up all the components.

I think scale down should try to come as close to the packing algo as possible.

@ashvina
Copy link
Contributor

ashvina commented Nov 15, 2016

Thinking aloud...

A scale-down event will result in some disruption as processes are killed, some in-memory state will be lost etc. How about performing rebalancing and defragmentation (GC) when scale down is invoked?

@billonahill
Copy link
Contributor Author

My reason for suspecting that a single component will be scaled is that when we see issues in production it's typically a single component that appears to be causing back pressure. As a result I suspect it will be fairly common that a.) the initial topology will be fairly well packed and that b.) users will adjust one at a time. When that happens, it's very likely that new, fairly unbalanced containers will be added, which should be remove upon a scale down.

Repacking the entire topology is something we should experiment with and possibly iterate towards, but until we have stateful processing, we (i.e. Twitter) want to optimize for least disruption in terms of plan changes, speed of execution and state loss.

@wangli1426
Copy link
Contributor

I agree. IMO, least disruption in terms of plan changes is important, even if the underlying engine supports elastic operations, e.g. operator scaling, task migration. That's because plan changes always come with cost.

@avflor
Copy link
Contributor

avflor commented Nov 15, 2016

I thought more about this based on @ashvina comments and I think that essentially @billonahill also wants to remove the containers that consist of same components whenever that possible. However, the solution he suggests is through the imbalance metric. The problem with this solution is that it might lose opportunities for removing containers because it attempts to remove components from the most loaded instances. I think we can still achieve what Bill wants (removing from newly added containers) without using the imbalance metric. The overall idea is that during scale down, we want to remove as many containers as possible (essentially from all the sets of possible solutions we want the one that minimizes the minimum container load). I think we can write an integer program that does that for not only single component scale downs but also for multiple component scale down in one command. Another approach is to use a heuristic that operates at a single component at a time. Let's take an example:

We have 9 containers with 2 As and 2 Bs (A A B B), 1 container (A,B), 1 container with (A A A A A) and 1 container (A A A) each as a result of a scale up operation. The user wants to scale down A 10 times. The heuristic woks as follows:

First pick the containers that contain As only (2 containers in this example) and sort them based on increasing number of instances . Start removing from the first container in the list (A A A), moving to the next one (A A A A A) until both containers are removed or we do not need to remove more instances. Note that in this example, we still need to remove two instances of A.

Now we look at the containers that contain As and other components as well. We sort them in increasing order of load (total number of instances per container) and break ties by sorting them in increasing number of As. So the first container in the list will be (A B). We remove A from this container (result is container (B)) and then remove A from the next container (A A B B) to get container (A B B).

I think this heuristic will have the expected behavior that Bill mentioned in his original example (in cases where users want to scale up/down one component at a time) and at the same time will attempt to remove as many containers as possible. In Ashvin's example above, the container (AB) will be removed resulting in the original packing plan.

Any thoughts? We can also create an integer program and solve it thorugh some optimization framework if we want to be more accurate/fancy.

@billonahill
Copy link
Contributor Author

@avflor what you're describing sounds similar to the scoring algo that I described with the addition of a secondary sort in the event of a scoring tie, by total number of instances ascending. This way you favor draining homogeneous containers first. Sounds reasonable to me.

+1

@avflor
Copy link
Contributor

avflor commented Nov 15, 2016

@billonahill Yes, the basic differences are the following:

  1. We categorize the containers based on their type (containers that contain only the component we want to scale down and containers that contain a mix)
  2. The heuristic sorts the containers of each category in increasing load (total number of instances) and not decreasing.
  3. Ties are broken by sorting the containers in increasing order of number of instances of the component we want to remove.
  4. We operate on the category of homogeneous containers first and next on the category with the mix.

Using this heuristic we remove from least loaded containers thus increasing the chances of removing containers overall, but at the same time achieve the behavior we want for the scenarios you described using container categorization.

@ashvina
Copy link
Contributor

ashvina commented Nov 15, 2016

@avflor - +1 looks good to me

@billonahill @wangli1426 - Could you please help me understand the impact of redistributing process? Restarting an instance would require rebuilding in-memory state from an external state store. Is there any other heron system or user level cost? Could you please share some more details?

@billonahill
Copy link
Contributor Author

@avflor no need to specially categorize in 1) as you describe since the homogeneous containers will have a score of 1.0 so they're naturally sort to the top, above heterogeneous.

@ashvina restarting processes has start up costs in terms of time taken to shutdown/deregister plus time taken to reinitialize, as well as load on other components during service discovery, as well as re-registration with others. As a result, moving and restarting 1000s of process when adding or removing only a few, for example, would incur unnecessary cost.

@avflor
Copy link
Contributor

avflor commented Nov 15, 2016

@billonahill Note that I don't use the ratio as the score. The scoring metric now is load (total number of instances in a container). The ratio score will result in removing from overloaded containers which is not correct since you will lose opportunities for container removal. By container categorization + different scoring function you avoid these problems while still getting the behavior you want in the scenario you described. Let me give an example:

Let's take the case where we have the containers (A A B B ) (A A B B ) (A A B B ) and the user wants tor remove 2As and 2 Bs. You heuristic (ration as scoring function) after removal of 2 As will result in:

(A B B ) (A B B) (A A B B) and after removal of B will result in (A B ) (A B) ( A A B B). We lost an opportunity here to remove one container.

The heuristic I propose will produce ( A A B B) (A A B B).
At the same time, it will work as you want in the example you gave in the beginning of this thread due to container categorization.

@wangli1426
Copy link
Contributor

@ashvina Heron currently does not support restoring operator state (you referred to as in-memory state) after restarting a heron instance. But I believe it will be achieved in the near future.

There are two ways to scale an operator. The first way is to kill all the instances of the operator and then restart a desirable number of instances. To guarantee operator state consistency, before an instance is called, the associated state should be written to a persistent storage. When new instances start, the state is restored from the persistent storage. Additional efforts are needed to decide how to re-partition the state to the instances.

The second way is to redistribute the state from the instances to delete among the existing instances, or migration partial state from existing instances to the instances to create. By control the routing rules in the stream manager and the process of state migration carefully, live migration can be achieved without restarting instances. Compared with the first method, this method significantly reduces the interruption to the data processing, but comes with high implementation complex. So I believe we can implement the first method first.

@wangli1426
Copy link
Contributor

Is there any plan to support HeronInstance migration among different containers? With this feature, we can easily design algorithm to remove the under-utilized containers.

For instance, consider we have two containers: (A A B B) ( B B A). After B is scaled-down to only one instance, the containers look like: (A A B) ( A ). If HeronInstance migration is supported, we can move A from the second container to the first one and release the second container.

@ashvina
Copy link
Contributor

ashvina commented Nov 17, 2016

@wangli1426 - I am not aware of anyone working on HeronInstance migration. It is a good idea and was discussed informally a couple of times in the past.

@billonahill - you mentioned this in one of the comments above, "moving and restarting 1000s of process when adding or removing only a few, for example, would incur unnecessary cost". I am not sure if I understand this. If scale down is removing only a few instances, then very container plans will change. Which means defragmentation will impact few processes only.

@billonahill
Copy link
Contributor Author

@ashvina only the container plan of the instances removed would be effected if we just removed a few of them. If we instead reshuffled all instances across all containers, then every container plan would be effected.

@avflor I see what you're saying now. I mis-read the first time. Yes, we should take into account all component changes requested when selecting a container. I'll follow up on #1584

@billonahill
Copy link
Contributor Author

Actualy, @avflor the algorithm you propose only works in examples like you've shows where you have a balanced plan to start with, which we've found is often not the case after scaling up. Take for example this plan:

[AABB, AABB, AABB, AABB, AAAA, BBBB]

If a request is made to remove 4As and 4Bs the user would expect to remove the last two, but instead we would get this:

[AABB, AABB, AAAA, BBBB]

If they did that again the would get this:

[AAAA, BBBB]

If instead they scaled down 4As and 4Bs in two different scale down requests they would get back to:

[AABB, AABB, AABB, AABB]

I think we'd need one more set of tie breakers (see 2):

  1. score based on homogeneity taking all components requested into account
  2. break tie based on homogeneity taking just the component being scaled into account
  3. break tie by count of the component ascending
  4. break tie by count of all components ascending
  5. break tie by container id descending

I believe this would work for both scenarios that you and I presented.

@avflor
Copy link
Contributor

avflor commented Nov 30, 2016

@billonahill I think we need a call to clarify all this :). My algorithm works at one component at a time as yours (not multiple ones). If a user has submitted a single request to remove 4As and 4 Bs, then the algorithm will work on the first A, second A, third A, fourth A, first B, second B, third B and fourth B in that order. Let's see the algo now:

1st A removal: We categorize the containers into 2 groups.
Group that contains only As: [A,A,A,A]
Group that contains a mix of A with other stuff: [[AABB, AABB, AABB, AABB]

We start with first group and remove 1 A.

The result would be [[AABB, AABB, AABB, AABB, AAA,BBBB].

2nd A removal. Creating again two groups:
Group that contains only As (homogeneous) : [A,A,A]
Group that contains a mix of A with other stuff (heterogeneous): [[AABB, AABB, AABB, AABB]
We remove 1 A from container in first group and the result is:
[[AABB, AABB, AABB, AABB, AA, BBBB]

it is easy to see that 3rd and 4rth removal will result in removing one container resulting in:
[AABB, AABB, AABB, AABB, BBBB]

Then we get to 1st B. We have the following groups:
Group that contains only Bs: [B,B,B,B]
Group that contains a mix of B with other stuff: [[AABB, AABB, AABB, AABB]

We remove B from group 1 resulting in:
[AABB, AABB, AABB, AABB, BBB]

As previosly after the fourth removal we will have [AABB, AABB, AABB, AABB].

Note that if you had multiple containers to pick in a group you pick the one with the least number of instances. In this example we had only one container every time in the homogeneous group so that didn't come across.

@billonahill
Copy link
Contributor Author

For some reason I thought you were previously proposing an initial scoring taking into account all components being requested, which I see wasn't your intent.

The challenge with the primary scoring being based on number on total instances ascending is that it biases away from removing imbalanced containers. For example if someone makes a request to remove 3As and a B from this:

[ABC, ABC, ABC, AAAB]

they would get this:

[C, BC, BC, AAAB]

Unbalance appears as a result of scale up, which is why if we want scale down to effectively back out those changes when scaling down, we would focus on removing imbalance first IMO.

@avflor
Copy link
Contributor

avflor commented Nov 30, 2016

@billionahill. The result of scale up (which is homogeneous containers) will be undone with both your and my algorithm since they both remove the homogeneous containers (yours because of the imbalance metric and mine because of categorization). In this scenario both algos will produce the old state (state before scale upoperation). The problem comes when you operate on mixed containers. If your goal is for mixed containers to remove imbalance then that's ok, but if your goal is to remove as much containers as possible during scale down then imbalance metric will not do that. IMO the latter provides better user experience than the former because users ultimately scale down to release resources.

@avflor
Copy link
Contributor

avflor commented Nov 30, 2016

The example you give me is a use case where your heuristic will remove a container and my will not. However, note that these are heuristics. There are cases that these will not work. The observation here is that optimizing for removing imbalance doesn't necessarily mean that you remove containers. In fact it can prevent container removal. If you want to optimize for removing containers then a heuristic that removes from the least loaded container will result in the general case in more removals that a heuristic that optimizes for imbalance. If we agree on these observations and you still think that removing imbalance is the first goal here then it's ok. But I want to make sure you understand my point that these are two different and conflicting objectives and that removing imbalance results in lost opportunities for removing containers in the general case.

@ashvina
Copy link
Contributor

ashvina commented Nov 30, 2016

Folks, would it be worth creating a list of "popular" use cases, and compare the algorithms against that list?

Also, what do you think about adding a policy configuration to choose between container optimization and balanced scale down?

@avflor
Copy link
Contributor

avflor commented Nov 30, 2016

@ashvina We need to agree on the objective function first which means that we need to agree that optimizing for imbalance and optimizing for container removal is a different thing. When we agree on the objective function then we can determine the best algo fr our use cases.

@billonahill
Copy link
Contributor Author

At some point we should consider adopting scale up/down strategies, but not until we see how people actually use the feature. Until then we can provide what we think it a reasonable default.

@avflor I've updated the algorithm in #1584 per your suggestion. After reading it repeatedly I think I finally grokked all the scoring stages and wrote something that handles the examples we've discussed.

@wangli1426
Copy link
Contributor

I agree with @ashvina . It seems that we mixed different objectives, say load balance and minimization of # of containers, together.

IMO, if we consider the problem we discuss as a Bin Packing problem, minimization of the # of containers used implicitly covers the load balance objective. Because you can remove more containers, only if the containers are well balanced. The difference is that in our scenario there is already a packing plan, so we also want to minimize the instance movement cost (Removing an instance has no cost, but moving an instance does) when doing the bin packing. By proposing a algorithm to solve the bin packing problem with the consideration on minimization the instance movement overhead, we can achieve both load balance, container removal and operator scaling-up/down, simultaneously.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants