Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose Alameda design for rook integration #2182

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added design/alameda/architecture_rook_integration.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions design/alameda/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## What is Alameda

Alameda is an intelligent resource orchestrator for Kubernetes, providing the features of autonomous balancing, scaling, and scheduling by using machine learning. Alameda learns the continuing changes of computing resources from K8S clusters, predicts the future computing resources demands for pods and nodes, and intelligently orchestrates the underlying computing resources without manual configuration.

For more details, please refer to https://github.com/containers-ai/Alameda

## How Alameda works with Rook

![integration](./architecture_rook_integration.png)
(Note: CRD AlamedaResource is called AlamedaScaler and CRD AlamedaResourcePrediction is called AlamedaRecommendation from release 0.2)

This figure illustrates Alameda's architecture and how it can work with Rook. The primary purpose of Alameda is to **recommend optimal computing resource configuration for Kubernetes**. To achieve that, users of Alameda (which is Rook here) specify which Pods require metrics prediction and resource configuration recommendation by creating *AlamedaResource* CRs ([example](https://github.com/containers-ai/alameda/blob/master/example/samples/nginx/alameda_deployment.yaml)). After some machine learning calculations, users can see metrics predictions and recommendations in the *AlamedaResourcePrediction* CRs. An example can be found [here](https://github.com/containers-ai/alameda/blob/master/docs/quickstart.md#example). Please note that starting from Alameda release 0.2, the metrics predictions will be store in a time-series DB instead of CRs due to performance and size considerations of etcd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example AlamedaResource CRD link is broken (404): https://github.com/containers-ai/alameda/blob/master/example/samples/nginx/alameda_deployment.yaml

i'd recommend breaking up this paragraph to a new line per sentence, comments can be added by line, which is an entire paragraph right now, so it's difficult to make scoped comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metrics predictions will be store in a time-series DB

What does the interface look like for an orchestrator to consume the predictions and recommendations? does it call a specific Alameda API? how does it discover that endpoint?

How big are the predictions and recommendations? I can see Alameda gathering a lot of metrics/stats and that taking up a lot of space, but I would expect the recommendations to have a small footprint comparatively. Maybe a CRD approach could still work if the predictions are small and each CRD is over time updated with the latest recommendations, as opposed to new recommendation CRDs constantly being generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add an example of what a recommendation looks like? what data/fields are exposed?


Another function that Alameda provided for Rook is **disk failure prediction**. This function can be switch on/off when deploying Alameda or in Alameda's configmap. When this function is enabled, Alameda requires [prometheus_disk_log_exporter](https://github.com/containers-ai/prometheus_disk_log_exporter) to expose disk data such as S.M.A.R.T. to Prometheus. Once Alameda have disk failure prediction results, it will expose them in Alameda's recommendation. Since the physical disk can be identified by their world wild names (WWN) or serial numbers, Rook can pick up from here or Alameda can write the result to Rook's configmap that is named with *device-in-use* prefix. For example, Rook maintains a configmap to track the devices in use. Crane component of Alameda can execute the disk failure prediction by adding data to the configmap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of a decoupled system, I'd lean towards the failure predictions being exposed through Alameda's recommendations, instead of Alameda having internal knowledge about Rook's internal config structures.

![rook_device_configmap](./rook_cm.png)

An example disk failure prediction result is:
```
{
"near_failure": "Good",
"disk_wwn": "500a075118cb0318",
"serial_number": "174718CB0318",
"predicted": "2018-05-30 18:33:12",
}
```
and the *near_failure* attribute indicates disk life expectancy in the following table.

near_failure | Life expectancy (weeks)
--------------------|------------------------------------------------------
Good | > 6 weeks |
Warning | 2 weeks ~ 6 weeks |
Bad | < 2 weeks |


Binary file added design/alameda/rook_cm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.