Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Integrates with DRA and CDI #3329

Merged
merged 13 commits into from
Apr 29, 2024
Merged

Conversation

cyclinder
Copy link
Collaborator

Thanks for contributing!

What type of PR is this?

  • release/feature

What this PR does / why we need it:

Now spiderpool integrates dra and cdi, which allows for some complex scheduling and better manipulation of hardware resources based on dra.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Copy link

codecov bot commented Mar 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.31%. Comparing base (237b89d) to head (88f6b3b).
Report is 8 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3329   +/-   ##
=======================================
  Coverage   81.31%   81.31%           
=======================================
  Files          50       50           
  Lines        4352     4352           
=======================================
  Hits         3539     3539           
  Misses        661      661           
  Partials      152      152           
Flag Coverage Δ
unittests 81.31% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@cyclinder cyclinder changed the title feature: Integrates with DRA and CDI DRA: Integrates with DRA and CDI Mar 27, 2024
@cyclinder cyclinder force-pushed the dra/feature1 branch 9 times, most recently from 94c4e52 to d75b9c1 Compare March 28, 2024 10:36
// +kubebuilder:validation:Optional
MultusNames []string `json:"multusNames,omitempty"`

// +kubebuilder:validation:Optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以这这些参数用途 做些 comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些字段还没有完全确定,我想这个pr可以不用先确定它们

@weizhoublue
Copy link
Collaborator

need :
E2E test case , and documents

return &driver{spiderClientset: spiderClientset}
}

func (d driver) GetClassParameters(ctx context.Context, class *resourcev1alpha2.ResourceClass) (interface{}, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

小白下,这个是什么时机被触发调用的?比较如 resourceclass 被创建时 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

实际上这个函数是被 dra-controller 调用 allocate 时用的。目前dra的实现暂时没看到有使用到,我猜测可以用在 resourceclass 资源创建时,dra-plugin 读取之后,完成节点代表该 resourceclass 的硬件资源一些初始化操作

}

func (d driver) allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, claimParameters interface{}, class *resourcev1alpha2.ResourceClass, classParameters interface{}, selectedNode string) (*resourcev1alpha2.AllocationResult, error) {
if selectedNode == "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看这个逻辑,每个 selectedNode 代表一个node ? 在匹配过程中,selectedNode 并没有与 claimParameters 等进行 匹配过滤,还没有 调度的效果 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前的代码中还没有涉及到调度,这个 selectedNode 是 kube-scheduler 设置的。Dra支持两种分配策略: 立即分配和延迟分配。目前不支持立即分配(即创建resourceclaim时就分配)。

"k8s.io/dynamic-resource-allocation/kubeletplugin"
)

func StartDRAPlugin(logger *zap.Logger, cdiRoot, so string) (kubeletplugin.DRAPlugin, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so string 可以修改为一个 字典,将来方便扩展 挂入多个 so

map[featureNameString]SoPathString

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice 就足够?

Copy link
Collaborator Author

@cyclinder cyclinder Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

挂载多个 so 感觉没有意义?挂载多个时 LD_PRELOAD 变量该如何指定?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我是在想一个易扩展的框架,将来方便 扩展新 so

Copy link
Collaborator Author

@cyclinder cyclinder Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能你想要的是在不变动代码的情况下,就能够轻松扩展新so。但对于一项新的 feature 来说,不仅仅是需要 so,还需要 ENV 等等其他条件,所以这不是能够确定的东西。而目前的框架如果需要开发一项新的feature,他需要做以下的代码工作:

  1. spiderclaimparameter 中添加新的字段
  2. 修改dra-plugin 代码,change cdi file

通过 spiderclaimparameter 来控制 feature 开关,是比较标准的方式。你说的方式是全局生效,即使有些pod 不需要某个feature,只要安装时指定了,创建pod的时候就会挂载,没办法做到更细腻度的控制

@cyclinder cyclinder force-pushed the dra/feature1 branch 4 times, most recently from 39867d1 to 45e1169 Compare April 21, 2024 03:42
@cyclinder cyclinder force-pushed the dra/feature1 branch 3 times, most recently from 932793c to 03d6780 Compare April 23, 2024 08:01
@cyclinder
Copy link
Collaborator Author

/cc @weizhoublue

The PR is ready to merge.

6. Create resource files such as workloads and resourceClaim.

```
~# export NAME=demo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export NAME=demo
然后一坨 yaml , 这样是不可能执行成功的


目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力:

* 可根据 Pod 使用的子网和网卡信息,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说明下 原理或实施条件: 根据 node 上 master 网卡上报的 情况,结合 multusconfigure 中的 master 接口、ippool 等 三个信息来综合调度 ?

目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力:

* 可根据 Pod 使用的子网和网卡信息,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动
* 统一多个 device-plugin 的资源声明方式
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说明下 哪些 device-plugin ,给出 refercen 链接 ? 说明哪个 字段 对应哪个 device-plugin

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说明哪个 字段 对应哪个 device-plugin

什么字段?

Copy link
Collaborator

@weizhoublue weizhoublue Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parmeter 那个 CRD ,用法有个说明

Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>

Now spiderpool integrates dra and cdi, which allows for some complex scheduling and better manipulation of hardware resources based on dra.
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>

目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力:

* 可根据每个节点上报的网卡和子网信息,并结合 Pod 使用的 SpiderMultusConfig 配置,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动
Copy link
Collaborator

@weizhoublue weizhoublue Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1)说明的太简单,我需要详细点,工作方式,如何排查(网卡ip信息上报到某个 crd?),SpiderMultusConfig 配置 具体是什么,或者举个例子,什么样子能够调度上去
(2)主机1 的 eth0 10网段, pod 声明 macvlan master eth1 , 要求子网10,这样也能调度到 主机1 上 ?

或者这篇文档 在某个一节,具体说明下这个东西

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分内容统一在功能实现了之后补充

1. 准备一个高版本的 Kubernetes 集群, 推荐版本大于 v1.29.0, 并且开启集群的 dra feature-gate 功能
2. 已安装 Kubectl、[Helm](https://helm.sh/docs/intro/install/)

## 快速开始
Copy link
Collaborator

@weizhoublue weizhoublue Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要说明下 这个快速开始 为为了演示什么?(子网调度?srivo vf 获取 ?),不需要让读者 看完整个流程 才知道

目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力:

* 可根据每个节点上报的网卡和子网信息,并结合 Pod 使用的 SpiderMultusConfig 配置,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动
* 在 SpiderClaimParameter 中统一多个 device-plugin 如 [sriov-network-device-plugin](https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin), [k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin) 的资源使用方式
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

太简单了,作为小白,他得知道如何使用,如何排障
例如,SpiderClaimParameter 中的 哪个字段 能生效某个 device plugin
的功能,如果 启动 dra 功能,pod 的resource 声明了的话,谁先生效 ?

@weizhoublue weizhoublue merged commit 074c1d0 into spidernet-io:main Apr 29, 2024
49 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature release/feature-new release note for new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants