Skip to content

Container Load Balancer - DiffTracker Integration #8464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

georgeedward2000
Copy link

@georgeedward2000 georgeedward2000 commented Feb 27, 2025

What type of PR is this?

/kind features

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

linux-foundation-easycla bot commented Feb 27, 2025

CLA Not Signed

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Feb 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: georgeedward2000
Once this PR has been reviewed and has the lgtm label, please assign nilo19 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-kind needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @georgeedward2000. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 27, 2025
az.diffTracker = diffTrackerState
// Print syncOps to surpress unused variable error
klog.V(2).Infof("Sync operations: %v", syncOperations)
// TODO (enechitoaia): call NRP APIs (including ServiceGateway) to update the state of the NRP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to list a TODO for the k8 API calls as well here. The actual implementation could be done in a different function to keep it modular.

@@ -183,6 +183,10 @@ func (az *Config) IsLBBackendPoolTypeNodeIP() bool {
return strings.EqualFold(az.LoadBalancerBackendPoolConfigurationType, consts.LoadBalancerBackendPoolConfigurationTypeNodeIP)
}

func (az *Config) IsLBBackendPoolTypePodIP() bool {
return strings.EqualFold(az.LoadBalancerBackendPoolConfigurationType, consts.LoadBalancerBackendPoolConfigurationTypeNodeIP) && az.UseStandardV2LoadBalancer()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BackendPoolConfigurationType should be PodIP.

return updateK8Resource(input, dt.K8s.Egresses, ResourceTypeEgress)
}

func updateK8Resource(input UpdateK8sResource, set *utilsets.IgnoreCaseSet, resourceType string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think utilsets.IgnoreCaseSet is not thread-safe. Better to make the update functions thread-safe as multiple service reconcile threads could update in parallel. Also, a read maybe happening in batch processor go routine when write via update function is taking place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd just lock/defer unlock at the entry points, might get complicated otherwise

continue
}

if location == "" {
Copy link

@kartickmsft kartickmsft Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the first check at line 41. We need to continue.

continue
}

if location == "" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be the first check. Continue.

NATGatewayUpdates SyncNRPServicesReturnType
LocationData LocationData
}

Copy link

@kartickmsft kartickmsft Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need DTOs for Service (referring to LB/NATGW) also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss

}

type NRPState struct {
LoadBalancers *utilsets.IgnoreCaseSet
Copy link

@kartickmsft kartickmsft Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if NRPState should represent Service state in NRP. Service state in NRP will have references to LBBackendpool/NATGWs. To fit into the NRP API for Service/Location, we may need to store more information like LB name etc so that we can construct the URI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss

return true
}

// Map LocationData to LocationDataDTO to be used as payload in ServiceGateway API calls

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something similar is needed for Service API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss

Copy link

@david-kow david-kow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts

@@ -183,6 +183,10 @@ func (az *Config) IsLBBackendPoolTypeNodeIP() bool {
return strings.EqualFold(az.LoadBalancerBackendPoolConfigurationType, consts.LoadBalancerBackendPoolConfigurationTypeNodeIP)
}

func (az *Config) IsLBBackendPoolTypePodIP() bool {
return strings.EqualFold(az.LoadBalancerBackendPoolConfigurationType, consts.LoadBalancerBackendPoolConfigurationTypeNodeIP) && az.UseStandardV2LoadBalancer()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if pool type is podip but lb sku is just standard (not v2)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to throw validation error in aks-rp before we pass on the settings to cloud-provider from aks-rp. There may be self-managed clusters wherein, aks-rp is not involved. So, maybe, throw validation error during cloud-provider init so that cloud-provider init fails and customer corrects the config.


// Initialize the diff tracker state and get the necessary operations to sync the cluster with NRP
diffTrackerState, syncOperations := difftracker.InitializeDiffTrackerState(K8sState, NRPState)
az.diffTracker = diffTrackerState

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: difftracker vs difftrackerstate might be confusing (the latter sounds like a part of difftracker itself) - should we align on the name? probably just diffTracker?

diffTrackerState, syncOperations := difftracker.InitializeDiffTrackerState(K8sState, NRPState)
az.diffTracker = diffTrackerState
// Print syncOps to surpress unused variable error
klog.V(2).Infof("Sync operations: %v", syncOperations)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just do:

diffTrackerState, _ := difftracker.InitializeDiffTrackerState(K8sState, NRPState)

and avoid unnecessary print.

K8s: k8sState,
NRP: nrpState,
}
syncOperations := diffTrackerState.GetSyncDiffTrackerState()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason to do that? While probably this will be a common pattern to call get that right after initialization, maybe we leave it to do to the user?

K8s: k8sState,
NRP: nrpState,
}
syncOperations := diffTrackerState.GetSyncDiffTrackerState()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be called getSyncOperations?

return updateK8Resource(input, dt.K8s.Egresses, ResourceTypeEgress)
}

func updateK8Resource(input UpdateK8sResource, set *utilsets.IgnoreCaseSet, resourceType string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd just lock/defer unlock at the entry points, might get complicated otherwise

ResourceTypeEgress = "Egress"
)

func (dt *DiffTrackerState) UpdateK8service(input UpdateK8sResource) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we expect to have more details here about the service config or this is not necessary?

2. Added locking at entry points
3. Added DTOs and Mappers for LocationData and Service within ServiceGateway API
4. Clean code + updated initialization/sync cloud object
5. Added/Updated Tests
package difftracker

func (dt *DiffTracker) handleService(input UpdateK8sResource) SyncServicesReturnType {
dt.mu.Lock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to move the locks within the UpdateK8service. Same comment applies for the other handlers.

defer dt.mu.Unlock()

dt.UpdateK8service(input)
SyncDiffTrackerReturnType := dt.GetSyncLoadBalancerServices()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may not need the Diff. It's just update K8 state. So, I think we don't need the Get functions and as a result, locks can be made more granular.

return syncServices
}

func (dt *DiffTracker) GetSyncLoadBalancerServices() SyncServicesReturnType {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to have lock for Get Diff functions as UpdateK8 state and GetDiff will be called independently from reconcile and batchprocessor threads respectively.

utilsets "sigs.k8s.io/cloud-provider-azure/pkg/util/sets"
)

func (dt *DiffTracker) UpdateNRPLoadBalancers() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we need to pass in the Diff State that needs to be updated, as current implementation will sync NRP to the latest k8 state. Not the k8 state that we updated in NRP. Same comment applies for all Update functions.

if dt.NRPResources.LoadBalancers.Has(service) {
continue
}
dt.NRPResources.LoadBalancers.Insert(service)
Copy link

@kartickmsft kartickmsft Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we insert a duplicate entry? I guess it will just replace with the latest entry, which won't impact any impact if the entry is already present. If my understanding is correct, then we can remove the find in line 15. Will avoid cpu cycles for the new entries.
Similar comment applies for other Update functions as well.

continue
}
dt.NRPResources.NATGateways.Insert(egress)
fmt.Printf("Added egress %s to NRP NATGateways\n", egress)
Copy link

@kartickmsft kartickmsft Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using klog here?
Comment applies to line 48 also.

serviceRef := createServiceRef(pod)
addressData := Address{ServiceRef: serviceRef}

if !exists || !serviceRef.Equals(nrpLocation.Addresses[address].Services) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check if nrpLocation.Addresses[address] is present before comparing services within the address?

}

// Initialize the diff tracker state and get the necessary operations to sync the cluster with NRP
az.diffTracker = difftracker.InitializeDiffTracker(K8s, NRP)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming, should be k8s, nrp?

// loadBalancerServicesDTO := difftracker.MapLoadBalancerUpdatesToServiceDataDTO(syncOperations.LoadBalancerUpdates)
// natGatewayServicesDTO := difftracker.MapNATGatewayUpdatesToServiceDataDTO(syncOperations.NATGatewayUpdates)
// servicesDTO := difftracker.ServiceDataDTO{
// Action: difftracker.PartialUpdate,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a full update since we know full k8s state at this point?

// }

// TODO (enechitoaia): Implement the logic for ServiceGatewayClient
// Call ServiceGate APIs (including ServiceGateway) to update the state of the NRP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before that we'd need to create LBs, NATs, LB rules, etc.

case REMOVE:
set.Delete(input.ID)
default:
return fmt.Errorf("error Update%s, Operation=%s and ID=%s", resourceType, input.Operation, input.ID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change that as I think the intention was to indicate func name that was called. But in case of rename, or calling from a different func this will be misleading (eg. right now this will print error UpdateService, while the func name is Update_K8s_Service). I'd print out all params as they are, and let callers decide how to present it

},
},
},
NRPResources: NRP{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be more readable if you have a helper func to generate these (eg. pass string map like in expectedNRP). Will shorten this to:

Suggested change
NRPResources: NRP{
K8sResources: gen("node1": { "10.0.0.1": { "service1"}}),

Copy link

@david-kow david-kow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts

Service: service,
LoadBalancerBackendPools: []LoadBalancerBackendPoolDTO{
{
Id: fmt.Sprintf("%s-backendpool", service),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to specify the full uri of the LB backendpool resource in NRP.

serviceDTO := ServiceDTO{
Service: service,
PublicNatGateway: NatGatewayDTO{
Id: fmt.Sprintf("%s-natgateway", service),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to specify the full URI of NATGateway resource.

1. Drop event handlers and keep divided update k8s, get sync ops, update nrp flows
2. "Update nrp" now uses sync ops
3. Granular locking to address divded flows
4. Update DTO mappings to include URI formatting
5. Updated tests
existingAddress := nrpLocation.Addresses[addressKey]
if !serviceRefs.Equals(existingAddress.Services) {
existingAddress.Services = serviceRefs
nrpLocation.Addresses[addressKey] = existingAddress
Copy link

@kartickmsft kartickmsft Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to add the existingAddress back as it's already retrieved in line 78, which means it already exists in nrpLocation.Addresses.

)

// SyncServices handles the synchronization of services between K8s and NRP
func syncServices(k8sServices, Services *utilsets.IgnoreCaseSet) SyncServicesReturnType {
Copy link

@kartickmsft kartickmsft Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename it to GetServicesToSync to imply that it returns the services to sync?

Pods map[string]Pod
}

type K8s struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename it to K8s_State and NRP_State for better readability?

}

// LocationDataDTO represents the DTO for LocationData
type LocationDataDTO struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocationsDataDTO is better?

isDelete bool
}

type ServiceDataDTO struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServicesDataDTO is better?

func (dt *DiffTracker) DeepEqual() bool {
// Compare Services with LoadBalancers
if dt.K8sResources.Services.Len() != dt.NRPResources.LoadBalancers.Len() {
klog.Errorf("Services and LoadBalancers length mismatch")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch between k8 and NRP is not really an error. It could be just a transient state where we didn't sync NRP with k8 yet. So, suggesting to not log mismatches as an error in this function.

klog.Errorf("Identity %s not found in Services for pod %s in node %s\n", identity, podKey, nodeKey)
return false
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check if services in nrp are different from k8?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks on line 72 (Inbound) and line 90 (Outbound) for services in NRP_State and not in K8s_State

subscriptionID,
resourceGroup,
service,
fmt.Sprintf("%s-backendpool", service),
Copy link

@kartickmsft kartickmsft Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the name of the backendpool that's used for CLB in createOrUpdateBackendPool. You could refer to getLocalServiceBackendPoolName()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type property has been introduced for Service. We could pass Type as "Inbound" as we are mapping LB to Service.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I found in a test for createOrUpdateBackendPool, which follows the same structure as the one I used:
ptr.To(fmt.Sprintf("/subscriptions/subscriptionID/resourceGroups/rg/providers/Microsoft.Network/loadBalancers/%s/backendAddressPools/%s", lbName, bpName)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought line 350 should be fmt.Sprintf("%s",service). I mean not "%s-backendpool".

for _, service := range natGatewayUpdates.Additions.UnsortedList() {
serviceDTO := ServiceDTO{
Service: service,
PublicNatGateway: NatGatewayDTO{
Copy link

@kartickmsft kartickmsft Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type property has been introduced for Service. We could pass Type as "Outbound" as we are mapping NatGw to Service.

@github-actions github-actions bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-kind do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants