Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding an outage from a wrong configuration #7449

Open
tty101 opened this issue Mar 3, 2025 · 2 comments
Open

Avoiding an outage from a wrong configuration #7449

tty101 opened this issue Mar 3, 2025 · 2 comments
Labels
proposal An issue that proposes a feature request

Comments

@tty101
Copy link

tty101 commented Mar 3, 2025

Problem:

The problem described below is related to most of the resources managed by NIC, but I will describe it based on VirtualServer.

Currently, if an existing VirtualServer resource becomes invalid, NIC removes the corresponding configuration from Nginx (code, doc), resulting in an outage. And it wouldn't be so bad and scary if there was an interface to make "comprehensive validation" before updating VirtualServer resource. But there is no built-in and reliable way to validate VirtualServer resource before applying it - only structural validation by OpenAPI schema (correct me if I'm wrong).

We are migrating to the NIC as the main ingress for an external and internal traffic to our clusters. At the same time, we provide to our users an interface to change VS resource (for example, to add/change routes). How can we protect ourselves and our users from allowing someone to break important ingress endpoints because they added a duplicated path in routes or made some other syntax mistake (which passed structural validation but failed during the comprehensive validation phase)?

Preferred solutions:

  • Do not remove the corresponding nginx configuration in case if VirtualServer becomes invalid
    There is no sense in removing it, and it's difficult to imagine how it can make a situation better - too big a penalty for a configuration error. Should be enough just to mark it “Invalid” and then responsible ppl or automation (by alerts, cd pipelines errors, etc…) can decide what to do with it. But let me know if I'm missing something and there's a legitimate reason for the current behavior.
  • Provide interface for "comprehensive validation" before applying VirtualServer resource
    For example, it could be done via ValidatingAdmissionWebhook for usual and dry-run mode (oss nginx-ingress has something similar). With this approach, there will be much fewer possibilities to make VirtualServer invalid and there will be a possibility to make "pre-validation" with something like "kubectl apply –dry-run=server".
  • (optionally) To think about auto rollback for invalid resources
    It seems this is not necessary if proper "pre-validation" is implemented in the way described above. But maybe there are some cases where it can be useful.

Alternative workarounds:

Please let me know if you see any good workarounds for the current behavior. From what I see:

  • Install additional NIC in isolated namespace with separate "ingressClassName" for validation purpose only, and integrate it with our cd pipeline (to first create resource in this isolated namespace with dedicated "ingressClassName" for validation purpose only, and only then proceed to change the main resource if validation passed).
    This should work, but seems a bit complicated, requires additional “validation” NIC, and duplicates already existing logic.
  • Create a dedicated validator based on NIC logic and trigger it either by cd pipeline or by policy managers like kyverno.
    The same comment: seems complicated, and duplicates already existing logic in multiple places. And there are other problems, like keeping this tool in sync with upstream validation.

Additional comments:

Ideally, to change the behavior for other resources as well (like transportserver, ingress, etc..), but let's focus on VirtualServer first. We can discuss other resources later, if you are agree with the proposed changes for VirtualServer.

@tty101 tty101 added the proposal An issue that proposes a feature request label Mar 3, 2025
Copy link

github-actions bot commented Mar 3, 2025

Hi @tty101 thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

@shaun-nx shaun-nx moved this from Todo ☑ to Prioritized backlog in NGINX Ingress Controller Mar 11, 2025
@nginx-discourse-bot
Copy link

This issue has been mentioned on NGINX Community Forum. There might be relevant details there:

https://community.nginx.org/t/nginx-ingress-controller-virtual-server-validation/1030/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal An issue that proposes a feature request
Projects
Status: Prioritized backlog
Development

No branches or pull requests

2 participants