Skip to content
This repository has been archived by the owner on Mar 17, 2021. It is now read-only.

Tool for monitoring untracked workspace deployments #1690

Closed
1 of 4 tasks
skabashnyuk opened this issue Nov 11, 2019 · 11 comments
Closed
1 of 4 tasks

Tool for monitoring untracked workspace deployments #1690

skabashnyuk opened this issue Nov 11, 2019 · 11 comments

Comments

@skabashnyuk
Copy link
Collaborator

Issue problem:
During testing eclipse-che/che#15006 on che.openshift.io I've got into the situation when some exception happens during workspace stop/delete. That exception caused that

  • Che doesn't know about deployments or maybe some other resources related to workspace
  • Resources are still running and there is no code that is tracking its lifetime.
    Знімок екрана  о 09 44 01
    I believe that this is a complex issue, that happens before and will happen again. I think right now there are multiple deployments that che users are not aware of.

Red Hat Che version:

version: (help/about menu)

  • I can reproduce it on latest official image

Reproduction Steps:

Describe how to reproduce the problem

Runtime:

runtime used:

  • minishift (include output of minishift version)
  • OpenShift.io
  • Openshift Container Platform (include output of oc version)
@skabashnyuk
Copy link
Collaborator Author

I think #1691 this can cause Untracked workspace deployments too.

@ibuziuk
Copy link
Member

ibuziuk commented Nov 12, 2019

untracked deployments is the side-effect of the eclipse-che/che#15006
I will proceed with cleanup once we have a fix on production

@ibuziuk ibuziuk self-assigned this Nov 12, 2019
@skabashnyuk
Copy link
Collaborator Author

this is not the first time when we have such a thing and not the last time, is that possible to have some general solution to report and clean up such things?

@amisevsk
Copy link
Collaborator

is that possible to have some general solution to report and clean up such things?

It may be possible but this would be hard to envision as a Che feature; how we check users and their (unique) namespaces in a way that scales to thousands of users?

@ibuziuk ibuziuk removed their assignment Nov 27, 2019
@ibuziuk ibuziuk added this to the 7.6.0 milestone Nov 27, 2019
@amisevsk amisevsk self-assigned this Dec 4, 2019
@amisevsk
Copy link
Collaborator

Before sinking significant time into implementing something along these lines, I think some more design discussion is required.

As I see it, there are three options for implementing this functionality, in terms of where and how this service would run.

  1. The workspace tracker runs separately from all clusters, potentially as a cron job or similar.

    Pros:

    • Fairly easy to implement -- could be as simple as doing effectively
      pods = oc get po -l che.workspace.id
      for each pod in pods:
        check if workspace is running in DB 
        remove resources if it isn't
      

    Cons:

    • Not clear on how permissions would be managed (need cluster-write access to tenant clusters, database secret and cluster access for dsaas cluster).
    • Would have to manage multiple clusters and connections (get pods from tenant clusters, get running workspaces from dsaas database).
    • Managing deployment/running would require SD input
  2. Runs as a separate service in the dsaas/preview cluster

    Pros:

    • Same deployment strategy as other services (e.g. k8s-image-puller)
    • Could reuse existing functionality (DB connection, rhche secret) to automatically look into tenant clusters and get configuration that Che uses.

    Cons:

    • Another service, image, and CI to maintain.
    • Would have to share rhche secrets and be another service that has use of the rhche SA token
  3. Service is a scheduled job of Che server.

    Pros:

    • Could be implemented fairly easily since e.g. database communication logic is available
    • Would be easier to upstream, and simplify deployments if we chose to upstream it

    Cons:

    • Could be a long running job that bogs down normal Che functionality to only check for a few stray workspaces out of thousands
    • Upstream utility is not clear
    • If downstream, would need to be potentially maintained for each Che version.
    • Upstreaming might not be meaningful as namespaces are handled differently.

Personally, I'm leaning towards option 2, but that's because we know how to deploy and update such a service. I don't think it's suitable to plan this sort of thing for upstream, since the assumptions are very different.

@ibuziuk ibuziuk changed the title Untracked workspace deployments Tool for monitoring untracked workspace deployments Dec 11, 2019
@ibuziuk
Copy link
Member

ibuziuk commented Dec 11, 2019

@amisevsk IMO we should stick to a solution that could be re-used in the upstream (not necessarily embedded in che-server, it could be auxiliary deployment like k8s-image-puller).
Speaking about 3. Service is a scheduled job of Che server. - isn't it smth that @sleshchenko already implemented upstream and we just need to make sure it works properly on Hosted Che side ?

@ibuziuk
Copy link
Member

ibuziuk commented Dec 12, 2019

Talked with @sleshchenko and what we currently have in the upstream is - RuntimeHangingDetector - https://github.com/eclipse/che/blob/master/infrastructures/kubernetes/src/main/java/org/eclipse/che/workspace/infrastructure/kubernetes/RuntimeHangingDetector.java

Tracks STARTING/STOPPING runtimes and forcibly stop them if they did not change status before a timeout is reached.

@amisevsk
Copy link
Collaborator

amisevsk commented Dec 12, 2019

Yeah, the RuntimeHangingDetector is a different case, since we can look for STARTING and STOPPING workspaces in the Che db and track those; for untracked deployments, the workspace may not even be in the database as STOPPED, if e.g. the user has deleted it.

The flow I would follow for this would be:

  1. For each user, get everything with label che.workspace_id in their <username>-che namespace
  2. For each workspace id there, check if workspace has RUNNING entry in database
  3. If it doesn't remove all resources labelled che.workspace_id=<workspaceId> in that user's namespace.

The worry comes in where we have to scale to thousands of users.

@ibuziuk
Copy link
Member

ibuziuk commented Dec 13, 2019

The worry comes in where we have to scale to thousands of users.

Could not we get the other way round:

  1. get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id
  2. (optional) identify those which are running for more than n (e.g. 24h) hours.
  3. check if the workspace name (based on pod name) dedicated to the pod exists and check it's state (we might need to use admin account for that)
  4. If workspace does not exist or status is stopped - remove all resources labelled che.workspace_id= in that user's namespace.

@amisevsk
Copy link
Collaborator

@ibuziuk The issue is an access problem:

get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id

This requires admin access to the tenant clusters, so we're no longer talking about something that runs in dsaas without a new config (I don't know if SD supports this flow).

  • If we want to run somewhere in dsaas (whether as a service or part of Che) we have to go through oso-proxy, which prevents something like oc get pod --all-namespaces AFAIK. Even if it would be allowed, we would have to do something hacky like we do for the k8s-image-puller to check all the tenant clusters (we use four test accounts that we know proxy into desired clusters).
  • If we don't care about being a dsaas service, then we have to deal with getting access to the Che database for step 3, and it's unclear on how such a job would be managed/automated.

@ibuziuk ibuziuk removed this from the 7.6.0 milestone Dec 18, 2019
@ibuziuk
Copy link
Member

ibuziuk commented Jul 15, 2020

Closing, untracked deployments are currently expected to be tracked manually

@ibuziuk ibuziuk closed this as completed Jul 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants