Tool for monitoring untracked workspace deployments #1690

skabashnyuk · 2019-11-11T11:34:59Z

Issue problem:
During testing eclipse-che/che#15006 on che.openshift.io I've got into the situation when some exception happens during workspace stop/delete. That exception caused that

Che doesn't know about deployments or maybe some other resources related to workspace
Resources are still running and there is no code that is tracking its lifetime.

I believe that this is a complex issue, that happens before and will happen again. I think right now there are multiple deployments that che users are not aware of.

Red Hat Che version:

version: (help/about menu)

I can reproduce it on latest official image

Reproduction Steps:

Describe how to reproduce the problem

Runtime:

runtime used:

minishift (include output of minishift version)
OpenShift.io
Openshift Container Platform (include output of oc version)

The text was updated successfully, but these errors were encountered:

skabashnyuk · 2019-11-11T11:52:44Z

I think #1691 this can cause Untracked workspace deployments too.

ibuziuk · 2019-11-12T17:07:33Z

untracked deployments is the side-effect of the eclipse-che/che#15006
I will proceed with cleanup once we have a fix on production

skabashnyuk · 2019-11-12T19:17:00Z

this is not the first time when we have such a thing and not the last time, is that possible to have some general solution to report and clean up such things?

amisevsk · 2019-11-26T15:56:25Z

is that possible to have some general solution to report and clean up such things?

It may be possible but this would be hard to envision as a Che feature; how we check users and their (unique) namespaces in a way that scales to thousands of users?

amisevsk · 2019-12-10T19:44:24Z

Before sinking significant time into implementing something along these lines, I think some more design discussion is required.

As I see it, there are three options for implementing this functionality, in terms of where and how this service would run.

The workspace tracker runs separately from all clusters, potentially as a cron job or similar.

Pros:
- Fairly easy to implement -- could be as simple as doing effectively
```
pods = oc get po -l che.workspace.id
for each pod in pods:
  check if workspace is running in DB 
  remove resources if it isn't
```
Cons:
- Not clear on how permissions would be managed (need cluster-write access to tenant clusters, database secret and cluster access for dsaas cluster).
- Would have to manage multiple clusters and connections (get pods from tenant clusters, get running workspaces from dsaas database).
- Managing deployment/running would require SD input
Runs as a separate service in the dsaas/preview cluster

Pros:
- Same deployment strategy as other services (e.g. k8s-image-puller)
- Could reuse existing functionality (DB connection, rhche secret) to automatically look into tenant clusters and get configuration that Che uses.
Cons:
- Another service, image, and CI to maintain.
- Would have to share rhche secrets and be another service that has use of the rhche SA token
Service is a scheduled job of Che server.

Pros:
- Could be implemented fairly easily since e.g. database communication logic is available
- Would be easier to upstream, and simplify deployments if we chose to upstream it
Cons:
- Could be a long running job that bogs down normal Che functionality to only check for a few stray workspaces out of thousands
- Upstream utility is not clear
- If downstream, would need to be potentially maintained for each Che version.
- Upstreaming might not be meaningful as namespaces are handled differently.

Personally, I'm leaning towards option 2, but that's because we know how to deploy and update such a service. I don't think it's suitable to plan this sort of thing for upstream, since the assumptions are very different.

ibuziuk · 2019-12-11T19:48:13Z

@amisevsk IMO we should stick to a solution that could be re-used in the upstream (not necessarily embedded in che-server, it could be auxiliary deployment like k8s-image-puller).
Speaking about 3. Service is a scheduled job of Che server. - isn't it smth that @sleshchenko already implemented upstream and we just need to make sure it works properly on Hosted Che side ?

ibuziuk · 2019-12-12T15:08:16Z

Talked with @sleshchenko and what we currently have in the upstream is - RuntimeHangingDetector - https://github.com/eclipse/che/blob/master/infrastructures/kubernetes/src/main/java/org/eclipse/che/workspace/infrastructure/kubernetes/RuntimeHangingDetector.java

Tracks STARTING/STOPPING runtimes and forcibly stop them if they did not change status before a timeout is reached.

amisevsk · 2019-12-12T19:42:45Z

Yeah, the RuntimeHangingDetector is a different case, since we can look for STARTING and STOPPING workspaces in the Che db and track those; for untracked deployments, the workspace may not even be in the database as STOPPED, if e.g. the user has deleted it.

The flow I would follow for this would be:

For each user, get everything with label che.workspace_id in their <username>-che namespace
For each workspace id there, check if workspace has RUNNING entry in database
If it doesn't remove all resources labelled che.workspace_id=<workspaceId> in that user's namespace.

The worry comes in where we have to scale to thousands of users.

ibuziuk · 2019-12-13T17:42:12Z

The worry comes in where we have to scale to thousands of users.

Could not we get the other way round:

get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id
(optional) identify those which are running for more than n (e.g. 24h) hours.
check if the workspace name (based on pod name) dedicated to the pod exists and check it's state (we might need to use admin account for that)
If workspace does not exist or status is stopped - remove all resources labelled che.workspace_id= in that user's namespace.

amisevsk · 2019-12-13T18:37:11Z

@ibuziuk The issue is an access problem:

get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id

This requires admin access to the tenant clusters, so we're no longer talking about something that runs in dsaas without a new config (I don't know if SD supports this flow).

If we want to run somewhere in dsaas (whether as a service or part of Che) we have to go through oso-proxy, which prevents something like oc get pod --all-namespaces AFAIK. Even if it would be allowed, we would have to do something hacky like we do for the k8s-image-puller to check all the tenant clusters (we use four test accounts that we know proxy into desired clusters).
If we don't care about being a dsaas service, then we have to deal with getting access to the Che database for step 3, and it's unclear on how such a job would be managed/automated.

ibuziuk · 2020-07-15T09:04:27Z

Closing, untracked deployments are currently expected to be tracked manually

skabashnyuk added the kind/bug label Nov 11, 2019

skabashnyuk mentioned this issue Nov 12, 2019

Do not allow to stop workspace with machine token if known that this operation will fail eclipse-che/che#15164

Closed

ibuziuk self-assigned this Nov 12, 2019

ibuziuk added severity/P1 sprint/next-sprint labels Nov 19, 2019

ibuziuk removed their assignment Nov 27, 2019

ibuziuk added this to the 7.6.0 milestone Nov 27, 2019

ibuziuk added level/advanced and removed sprint/next-sprint labels Nov 27, 2019

amisevsk self-assigned this Dec 4, 2019

ibuziuk changed the title ~~Untracked workspace deployments~~ Tool for monitoring untracked workspace deployments Dec 11, 2019

ibuziuk added kind/task and removed kind/bug labels Dec 11, 2019

ibuziuk removed this from the 7.6.0 milestone Dec 18, 2019

ibuziuk added the sprint/next-sprint label Dec 18, 2019

ibuziuk removed the sprint/next-sprint label Feb 14, 2020

ibuziuk closed this as completed Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool for monitoring untracked workspace deployments #1690

Tool for monitoring untracked workspace deployments #1690

skabashnyuk commented Nov 11, 2019

skabashnyuk commented Nov 11, 2019

ibuziuk commented Nov 12, 2019

skabashnyuk commented Nov 12, 2019

amisevsk commented Nov 26, 2019

amisevsk commented Dec 10, 2019

ibuziuk commented Dec 11, 2019

ibuziuk commented Dec 12, 2019 •

edited

Loading

amisevsk commented Dec 12, 2019 •

edited

Loading

ibuziuk commented Dec 13, 2019

amisevsk commented Dec 13, 2019

ibuziuk commented Jul 15, 2020

Tool for monitoring untracked workspace deployments #1690

Tool for monitoring untracked workspace deployments #1690

Comments

skabashnyuk commented Nov 11, 2019

skabashnyuk commented Nov 11, 2019

ibuziuk commented Nov 12, 2019

skabashnyuk commented Nov 12, 2019

amisevsk commented Nov 26, 2019

amisevsk commented Dec 10, 2019

ibuziuk commented Dec 11, 2019

ibuziuk commented Dec 12, 2019 • edited Loading

amisevsk commented Dec 12, 2019 • edited Loading

ibuziuk commented Dec 13, 2019

amisevsk commented Dec 13, 2019

ibuziuk commented Jul 15, 2020

ibuziuk commented Dec 12, 2019 •

edited

Loading

amisevsk commented Dec 12, 2019 •

edited

Loading