Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters #75

mbukatov · 2019-05-21T08:02:00Z

Description of the problem

Extra volumes are not deleted during cluster destroy.

Steps to reproduce

Create cluster, but disable cluster destroy: python run.py --suite suites/ocs_basic_install.yml --log-level info --cluster-name=loginname-ocs4-testing-cluster --no-email --no-destroy
Destroy cluster via: python run.py --log-level DEBUG --cluster-path /tmp/loginname-ocs4-testing-cluster-88528/ --destroy

Actual results

3 extra volumes created in the step 1 still exist:

mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume
mbukatov-ocs4-testing-fmc7p-worker-us-east-2b-rgfx5_extra_volume
mbukatov-ocs4-testing-fmc7p-worker-us-east-2c-97vjn_extra_volume

Expected results

There are no extra volumes left.

The text was updated successfully, but these errors were encountered:

mbukatov · 2019-05-21T16:03:06Z

Note: @dahorak seen this in CI today as well.

clacroix12 · 2019-05-21T16:41:17Z

@mbukatov do you have logs for the run? We should be logging the volumes we found with the pattern we are using. Would be helpful to see if there's a disconnect between your cluster name and what we're using for pattern matching the volumes.

mbukatov · 2019-05-21T16:50:53Z

@clacroix12 Do you mean logs from the installation or destroy run? I keep the logs from installation only. Edit: Ah, it should be good enough. I will locate and check the logs tomorrow.

clacroix12 · 2019-05-21T17:40:27Z

@mbukatov also are you sure that those volumes are for this exact run? I just spun up a cluster and the name of my volumes includes the random id that we generate.

ocs-ci cmd:

python run.py --suite suites/ocs_basic_install.yml --cluster-name clacroix-test --no-email --no-destroy

worker instance names:

clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h

extra volume names:

clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h_extra_volume

Notice how both have 65942 after the name I provided with --cluster-name. Your extra volumes do not include this string, which makes me think they're actually not the volumes created with that installation.

I ran the same commands as you, only changing the provided cluster name, and all volumes were destroyed. I verified from my logs that the pattern being used to find volumes was clacroix-test-65942* and all three of my extra volumes were found.

destroy cmd:

python run.py --log-level DEBUG --cluster-path /tmp/clacroix-test-65942 --destroy

I was not able to reproduce this issue.

nehaberry · 2019-05-22T05:19:04Z

@clacroix12 +1

The destroy cluster worked for me and deleted the 3 extra volumes as part of cleanup. Hence, even I was not able to reproduce this issue

This was the command I used to create cluster
python run.py --cluster-name=nberry-may17 --suite=suites/ocs_basic_install.yml --cluster-path=/home/nberry/aws-install/may21 --no-email --no-destroy

Command to destroy cluster
python run.py --destroy --cluster-path /home/nberry/aws-install/may21/

The extra volumes are deleted:

nberry-may21-dsltb-worker-us-east-2a-9ns2k_extra_volume
nberry-may21-dsltb-worker-us-east-2b-k55kt_extra_volume
nberry-may21-dsltb-worker-us-east-2c-7sr59_extra_volume

dahorak · 2019-05-22T07:11:05Z

I have theory, why it doesn't work for me once ... but I wasn't able to install the cluster since then, so I wasn't able to reproduce it yet..
If I'll be able to reproduce it, in my case it will be probably connected with the length of the cluster name.

dahorak · 2019-05-22T09:01:55Z

I was able to reproduce it and it and as I've write above, the reason in my case might be because of long cluster name:
The cluster is created using following command:

 python3 run.py --log-level DEBUG --suite suites/ocs_basic_install.yml --cluster-name ocs-ci-jenkins-dahorak-cluster --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --no-destroy

And should be destroyed via this command:

python3 run.py --log-level DEBUG --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --destroy

But in AWS I see, that it shrink the name of the cluster, so for example
instances are named this way:

ocs-ci-jenkins-dahora-bvxbp-master-0
ocs-ci-jenkins-dahora-bvxbp-master-1
ocs-ci-jenkins-dahora-bvxbp-master-2
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk

And after the destroy command, following 3 volumes remain in the AWS:

ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj_extra_volume  vol-03b9b83a1ef047e70 100 GiB gp2 300 May 22, 2019 at 10:43:30 AM UTC+2 us-east-2a  available None  Okay  Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw_extra_volume  vol-059392d8112372476 100 GiB gp2 300 May 22, 2019 at 10:43:18 AM UTC+2 us-east-2b  available None  Okay  Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk_extra_volume  vol-0fb20441fb036b461 100 GiB gp2 300 May 22, 2019 at 10:43:24 AM UTC+2 us-east-2c  available None

The full output of the destroy command is here:
https://gist.github.com/dahorak/59e1a4d8e7c31ac01773204f7855f6d4

mbukatov · 2019-05-22T12:40:58Z

I located logs from 2019-05-20, as volumes listed in description of this issue belongs to cluster I created on on this date.

This can be demonstrated via:

$ cd cluster-2019-05-20/
$ grep mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume cephci-run-1558358218636/install_OCS_0.log | wc -l
15

And 1st line when creation of the volume is logged:

2019-05-20 15:51:30,404 - botocore.endpoint - DEBUG - Making request for OperationModel(name=CreateVolume) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.143 Python/3.7.3 Linux/5.0.13-300.fc30.x86_64 Botocore/1.12.143'}, 'body': {'Action': 'CreateVolume', 'Version': '2016-11-15', 'AvailabilityZone': 'us-east-2a', 'Encrypted': 'false', 'Size': 100, 'VolumeType': 'gp2', 'TagSpecification.1.ResourceType': 'volume', 'TagSpecification.1.Tag.1.Key': 'Name', 'TagSpecification.1.Tag.1.Value': 'mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume'}, 'url': 'https://ec2.us-east-2.amazonaws.com/', 'context': {'client_region': 'us-east-2', 'client_config': <botocore.config.Config object at 0x7f7c268154e0>, 'has_streaming_input': False, 'auth_type': None}}

Daniel's observation about long cluster name is good catch. As you can see, this applies to my case as well. Full name of my cluster was mbukatov-ocs4-testing-cluster-88528 as can be seen here:

2019-05-20 15:16:58,818 - utility.utils - INFO - Executing command: mkdir -p /tmp/mbukatov-ocs4-testing-cluster-88528

while the volume name prefix is just mbukatov-ocs4-testing.

It seems to me that we should delete the extra volumes using cluster_id instead of cluster_name, because that is what is actually used as a prefix in name of all AWS resources. We can also consider rejecting cluster name which is too long.

clacroix12 · 2019-05-22T17:10:36Z

@mbukatov @dahorak It looks like either terraform or openshift-install is truncating the name if it's too long so this does actually create an issue if we're using the name we think the volumes will have to pattern match them. I like the idea of using the cluster_id instead. We could also limit the name length so we don't end up with a cluster directory name that doesn't match the corresponding cluster in aws. @petr-balogh thoughts?

petr-balogh · 2019-05-28T20:10:42Z

@clacroix12 yep, I think that we should propagate this cluster_id from cluster directory to some ENV_DATA and we can use this cluster_id for deleting extra volumes which will solve the issue. About cluster name length I think we should also limit it to some length as we are also adding some random numbers CID to the name. Currently CID is int with length 5 digits, we can decrease it to 3 chars which will be containing letters and digits to have more combination.

clacroix12 · 2019-05-28T21:44:38Z

@petr-balogh the purpose of the 5 digits we add to the cluster name was to ensure that the directory created was going to be unique. This is because by default all cluster directories are going to /tmp. However, if we started placing the cluster directory inside of the "run directory" that we create for logs / potentially other things for the entire automation run, we wouldn't need the 5 digit cid anymore. Since we would have a new parent directory for each individual automation run, we wouldn't need to worry about uniqueness at the cluster directory level.

Now cluster dir exists here:

/tmp/my-cluster-name-12345

Proposed solution:

/logs-ceph-mount-point/ocsci-jenkins/run-1559079670/my-cluster-name-12345

Note that run-1559079670 is an example of what our "run" directory would be. Basically run-<timestamp> or something similar, ensuring uniqueness between automation runs.

Long story short, we can cut the 5 digits out of the equation if we wanted to use the run directory as the parent for cluster directories. This obviously doesn't cover the case where the user provides a cluster_path at the CLI level, but we could always do something like enforce that the directory is already created before specifying it.

vasukulkarni · 2020-05-13T20:25:34Z

should be fixed, old one.

mbukatov changed the title ~~Additional AWS volumes not deleted when using --destroy option~~ Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters May 22, 2019

vasukulkarni closed this as completed May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters #75

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters #75

mbukatov commented May 21, 2019 •

edited

Loading

mbukatov commented May 21, 2019

clacroix12 commented May 21, 2019

mbukatov commented May 21, 2019 •

edited

Loading

clacroix12 commented May 21, 2019 •

edited

Loading

nehaberry commented May 22, 2019

dahorak commented May 22, 2019

dahorak commented May 22, 2019

mbukatov commented May 22, 2019

clacroix12 commented May 22, 2019

petr-balogh commented May 28, 2019

clacroix12 commented May 28, 2019

vasukulkarni commented May 13, 2020

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters #75

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters #75

Comments

mbukatov commented May 21, 2019 • edited Loading

Description of the problem

Steps to reproduce

Actual results

Expected results

mbukatov commented May 21, 2019

clacroix12 commented May 21, 2019

mbukatov commented May 21, 2019 • edited Loading

clacroix12 commented May 21, 2019 • edited Loading

nehaberry commented May 22, 2019

dahorak commented May 22, 2019

dahorak commented May 22, 2019

mbukatov commented May 22, 2019

clacroix12 commented May 22, 2019

petr-balogh commented May 28, 2019

clacroix12 commented May 28, 2019

vasukulkarni commented May 13, 2020

mbukatov commented May 21, 2019 •

edited

Loading

mbukatov commented May 21, 2019 •

edited

Loading

clacroix12 commented May 21, 2019 •

edited

Loading