Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters #75

Closed
mbukatov opened this issue May 21, 2019 · 12 comments

Comments

@mbukatov
Copy link
Contributor

mbukatov commented May 21, 2019

Description of the problem

Extra volumes are not deleted during cluster destroy.

Steps to reproduce

  1. Create cluster, but disable cluster destroy: python run.py --suite suites/ocs_basic_install.yml --log-level info --cluster-name=loginname-ocs4-testing-cluster --no-email --no-destroy
  2. Destroy cluster via: python run.py --log-level DEBUG --cluster-path /tmp/loginname-ocs4-testing-cluster-88528/ --destroy

Actual results

3 extra volumes created in the step 1 still exist:

mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume
mbukatov-ocs4-testing-fmc7p-worker-us-east-2b-rgfx5_extra_volume
mbukatov-ocs4-testing-fmc7p-worker-us-east-2c-97vjn_extra_volume

Expected results

There are no extra volumes left.

@mbukatov
Copy link
Contributor Author

Note: @dahorak seen this in CI today as well.

@clacroix12
Copy link
Contributor

@mbukatov do you have logs for the run? We should be logging the volumes we found with the pattern we are using. Would be helpful to see if there's a disconnect between your cluster name and what we're using for pattern matching the volumes.

@mbukatov
Copy link
Contributor Author

mbukatov commented May 21, 2019

@clacroix12 Do you mean logs from the installation or destroy run? I keep the logs from installation only. Edit: Ah, it should be good enough. I will locate and check the logs tomorrow.

@clacroix12
Copy link
Contributor

clacroix12 commented May 21, 2019

@mbukatov also are you sure that those volumes are for this exact run? I just spun up a cluster and the name of my volumes includes the random id that we generate.

ocs-ci cmd:

python run.py --suite suites/ocs_basic_install.yml --cluster-name clacroix-test --no-email --no-destroy

worker instance names:

clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h

extra volume names:

clacroix-test-65942-qbk7z-worker-us-east-2b-d2gfh_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2a-rz9tt_extra_volume
clacroix-test-65942-qbk7z-worker-us-east-2c-dds5h_extra_volume

Notice how both have 65942 after the name I provided with --cluster-name. Your extra volumes do not include this string, which makes me think they're actually not the volumes created with that installation.

I ran the same commands as you, only changing the provided cluster name, and all volumes were destroyed. I verified from my logs that the pattern being used to find volumes was clacroix-test-65942* and all three of my extra volumes were found.

destroy cmd:

python run.py --log-level DEBUG --cluster-path /tmp/clacroix-test-65942 --destroy

I was not able to reproduce this issue.

@nehaberry
Copy link
Contributor

@clacroix12 +1

The destroy cluster worked for me and deleted the 3 extra volumes as part of cleanup. Hence, even I was not able to reproduce this issue

This was the command I used to create cluster
python run.py --cluster-name=nberry-may17 --suite=suites/ocs_basic_install.yml --cluster-path=/home/nberry/aws-install/may21 --no-email --no-destroy

Command to destroy cluster
python run.py --destroy --cluster-path /home/nberry/aws-install/may21/

The extra volumes are deleted:

nberry-may21-dsltb-worker-us-east-2a-9ns2k_extra_volume
nberry-may21-dsltb-worker-us-east-2b-k55kt_extra_volume
nberry-may21-dsltb-worker-us-east-2c-7sr59_extra_volume

@dahorak
Copy link
Contributor

dahorak commented May 22, 2019

I have theory, why it doesn't work for me once ... but I wasn't able to install the cluster since then, so I wasn't able to reproduce it yet..
If I'll be able to reproduce it, in my case it will be probably connected with the length of the cluster name.

@dahorak
Copy link
Contributor

dahorak commented May 22, 2019

I was able to reproduce it and it and as I've write above, the reason in my case might be because of long cluster name:
The cluster is created using following command:

 python3 run.py --log-level DEBUG --suite suites/ocs_basic_install.yml --cluster-name ocs-ci-jenkins-dahorak-cluster --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --no-destroy

And should be destroyed via this command:

python3 run.py --log-level DEBUG --cluster-path /tmp/ocs-ci-jenkins-dahorak-cluster --destroy

But in AWS I see, that it shrink the name of the cluster, so for example
instances are named this way:

ocs-ci-jenkins-dahora-bvxbp-master-0
ocs-ci-jenkins-dahora-bvxbp-master-1
ocs-ci-jenkins-dahora-bvxbp-master-2
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk

And after the destroy command, following 3 volumes remain in the AWS:

ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2a-kmvqj_extra_volume  vol-03b9b83a1ef047e70 100 GiB gp2 300 May 22, 2019 at 10:43:30 AM UTC+2 us-east-2a  available None  Okay  Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2b-fkcsw_extra_volume  vol-059392d8112372476 100 GiB gp2 300 May 22, 2019 at 10:43:18 AM UTC+2 us-east-2b  available None  Okay  Not Encrypted
ocs-ci-jenkins-dahora-bvxbp-worker-us-east-2c-xrcmk_extra_volume  vol-0fb20441fb036b461 100 GiB gp2 300 May 22, 2019 at 10:43:24 AM UTC+2 us-east-2c  available None

The full output of the destroy command is here:
https://gist.github.com/dahorak/59e1a4d8e7c31ac01773204f7855f6d4

@mbukatov
Copy link
Contributor Author

I located logs from 2019-05-20, as volumes listed in description of this issue belongs to cluster I created on on this date.

This can be demonstrated via:

$ cd cluster-2019-05-20/
$ grep mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume cephci-run-1558358218636/install_OCS_0.log | wc -l
15

And 1st line when creation of the volume is logged:

2019-05-20 15:51:30,404 - botocore.endpoint - DEBUG - Making request for OperationModel(name=CreateVolume) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.9.143 Python/3.7.3 Linux/5.0.13-300.fc30.x86_64 Botocore/1.12.143'}, 'body': {'Action': 'CreateVolume', 'Version': '2016-11-15', 'AvailabilityZone': 'us-east-2a', 'Encrypted': 'false', 'Size': 100, 'VolumeType': 'gp2', 'TagSpecification.1.ResourceType': 'volume', 'TagSpecification.1.Tag.1.Key': 'Name', 'TagSpecification.1.Tag.1.Value': 'mbukatov-ocs4-testing-fmc7p-worker-us-east-2a-9hr9r_extra_volume'}, 'url': 'https://ec2.us-east-2.amazonaws.com/', 'context': {'client_region': 'us-east-2', 'client_config': <botocore.config.Config object at 0x7f7c268154e0>, 'has_streaming_input': False, 'auth_type': None}}

Daniel's observation about long cluster name is good catch. As you can see, this applies to my case as well. Full name of my cluster was mbukatov-ocs4-testing-cluster-88528 as can be seen here:

2019-05-20 15:16:58,818 - utility.utils - INFO - Executing command: mkdir -p /tmp/mbukatov-ocs4-testing-cluster-88528

while the volume name prefix is just mbukatov-ocs4-testing.

It seems to me that we should delete the extra volumes using cluster_id instead of cluster_name, because that is what is actually used as a prefix in name of all AWS resources. We can also consider rejecting cluster name which is too long.

@mbukatov mbukatov changed the title Additional AWS volumes not deleted when using --destroy option Additional AWS volumes not deleted when using --destroy option when cluster name is longer than 21 characters May 22, 2019
@clacroix12
Copy link
Contributor

@mbukatov @dahorak It looks like either terraform or openshift-install is truncating the name if it's too long so this does actually create an issue if we're using the name we think the volumes will have to pattern match them. I like the idea of using the cluster_id instead. We could also limit the name length so we don't end up with a cluster directory name that doesn't match the corresponding cluster in aws. @petr-balogh thoughts?

@petr-balogh
Copy link
Member

@clacroix12 yep, I think that we should propagate this cluster_id from cluster directory to some ENV_DATA and we can use this cluster_id for deleting extra volumes which will solve the issue. About cluster name length I think we should also limit it to some length as we are also adding some random numbers CID to the name. Currently CID is int with length 5 digits, we can decrease it to 3 chars which will be containing letters and digits to have more combination.

@clacroix12
Copy link
Contributor

@petr-balogh the purpose of the 5 digits we add to the cluster name was to ensure that the directory created was going to be unique. This is because by default all cluster directories are going to /tmp. However, if we started placing the cluster directory inside of the "run directory" that we create for logs / potentially other things for the entire automation run, we wouldn't need the 5 digit cid anymore. Since we would have a new parent directory for each individual automation run, we wouldn't need to worry about uniqueness at the cluster directory level.

Now cluster dir exists here:

/tmp/my-cluster-name-12345

Proposed solution:

/logs-ceph-mount-point/ocsci-jenkins/run-1559079670/my-cluster-name-12345

Note that run-1559079670 is an example of what our "run" directory would be. Basically run-<timestamp> or something similar, ensuring uniqueness between automation runs.

Long story short, we can cut the 5 digits out of the equation if we wanted to use the run directory as the parent for cluster directories. This obviously doesn't cover the case where the user provides a cluster_path at the CLI level, but we could always do something like enforce that the directory is already created before specifying it.

@vasukulkarni
Copy link
Contributor

should be fixed, old one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants