etcd-snapshot save without local state file waits for state file to get copied #2492

superseb · 2021-03-25T13:24:07Z

RKE version:
v1.2.6

Steps to Reproduce:

Create cluster.yml
Run rke up
Remove cluster.rkestate
Run rke etcd-snapshot save

Results:
See the process trying to copy the state file and waiting for the state file to be copied (around 20-30 seconds) and still successfully create the snapshot but waiting for the state file is unnecessary.

We probably need to error out without a local state file to indicate the state file is missing and add a flag to override it without local state (in the case we just need the etcd snapshot) and skip the copy state step to save time.

gz#14991

The text was updated successfully, but these errors were encountered:

superseb · 2021-05-21T09:08:13Z

This needs to be tested so that without a state file, the logging for waiting for state file to be copied is not seen.

And a regular snapshot-save with state file present should still work as normal.

anupama2501 · 2021-05-28T15:23:31Z

Reproduced on rke version v1.2.6

On an aws ec2 instance, install docker
Create a cluster.yml, in the yaml file input the IP address, internal address of the instance
Run rke up and once the process is finished successfully, run rke etcd snapshot-save

WARN[0000] Name of the snapshot is not specified using [rke_etcd_snapshot_2021-05-28T08:17:05-07:00]
INFO[0000] Starting saving snapshot on etcd hosts
INFO[0000] [dialer] Setup tunnel for host []
INFO[0000] Removing container [cluster-state-deployer] on host [], try #1
INFO[0001] [remove/cluster-state-deployer] Successfully removed container on host []
INFO[0001] [state] Deploying state file to [/etc/kubernetes/rke_etcd_snapshot_2021-05-28T08:17:05-07:00.rkestate] on host []
INFO[0001] Image [rancher/rke-tools:v0.1.72] exists on host []
INFO[0002] Starting container [cluster-state-deployer] on host [], try #1
INFO[0002] [state] Successfully started [cluster-state-deployer] container on host []
INFO[0003] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0003] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 1
]
INFO[0004] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0004] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 1
]
INFO[0005] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0005] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 1
]
INFO[0006] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0006] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 1
]
INFO[0007] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0008] Removing container [cluster-state-deployer] on host [], try #1
INFO[0008] [remove/cluster-state-deployer] Successfully removed container on host []
INFO[0008] [etcd] Running snapshot save once on host []
INFO[0008] Image [rancher/rke-tools:v0.1.72] exists on host []
INFO[0009] Starting container [etcd-snapshot-once] on host [], try #1
INFO[0009] [etcd] Successfully started [etcd-snapshot-once] container on host []
INFO[0009] Waiting for [etcd-snapshot-once] container to exit on host []
INFO[0009] Container [etcd-snapshot-once] is still running on host []: stderr: [time="2021-05-28T15:17:15Z" level=info msg="Initializing Onetime Backup" name="rke_etcd_snapshot_2021-05-28T08:17:05-07:00"
], stdout: []
INFO[0010] Waiting for [etcd-snapshot-once] container to exit on host []
INFO[0011] Removing container [etcd-snapshot-once] on host [], try #1
INFO[0011] Finished saving/uploading snapshot [rke_etcd_snapshot_2021-05-28T08:17:05-07:00] on all etcd hosts

Now removed the cluster.rkestate and ran the command rke etcd snapshot-save
Following logs are seen:

INFO[0048] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0048] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 10]
INFO[0049] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0049] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 10]
INFO[0050] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0050] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 10
]
INFO[0051] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0051] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 10
]
INFO[0052] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0052] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 11
]
INFO[0053] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0053] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 11
]
INFO[0054] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0054] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 11
]
INFO[0055] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0055] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 11
]
INFO[0056] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0057] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 11
]
INFO[0058] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0058] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 12
]
INFO[0059] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0059] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 12
]
INFO[0060] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0060] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 12
]
INFO[0061] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0061] Container [cluster-state-deployer] is still running on host []: stderr: [], stdout: [Waiting for file [/etc/kubernetes/cluster.rkestate] to be successfully copied to this container, retry count 12
]
INFO[0062] Waiting for [cluster-state-deployer] container to exit on host []
INFO[0062] Removing container [cluster-state-deployer] on host [], try #1
INFO[0062] [remove/cluster-state-deployer] Successfully removed container on host []
INFO[0062] [etcd] Running snapshot save once on host []
INFO[0062] Image [rancher/rke-tools:v0.1.72] exists on host []
INFO[0062] Starting container [etcd-snapshot-once] on host [], try #1
INFO[0063] [etcd] Successfully started [etcd-snapshot-once] container on host []
INFO[0063] Waiting for [etcd-snapshot-once] container to exit on host []
INFO[0063] Container [etcd-snapshot-once] is still running on host []: stderr: [time="2021-05-28T15:21:18Z" level=info msg="Initializing Onetime Backup" name="rke_etcd_snapshot_2021-05-28T08:20:15-07:00"
], stdout: []
INFO[0064] Waiting for [etcd-snapshot-once] container to exit on host []
INFO[0064] Removing container [etcd-snapshot-once] on host [], try #1
INFO[0064] Finished saving/uploading snapshot [rke_etcd_snapshot_2021-05-28T08:20:15-07:00] on all etcd hosts

anupama2501 · 2021-05-28T15:44:28Z

Verified on the rke version v1.3.0-rc2

On an aws ec2 instance, install docker
Create a cluster.yml, in the yaml file input the IP address, internal address of the instance
Run rke up and once the process is finished successfully, run rke etcd snapshot-save.
Now removed the cluster.rkestate and ran the command rke etcd snapshot-save
Logs seen in the console:

INFO[0000] Running RKE version: v1.3.0-rc2
**WARN[0000] Name of the snapshot is not specified, using [rke_etcd_snapshot_2021-05-28T08:41:11-07:00]**
INFO[0000] Starting saving snapshot on etcd hosts
INFO[0000] [dialer] Setup tunnel for host []
**WARN[0001] Could not read cluster state file from [./cluster.rkestate], file does not exist. Snapshot will be created without cluster state file. You can retrieve the cluster state file using 'rke util get-state-file'**
INFO[0001] [etcd] Running snapshot save once on host []
INFO[0001] Image [rancher/rke-tools:v0.1.74] exists on host []
INFO[0001] Starting container [etcd-snapshot-once] on host [], try #1
INFO[0002] [etcd] Successfully started [etcd-snapshot-once] container on host []
INFO[0002] Waiting for [etcd-snapshot-once] container to exit on host []
INFO[0002] Container [etcd-snapshot-once] is still running on host []: stderr: [time="2021-05-28T15:41:13Z" level=info msg="Initializing Onetime Backup" name="rke_etcd_snapshot_2021-05-28T08:41:11-07:00"
], stdout: []
INFO[0003] Waiting for [etcd-snapshot-once] container to exit on host []
INFO[0003] Removing container [etcd-snapshot-once] on host [], try #1
INFO[0003] Finished saving/uploading snapshot [rke_etcd_snapshot_2021-05-28T08:41:11-07:00] on all etcd hosts

superseb added the kind/enhancement label Mar 25, 2021

superseb self-assigned this Mar 25, 2021

oxr463 added the internal label Mar 25, 2021

superseb mentioned this issue May 15, 2021

Dont deploy statefile if its not readable #2537

Merged

superseb added the [zube]: Release Candidates label May 16, 2021

superseb added this to the RKE v1.3.0 - Rancher v2.6 milestone May 21, 2021

superseb added [zube]: To Test and removed [zube]: Release Candidates labels May 21, 2021

sangeethah assigned anupama2501 May 21, 2021

sangeethah added [zube]: QA Next up and removed [zube]: To Test labels May 21, 2021

anupama2501 closed this as completed May 28, 2021

zube bot added [zube]: Done and removed [zube]: QA Next up labels May 28, 2021

zube bot removed the [zube]: Done label Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd-snapshot save without local state file waits for state file to get copied #2492

etcd-snapshot save without local state file waits for state file to get copied #2492

superseb commented Mar 25, 2021 •

edited by oxr463

superseb commented May 21, 2021

anupama2501 commented May 28, 2021

anupama2501 commented May 28, 2021

etcd-snapshot save without local state file waits for state file to get copied #2492

etcd-snapshot save without local state file waits for state file to get copied #2492

Comments

superseb commented Mar 25, 2021 • edited by oxr463

superseb commented May 21, 2021

anupama2501 commented May 28, 2021

anupama2501 commented May 28, 2021

superseb commented Mar 25, 2021 •

edited by oxr463