mount.lustre failed: Cannot send after transport endpoint shutdown #107
Conversation
|
strange that during the test run the version of lustre is updated: could this be because the version on the public lustre repo changed mid test? |
Yes, that is the case: https://build.hpdd.intel.com/job/lustre-master/3608/changes. We also now have: https://git.hpdd.intel.com/gitweb?p=fs/lustre-release.git;a=commit;h=a9d45ef39a471595709a31b3d60b0c67b7af0c91 which is on a new branch. However! Lustre is supposed to be backward compatible through a range/selection of previous releases, so the kind of "upgrade" that this test run experienced should to be supported. So looking at the test failure... The first "Cannot send after transport endpoint shutdown" happened during the "test_copytool_remove" test which according to the messages log on lotus-55vm18 happened during the window of 15:45:17->16:00:18, but the Lustre "upgrade" happened at ~13:56:35, almost 2 hours earlier. I think the Lustre upgrade is a red-herring. So looking at the failure more closely... Error is suggesting the MGS is not running, which it needs to be to complete the registration. Looking at lotus-55vm15's messages file during the same time window: We can see that the MGS was started at 15:59:10 but that even as late as 15:59:52 it was still restoring connections. It restored connections to lotus-55vm17 and lotus-55vm16 but never lotus-55vm18. I'm not positive it should have but it's just a data point to consider. It might be worth looking this test over to see if we have some kind of race, or other timing assumption that may be sometimes invalid. Maybe we are assuming that just because we start the MGS it's immediately available and perhaps we need an availability test before trying to register a new target. If we did find that to be the case however, I would strongly suspect a Lustre bug. |
|
failure still appears on subsequent runs with lustre packages installed on version 2.10.50 from the start: |
|
/Var/log/messages from lotus-58vm5 node on the failing run: https://gist.github.com/tanabarr/95ec25ea9262f79348dddf2da7aaeea4 |
|
Here's another occurrence: where this time there is nothing new on the Lustre branch we are pulling from so the upgrading-during-the-test is definitely a red-herring. |
|
|
Summary: so we've discounted the upgrade during test cause and we think it's to do with MGS availability/state. we can see that the MGS mount command returns rc=0 and stderr=mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 512 to 16384 should be noted that we don't actually register MGS, (steps empty for RegisterTargetJob with target=MGS) As @brianjmurrell has suggested it might be prudent to verify availability of MGS when registering other targets. During RegisterTargetJob.get_deps() we do already verify the MGS is in a mounted state when registering filesystem member targets, but maybe just verifying mounted state is not sufficient. |
This is informational and benign.
No. That's not what I was saying. We really shouldn't be adding more "sanity checks" (of stuff that should be dependably working) to the process as sanity checks are what make it take so long already. What I was suggesting was, as a temporary debugging measure to figure out what is going wrong here was to verify that the MGS format/registration/etc. is being successful before we move on to other steps. |
|
|
|
|
This is being tracked in: https://jira.hpdd.intel.com/browse/LU-9838 |
|
and: and: |
|
|
Seeing this in 20 of the last 20 runs of SSI. |
|
Moved to #230 |
* Bump Copyright to 2021 Signed-off-by: Nick Linker <nlinker@gmail.com> * Typo fixed Signed-off-by: Nick Linker <nlinker@gmail.com>
This is being tracked in: https://jira.hpdd.intel.com/browse/LU-9838
The pertinent failure string seems to be
mount.lustre: mount /dev/sdb at /mnt/testfs-OST0001 failed: Cannot send after transport endpoint shutdown
Failure is repeatable and can be seen on SSI runs 501, 502, 504
The lustre package versions are as follows:
lustre-client-2.9.59_35_gc1d70a4
lustre-2.9.59_35_gc1d70a4
http://jenkins.lotus.hpdd.lab.intel.com/job/integration-tests-shared-storage-configuration/arch=x86_64,distro=el7/501//consoleFull