-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deadlock in clean up cache #1151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am uneasy about the lack of testing for the cache cleanup code generally. Do you see a way to exercise this as a unit test?
continue; | ||
} | ||
|
||
if(ent->IsMultiOpened()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the definition of IsMultiOpened
?
S3FS_PRN_DBG("cleaned up: %s", next_path.c_str()); | ||
FdManager::DeleteCacheFile(next_path.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correct, we do not need to check FdEntity::pagelist.IsModified
since there are no opens and thus no modified data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good!
test/test-utils.sh
Outdated
@@ -258,6 +258,11 @@ function check_content_type() { | |||
fi | |||
} | |||
|
|||
function get_disk_avail_size() { | |||
DISK_AVAIL_SIZE=`df $1 --output=avail |tail -n 1|tr -dc '0-9'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work on macOS. Instead try:
BLOCK_SIZE=$((1024 * 1024)) df $1 | awk '{print $4}' | tail -n 1
test/integration-test-main.sh
Outdated
dd if=/dev/urandom of=$dir/file-$x bs=1048576 count=1 | ||
done | ||
|
||
file_cnt=$(ls -1 $dir | wc -l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -1
is unnecessary in non-interactive shells/pipelines.
test/integration-test-main.sh
Outdated
@@ -597,6 +621,10 @@ function add_all_tests { | |||
add_tests test_concurrency | |||
add_tests test_concurrent_writes | |||
add_tests test_open_second_fd | |||
ENSURE_DISKFREE_SIZE=`ps -ef|grep ensure_diskfree|awk '{print $NF}' |tr -dc '0-9'` | |||
if [ ! -z "$ENSURE_DISKFREE_SIZE" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you simplify this to:
if pidof s3fs | grep -q ensure_diskfree; then
Please add |
test/integration-test-main.sh
Outdated
@@ -597,6 +621,9 @@ function add_all_tests { | |||
add_tests test_concurrency | |||
add_tests test_concurrent_writes | |||
add_tests test_open_second_fd | |||
if pidof s3fs | xargs -I {} ps -o cmd -fp {} | grep -q ensure_diskfree; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macOS Travis complains that cmd
is not found -- is there another way to do this?
https://travis-ci.org/s3fs-fuse/s3fs-fuse/jobs/585372211#L3704
@liuyongqing #1169 will solve the build error on OSX. Please try it. |
@liuyongqing I'm sorry that #1171 issue makes build error now, please wait for fixing it. |
@ggtakec ,the -oensure_diskfree param seems have some problem in mac os,the simplest ensure_diskfree mount test code can't pass test case:https://github.com/s3fs-fuse/s3fs-fuse/pull/1170/commits |
@ggtakec ,do you know how to fix ensure_diskfree mount can't pass test case in MacOS? |
@liuyongqing do we have a path forward on this pull request? |
@gaul ,can we ignore the check for mac os temporarily,according to this test pull request:https://github.com/s3fs-fuse/s3fs-fuse/pull/1170/commits,the failed test case not related to this commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuyongqing Sorry for my late review.
One of reason for this test failure is:
DISK_AVAIL_SIZE=`BLOCK_SIZE=$((1024 * 1024)) df $1 | awk '{print $4}' | tail -n 1`
"BLOCK_SIZE" is not valid in Mac OS df.
But we can use "BLOCKSIZE" instead.
The environment variable "BLOCKSIZE" can be used on Mac OS and other OS.
In Mac OS df, the only acceptable environment variable name is "BLOCKSIZE".
And the other is the following part.
ENSURE_DISKFREE_SIZE=$((CACHE_DISK_AVAIL_SIZE - 256))
200MB is defined as free space, but the test_clean_up_cache test uses 256MB.
So this part seems to be correct for "-256".
Please confirm these points in the conversation of the source code.
(I made the same fix to master as you and confirmed that TravisCI was no longer a problem.)
Please try rebasing this PR code on the latest master.
Thanks in advance for your help.
file_cnt=$(ls $dir | wc -l) | ||
if [ $file_cnt != $count ]; then | ||
echo "Expected $count files but got $file_cnt" | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to call "rm -rf $dir" before returning, because of next testing.
fi | ||
CACHE_DISK_AVAIL_SIZE=`get_disk_avail_size $CACHE_DIR` | ||
if [ "$CACHE_DISK_AVAIL_SIZE" -lt "$ENSURE_DISKFREE_SIZE" ];then | ||
echo "Cache disk avail size:$CACHE_DISK_AVAIL_SIZE less than ensure_diskfree size:$ENSURE_DISKFREE_SIZE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to call "rm -rf $dir" before returning, because of next testing, too.
#reserve 200MB for data cache | ||
source test-utils.sh | ||
CACHE_DISK_AVAIL_SIZE=`get_disk_avail_size $CACHE_DIR` | ||
ENSURE_DISKFREE_SIZE=$((CACHE_DISK_AVAIL_SIZE - 200)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_clean_up_cache function uses 256MB, so here I think it should be "- 256".
test/test-utils.sh
Outdated
@@ -258,6 +258,11 @@ function check_content_type() { | |||
fi | |||
} | |||
|
|||
function get_disk_avail_size() { | |||
DISK_AVAIL_SIZE=`BLOCK_SIZE=$((1024 * 1024)) df $1 | awk '{print $4}' | tail -n 1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the environment variable name to "BLOCKSIZE" instead of "BLOCK_SIZE".
This is the environment variable name common to MacOS and other OS.
@ggtakec ,hi,I changed the code according to the above advice, but it still failed in macos |
@liuyongqing Thanks for your help. I think the reason which is that we run some tests and fter that we run "use_cache = $ {CACHE_DIR} -o ensure_diskfree = $ {ENSURE_DISKFREE_SIZE}" final test.
This is because we determined the ensure_diskfree size before the test started, but after some tests it seems that the disk usage is different. What about running the ensure_diskfree test first to solve it?
I have confirmed that this procedure does not cause any errors. Please retry to change your code and run test again. |
@ggtakec ,the test still failed, complain:The job exceeded the maximum log length, and has been terminated. |
@liuyongqing Thanks for your reports. I found following error in the TravisCI logs.
The test_clean_up_cache test passed, but an error occured in that test script. I think we need to find out why this script error is occurring. |
@liuyongqing I found the reason of following error message.
The reason is not clear, but on MacOS, the variables CACHE_DIR and In addition to this problem, some test units(functions) forgot to delete test files. Even after solving the above two problems, the test after test_chown still seems to fail. |
@liuyongqing I merged #1232. Thanks in advance for your help. |
@ggtakec ,hi,the macos run test success, but ppc64Ie platform run test case failed |
@liuyongqing It was "The build has been terminated". |
@ggtakec ,all test pass,commits are squashed to one commit. |
@liuyongqing Thank you for your cooperation. |
@gaul ,the previous pull request is:#1146, still have some situations not fully considered, so I closed the previous pull request.
Deadlocks will appear in the following situations also:
thread A call s3fs_write, already get the fdent_data_lock and cache_cleanup_lock(for clean up cache dir), after cleaning one file, and trying to get fd_manager_lock for Close the FdEntity.
thread B call s3fs_open for the same file,already get fd_manager_lock and fdent_lock, trying to get fdent_data_lock
the deadlock occured between thread A and thread B, because each thread wants to get the lock of another thread
so when cleanup a cache file, we should hold fd_manager_lock until it finished, if we cannot hold it, we can ignore it temporarily.
Another deadlock condition is in NoCacheLoadAndPost, because it trying to hold fd_manager_lock after hold fdent_data_lock
In short,We should lock in the order fd_manager_lock->fdent_lock->fdent_data_lock, if we trying to hold fd_manager_lock after hold fdent_lock or fdent_data_lock,we should use try_lock. The NoCacheLoadAndPost deadlock condition I didn't find a nice way to solve it,but it will only triggered if didn't have enough disk after cleaned up cache.