Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add testing checking copying of many small files #7411

Merged
merged 1 commit into from
Apr 30, 2018

Conversation

aerusso
Copy link
Contributor

@aerusso aerusso commented Apr 8, 2018

Issue #7401 identified data loss when many small files are being copied. Add tests to check for this, and other similar conditions.

I have not interacted with automatic testing infrastructure, so this is a WIP.

Adds a cp functional test, the original reproducer; a random_creation test that naturally uses the native list order from ZFS to create the ZAP hash collision; and a direct test, with a specific sequence that has been demonstrated to cause the hash collision.

Description

~~Maybe I should also add some cache options?~~This does not appear to be important, though should we consider this for future bugs?

How Has This Been Tested?

This is a test.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)
  • This is a new test.

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@loli10K
Copy link
Contributor

loli10K commented Apr 8, 2018

We usually add new tests to prevent regression after fixing an bug, especially because we first need to understand the issue/cause: adding new tests that seems to reproduce it (we don't even know why it seems only reproducible on centos) doesn't look good.

@aerusso aerusso force-pushed the pulls/cp_many_files-test branch 3 times, most recently from 0d44065 to 36f5db0 Compare April 8, 2018 20:22
@trisk
Copy link
Contributor

trisk commented Apr 9, 2018

#7401 is only easily reproducible with particular versions of coreutils (as shipped in RHEL/CentOS 7) that copy files in a pseudorandom order. A simpler reproducer is creating the same names in the same order with touch: https://gist.github.com/trisk/9966159914d9d5cd5772e44885112d30

@aerusso
Copy link
Contributor Author

aerusso commented Apr 9, 2018

This last version uses a randomized ordering. But, I wasn't able to get either my or @trisk's gist to reproduce the issue on the Debian machine I left running 0.7.7:

Linux 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux

If this randomized version doesn't reproduce on the testbots, I'll put the specific ordered case in.

@aerusso aerusso force-pushed the pulls/cp_many_files-test branch 2 times, most recently from d496e72 to dc4120a Compare April 10, 2018 01:39
@trisk
Copy link
Contributor

trisk commented Apr 10, 2018

@aerusso I updated the script to include the a list of 10000 names produced by the offending version of cp instead of just the first 2049. If cp of a directory with 10k files works for reproducing on your system, chances are this will also.

@behlendorf
Copy link
Contributor

@aerusso thanks for taking the time to write test cases for this issue for the ZTS.

If you want to iterate faster using the buildbot I'd suggest adding the following line to your top git commit message. It will request that the buildbot only run the "cp_many_files" test group in the ZFS Test Suite.

TEST_ZFSTESTS_TAGS="cp_many_files"

@aerusso aerusso force-pushed the pulls/cp_many_files-test branch 2 times, most recently from 681dbc8 to 3a43d70 Compare April 10, 2018 12:09
@tonyhutter
Copy link
Contributor

I gave this a try, comparing the test results between master and 0.7.7. A couple observations:

  1. It timed out when run as-is (with FILE_COUNT=300,000):
Test: /home/hutter/zfs-0.7.7/tests/zfs-tests/tests/functional/many_small_files/setup (run as root) [00:01] [PASS]
Test: /home/hutter/zfs-0.7.7/tests/zfs-tests/tests/functional/many_small_files/cp_test (run as root) [07:36] [FAIL]
Test: /home/hutter/zfs-0.7.7/tests/zfs-tests/tests/functional/many_small_files/random_creation (run as root) [19:37] [KILLED]
  1. I reduced FILE_COUNT to 30,000, and ran it 10 times against both 0.7.7 (results) and master (results) on centos 7.4. All tests correctly passed on master. At least one of the three individual tests correctly failed each test run on 0.7.7. However, each test did "pass" at least once on 0.7.7, meaning that they didn't reproduce the orphan file issue.

  2. I re-ran with FILE_COUNT=10,000 and got similar results as with 30,000.

  3. I re-ran with FILE_COUNT=100,000 and didn't see random_creation or cp_test incorrectly "pass" on 0.7.7, which was good, although I only had time to run the tests 6 times.

  4. When the test fails, I never see cleanup pass. It passes on success though.

@aerusso
Copy link
Contributor Author

aerusso commented Apr 11, 2018

@tonyhutter Thanks for looking at that. I'll look into increasing some timeout threshold, though I'm a little less than thrilled about having a test run for ~30 minutes given how many other tests are already in this lineup. I'll try having the test bail out as soon as a problem starts and see if that makes things better.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aerusso let's go ahead and structure this as complementaty test to #7421 to add additional coverage above and beyond the targeted test cases being added in #7421. A few notes.

  • Since we want the test cases to always run as part of the ZTS we need to bring the run time down to a few minutes if at all possible.
  • cp_test.sh and zap_collision.sh are now covered by the test @tuxoko added which runs fast, so why don't be drop these.
  • random_creation.sh offers slightly different test coverage which is good, but let's move it in to mv_files and leverage what we can of the existing helper functions. And it sounds like dropping it to 30,000 files is sufficient.

@aerusso
Copy link
Contributor Author

aerusso commented Apr 11, 2018

These all sound good to me. I'll move it into mv_files, and get the test time reduced.

@aerusso aerusso force-pushed the pulls/cp_many_files-test branch 3 times, most recently from 7bb077f to 8c6c892 Compare April 13, 2018 00:15
@codecov
Copy link

codecov bot commented Apr 13, 2018

Codecov Report

Merging #7411 into master will increase coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7411      +/-   ##
==========================================
+ Coverage   76.58%   76.69%   +0.11%     
==========================================
  Files         336      336              
  Lines      107137   107137              
==========================================
+ Hits        82047    82165     +118     
+ Misses      25090    24972     -118
Flag Coverage Δ
#kernel 76.92% <ø> (-0.02%) ⬇️
#user 66.02% <ø> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 089500e...4c1e9b1. Read the comment docs.

@behlendorf
Copy link
Contributor

@aerusso can you please rebase this on master with only the additional test cases. Then we can get a clean run from the bots.

@aerusso
Copy link
Contributor Author

aerusso commented Apr 20, 2018

@behlendorf Before rebasing, here is a summary of the test times in the failure case (the test never succeeded).

Ubuntu 17.10: 0:30
Ubuntu 17.04: 0:58
Ubuntu 16.04: 0:26
Fedora Rawhide (build failed)
Fedora 2 x86_64: 0:49
CentOS 7 x86_64: 0:44
CentOS 7 x86_64: 0:45
CentOS 6 x86_64: 0:39
Amazon 2 x86_64: 0:26

This was with 2^15 files created and 2^12 of those files created in ls -U order in another directory. Test time should presumably grow roughly proportional to number of files.

@aerusso
Copy link
Contributor Author

aerusso commented Apr 24, 2018

@behlendorf Are there any more changes needed in this before applying it to master?

@tonyhutter
Copy link
Contributor

@aerusso I'd like to give it a spin first. Unfortunately, I'm out of office this week so testing may be sporadic. Can you squash the commits though?

source "${STF_SUITE}/tests/functional/mv_files/mv_files.cfg"

log_assert "Check that creating many files quickly is safe"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment like "This will test the #7401 regression."

@tonyhutter
Copy link
Contributor

I finally got around to testing this out on my Centos 7.4 VM:

0.7.7 + this testcase, RC_PASS1=32k, RC_PASS2=4k (defaults):
16/100 incorrectly passed
28sec test time

0.7.7 + this testcase, RC_PASS1=64k, RC_PASS2=8k (defaults doubled):
1/100 incorrectly passed
48sec test time

0.7.8 + this testcase, RC_PASS1=32k, RC_PASS2=4k (defaults):
100/100 correctly passed

So you'll want to double the RC_PASS* values to 64k/8k.


# controls the "random_creation" test
export RC_PASS1=$((2**16)) # <number of files to create in the first directory>
export RC_PASS2=$((2**13)) # <process this many files into the second directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write out the full number for these?:

- export RC_PASS1=$((2**16)) # <number of files to create in the first directory>
- export RC_PASS2=$((2**13)) # <process this many files into the second director
+ export RC_PASS1=65536 # <number of files to create in the first directory>
+ export RC_PASS2=8192  # <process this many files into the second director

That would match the convention for the rest of the entries in this file.

Data loss was identified in openzfs#7401 when many small files were copied.
This adds a reproducer for this bug and other similar ones: randomly
generate N files. Then, listing M of them by `ls -U` order, produce
those same files in a directory of the same name.

This triggers the bug consistently, provided N and M are large enough.
Here, N=2^16 and M=2^13.

Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
@aerusso
Copy link
Contributor Author

aerusso commented Apr 26, 2018

@tonyhutter Thanks for the repeatability analysis. Let me know if there is anything else that should be adjusted.

Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks again for putting together this test!

@behlendorf behlendorf merged commit c83ccb3 into openzfs:master Apr 30, 2018
@tonyhutter
Copy link
Contributor

I'll put this in 0.7.9

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request May 3, 2018
Data loss was identified in openzfs#7401 when many small files were copied.
This adds a reproducer for this bug and other similar ones: randomly
generate N files. Then, listing M of them by `ls -U` order, produce
those same files in a directory of the same name.

This triggers the bug consistently, provided N and M are large enough.
Here, N=2^16 and M=2^13.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes openzfs#7411
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request May 3, 2018
Data loss was identified in openzfs#7401 when many small files were copied.
This adds a reproducer for this bug and other similar ones: randomly
generate N files. Then, listing M of them by `ls -U` order, produce
those same files in a directory of the same name.

This triggers the bug consistently, provided N and M are large enough.
Here, N=2^16 and M=2^13.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes openzfs#7411
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request May 4, 2018
Data loss was identified in openzfs#7401 when many small files were copied.
This adds a reproducer for this bug and other similar ones: randomly
generate N files. Then, listing M of them by `ls -U` order, produce
those same files in a directory of the same name.

This triggers the bug consistently, provided N and M are large enough.
Here, N=2^16 and M=2^13.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes openzfs#7411
tonyhutter pushed a commit that referenced this pull request May 10, 2018
Data loss was identified in #7401 when many small files were copied.
This adds a reproducer for this bug and other similar ones: randomly
generate N files. Then, listing M of them by `ls -U` order, produce
those same files in a directory of the same name.

This triggers the bug consistently, provided N and M are large enough.
Here, N=2^16 and M=2^13.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants