Skip to content

pytest with datafile#38

Closed
cboulay wants to merge 1 commit intomasterfrom
pytest_datafile
Closed

pytest with datafile#38
cboulay wants to merge 1 commit intomasterfrom
pytest_datafile

Conversation

@cboulay
Copy link
Copy Markdown
Contributor

@cboulay cboulay commented Sep 20, 2019

This is a first attempt. I would like to NOT include the datafile in the repo and instead download the file on demand. We can put the (small) datafile in a GitHub release, otherwise we need to host them somewhere.

@tstenner , how do you think we should get azure to download our test data? Within a script? Or within a pytest fixture? Something else?

vla(b'\x00')


FIXTURE_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'test_files')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this with pathlib?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the most straightforward example from pytest-datafiles. These small details aren't important until we iron out the bigger details, which may require using a completely different way of downloading/loading data.

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Sep 20, 2019

Would git-lfs or git-annex be a viable option? Why don't you want to include (small) test files directly in this repo? If they are small, I don't see the problem (the test files should not be packaged of course).

In MNE, we have a separate repository just for our testing data, but the main problem is that downloading is super slow. Maybe this is faster when you attach files to a release, but isn't this difficult to manage? Do we have to move/copy all testing files for each new release then?

@cboulay
Copy link
Copy Markdown
Contributor Author

cboulay commented Sep 20, 2019

Why don't you want to include (small) test files directly in this repo?

(1) The way the package is structured currently, the tests travel with the python package to pypi. We could probably put the test data in a different folder, or manually remove them from the packages before uploading (seems like silly over-engineering to do it that way). This is also exactly the same test file that's in the parent repo.
(2) I don't want to inflate the git history with binaries. Seems not so bad right now, but it can become a problem quickly, especially because...
(3) there are plans (SCCN/Intheon) to add many more test files to the test suite, especially 'problematic' files.

Attaching to a release feels like the wrong solution. The main reason I want the test files in there at all is to automatically test file loading whenever someone makes a PR. If a new feature requires a new test file, how does one verify the feature before the test file is in a release?

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Sep 20, 2019

I totally agree that we need a test suite that automatically runs with a PR. The more files we have in there the better. I think git-lfs should work pretty well for our purposes.

@tstenner
Copy link
Copy Markdown
Contributor

tstenner commented Sep 20, 2019

I would like to NOT include the datafile in the repo and instead download the file on demand.

Seconded. It's not always practical to clone a shallow copy and processing any data file every present in the repository slows everything down (or, as Clemens put it: "super slow") . With Git LFS it would be reasonable, but I think it's overkill. Also, once a big file gets pushed it's in the repository practically forever.

Also, once we isolate the broken part of a file it should be possible to create a synthetic test file that triggers the bug, but is much shorter.

For the benchmarks, we need huge files so I have small scripts that generate big files to test the different code paths on the fly and then delete them after. Those could stay in the repository.

@tstenner , how do you think we should get azure to download our test data?

It all depends what the CI should test. For one job (i.e., only one operating system), it doesn't really matter. For parallel jobs on Azure, I'd create one stage to download the files and then copy them to each job so the files get downloaded (and possibly cached) once per run and copying them to the CI workers is fast.

Within a script? Or within a pytest fixture? Something else?

I'd suggest in a script unless pytest handles downloading changed/new files efficiently. rsync would do this, but it'd need a dedicated host. Or a simple folder listing served via HTTP and wget.

@cboulay cboulay mentioned this pull request Oct 9, 2019
2 tasks
@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 10, 2019

I have about 1.7GB of test files. Not sure any online/cloud solution is suitable, because CIs need to download the data each time. Sure we could restrict ourselves to small files, but it'd still be nice to also test with real-world large files.

@cboulay
Copy link
Copy Markdown
Contributor Author

cboulay commented Oct 10, 2019

but it'd still be nice to also test with real-world large files

Not with CI systems though. If you want to include some tests in the repo that are only run on your local computer (e.g., after checking for the presence of a specific folder and/or only invoked with a special argument to pytest) then that's fine. If you want the tests to be run on the CI servers then the files should be small or generated on the fly.

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 10, 2019

Yes, obviously we can't test large files with CIs, so I'm fine with only using small files for CIs. We should still try to run our tests on large files, because these might reveal issues that are not found in small files. We could run our tests on all files (small and large) manually before a new release. I'm open to the implementation details of how we want to make this work, and we also still need to solve the question where we store our large files (since I assume that you all seem to be OK with including small files in either this repo or a separate xdf-testdata repo - I'd prefer the separate repo because that way other repos can re-use the same test files).

@cboulay
Copy link
Copy Markdown
Contributor Author

cboulay commented Oct 10, 2019

Agreed, not within this repo. I don't have enough experience with git-lfs, does it have better performance with large binary files than normal git?

@tstenner
Copy link
Copy Markdown
Contributor

Does it have better performance with large binary files than normal git?

It basically uploads the files separately and replaces them with links, so it's a lot faster than normal git but only as fast as the place where the files get uploaded to.

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 10, 2019

Exactly. And we still don't know where to host the large files.

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 10, 2019

@cboulay could you go ahead and create an xdf-testdata repo within the xdf-modules organization?

@tstenner
Copy link
Copy Markdown
Contributor

https://github.com/xdf-modules/example-files

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 11, 2019

Re how to do the actual testing, I think pytest-datafiles is not even required. This package's main purpose is to copy files to a temp dir in order to avoid modifying the original files. We don't need to do this, because (1) we're pulling our test files from a git repo (and we don't push), and (2) we don't modify our test files anyway.

I'd clone the example-files repo in an Azure script, and we can then use the test files with the @pytest.mark.parametrize decorator.

@cbrnr cbrnr closed this in #49 Oct 23, 2019
@cbrnr cbrnr deleted the pytest_datafile branch October 23, 2019 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants