-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve regression detection with continous testing #8496
Comments
I've been working on a python version integrated with travis here: It is not ready for a pull request yet, since this is a bit more complex than i'd have wished, but already does the job nicely:
Warnings are treated as errors for now, since it could mess up with regression detections, but I'm not sure what to do about this yet. Any idea ? Last but not least, this splits the test suite in multiple part, because we might otherwise run it to the travis time limit. It contains a generic solution to do that with nosetests. I originally wanted to do this work separately, but the travis 2 hours time limit forced my hand. |
For parallel tests, see #7267. It uses nose's built-in parallel feature. |
Looks nice. I had trouble using it nose processes (did not see all tests, wasn't deterministic), I should test your branch. Note that both solutions can be used in parallel (pun unintended), since my solution splits the work at the travis level, allowing you to have different jobs (running on different VMs), while the nose-based solution splits the work inside a job/VM. The nose solution could be made generic though if travis integrated automatic nose like it does with RSpec, Cucumber and Minitest. |
This is copy of the message in #9235 : I understand that due to the nature of the project, tests depend on a third party (the website), and that website may well be quite flaky. But I want to at least eliminate the most unreliable tests. For example, I ran the full test suite (with regression detection) 75 times. Out of those, a few tests have more than 2 detected regressions (false positives):
Some have as many as 19(!). It even generates user issues, for example: Streetvoice #9219 . |
Well I run for i in $(seq 1 75) ; do
python3.6 -Werror test/test_download.py TestDownload.test_ACast || break
done And there are no errors. Could you give an example of error messages? |
Oh I found one: https://travis-ci.org/anisse/youtube-dl/jobs/123862117#L388
503 is indeed quite strange. |
streetvoice.py updated with the new API in 4dccea8. Hopefully fixes 403 errors. Please leave comments if the problem is still. |
I've improved the script in the regdetect branch: I've collapsed the commits and it's getting closer to ready for a merge. It now tests for reliability by going back and forth to before/after push, to hopefully reduce the number of false positives. It also does automatic regression bisecting, so we don't need to look at the various commits in push to guess which one introduced a regression. As you have seen, I have put in place automatic merging in my tree, and it already found a few regressions (#9991, #10018, #10030, #10048, #10064, #10096 for example). I'm hoping the last changes will make it even easier to use since it will automatically pinpoint the bad commits, have less false positives, and tell us which tests/websites are not reliable. |
It runs tests and parses nosetests output to detect failures and test them for regressions against a reference version. If it finds a regression, it is automatically bisected. Unstable or flaky tests are detected and ignored automatically by running them multiple times in a row. We keep the original test suite around, but mark it as allowed to fail. It serves as a dashboard of current test statuses, but since the test can fail out of our control (this is the essence of this project), we don't want it to be blocking. Using the regression detection as a fail means that any failing build need to be examined. Even if the next build is "fixed", it does not mean that the regression has been fixed. This is a change in semantics when analyzing build history. We map/reduce by splitting the test suite in 7 parts by abusing travis' matrix feature. Reduce is done by hand, by analyzing the dashboard of travis, any failing test being critical. Because nosetests --processes just doesn't work with youtube-dl test suite (yet), we route around it by first enumerating tests. This is a bit long because nosetests needs to find and run all tests files in order to enumerate them, but it should be at most 30 seconds, while the test suite can take more than 2 hours on travis' infrastructure. By doing this, we ensure we'll be able to run the tests faster, since they are mostly I/O (network) bound due to the nature of the project. Closes #8496
It seems the youtube-dl project has abandoned the idea of using tests with travis-ci to keep all IEs always working (last passed build was 2 years ago, #2571). Not that it's a bad idea, since it's hard to keep up with all those sites. But a side effect of that is that it's harder to detect test fails that are due to a new commit, and not a website change.
What's I'd like to propose, is to keep the current testing infrastructure as a dashboard for working/non-working IEs, and add another "testsuite" that would test each commit for regressions: if a test is failing, the previous commit would be tested as well, and if it fails, it's not considered a regression.
I've made a simple proof of concept https://gist.github.com/anisse/6093f8b5814ab3ce7140 in bash. This could be improved/rewritten, in order to be integrated with travis-ci.
What do you guys think ?
The text was updated successfully, but these errors were encountered: