Improve regression detection with continous testing #8496

anisse · 2016-02-10T13:01:22Z

It seems the youtube-dl project has abandoned the idea of using tests with travis-ci to keep all IEs always working (last passed build was 2 years ago, #2571). Not that it's a bad idea, since it's hard to keep up with all those sites. But a side effect of that is that it's harder to detect test fails that are due to a new commit, and not a website change.

What's I'd like to propose, is to keep the current testing infrastructure as a dashboard for working/non-working IEs, and add another "testsuite" that would test each commit for regressions: if a test is failing, the previous commit would be tested as well, and if it fails, it's not considered a regression.

I've made a simple proof of concept https://gist.github.com/anisse/6093f8b5814ab3ce7140 in bash. This could be improved/rewritten, in order to be integrated with travis-ci.

What do you guys think ?

anisse · 2016-04-15T10:24:05Z

I've been working on a python version integrated with travis here:
https://github.com/anisse/youtube-dl

It is not ready for a pull request yet, since this is a bit more complex than i'd have wished, but already does the job nicely:

it does the original test suite, but marks it as allowed to fail in travis
it runs all tests on the pull-request/pushed commit and if they fail, then runs them on the original commit to see if they are a regression. It tries them a few times to be sure this is not a temporary error.

Warnings are treated as errors for now, since it could mess up with regression detections, but I'm not sure what to do about this yet. Any idea ?

Last but not least, this splits the test suite in multiple part, because we might otherwise run it to the travis time limit. It contains a generic solution to do that with nosetests. I originally wanted to do this work separately, but the travis 2 hours time limit forced my hand.

yan12125 · 2016-04-15T10:31:28Z

For parallel tests, see #7267. It uses nose's built-in parallel feature.

anisse · 2016-04-15T10:41:26Z

Looks nice. I had trouble using it nose processes (did not see all tests, wasn't deterministic), I should test your branch.

Note that both solutions can be used in parallel (pun unintended), since my solution splits the work at the travis level, allowing you to have different jobs (running on different VMs), while the nose-based solution splits the work inside a job/VM.

The nose solution could be made generic though if travis integrated automatic nose like it does with RSpec, Cucumber and Minitest.

anisse · 2016-04-18T16:52:11Z

This is copy of the message in #9235 :
The problem is that some tests are so unreliable, that it's very difficult to run them twice and get the same result.

I understand that due to the nature of the project, tests depend on a third party (the website), and that website may well be quite flaky. But I want to at least eliminate the most unreliable tests.

For example, I ran the full test suite (with regression detection) 75 times. Out of those, a few tests have more than 2 detected regressions (false positives):

      2 test.test_download:TestDownload.test_WeiqiTV ERROR
      2 test.test_download:TestDownload.test_YesJapan ERROR
      2 test.test_subtitles:TestYoutubeSubtitles.test_youtube_translated_subtitles ERROR
      3 test.test_download:TestDownload.test_DouyuTV_2 ERROR
      3 test.test_download:TestDownload.test_KuwoMv FAIL
      3 test.test_download:TestDownload.test_NRKTV ERROR
      3 test.test_download:TestDownload.test_Sohu_3 ERROR
      3 test.test_download:TestDownload.test_Vodlocker ERROR
      3 test.test_subtitles:TestNPOSubtitles.test_allsubtitles ERROR
      4 test.test_download:TestDownload.test_Chaturbate ERROR
      4 test.test_download:TestDownload.test_ElPais ERROR
      4 test.test_download:TestDownload.test_Jpopsuki ERROR
      4 test.test_download:TestDownload.test_Telegraaf ERROR
      4 test.test_download:TestDownload.test_TudouPlaylist FAIL
      4 test.test_download:TestDownload.test_Viddler FAIL
      5 test.test_download:TestDownload.test_NPO_4 ERROR
      5 test.test_download:TestDownload.test_NRKTV_1 ERROR
      6 test.test_download:TestDownload.test_C56_1 ERROR
      6 test.test_download:TestDownload.test_Smotri_5 ERROR
      7 test.test_download:TestDownload.test_AdobeTVVideo ERROR
      7 test.test_download:TestDownload.test_LePlaylist FAIL
      7 test.test_download:TestDownload.test_MiTele FAIL
      7 test.test_download:TestDownload.test_NPO_5 ERROR
      9 test.test_download:TestDownload.test_Yahoo_7 FAIL
     14 test.test_download:TestDownload.test_GodTube FAIL
     15 test.test_download:TestDownload.test_Mwave FAIL
     16 test.test_download:TestDownload.test_LivestreamOriginal_1 ERROR
     16 test.test_download:TestDownload.test_Sexu ERROR
     19 test.test_download:TestDownload.test_ACast ERROR
     19 test.test_download:TestDownload.test_StreetVoice ERROR

Some have as many as 19(!).
ERROR means that there was an error running the test. Maybe we should ignore those. FAIL, means that a test passed, then failed on the same code revision. You can see how this translates in regression detection here:
StreetVoice: https://travis-ci.org/anisse/youtube-dl/jobs/123862153 https://travis-ci.org/anisse/youtube-dl/jobs/123289890
MWave: https://travis-ci.org/anisse/youtube-dl/jobs/123289900
GodTube: https://travis-ci.org/anisse/youtube-dl/jobs/123289914

It even generates user issues, for example: Streetvoice #9219 .

yan12125 · 2016-04-18T17:01:56Z

Well I run test_ACast for 75 times:

for i in $(seq 1 75) ; do
    python3.6 -Werror test/test_download.py TestDownload.test_ACast || break
done

And there are no errors. Could you give an example of error messages?

yan12125 · 2016-04-18T17:06:33Z

Oh I found one: https://travis-ci.org/anisse/youtube-dl/jobs/123862117#L388

ERROR: test_ACast (test.test_download.TestDownload)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/anisse/youtube-dl/test/test_download.py", line 141, in test_template
    force_generic_extractor=params.get('force_generic_extractor', False))
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 683, in extract_info
    self.report_error(compat_str(e), e.format_traceback())
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 543, in report_error
    self.trouble(error_message, tb)
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 513, in trouble
    raise DownloadError(message, exc_info)
DownloadError: ERROR: Unable to download webpage: HTTP Error 503: Success (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
-------------------- >> begin captured stdout << ---------------------
[acast] -where-are-you-taipei-101-taiwan: Downloading webpage
ERROR: Unable to download webpage: HTTP Error 503: Success (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/extractor/common.py", line 365, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 1929, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

503 is indeed quite strange.

yan12125 · 2016-04-21T05:15:14Z

streetvoice.py updated with the new API in 4dccea8. Hopefully fixes 403 errors. Please leave comments if the problem is still.

anisse · 2016-07-17T10:04:16Z

I've improved the script in the regdetect branch:
https://github.com/anisse/youtube-dl/tree/regdetect

I've collapsed the commits and it's getting closer to ready for a merge.

It now tests for reliability by going back and forth to before/after push, to hopefully reduce the number of false positives. It also does automatic regression bisecting, so we don't need to look at the various commits in push to guess which one introduced a regression.

As you have seen, I have put in place automatic merging in my tree, and it already found a few regressions (#9991, #10018, #10030, #10048, #10064, #10096 for example).

I'm hoping the last changes will make it even easier to use since it will automatically pinpoint the bad commits, have less false positives, and tell us which tests/websites are not reliable.

It runs tests and parses nosetests output to detect failures and test them for regressions against a reference version. If it finds a regression, it is automatically bisected. Unstable or flaky tests are detected and ignored automatically by running them multiple times in a row. We keep the original test suite around, but mark it as allowed to fail. It serves as a dashboard of current test statuses, but since the test can fail out of our control (this is the essence of this project), we don't want it to be blocking. Using the regression detection as a fail means that any failing build need to be examined. Even if the next build is "fixed", it does not mean that the regression has been fixed. This is a change in semantics when analyzing build history. We map/reduce by splitting the test suite in 7 parts by abusing travis' matrix feature. Reduce is done by hand, by analyzing the dashboard of travis, any failing test being critical. Because nosetests --processes just doesn't work with youtube-dl test suite (yet), we route around it by first enumerating tests. This is a bit long because nosetests needs to find and run all tests files in order to enumerate them, but it should be at most 30 seconds, while the test suite can take more than 2 hours on travis' infrastructure. By doing this, we ensure we'll be able to run the tests faster, since they are mostly I/O (network) bound due to the nature of the project. Closes #8496

anisse changed the title ~~Make continuous testing great again~~ Improve regression detection with continous testing Mar 15, 2016

yan12125 added the request label Apr 15, 2016

anisse mentioned this issue Apr 18, 2016

Remove/Skip unreliable/non-deterministic tests #9235

Closed

anisse mentioned this issue Jun 26, 2016

Iqiyi: test regression #9912

Closed

8 tasks

yan12125 mentioned this issue Mar 5, 2017

Travis CI fails tests that require ffmpeg/libavconv #12366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve regression detection with continous testing #8496

Improve regression detection with continous testing #8496

anisse commented Feb 10, 2016

anisse commented Apr 15, 2016

yan12125 commented Apr 15, 2016

anisse commented Apr 15, 2016

anisse commented Apr 18, 2016

yan12125 commented Apr 18, 2016

yan12125 commented Apr 18, 2016

yan12125 commented Apr 21, 2016

anisse commented Jul 17, 2016

Improve regression detection with continous testing #8496

Improve regression detection with continous testing #8496

Comments

anisse commented Feb 10, 2016

anisse commented Apr 15, 2016

yan12125 commented Apr 15, 2016

anisse commented Apr 15, 2016

anisse commented Apr 18, 2016

yan12125 commented Apr 18, 2016

yan12125 commented Apr 18, 2016

yan12125 commented Apr 21, 2016

anisse commented Jul 17, 2016