Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve regression detection with continous testing #8496

Open
anisse opened this issue Feb 10, 2016 · 8 comments
Open

Improve regression detection with continous testing #8496

anisse opened this issue Feb 10, 2016 · 8 comments
Labels

Comments

@anisse
Copy link
Contributor

anisse commented Feb 10, 2016

It seems the youtube-dl project has abandoned the idea of using tests with travis-ci to keep all IEs always working (last passed build was 2 years ago, #2571). Not that it's a bad idea, since it's hard to keep up with all those sites. But a side effect of that is that it's harder to detect test fails that are due to a new commit, and not a website change.

What's I'd like to propose, is to keep the current testing infrastructure as a dashboard for working/non-working IEs, and add another "testsuite" that would test each commit for regressions: if a test is failing, the previous commit would be tested as well, and if it fails, it's not considered a regression.

I've made a simple proof of concept https://gist.github.com/anisse/6093f8b5814ab3ce7140 in bash. This could be improved/rewritten, in order to be integrated with travis-ci.

What do you guys think ?

@anisse anisse changed the title Make continuous testing great again Improve regression detection with continous testing Mar 15, 2016
@anisse
Copy link
Contributor Author

anisse commented Apr 15, 2016

I've been working on a python version integrated with travis here:
https://github.com/anisse/youtube-dl

It is not ready for a pull request yet, since this is a bit more complex than i'd have wished, but already does the job nicely:

  • it does the original test suite, but marks it as allowed to fail in travis
  • it runs all tests on the pull-request/pushed commit and if they fail, then runs them on the original commit to see if they are a regression. It tries them a few times to be sure this is not a temporary error.

Warnings are treated as errors for now, since it could mess up with regression detections, but I'm not sure what to do about this yet. Any idea ?

Last but not least, this splits the test suite in multiple part, because we might otherwise run it to the travis time limit. It contains a generic solution to do that with nosetests. I originally wanted to do this work separately, but the travis 2 hours time limit forced my hand.

@yan12125
Copy link
Collaborator

For parallel tests, see #7267. It uses nose's built-in parallel feature.

@anisse
Copy link
Contributor Author

anisse commented Apr 15, 2016

Looks nice. I had trouble using it nose processes (did not see all tests, wasn't deterministic), I should test your branch.

Note that both solutions can be used in parallel (pun unintended), since my solution splits the work at the travis level, allowing you to have different jobs (running on different VMs), while the nose-based solution splits the work inside a job/VM.

The nose solution could be made generic though if travis integrated automatic nose like it does with RSpec, Cucumber and Minitest.

@anisse
Copy link
Contributor Author

anisse commented Apr 18, 2016

This is copy of the message in #9235 :
The problem is that some tests are so unreliable, that it's very difficult to run them twice and get the same result.

I understand that due to the nature of the project, tests depend on a third party (the website), and that website may well be quite flaky. But I want to at least eliminate the most unreliable tests.

For example, I ran the full test suite (with regression detection) 75 times. Out of those, a few tests have more than 2 detected regressions (false positives):

      2 test.test_download:TestDownload.test_WeiqiTV ERROR
      2 test.test_download:TestDownload.test_YesJapan ERROR
      2 test.test_subtitles:TestYoutubeSubtitles.test_youtube_translated_subtitles ERROR
      3 test.test_download:TestDownload.test_DouyuTV_2 ERROR
      3 test.test_download:TestDownload.test_KuwoMv FAIL
      3 test.test_download:TestDownload.test_NRKTV ERROR
      3 test.test_download:TestDownload.test_Sohu_3 ERROR
      3 test.test_download:TestDownload.test_Vodlocker ERROR
      3 test.test_subtitles:TestNPOSubtitles.test_allsubtitles ERROR
      4 test.test_download:TestDownload.test_Chaturbate ERROR
      4 test.test_download:TestDownload.test_ElPais ERROR
      4 test.test_download:TestDownload.test_Jpopsuki ERROR
      4 test.test_download:TestDownload.test_Telegraaf ERROR
      4 test.test_download:TestDownload.test_TudouPlaylist FAIL
      4 test.test_download:TestDownload.test_Viddler FAIL
      5 test.test_download:TestDownload.test_NPO_4 ERROR
      5 test.test_download:TestDownload.test_NRKTV_1 ERROR
      6 test.test_download:TestDownload.test_C56_1 ERROR
      6 test.test_download:TestDownload.test_Smotri_5 ERROR
      7 test.test_download:TestDownload.test_AdobeTVVideo ERROR
      7 test.test_download:TestDownload.test_LePlaylist FAIL
      7 test.test_download:TestDownload.test_MiTele FAIL
      7 test.test_download:TestDownload.test_NPO_5 ERROR
      9 test.test_download:TestDownload.test_Yahoo_7 FAIL
     14 test.test_download:TestDownload.test_GodTube FAIL
     15 test.test_download:TestDownload.test_Mwave FAIL
     16 test.test_download:TestDownload.test_LivestreamOriginal_1 ERROR
     16 test.test_download:TestDownload.test_Sexu ERROR
     19 test.test_download:TestDownload.test_ACast ERROR
     19 test.test_download:TestDownload.test_StreetVoice ERROR

Some have as many as 19(!).
ERROR means that there was an error running the test. Maybe we should ignore those. FAIL, means that a test passed, then failed on the same code revision. You can see how this translates in regression detection here:
StreetVoice: https://travis-ci.org/anisse/youtube-dl/jobs/123862153 https://travis-ci.org/anisse/youtube-dl/jobs/123289890
MWave: https://travis-ci.org/anisse/youtube-dl/jobs/123289900
GodTube: https://travis-ci.org/anisse/youtube-dl/jobs/123289914

It even generates user issues, for example: Streetvoice #9219 .

@yan12125
Copy link
Collaborator

Well I run test_ACast for 75 times:

for i in $(seq 1 75) ; do
    python3.6 -Werror test/test_download.py TestDownload.test_ACast || break
done

And there are no errors. Could you give an example of error messages?

@yan12125
Copy link
Collaborator

Oh I found one: https://travis-ci.org/anisse/youtube-dl/jobs/123862117#L388

ERROR: test_ACast (test.test_download.TestDownload)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/anisse/youtube-dl/test/test_download.py", line 141, in test_template
    force_generic_extractor=params.get('force_generic_extractor', False))
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 683, in extract_info
    self.report_error(compat_str(e), e.format_traceback())
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 543, in report_error
    self.trouble(error_message, tb)
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 513, in trouble
    raise DownloadError(message, exc_info)
DownloadError: ERROR: Unable to download webpage: HTTP Error 503: Success (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
-------------------- >> begin captured stdout << ---------------------
[acast] -where-are-you-taipei-101-taiwan: Downloading webpage
ERROR: Unable to download webpage: HTTP Error 503: Success (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/extractor/common.py", line 365, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/home/travis/build/anisse/youtube-dl/youtube_dl/YoutubeDL.py", line 1929, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/opt/python/2.6.9/lib/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

503 is indeed quite strange.

@yan12125
Copy link
Collaborator

streetvoice.py updated with the new API in 4dccea8. Hopefully fixes 403 errors. Please leave comments if the problem is still.

@anisse anisse mentioned this issue Jun 26, 2016
8 tasks
@anisse
Copy link
Contributor Author

anisse commented Jul 17, 2016

I've improved the script in the regdetect branch:
https://github.com/anisse/youtube-dl/tree/regdetect

I've collapsed the commits and it's getting closer to ready for a merge.

It now tests for reliability by going back and forth to before/after push, to hopefully reduce the number of false positives. It also does automatic regression bisecting, so we don't need to look at the various commits in push to guess which one introduced a regression.

As you have seen, I have put in place automatic merging in my tree, and it already found a few regressions (#9991, #10018, #10030, #10048, #10064, #10096 for example).

I'm hoping the last changes will make it even easier to use since it will automatically pinpoint the bad commits, have less false positives, and tell us which tests/websites are not reliable.

anisse referenced this issue in anisse/youtube-dl Sep 27, 2016
It runs tests and parses nosetests output to detect failures and test
them for regressions against a reference version. If it finds a
regression, it is automatically bisected.

Unstable or flaky tests are detected and ignored automatically by
running them multiple times in a row.

We keep the original test suite around, but mark it as allowed to fail.
It serves as a dashboard of current test statuses, but since the test
can fail out of our control (this is the essence of this project), we
don't want it to be blocking.

Using the regression detection as a fail means that any failing build
need to be examined. Even if the next build is "fixed", it does not mean
that the regression has been fixed. This is a change in semantics when
analyzing build history.

We map/reduce by splitting the test suite in 7 parts by abusing travis'
matrix feature. Reduce is done by hand, by analyzing the dashboard of
travis, any failing test being critical.

Because nosetests --processes just doesn't work with youtube-dl test
suite (yet), we route around it by first enumerating tests. This is a
bit long because nosetests needs to find and run all tests files in
order to enumerate them, but it should be at most 30 seconds, while the
test suite can take more than 2 hours on travis' infrastructure.

By doing this, we ensure we'll be able to run the tests faster, since
they are mostly I/O (network) bound due to the nature of the project.

Closes #8496
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants