-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update regression tests #122
Update regression tests #122
Conversation
Adobe Naskh is a lot prettier, esp with mixed Arabic RTL text and LTR numerals (which seems to be the point of this test), but it's non-free and hard to come by. The Noto series on the other hand is easier to install on Travis for the test suite. I have a copy of both compared the output to see that the it was laying out the same other than the different font metrics.
The original Minion Pro font for the chapter break might have not shown this problem, but on systems without that font the fallback might have a different X height. On my system this test would wrap pages, which introduced a new problem...that may need a test of it's own, but in the spirit of testing one thing at a time this reduces the total content to be on one page even if the selected font isn't around.
In the spirit of testing one thing at a time I pared this test down to just the code that is essential for the test, notably of typesetting mixed LTL/RTL frames and the same Unicode in different Arabic scripts. I also added the expected output as a benchmark.
9d42b4d
to
61cfa1c
Compare
e1d68b5
to
1ab863f
Compare
All trim out now obsolete traling comments to suppress whitespace.
This was generated on an OSX machine with the TTC font in question, so it should be the _correct_ output. Note I was unable to duplicate the fail state on this because the TTC loader only wants to work for me. Also note the test that checks this needs to be skiped if not on OSX.
Although it works in parallel, Travis launches jobs from the top down in the matrix. During busy times and if availability is low this can mean the early ones on the list finish some time before the others. Since I'm most interested in seeing the tests run in the latest versions and can wait to hear about regression issues later, sort the matrix in reverse order with the newest versions on top.
Newer systems can use ~/.local/share/fonts but in spite of the documentation Ubuntu 12.04 as used by Travis CI does not actually look there by default. It does check /home/<usernome>/.fonts so...
The Ubuntu 12.04 LTS repos have very very old versions of the SIL fonts (e.g. 1.001 from 2009 instead of 2.020 from 2014) and the metrics have gotten changed quite a bit, especially for the Arabic Scheherazade font, but also lesser so for some of the others. This makes the regression tests hard to run because it's not the font version any developer is likely running on their system. Rather than benchmark the old ones, this will just install the newest available versions into the Travis test environment. (Note mkdir is not necessary for the manual font install since unzip will create up to 1 directory on extract.)
Whatever is wrong with the toolchain that generated the last one (see issue sile-typesetter#124) it's definitely broken (the diacritics are positioned all wrong) and this one is—at the very least—less broken.
This is a package from Trusty because the one in Precise repos is way out of date and has some bad metrics, ergo killing some of the regression tests that depend on it. Since there isn't much in the package but the font the deb file installs fine without much help.
The Travis environment is really hard (and slow) to debug, so it's more useful to get data about all the regression tests at once rather than waiting to find out others failed after fixing one. This will * Still fail immediately for any failure in the normal compile checks * Keep processing regression checks after a failure, but output the right exit code in the end. * Not be quite as verbose. The actual/expected data is in the file system anyway and the diff is far more useful for debuging from Travis logs than the full dumps which are hard to even scroll past.
Not that we want to go breaking examples and such, but these will need a different testing system than the ones we run using attack.pl triggered by the existence of .expected files. Here we actually have the PDF's in the repository anyway. If there _are_ regression tests that need to be run on the content of these examples it should be moved into tests/.
The logic of checking whether all the stuff compiles vs whether it exactly matches the reference implementation was getting too convoluted. This separates the concerns so that only documentation and examples get compile checks while the tests get compile and regression checks (depending of the mode of the flag at run time). Travis for example runs both modes, but separately so what failed is more apparent than if they were lumped in together.
Having a callout to a debug function is great in local code but it shouldn't be in the committed code unless it is behind a switch of some sort. This particular instance breaks the way the test suite works because of the extra un-expected output.
1ab863f
to
6893011
Compare
Oley oley! Happy dance—I see green! This should be ready for merge. Finally. But @simoncozens, you might say, “It looks red to me?!” And you'd be right. But if you look closer you'll find it's actually supposed to be red because there actually is a regression issue (just reported as #125) between Lua versions. Lua 5.3 handles float math differently than 5.3 and As fixing the actual existing regression is something for the master branch rather than than the test code I'll leave that to be fixed in master or a different PR. This set of changes is, to the best of my knowledge, doing what it should be doing and merging it to master makes sense even with the red x because the issue it found actually exists in master. |
This is incredibly cool (and pretty scary). Thank you very much for all this work. I came up with the idea of regression tests when I was mucking about with the linebreaking algorithm - that is some very scary code and I didn't want to disturb it unintentionally - but I think this will also help us catch plenty of other problems along the way. (Or we could be drowning in a sea of red crosses for the near future. I don't know. Let's find out. :-) ) |
The sea of red crosses is a definite possibility. I think it can be manged however. I have some shell scripts I wrote while setting this up that might help. One of them for example runs through regression tests, catches ones that fail, generates overlay images for visual review, pops up the diff and the overlay image and asks if it's okay, then regenerates the expected output if it gets an okay. The debug output is pretty cool actually and I think as long as we pay attention to keeping it deterministic and have the right tooling on hand it shouldn't be too bad to manage. In any event it will be easier if we start now rather than trying to add this on farther down the road. |
So this includes a whole bunch of changes to get the test suite gearing running a bit smoother. Here's the basic idea.
--regression
testing for the build tests in Travis. If changes to the core change output of existing documents, there should be a reason for it. This will mean updating the expected outputs a lot more as things change, but that extra review work should be done. I have some tooling I've been using locally for updating tests that I'll try to adapt for broader use and submit later.debug-output
format or that the default margins for the class or the way the folio was arranged changed for one reason or another.tests/*
, but I would like to be able to compile anything coming out of the core repository without needing proprietary fonts on my system. None of the items in the test suite seem to depend on the licensed fonts so I've replaced them with reasonable free ones that can be installed in the Travis environment as well as by anyone wanting them. The only exception to this is the OSX specific font issue. As this font format is only used on OSX and it's a default system font I left that one alone and marked the test to only run on Darwin (which means it won't be part of Travis's tests for now).Most of these are fully tested in that I actually reproduced the errors running old versions of SILE and made sure the expected output was actually the desired thing. There are a couple exceptions (marked "fail state untested") when I couldn't figure out the fail state (or it looked like I didn't have a working state).
Most of the commits have been isolated to individual tests. I squashed a few of them that were closely related, but if problems ever come up with the tests I wanted
git blame
andgit revert
to be useful so they are mostly test-specific.