Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realistic Page Load Time Test #10452

Open
shinglyu opened this issue Apr 7, 2016 · 59 comments
Open

Realistic Page Load Time Test #10452

shinglyu opened this issue Apr 7, 2016 · 59 comments

Comments

@shinglyu
Copy link
Member

@shinglyu shinglyu commented Apr 7, 2016

I had some discussion with @larsbergstrom about this, and here is my proposal. Feedbacks welcome!


Goal

Measuring Servo's page load performance on top websites (e.g. Alexa 500), and comparing them to other browser's performance.

Existing Solutions

  • Firefox Desktop
    • Talos - tp5 suite
    • PerfHerder (dashboard)
  • Firefox OS
    • b2gperf (deprecated)
    • Raptor (dashboard)

Proposal

Measurements

  • Page load time
    • Using window.performance API
  • (Optional) Record profiling data during test

Test Environment

  • (By priority) Linux, OSX, Windows, Android, ARM embedded, other

Test Harness

  • Python test runner (unittest or py.test)
    • wpt-runner?
  • Test pages are served from a local server to minimize network latency noise
  • Use WebDriver to execute performance.timing and collect the results

Test Cases

  • tp5
    • 51 manully selected and cleaned site from Alexa top 500 April 8th, 2011
    • Outdated, but a good starting point
    • Served from local server
  • Web Page Replay
    • Record and replay Alexa top 500 sites
    • Need to select representitive pages (i.e. not login or landing page)
    • Need a lot of manual labor
    • We should also include target sites from our June release plan:
      • github.com
      • duckduckgo
      • hackernews
      • reddit

Visualization & Notificaion

  • Raptor
    • What we used in B2G
    • A drag-and-drop dashboard solution with influxDB backend
  • (Alternative) PerfHerder?
  • (Alternative) Another AreWe...Yet from scratch?
  • Automated regression identification and bug reporting

Plan

  • Phase 1:
    • Serve tp5 pages on a local server
    • Run Servo (Linux 64) against tp5 and collect performance.timing data
    • Push the data to an influxDB instance
    • Plot them in Raptor (hosted on heroku)
    • Run against nightly builds on a local Jenkins server
  • Phase 2:
    • Record selected Alexa 500 sites with Web Page Replay
    • Automate regression finding and bug reporting
    • Compare with Gecko
    • Integrate with existing Servo test infrastructure
  • Phase 3
    • Bring up more platforms
@autrilla
Copy link
Contributor

@autrilla autrilla commented Apr 7, 2016

Is this something you plan on doing yourself, or are you looking for someone to work on this? If so, I'm interested :)

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented Apr 7, 2016

I'd be very interested in seeing page load time as the most important piece of information but also capturing the raw output of the profiling data. @rzambre and I are working on some patches to make that easier to grab for automated systems, which we hope to land very soon.

The only other obvious piece of information that would be really helpful is the output of memory profiling, though I don't know how practical that is to collect for initial page load scenarios. @nnethercote can you comment on whether that would be a useful measure to track or if we need more steady-state browsing data?

CC: @jgraham @jdm @metajack @Ms2ger @edunham whom I expect to have other feedback

Thanks for writing this up - I'm very excited for this work!

@jgraham
Copy link
Contributor

@jgraham jgraham commented Apr 7, 2016

So, generally the idea of using something like tp5 and measuring the load times seems like a sensible first step. I would encourage you to build as little infrastructure as possible though; we already have solutions for monitoring performance data and I suspect that anything you invent now will be about the same amount of effort to get working as reporting to perfherder, but will be more of a maintenance burden in the future. @wlach is the expert here and will be able to provide hints.

For harnesses, wptrunner already provides a mechanism to launch servo, but that's pretty much all you'll be using in this case. It will be possible to adapt to your usecase, but if your plan is literally just to launch servo for each url and read the timing data from stdout (which seems like the easiest implementation for now), then it's probably overkill. I think just using purely custom python code is quite defensible. I would avoid unittest.

I agree that in the future using recorded loads instead of static copies of sites will be a much better simulation of the real world.

@wlach
Copy link

@wlach wlach commented Apr 7, 2016

Yeah, I'd really encourage you to consider using Perfherder, which not only solves the problem of storing and visualizing performance data but also acting on it. I've spent the last few quarters working on a performance sheriffing view, which we've been using for tracking regressions in Talos and other things:

https://treeherder.mozilla.org/perf.html#/alerts?status=-1&framework=1

Perfherder automatically detects regressions and provides a simple method for filing bugs on them based on a template. We'd probably need to make some minor adaptations to support Servo, but nothing major. I'm in the middle of a similar effort to make Perfherder a good solution for sheriffing AreWeFastYet data, which I think should cover most of your use case: http://wlach.github.io/blog/2016/03/are-we-fast-yet-and-perfherder/

Submitting data to perfherder is not hard, all that is involved is creating a standard treeherder job and adding a "performance artifact" to it (there's plenty of sample code for this). We've used treeherder successfully with github projects before (bugzilla, gaia), so I don't see why Servo would be a problem.

@edunham
Copy link
Contributor

@edunham edunham commented Apr 7, 2016

I'm +1 on using perfherder, since you'll almost certainly get better support and performance with a tool that people focus on full time than a one-off competing with many other projects for my, Jack's, and Lars's time. @wlach, does perfherder expose a public API of the data it collects, as well as the built-in metrics visualization?

@nnethercote
Copy link
Contributor

@nnethercote nnethercote commented Apr 8, 2016

Memory usage on page load would be reasonably useful. Tracking that would be a lot better than tracking nothing.

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 8, 2016

Let me summarize the above discussion:

  • Use tp5 as a start point
  • Use a dead simple python script as test harness
  • Use Perfherder for data collection and reporting
  • Additional metrics we can collect:

The technique used in AreWeSlimYet seems daunting to me, I'd appreciate if anyone can point me to any tool or document I can study.

@autrilla : Any help would be most welcome :) I'll open a new repo for this project and try to merge it back when it's mature.

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 8, 2016

I started some experiment in this repo: https://github.com/shinglyu/servo-perf

@wlach
Copy link

@wlach wlach commented Apr 8, 2016

@edunham: Perfherder has a bunch of endpoints for getting series data (the UI uses these):

https://treeherder.mozilla.org/docs/#!/project/Performance_Datum_list
https://treeherder.mozilla.org/docs/#!/project/Performance_Signature_list

And also one to get a list of "alerts" (detected changes) programatically:

https://treeherder.mozilla.org/docs/#!/performance/Performance_Alert_Summary_list

Feel free to ask either me or jmaher on irc.mozilla.org #treeherder or #perfherder if you have more questions

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 11, 2016

@wlach Thank you for your information, but I still don't understand how Perfherder works.

  • So is Perherder just a frontend, and it queries data from Perfherder?
  • The performance data is actually a bunch of build artifacts from a series of treeherder jobs, we just get them all using the treeherder API? Not aggregated to some database?
  • I don't have any experience in creating treeherder jobs, is there any documentation or previous patch I can follow?
  • Do I need to request special permission to run my custom treeherder job on the production server?

Thank you!

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 11, 2016

@wlach
Copy link

@wlach wlach commented Apr 11, 2016

@shinglyu I think you figured this out for yourself, but yes, that's the guide to use for submitting data.
It doesn't cover performance data specifically (at least not yet), but there is some good prior art in autophone that you can hopefully use as a reference:

https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py

Since this is the first time we'll be submitting Servo data to treeherder, we'll also need to send revision information. There's some guidance on doing that in the submitting data document that you linked to. Eventually you might want to consider using TaskCluster for scheduling jobs and submitting data, which I believe might take care of some of those details for you.

To answer your earlier questions, Treeherder/Perfherder does actually aggregate performance data in an easy-to-digest form, which is how we provide all the frontend views at https://treeherder.mozilla.org/perf.html

My recommendation would be as follows:

  1. Bring up your own test server and create a test program to submit data to it (both revision, and job/performance data).
  2. Make your performance testing job submit data to your test server
  3. We give you credentials and you start submitting data to stage (https://treeherder.allizom.org) for a few weeks, just to make sure everything's working
  4. Once we're confident that your script is submitting good, reliable data, start submitting to treeherder production.
@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 12, 2016

@wlach : Thanks a lot! I'll start step 1 and 2 and contact you when I'm ready for step 3. :)

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 13, 2016

Update: the test runner is almost ready https://github.com/shinglyu/servo-perf
Some of the tp5 test case (those with complex js and many ad pics) will run forever even if I set -o output.png. And trying to close it with window.close() wrapped in setTimeout() doesn't work either. I'm trying to figure out the root cause and may force kill servo if it runs for too long.

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 13, 2016

@wlach : The treeherder-client on PyPi is not the latest version in tree. Also the sample code in the documentation is out of sync with the unit tests. Which version should I use to match the server version on stage and production?

@wlach
Copy link

@wlach wlach commented Apr 13, 2016

@shinglyu Good catch! I updated the version on pypi to reflect what's in the tree (treeherder-client-2.1.0). Please use new the new version. The docs should be up-to-date at this point. If they're not, please file a PR to fix them or let me know what's wrong so I can do so.

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented Apr 13, 2016

@shinglyu Sometimes you may have better results with -x -o output.png. If Servo still does not exit with both of those flags, please open issues and we will look into them - that probably indicates a deadlock or other bug in Servo!

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 14, 2016

@wlach: Thank you
@larsbergstrom: Thanks, I'll use -x -o. I'm tying to identify those tests in this bug: https://github.com/shinglyu/servo-perf/issues/1. I might temporarily disable those first, and files bugs for them.

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 15, 2016

@wlach I was able to submit a ResultSetCollection and JobCollection through the python API

_082

But I can't figure out how to format a performance artifact, I found the following code:
https://github.com/mozilla/autophone/blob/16669a6a13c78dc376ed60b9c6b005d69bda572b/tests/perftest.py#L31
But when I tried to find it on treeherder, I found something like this: https://autophone.s3.amazonaws.com/pub/mobile/tinderbox-builds/mozilla-inbound-android-api-15/1460693988/autophone-talos-tp4m-remote.ini-1-nexus-6p-2-106e7edc-214c-4427-a5e4-4f0405e7d30d-autophone.log

I thought the log should be a JOSNified PerfherderArtificat? Or was it consumed in the backend and I can't see it from the UI?

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 18, 2016

Edit: I commited the wrong file...
Ha, I wrote a test script and successfully submitted to my local treeherder instance. I'll try to hook it up with my test runner.
_084

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 20, 2016

@larsbergstrom I'm not sure how to get the revision information when I submit data to treeherder. I am thinking about dumping the git log -n 1 output to a file, and load it when I run the test. But I'm not sure if that's flexible enough if we want to move the test to our CI infrastructure in the future. How can I get things like commit hash, author, and timestamp when I run the perf test on CI?

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 20, 2016

@larsbergstrom @wlach Also, I'm not sure how to present the data point. Talos' TP5 use this kind of summarization

summarization:
subtest: ignore first 5 data points, then take the median of the remaining 20; source: test.py
suite: geometric mean of the 51 subtest results.
(ref)

That is one (median) time for each website, and one mean time for the whole suite.
But for performance.timing we have multiple measurements, should we split them by measurement or by website? For example:

By measurement

By website

  • Suite 1: www.google.com
    • subtest: responseEnd
    • subtest: domComplete
    • ...
  • Suite 2: www.amazon.com
    • subtest: responseEnd
    • subtest: domComplete
    • ...
  • ...
@wlach
Copy link

@wlach wlach commented Apr 20, 2016

@shinglyu For getting commit information, I wonder if it might not be easiest to use a library like GitPython (http://gitpython.readthedocs.org/)

For the second question, I think definitely seperating by measurement makes the most sense. However, I would question the utility of measuring anything but the time for the document being fully loaded and painted (which is what tp5o measures). There's a cost of complexity of recording additional information, I'd personally probably just start with the same metric as tp5o, then add additional measurements if they're proven to be needed.

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 21, 2016

@wlach I wants to separate my build step and test step, so I'll package the servo binary into a zip and copy it to my test runner's directory. So the test runner doesn't need to access the Servo code base. I think I'll use git log's formatting option to export the commit message as a JSON string, and dump it into the zip file.

Your suggestion makes a lot of sense. I think I'll only submit the domComplete timing for visualization, while keeping the other measurements in the log files. If we found that we need them, we can submit them later.

@wlach
Copy link

@wlach wlach commented Apr 21, 2016

@shinglyu: BTW, soon treeherder will have the capability of ingesting github revision data (on a push level, no less) which I think will work much better than you submitting revision data by hand. So I'd just get something hacky working there for now (your solution sounds fine) and hopefully we can switch to something better later in this quarter.

https://bugzilla.mozilla.org/show_bug.cgi?id=1264074

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 25, 2016

@wlach Good to know!

I have automated the whole build > test > submit to local perfherder flow. I'll let it run for a few days to see if everything is stable enough for submitting to staging.

@jgraham
Copy link
Contributor

@jgraham jgraham commented Apr 25, 2016

@shinglyu Awesome!

@wlach
Copy link

@wlach wlach commented Apr 27, 2016

@shinglyu Submitting to treeherder stage should be no problem, just follow the procedure here to add credentials and ping me again when you've done so:

http://treeherder.readthedocs.io/common_tasks.html#generating-and-using-credentials-on-treeherder-stage-or-production

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented Apr 28, 2016

@metajack: Thanks for the information
@wlach: Thank you, here is the bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1268381

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 3, 2016

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 3, 2016

Some initial data can be seen on the staging server now: https://treeherder.allizom.org/#/jobs?repo=servo&selectedJob=1

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 7, 2016

I documented how I submitted data to Perfherder: http://shinglyu.github.io/web/2016/05/07/visualizing_performance_data_on_perfherder.html
Feedback is welcome!

cc: @wlach

@wlach
Copy link

@wlach wlach commented May 9, 2016

Hey @shinglyu, great post! I would like to get some of that integrated within the treeherder documentation.

One thing, one should not be submitting servo data with the "talos" framework, as that's intended solely for the Gecko platform. I'd like to add a new performance framework for "servo", see: https://bugzilla.mozilla.org/show_bug.cgi?id=1271472

@metajack
Copy link
Contributor

@metajack metajack commented May 12, 2016

@shinglyu I see jobs are getting sent to staging on a regular basis, including the performance artifacts. Is there a way to compare these results with firefox yet?

@wlach If we have a new framework (seems like servo-perf is what was chosen) can we then compare performance results against things in Talos? We definitely want to be able to see how we're doing against Firefox's tp5 results.

@wlach
Copy link

@wlach wlach commented May 12, 2016

@shinglyu I'm seeing some issues with this:

  1. It appears as if you're still specifying the talos framework. As of yesterday, the servo-perf framework is on stage, so please assign that to your series.

  2. It looks like the performance series signature keeps on changing, which makes it impossible to track performance and generate alerts. I can't see any logs to determine what you're actually submitting, so I don't know why this is. Could you create a log of what you're submitting to the job (PERFHERDER_DATA), and upload it to s3 or somewhere similar, and then link it to treeherder? You can see an example of adding a treeherder log to a job here:

    https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L437

    (I presume Servo has an S3 account to use -- if you don't, let me know)

@metajack I think it's going to be really hard to compare against Firefox unless you're running the exact same test, which I don't believe you are at this point. Maybe the easiest route is to somehow run servo-perf against firefox, perhaps on a nightly basis?

@metajack
Copy link
Contributor

@metajack metajack commented May 12, 2016

How does our tp5 test differ from the one that Firefox runs?

@wlach
Copy link

@wlach wlach commented May 12, 2016

@metajack I'm not familiar with exactly what servo-perf is testing, if it's using the same pageset as talos tp5 that's a great start at measuring the same thing. But even if the pageset is the same, you would have to make sure that the harness is recording information in the same way.

The numbers from talos vs. servo-perf seem pretty far off from one another:

https://treeherder.allizom.org/perf.html#/graphs?series=%5Bmozilla-inbound,6a48ac54b45a24ccd037d18e2d58b0472c4ccd6a,1,1%5D&series=%5Bservo,b28838a4b625b0f341e87aeb3e10aeb1633afeed,1,8%5D&series=%5Bservo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,1,8%5D&zoom=1462841532703.583,1463063448000,0,2206.5215179885645

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented May 12, 2016

@wlach I'd expect Servo's numbers to be pretty far off - we have done nearly zero "complete page load" performance work yet, and there's a ton of known low-hanging fruit. So, that chart may be pretty close to reality :-)

@jgraham
Copy link
Contributor

@jgraham jgraham commented May 12, 2016

I may be missing something, but trying to compare performance numbers from different implementations of the "same" testsuite running on different hardware seems like it isn't going to produce good results? The infrastructure that produces results for Servo should also should submit its own results for Firefox, running with the same harness on the same hardware in order to get meaningful numbers.

@metajack
Copy link
Contributor

@metajack metajack commented May 12, 2016

@jgraham Thanks for pointing that out.

@shinglyu What are the rough specs for what the tp5 results you have submitted so far run on? Are you planning to add Firefox tests on the same hardware?

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 13, 2016

@metajack We can't compare out servo-perf test with existing tp5-Firefox-talos test. The reason is that we have our custom test runner ( Open PR: #11107). It runs a subset of tp5 tests, because some pages makes servo run forever (see #11087 ). Also it measures domComplete time from performance.timing, which is different from how talos measures. We are planning to run Firefox in our test runner our way, as @jgraham said, see https://github.com/shinglyu/servo-perf/issues/4

@wlach I changed the framework, but I broke the test runner, so it failed to submit data for 2 days. The data you are looking at is probably 2 days old. The latest one should be correct: https://treeherder.allizom.org/#/jobs?repo=servo&selectedJob=16

About the "performance series signature", is that the job_guid? I though that was for identifying the specific test run, so I randomly generate a UUID style string. The old data seems to be on the same graph, but the new onews shows up as one data point per graph: https://treeherder.allizom.org/perf.html#/graphs?series=[servo,4df09c87df5f6294eb04c94f19ce8a0aae144c0e,1,8]&series=[servo,951d1b202b324d85bc3229a334b88370f6c18363,1,1]&series=[servo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,1,1]&selected=[servo,4df09c87df5f6294eb04c94f19ce8a0aae144c0e,2,3,1]

And yes, I haven't push the log to S3 and create a link in the artifact. It's on my backlog and I'll open a bug for that.

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 13, 2016

@wlach: Now the data points are in the same graph again. https://treeherder.allizom.org/perf.html#/graphs?series=[servo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,1]&selected=[servo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,9,17]

I assume the problem is because I'm transitioning from talos to servo-perf?

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 13, 2016

In case you are confused, the May 10 commit is still using talos. I changed to servo-perf and break the code, so there are some missing data between May 10 and May 12. The first May 12 build is also broken, so the new clean data should start from the bca625b commit.

@wlach
Copy link

@wlach wlach commented May 13, 2016

@shinglyu No the performance signature is distinct from the job_guid. The signature is calculated based on the properties of PERFHERDER_DATA (suite name, test name, options) as well as various reference data from the job (machine platform, options, ...). And yes, if you change the performance framework you'll get a new series (though the signature should remain the same, since performance framework is not currently incorporated into the signature).

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 16, 2016

@wlach: Thanks. Do the last 4~5 tests have the same signature? The only changing part should be the timestamp and performance numbers. I might have changed the subtest name when I fixed some bugs a week ago, but the latest 4 tests should be stable. Can you point me to the performance signature code/docs so I can double check?

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 16, 2016

@larsbergstrom @metajack I tried to prioritize the remaining work. Please let me know if we need to adjust anything.

Priority 1 (Critial for June Preview)

Prioirty 2 (Must have before, say, end of Q3 2016)

Priority 3 (Nice to have)

Edit: move automatic alert to P3

@wlach
Copy link

@wlach wlach commented May 16, 2016

@shinglyu: Code for signature calculation is here (warning: it has some rough edges): https://github.com/mozilla/treeherder/blob/master/treeherder/etl/perf.py

@shinglyu
Copy link
Member Author

@shinglyu shinglyu commented May 17, 2016

@wlach: Thanks. So did you check the signature by manual querying it in the SQL DB or is it shown on the UI?

@highfive
Copy link

@highfive highfive commented May 25, 2016

@highfive
Copy link

@highfive highfive commented May 25, 2016

@jdm
Copy link
Member

@jdm jdm commented May 31, 2016

Out of curiosity, what will it take to allow measuring non-master branches? Once the off-thread HTML parsing work is ready, it would be useful to be able to compare the before and after timing before the changes are actually merged.

@autrilla
Copy link
Contributor

@autrilla autrilla commented May 31, 2016

@jdm I don't think it would be very complicated at all. As far as I know currently we're still manually copying the servo binary to the performance test runner's directory, so it'd simply be a matter of adding the branch as a command line parameter to the runner, so that it gets sent as another treeherder project (or whatever they're called) and copying the servo binary from the other branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.