Join GitHub today
Plugin / theme checksum verification #6
Just like we have
However, WordPress.org doesn't currently publish plugin and theme checksums. We'd need to generate these and host them at a publicly accessible URL.
https://github.com/eriktorsner/wp-checksum is an existing project that implements this feature.
referenced this issue
Jan 19, 2017
@schlessera Some thoughts on this while I'm thinking of it.
First, @eriktorsner has an existing checksum middleware implementation that he's graciously offered to let us crib from. I'll let him weigh in here with more details.
Second, the simplest possible infrastructure to go with would be flat files (no database). I've chatted with the corresponding WordPress.org folks about hosting. If our middleware application can generate flat files served by some API, then it will be fine to sync those flat files to a WordPress.org server (with rsync or similar).
Lastly, the SVN checkouts are going to be hundreds of GB if not TB. DreamHost (via @getsource) has volunteered a server for us to run the checksum generator on.
First, for reference, my implementation is in two major parts:
1. The Worker
2. The API
With the suggested approach of storing all checksum data in files rather than in a database, some changes will be needed but I think they are fairly small. From the top of my head, this is what I'd like to do:
The Slim3 application would not be needed any longer. However, the worker would still need to be hosted somewhere with a PHP parser and a PDO compliant database (MySQL, haven't tried SQLite, but I'm sure it would work fine).
Currently, 446829 versions are indexed in the database, roughly 10-times the number of individual plugins and themes in the official repos. I trust this would be well within the limits of what a file system handles without performance issues, but please weigh in if you see a problem with this.
Using a file based approach, we'd need to look for another solution if we also want to serve individual files from a plugin (for local diffing). It's entirely possible to request files from the SVN web front end, but I guess we'd have to check with a sysadmin if additional load is welcome.
I currently have experimental mechanisms for getting checksums from a couple of premium plugin vendors (Gravity Forms and Easy Digital Downloads). In the interest of overall security for WordPress users, I'd really like to see a way we can keep this even if this ends up being hosting WordPress.org.
Current sizes (actual SVN repos, not checked out):
We had a Slack discussion about the server CPU intensity of using svnsync. To move things forward I conducted a test using a real world svn repo. During the test, roughly 3k revisions was synced (going from about revision 27k to 30k) at a speed of 10 revisions per second.
Server: Digital Ocean 4Gb Droplet. 2 CPU, Ubuntu 16.04
During the 5 minute sample time, the server CPU hovered between 0-4% for most of the time. Average CPU usage was 1.2%, peak CPU usage 8.3%. On the client, the average CPU usage was 17% with a peak at 49%.
Note 1: The CPU usage was measured different tools, it's not obvious that they can be compared like for like.
Note 2: The results are kind of expected. Reading revisions from a repo is mostly disk I/O. Subversion stores the diff between two revisions in a separate file per revision. The svnsync process is really just a long series of requests for individual diffs going from revision n to n+1. So this aligns closely with how Subversion stores things in its revision database. This works well for all non-binary files, but I think the process might be a little bit more CPU intensive for images and other binary files, they probably need to be transformed to a format suitable for transfer. I suspect that the peaks in server side usage above might be from managing binary files in individual revisions.
On the receiving end, each diff is then merged and committed into the revision tree which is much more CPU intensive. As this happens, I'm fairly sure (please correct me if I'm wrong) that Subversion actually needs to take the content of the old file, apply the diff and then store the new file (compressed, in the file
Note 3: From the server's perspective, svnsync is roughly the same as doing
Note 4: Committing new revisions takes more time the more revisions the repository has. The speed in this test (10 revisions/sec) is not realistic when we're at revision 1.7M. We're going to see well under 1 revision/sec which will further decrease the average load we put on the server.
Conclusion: Using svnsync to keep a separate repo in sync with the official plugin repo is not very hard on the Subversion server. It's going to spend most of its CPU time committing things from developers, we're just interested in reading individual revisions which is cheap in comparison.
I did some additional testing to see what kind of network effects svnsync has in real life. This svnsync was done over ssh+svn which uses an ssh tunnel rather than https which is the case for the WordPress repos, but I think we're talking about the same order of magnitude in regards of resources.
The server is the same Digital Ocean server as in my last comment and I usually get 60Mbit download speed from my home office.
First, I've synced a remote copy of a large repository about for about an hour. Total work done was:
It's easy to see that the download rate isn't limited by bandwidth. At this rate, syncing all 1.7M revisions from the official plugin repo would take 168 hours/1 week, but from experience, I know that the rate goes down as the revision nr goes up, so in the real world, it's more like 2-3 weeks.
Next. I've rsync'ed a 67Gb svn repo with a total of 1.68M revisions (slightly smaller than the current size of the live plugin repo) between two devices on the same server (USB 3.0 flash drive to internal SSD). Even if bandwidth is this case is pretty much unlimited, there are just over 3.3 million files to sync and rsync also does some integrity checking on each file so things still take time. Here are some numbers:
As a next step, I allowed the source repo to svnsync 51 revisions from it's "parent" repo so that it was 51 revisions ahead compared to the copy created above. I then re-ran the rsync operation. In theory, there should be about 103 modified files to handle but I didn't try to count them first.
So once the two repos are fairly in sync, rsync can be used to quickly establish a perfect file by file copy.
The issue with rsync (or copy) is an all or nothing type affair. Subversions have a proprietary storage format that is very easy to mess up. Anything else than a perfect copy of a repository simply won't work. Somewhat simplified, each revision creates two new files, one for the diff and one for revision metadata. In addition to those, there's one large FSFS file (see https://stackoverflow.com/questions/19687614/what-does-fsfs-stand-for-as-related-to-subversion). If we rsync from a live repo that is receiving commits from developers, it will sometimes give us a repo copy that is just a little bit out of sync internally. I'm certain that svnsync offer better transaction integrity, it either gets the diff or not.
Regarding the question from @schlessera about the difference between svnsyncing data vs copying the same data. I guess it's hard to answer because the trick is to figure out what exact data to copy. Rsync will do a good job of finding individual files. Just to give some sort of comparison between svnsyncing 150Mb from the same server down to my laptop, here's a benchmark of the bandwidth + ssh overhead:
About 51 Mbit/s
@eriktorsner: Great stuff! This data will help us make the case for using svnsync.
@Otto42 Yes, the planned approach was to build the API under a separate URL to be able to quickly iterate on it, and then migrate it over to the w.org API when everything is finalized (similarly to how we work on feature plugins before merging into Core, to keep initial velocity high). Do you think this does not make sense in this specific case?
No, I meant, why do we need to do a process to sync the repo instead of simply generating the checksums on w.org?
When we build zip files for plugins or themes, we have all the files right there. Adding code to generate checksums would be relatively simple in that process. We store those, make an endpoint to serve them, done.
Doing this all externally seems like adding a ton of load for code that we would never use anyway because it's the wrong way to integrate it to start with.
My thinking is that for any such process, we only want to do the code to change things when they actually change. We have such a system, where we build zip files when the plugin changes. Add checksums to that, sure them in a new table and voila. Add a simple API call to return json data for, say, a list of packages and there you go.
The reasoning behind building this on a separate server was that we don't want to add additional burden to the current SVN server. If we add this to the current server, and it happens to be more useful and successful than we'd like, it will be difficult to separate it again for scaling.
However, I have to admit that I don't know much about how the current API runs behind the scenes yet, and what resources the corresponding servers have. I'm happy to discuss this in more detail. In general, though, I think this might be something where we "erring on the safe side" might be useful.
We don't do the ZIP building on the SVN servers, we do it on the normal web servers. Basically, we use cavalcade for job scheduling. When a plugin is committed to, a job is added to there to be run. That job runs on one of the web servers. It builds the ZIP, updates the plugin directory database, etc. Since it's already doing svn export to get the files to build the ZIPs, it can also make checksums at that time. This means that checksums will only update when plugins actually update, which is obviously the right way to scale it.
Themes operate differently, but in a similar fashion with a cron job to update them from time to time.
Your suggested approach of syncing the SVN elsewhere seemingly adds far more load, not less. I'm thinking that you should examine the existing system for a proper integration instead of trying to create something new. Because whatever it is that you're thinking of creating will simply have to be thrown away in the end for something a lot more like what I'm suggesting here anyway.