Skip to content

Detect changes using crates.io index repository #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 26, 2016
Merged

Detect changes using crates.io index repository #90

merged 3 commits into from
Dec 26, 2016

Conversation

Byron
Copy link
Member

@Byron Byron commented Dec 26, 2016

Possible Improvements

  • Database errors are ignored just to match the previous implementation. It might be better to do as much as possible, but at least return the last seen error so something shows up in the logs. Alternatively, errors could just be logged.

Caveats

  • Even though the functionality itself is very well tested via the crates-index-diff and the crates-io-cli crates, I never saw the implementation in action.
  • Unless the crates.io-index is already checked out with the branch crates-index-diff_last-seen set to a recent commit, the first invocation of get_new_crates() will list (and possibly queue) all 40000 of them.
  • Previously a similar implementation might have caused file handle leakage. This should be watched carefully, as the leak could be in the outdated version of git2. If that should be the case, crates-io-cli could be used to move the leakage into its own process.

Notes

I had the feeling that this is required for it to build.
This assures that no matter how often we poll, we will always
see all changes since the last time we fetched.

Fixes #89
let index = try!(Index::from_path_or_cloned(&self.options.crates_io_index_path));
let changes = try!(index.fetch_changes());
for krate in changes.iter().filter(|k| k.kind != ChangeKind::Yanked) {
if self.db_cache.contains(&format!("{}-{}", krate.name, krate.version)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure if that is still needed, assuming that the crates.io index will never repeat itself. It might be that the previous implementation needed it to prevent duplicates or unnecessary work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes checking db_cache is no longer necessary and loading db_cache is actually really expensive. Actually that was the primary reason I always wanted to get new crates from crates.io-index repository.

It is safe to remove checking cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I removed it!
I also have a commit which removes the entire db_cache, but that might just be too much :).

@onur
Copy link
Member

onur commented Dec 26, 2016

Awesome, and thank you very much again. There is only one more thing I'd like to mention since we are also updating all dependencies with this PR.

docs.rs is currently using an old version of cargo and until docs.rs starts using iron 0.4 we are basically stuck with an old version of cargo. Before merging this PR I'd like to update cargo to a recent revision that still uses openssl 0.7.

Cargo started depending on openssl 0.9 with this commit: rust-lang/cargo@15acaa9

Last revision with openssl 0.7 is f6500c6b. I'll update cargo to this revison and merge this PR. This revision is using git2 0.5 but I think there isn't any compatibility problems between git2 0.4 and 0.5.

@Byron
Copy link
Member Author

Byron commented Dec 26, 2016

Great to hear!
If there are any problems, just let me know and I will put git2 back version 0.5 in the crates-index-diff crate - no problem.

onur added a commit that referenced this pull request Dec 26, 2016
Docs.rs currently stuck with openssl 0.7. This is only cargo revision
that is still using openssl 0.7 instead of 0.9.

Ref: #90
@onur
Copy link
Member

onur commented Dec 26, 2016

@Byron this is working great but there is only one problem.

When I first run get_new_crates, it added every crate and every version (~38k row) into build queue. I have a clone of crates.io-index in self.options.crates_io_index_path and it was only 6 commits behind. Can you make it to get current head of the self.options.crates_io_index_path repository and compare it with the new head.

@onur
Copy link
Member

onur commented Dec 26, 2016

Ahh its using head of crates-index-diff_last-seen branch and comparing it with new head of master branch. I prefer comparison of old-master and new-master but this is also perfectly fine.

@onur onur merged commit ecbdc84 into rust-lang:master Dec 26, 2016
onur added a commit that referenced this pull request Dec 26, 2016
@Byron
Copy link
Member Author

Byron commented Dec 27, 2016

Awesome!
I will now battle-test it and publish 140 or so crates in very short succession :)!

Because Docs.rs is so awesome (!!), I have now switched all my crates (at least those that I remember;)) to use docs.rs automatically.

Those who are generated, i.e. the ones of google-apis-rs link to their specific version. Said link right now will only be available, in the worst case, after 10 minutes (I believe to have read that value somewhere here), which could already distract interested parties.
Something that would certainly help is to reduce the time it polls considerably - what do you think about that? Another way to help could be to indicate something is queued to be built.
What is your opinion about that?

@onur
Copy link
Member

onur commented Dec 27, 2016

Yes I was thinking the same yesterday. I already had a FIXME tag about this in daemon.rs. We can return length of changes vector from get_new_crates and build them right away when there is any. Reducing sleeping time between each request is also fine now.

@Byron
Copy link
Member Author

Byron commented Dec 27, 2016

Great you have it on your radar, as we are both dreaming of a world where docs.rs can build near-realtime and is the defacto standard facility for all crates documentation.

By the way, right now I am anxiously looking at the site...
screen shot 2016-12-27 at 08 51 30

...and am wondering if everything is alright. It seems to be about 50 minutes behind, and also doesn't seem to process the crates in the same order as returned by the find_changes() method. I know this thanks to the following invocation (and would have expected orders to be the same):

➜  docs.rs git:(master) ✗ crates recent-changes
google-tasks1 1.0.2+20141121 added
➜  docs.rs git:(master) ✗ crates recent-changes
google-blogger3 1.0.2+20150422 added
google-dfareporting2d1 1.0.2+20160323 added
google-dfareporting2d2 1.0.2+20160803 added
google-dfareporting2d3 1.0.2+20160803 added
google-groupsmigration1 1.0.2+20140416 added
google-manager1_beta2 1.0.2+20140915 added
google-proximitybeacon1_beta1 1.0.2+20160429 added
➜  docs.rs git:(master) ✗ crates recent-changes
google-adexchangebuyer1d3 1.0.2+20161020 added
google-adexchangebuyer1d4 1.0.2+20161020 added
google-adexchangeseller2 1.0.2+20160805 added
google-admin1_directory 1.0.2+20161124 added
google-admin1_reports 1.0.2+20160704 added
google-adsense1d4 1.0.2+20161206 added
google-adsensehost4d1 1.0.2+20161206 added
google-analytics3 1.0.2+20161004 added
google-androidenterprise1 1.0.2+20161207 added
google-androidpublisher2 1.0.2+20161212 added
google-appengine1 1.0.2+20161208 added
google-appengine1_beta4 1.0.2+20161208 added
google-appengine1_beta5 1.0.2+20161208 added
google-appsactivity1 1.0.2+20161202 added
google-appstate1 1.0.2+20161207 added
google-autoscaler1_beta2 1.0.2+20150629 added
google-bigquery2 1.0.2+20161130 added
google-blogger3-cli 1.0.2+20150422 added
google-books1 1.0.2+20161206 added
google-calendar3 1.0.2+20161211 added
google-classroom1 1.0.2+20161006 added
google-cloudbilling1 1.0.2+20151222 added
google-clouddebugger2 1.0.2+20160810 added
google-cloudlatencytest2 1.0.2+20160309 added
google-cloudmonitoring2_beta2 1.0.2+20161031 added
google-cloudresourcemanager1 1.0.2+20161212 added
google-cloudresourcemanager1_beta1 1.0.2+20161212 added
google-clouduseraccountsvm_beta 1.0.2+20160316 added
google-container1 1.0.2+20160421 added
google-content2 1.0.2+20161205 added
google-content2_sandbox 1.0.2+20161205 added
google-coordinate1 1.0.2+20150811 added
google-customsearch1 1.0.2+20160411 added
google-dataproc1 1.0.2+20161102 added
google-deploymentmanager2 1.0.2+20161209 added
google-deploymentmanager2_beta2 1.0.2+20160201 added
google-dfareporting2d1-cli 1.0.2+20160323 added
google-dfareporting2d2-cli 1.0.2+20160803 added
google-dfareporting2d3-cli 1.0.2+20160803 added
google-dfareporting2d4 1.0.2+20160803 added
google-dfareporting2d4-cli 1.0.2+20160803 added
google-dfareporting2d5 1.0.2+20161027 added
google-dfareporting2d6 1.0.2+20161027 added
google-dfareporting2d7 1.0.2+20161027 added
google-discovery1 1.0.2+00000000 added
google-dns1 1.0.2+20161130 added
google-doubleclickbidmanager1 1.0.2+20161010 added
google-doubleclicksearch2 1.0.2+20161108 added
google-drive2 1.0.2+20161212 added
google-drive3 1.0.2+20161212 added
google-firebasedynamiclinks1 1.0.2+20161118 added
google-fitness1 1.0.2+20161128 added
google-fusiontables2 1.0.2+20160526 added
google-games1 1.0.2+20161207 added
google-gamesconfiguration1_configuration 1.0.2+20161207 added
google-gamesmanagement1_management 1.0.2+20161207 added
google-gan1_beta1 1.0.2+20130205 added
google-genomics1 1.0.2+20160928 added
google-gmail1 1.0.2+20161206 added
google-groupsmigration1-cli 1.0.2+20140416 added
google-groupssettings1 1.0.2+20160525 added
google-iam1 1.0.2+20160915 added
google-identitytoolkit3 1.0.2+20161206 added
google-kgsearch1 1.0.2+20151215 added
google-licensing1 1.0.2+20150901 added
google-logging2 1.0.2+20161206 added
google-logging2_beta1 1.0.2+20161206 added
google-manager1_beta2-cli 1.0.2+20140915 added
google-manufacturers1 1.0.2+20161028 added
google-mirror1 1.0.2+20160616 added
google-ml1_beta1 1.0.2+20161212 added
google-monitoring3 1.0.2+20161212 added
google-pagespeedonline2 1.0.2+20161204 added
google-partners2 1.0.2+20151009 added
google-people1 1.0.2+20160210 added
google-playmoviespartner1 1.0.2+20160518 added
google-plus1 1.0.2+20161214 added
google-plusdomains1 1.0.2+20161214 added
google-prediction1d6 1.0.2+20160511 added
google-proximitybeacon1_beta1-cli 1.0.2+20160429 added
google-pubsub1 1.0.2+20161122 added
google-pubsub1_beta2 1.0.2+20161122 added
google-qpxexpress1 1.0.2+20160708 added
google-replicapool1_beta2 1.0.2+20160512 added
google-replicapoolupdater1_beta1 1.0.2+20161003 added
google-reseller1_sandbox 1.0.2+20160329 added
google-resourceviews1_beta2 1.0.2+20160512 added
google-safebrowsing4 1.0.2+20160520 added
google-serviceregistryalpha 1.0.2+20160401 added
google-siteverification1 1.0.2+20160228 added
google-slides1 1.0.2+20161213 added
google-spectrum1_explorer 1.0.2+20161116 added
google-sqladmin1_beta4 1.0.2+20161213 added
google-storage1 1.0.2+20161123 added
google-storagetransfer1 1.0.2+20150811 added
google-surveys2 1.0.2+20161103 added
google-tagmanager1 1.0.2+20160310 added
google-taskqueue1_beta2 1.0.2+20160428 added
google-tasks1-cli 1.0.2+20141121 added
google-translate2 1.0.2+20160627 added
google-urlshortener1 1.0.2+20150519 added
google-webfonts1 1.0.2+20160302 added
google-webmasters3 1.0.2+20160317 added
google-youtube3 1.0.2+20161202 added
google-youtubeanalytics1 1.0.2+20161213 added
google-youtubereporting1 1.0.2+20160719 added
➜  docs.rs git:(master) ✗ crates recent-changes
google-dfareporting2d5-cli 1.0.2+20161027 added
➜  docs.rs git:(master) ✗

@onur
Copy link
Member

onur commented Dec 27, 2016

Unfortunately each google crate is taking 6-10 minutes to build, so everything is fine they are in queue.

This is the order of crates docs.rs used for build:

2016/12/27 07:53:07 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-blogger3-1.0.2+20150422
2016/12/27 07:58:33 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-dfareporting2d1-1.0.2+20160323
2016/12/27 08:06:38 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-dfareporting2d2-1.0.2+20160803
2016/12/27 08:14:22 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-groupsmigration1-1.0.2+20140416
2016/12/27 08:20:27 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-manager1_beta2-1.0.2+20140915
2016/12/27 08:26:38 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-proximitybeacon1_beta1-1.0.2+20160429
2016/12/27 08:32:32 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-tasks1-1.0.2+20141121
2016/12/27 08:42:53 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-androidenterprise1-1.0.2+20161207
2016/12/27 08:49:21 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-customsearch1-1.0.2+20160411
2016/12/27 08:55:19 [INFO] cratesfyi::docbuilder::chroot_builder: Building package google-dfareporting2d3-1.0.2+20160803

And rest of the queue:

                     name                     |    version     
----------------------------------------------+----------------
 google-dfareporting2d4                       | 1.0.2+20160803
 google-dfareporting2d5                       | 1.0.2+20161027
 google-dfareporting2d6                       | 1.0.2+20161027
 google-dfareporting2d7                       | 1.0.2+20161027
 google-adexchangebuyer1d3                    | 1.0.2+20161020
 google-adexchangebuyer1d4                    | 1.0.2+20161020
 google-adsense1d4                            | 1.0.2+20161206
 google-cloudmonitoring2_beta2                | 1.0.2+20161031
 google-dns1                                  | 1.0.2+20161130
 google-gamesconfiguration1_configuration     | 1.0.2+20161207
 google-gan1_beta1                            | 1.0.2+20130205
 google-groupssettings1                       | 1.0.2+20160525
 google-iam1                                  | 1.0.2+20160915
 google-manufacturers1                        | 1.0.2+20161028
 google-pagespeedonline2                      | 1.0.2+20161204
 google-partners2                             | 1.0.2+20151009
 google-prediction1d6                         | 1.0.2+20160511
 google-pubsub1                               | 1.0.2+20161122
 google-pubsub1_beta2                         | 1.0.2+20161122
 google-replicapool1_beta2                    | 1.0.2+20160512
 google-resourceviews1_beta2                  | 1.0.2+20160512
 google-safebrowsing4                         | 1.0.2+20160520
...

I am not sure why we got different orders but getting the right order with git2 diff comparison was also an issue with my old code.

@Byron
Copy link
Member Author

Byron commented Dec 27, 2016

Thanks for the clarification! But wouldn't that mean that on the landing page, the most recent google crate should at best be 6 minutes old? It seems odd that the most recent release is said to be done an hour ago. Maybe the release specifies the actual release date in crates.io, and not when the documentation was built on docs.rs?

The order reported by find_changes() is dependent only on the diff between two arbitrary branches - to reproduce correct ordering, I believe it would have to step along the history and diff each commit separately.

However, another point of confusion is that docs.rs seems to have yet another order, maybe it's related to this sorting? It's not that I would think this is an actual issue, but it certainly added to my initial fear that it might not build all the docs.

@onur
Copy link
Member

onur commented Dec 27, 2016

Yes, release date is actual release date in crates.io.

id is just an auto incremented number. We are adding new crates to bottom of que with each INSERT in get_new_crates. There might be some logical error in this since we are also checking new crates every 5 minutes.

I am not sure why you have any fears but docs.rs was only skipping some crates because they weren't added properly into queue and thanks to you this problem is solved now. I only manually removed google-* crates from que in the past once or twice, because they were taking too long to build and you always released all of them at once.

BTW bottom of the queue right now:

  id   |                     name                     |    version     
-------+----------------------------------------------+----------------
 16271 | google-youtubeanalytics1-cli                 | 1.0.2+20161213
 16270 | google-doubleclicksearch2-cli                | 1.0.2+20161108
 16269 | google-coordinate1-cli                       | 1.0.2+20150811
 16268 | google-appengine1_beta5-cli                  | 1.0.2+20161208
 16267 | nanomsg-sys                                  | 0.6.1
 16266 | nanomsg                                      | 0.6.1
 16265 | google-taskqueue1_beta2-cli                  | 1.0.2+20160428
 16264 | google-tagmanager1-cli                       | 1.0.2+20160310
 16263 | google-serviceregistryalpha-cli              | 1.0.2+20160401
 16262 | google-games1-cli                            | 1.0.2+20161207
 16261 | google-drive3-cli                            | 1.0.2+20161212
 16260 | google-appengine1_beta4-cli                  | 1.0.2+20161208
 16259 | google-appengine1-cli                        | 1.0.2+20161208
 16258 | google-webfonts1-cli                         | 1.0.2+20160302
 16257 | google-storagetransfer1-cli                  | 1.0.2+20150811


I think there is definitely a logical error in my INSERT and SELECT statements and 5 min timeouts between each get_new_crates. But since every new crate is getting added to que, I am fine with it.

@Byron
Copy link
Member Author

Byron commented Dec 27, 2016

Thanks again for letting me know!
I think what confuses me is that it can take very long until there is an update on the release page, clearly longer than 6 minutes. Maybe it's waiting for the appveyor builds, so the total time till all docs are finished is a multiple of that?

I should probably look more closely on how it works, as my use-case seems to push some limits to the point where I am blocking docs.rs for ours all by myself. After all, one release means ~140 crates, of which only 70 generate meaningful docs, with a total of 500kLOC.

Something I could imagine is to make it easy to add additional workers and some scheduling logic that prevents crates known to take a long time to use all of them.
But I am rambling - let's just wait for now :).

@onur
Copy link
Member

onur commented Dec 27, 2016

@Byron you can see the list of crates in queue now: https://docs.rs/releases/queue

Since crate order in queue and actual release time is a bit different, its hard to see new crates in the main page.

@Byron
Copy link
Member Author

Byron commented Dec 27, 2016

This is so awesome! Thanks for sharing that. I never clicked the 'recent releases' link on the main page, but will surely do so more often now! 174 crates are queued at the time of writing, incredible.

Something that would be awesome for my case is if instead of showing this (for example when clicking the documentation link of some google crate...

screen shot 2016-12-27 at 13 02 02

...one would show that it is queued, maybe alongside the queue number to give a hint how long it might take till it shows up, maybe like that:

screen shot 2016-12-27 at 13 06 06

Doing this would solve my major problem, which is producing dead links right after uploading to crates.io.

If you think it would be nice to have, I could possibly contribute this feature, even though so far I have not brought up a server on my own, but would work to make that easier as well.

onur added a commit that referenced this pull request Dec 27, 2016
In addition to #90 this change will also run `build_queue` in it's own
thread to catch panics (only panicked twice in last 6 months).

This commit also introduces an attempt count to avoid endless tries to build
yanked crates. This only happens if a crate gets yanked before docs.rs tries.
docs.rs will only try to build a crate 5 times and ignore them after it but it
will still leave it in queue for inspection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

newly updated crates are not built if there are too many of them
2 participants