Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GitLab scraping #20

Open
pietroalbini opened this issue Sep 21, 2018 · 6 comments
Open

Add GitLab scraping #20

pietroalbini opened this issue Sep 21, 2018 · 6 comments

Comments

@pietroalbini
Copy link
Member

@pietroalbini pietroalbini commented Sep 21, 2018

Scraping GitLab repositories is required if we want to test them on Crater (and we do). This issue tracks the implementation of the scraper.

The API calls we would need to make to scrape are:

This is currently blocked on:

  • Extend the GitLab /projects API to support filtering only Rust repos (GitLab issue)
@ecstatic-morse
Copy link

@ecstatic-morse ecstatic-morse commented Feb 5, 2019

Gitlab CE now supports filtering /projects by programming language. I'm not sure when this will be deployed to gitlab.com.

However, Gitlab uses OFFSET and LIMIT to paginate the results of API calls, which means fetching the last page requires scanning the entire database. On my machine, fetching a page near the end of the database with https://gitlab.com/api/v4/projects?per_page=100&page=9000 takes about 14.5 seconds. If we assume that 10 seconds of that was spent in the database, fetching every project would take ~12.5 hours of database time.

Crater's use-case won't be quite that bad. If 5% of projects on Gitlab contain Rust, we'll need only 37.5 minutes of database time to scrape. If the result is cached, perhaps it's feasible to start running crater on Gitlab projects, but it's probably better to wait until keyset-based pagination is re-enabled.

@pietroalbini
Copy link
Member Author

@pietroalbini pietroalbini commented Feb 5, 2019

Gitlab CE now supports filtering /projects by programming language. I'm not sure when this will be deployed to gitlab.com.

That's great! Thanks for your work on this ❤️

Crater's use-case won't be quite that bad. If 5% of projects on Gitlab contain Rust, we'll need only 37.5 minutes of database time to scrape. If the result is cached, perhaps it's feasible to start running crater on Gitlab projects, but it's probably better to wait until keyset-based pagination is re-enabled.

Yeah, our results are cached, so it wouldn't that bad. If you ping me after the API is deployed on GitLab.com I can look into it, and if the performance is still bad I'll contact GitLab support to make sure we're not hurting them with the scrape.

@ecstatic-morse
Copy link

@ecstatic-morse ecstatic-morse commented Feb 5, 2019

Will do!

@ecstatic-morse
Copy link

@ecstatic-morse ecstatic-morse commented Feb 9, 2019

Looks like this will be deployed as part of the 11.8 release on February 22.

@mathstuf
Copy link

@mathstuf mathstuf commented Mar 23, 2019

If needed, we can add API calls to the gitlab crate to support this. We already support depagination too.

@mathstuf
Copy link

@mathstuf mathstuf commented Mar 23, 2019

There's also the question of which instances to scrape. gitlab.com is obvious, but the GNOME and freedesktop.org instances should probably also be on the list (at least).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants