Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can only scrape 10000 items. #19

Closed
fgregg opened this issue Oct 9, 2018 · 9 comments
Closed

Can only scrape 10000 items. #19

fgregg opened this issue Oct 9, 2018 · 9 comments
Assignees
Labels
Upcoming Items currently under development and will be available in our next release

Comments

@fgregg
Copy link

fgregg commented Oct 9, 2018

This may be by intention, but

https://api.govinfo.gov/collections/CHRG/1990-01-01T00:00:00Z/?offset=10000&pageSize=100&api_key=DEMO_KEY

returns

{"message":"Error page size + offset must be less than maxoffset: 10000"}

If this policy is intended, it would be good to

  1. document this limitation
  2. provide bulk access (would be good to do this regardless)

Thanks so much.

@jonquandt
Copy link
Member

jonquandt commented Oct 9, 2018

@fgregg -The collections response does have a current limitation of 10,000 items.. I'll look at how to best document it. You can also specify a narrower range using the endData parameter.

Will also look into showing packages by last modified date as a default. that way you could more easily get the time ranges.

For example, a large number of CHRG were republished on 2018-06-05:
https://api.govinfo.gov/collections/CHRG/2018-06-05T00:00:00Z/2018-06-06T:00:00Z?offset=0&pageSize=100&api_key=DEMO_KEY
`"count": 13412,``

All of those packages appear to have processed between 18:30 and 20:00.
https://api.govinfo.gov/collections/CHRG/2018-06-05T18:30:00Z/2018-06-05T20:00:00Z?offset=0&pageSize=100&api_key=DEMO_KEY
"count": 13412

Narrowing down:
https://api.govinfo.gov/collections/CHRG/2018-06-05T18:30:00Z/2018-06-05T18:50:00Z?offset=0&pageSize=100&api_key=DEMO_KEY
"count": 4131,`

https://api.govinfo.gov/collections/CHRG/2018-06-05T18:50:01Z/2018-06-06T00:00:00Z?offset=0&pageSize=100&api_key=DEMO_KEY
"count": 9281,`

There are another 13172 that were updated on 2018-07-30
https://api.govinfo.gov/collections/CHRG/2018-07-30T00:00:00Z/2018-08-01T00:00:00Z?offset=0&pageSize=100&api_key=DEMO_KEY

This will give you about half of those results:
https://api.govinfo.gov/collections/CHRG/2018-07-30T12:00:00Z/2018-08-01T00:00:00Z?offset=0&pageSize=100&api_key=DEMO_KEY
"count": 6161,

The remainder.
https://api.govinfo.gov/collections/CHRG/2018-07-30T20:15:01Z/2018-08-01T00:00:00Z?offset=100&pageSize=100&api_key=DEMO_KEY
"count": 7005,

Outside of those two major update events, there have only been 334 new/updated packages since 2018-06-01

@fgregg
Copy link
Author

fgregg commented Oct 9, 2018

yes, if we could sort by last-modfiied date that would be very good.

@fgregg
Copy link
Author

fgregg commented Oct 9, 2018

thanks for this information. It's very valuable

@jonquandt jonquandt self-assigned this Oct 9, 2018
@jonquandt
Copy link
Member

yes, if we could sort by last-modfiied date that would be very good.

looking into this request.

@fgregg
Copy link
Author

fgregg commented Oct 9, 2018

here's my current workaround. https://github.com/datamade/govinfo/blob/master/govinfo/__init__.py#L32-L61

there's a much more elegant solution is the data was sorted.

@jonquandt jonquandt added the Upcoming Items currently under development and will be available in our next release label Oct 25, 2018
@jonquandt jonquandt added this to the October 2018 update milestone Oct 25, 2018
@jonquandt
Copy link
Member

collections response is now sorting by lastModified date.

https://api.govinfo.gov/collections/CHRG/1990-01-01T00:00:00Z/?offset=10000&pageSize=100&api_key=DEMO_KEY

snippet:

{
"count": 27027,
"message": null,
"nextPage": "https://api.govinfo.gov/collections/CHRG/1990-01-01T00:00:00Z/?offset=100&pageSize=100",
"previousPage": null,
"packages": [
{
"packageId": "CHRG-103hhrg76069",
"lastModified": "2018-10-30T17:25:19Z",
"packageLink": "https://api.govinfo.gov/packages/CHRG-103hhrg76069/summary",
"docClass": "HHRG",
"title": "Bpa Competitiveness"
},
{
"packageId": "CHRG-103hhrg74075",
"lastModified": "2018-10-30T17:21:12Z",
"packageLink": "https://api.govinfo.gov/packages/CHRG-103hhrg74075/summary",
"docClass": "HHRG",
"title": "Bpa Proposed Flscal Year 1994 Budget"
},
{
"packageId": "CHRG-103hhrg69593",
"lastModified": "2018-10-30T17:17:43Z",
"packageLink": "https://api.govinfo.gov/packages/CHRG-103hhrg69593/summary",
"docClass": "HHRG",
"title": "Indian Tribal Justice Act"
},
{
"packageId": "CHRG-103hhrg74346",
"lastModified": "2018-10-30T13:44:22Z",
"packageLink": "https://api.govinfo.gov/packages/CHRG-103hhrg74346/summary",
"docClass": "HHRG",
"title": "Bpa Electric Power Resources Acquisition"
},
{
"packageId": "CHRG-113shrg28394",
"lastModified": "2018-10-30T13:41:19Z",
"packageLink": "https://api.govinfo.gov/packages/CHRG-113shrg28394/summary",
"docClass": "SHRG",
"title": "Reassessing Solitary Confinement II: The Human Rights, Fiscal, and Public Safety Consequences"
},
{
"packageId": "CHRG-115shrg27017",
"lastModified": "2018-10-30T13:41:16Z",
"packageLink": "https://api.govinfo.gov/packages/CHRG-115shrg27017/summary",
"docClass": "SHRG",
"title": "Nominations of Brock Long, Russell Vought and Neomi Rao"
},
...

@fgregg
Copy link
Author

fgregg commented Nov 3, 2018

@jonquandt in order use the sorting as a way to get around the 10000 row limit, last modified needs to be sorted from earlier to later.

right now it's sorted from later to ealrier.

If it's earlier to later, I can just use the last element I scraped before hitting the 10000 limit, and use that as a new start time parameter.

@fgregg
Copy link
Author

fgregg commented Nov 3, 2018

actually, nevermind. I can make this work.

@fgregg
Copy link
Author

fgregg commented Nov 3, 2018

Thank you for making this changde

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Upcoming Items currently under development and will be available in our next release
Projects
None yet
Development

No branches or pull requests

2 participants