Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Data: Statutes at Large #34

Closed
JoshData opened this issue Jan 26, 2013 · 38 comments
Closed

New Data: Statutes at Large #34

JoshData opened this issue Jan 26, 2013 · 38 comments
Assignees

Comments

@JoshData
Copy link
Member

Just a heads-up that Gordon and I are working on pulling info out of the new Statutes at Large MODS files.

@ghost ghost assigned JoshData Jan 26, 2013
@konklone
Copy link
Member

Ooh - using fdsys.py?

Curious how you intend to use them?

@GPHemsley
Copy link
Member

Josh can probably explain better, but the plan is to extract as much of the standard metadata from them as possible.

@GPHemsley
Copy link
Member

I've got a script now that extracts the basic Congress and bill numbers for all bills that became law. I'm going to attempt to squeeze some more data out of the MODS files in the coming days.

I'm also thinking that at some point the bill scraping code should be separated from the bill format/metadata files, since there will be multiple sources for bill data but only one way to output for each format (json/xml). I may tackle that at the same time.

@JoshData
Copy link
Member Author

@konklone: My primary goal is to turn it into bills on GovTrack (enacted only, of course). So we'll be generating output that looks similar to the bill output. Plus whatever other interesting metadata is in the MODS files. And something with the text layer in the PDF.

@GPHemsley
Copy link
Member

I think the goal should also be to be able to do something like this and have it Just Work™:

./run bills --congress=82

(Even if it doesn't get all the bills in the 82nd Congress at first.)

@GPHemsley
Copy link
Member

I've got a working version of the script here:

https://github.com/GPHemsley/congress/blob/historical-bills-1951/tasks/statutes.py

You can run it by running a command like this:

./run statutes --path=STATUTE/1951/STATUTE-65 --govtrack

(After you run the fdsys task to put the fdsys files in place.)

@GPHemsley
Copy link
Member

So Josh pulled this into master, with a bunch of documentation, if you want to use it. (There may be more to do still, so I don't know if this should be closed just yet.)

@konklone
Copy link
Member

Ah ha! I get it now (looked at the code). So this is intended to fill in gaps from 1951 to 1972. I like this a lot. I have a couple thoughts (surprise), related to keeping the system sane as we expand into more scripts.

Since the data we get from THOMAS is uniformly superior to the data from the Statutes (right? is there anything unique to the Statutes collection?), then the statutes script should probably default to an end year of 1972. This could become moot if we make the bills script default to a Statutes-driven approach for years before 1973.

It'd be great to keep this down to running one command, instead of two, using fdsys.py as a support library that the statutes.py script uses (rather than running it as a script as a prerequisite). You can see how I did this in bill_versions.py, to generate JSON files for each version of bill text.

@konklone
Copy link
Member

Separate thought - how much work would have to go into using scanned copies of the Statutes at Large for years prior to 1951? Since there's so much value just in getting the metadata, and scanning accurate text is not a concern, would it be worth it to engage in a manual (and one-time!) metadata collection effort using copies of the Statutes going back into antiquity, if we had them scanned?

@GPHemsley
Copy link
Member

I think it would be good to separate the code that is related to all bills from the code related to a single source of bills. And on top of that, it might be good to have the scrapers be separate from the parsers. Then the source-specific scripts could import the generic processing and output methods (as I do in statutes.py). Consolidating fdsys might be a part of that.

The Statutes data turns gaping holes into slightly-less-gaping holes by reverse-engineering (sort of) the metadata related to bills that have become law. The metadata is provided by the LOC somewhat accidentally, as a byproduct of being needed for archiving. As it stands, this does not get any information about the many bills that were never passed/enacted during the period from 1951 to 1972, and the data that it does get sometimes suffers from poor OCR. So yes, the Statutes data is essentially only a fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills (AFAIK—I could be wrong), at least not from FDsys. However, the LOC also has information available on its American Memory site, such as here that might be worth looking into. Some parts are in text form, which would be (relatively) easy to scrape, while others are in GIF/TIFF image format that would be a little more difficult. However, this does provide even different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I would definitely advise against that. There is just way too much metadata to collect.

@GPHemsley
Copy link
Member

It might be worth noting that they don't appear to have begun using codes like "H.R." to refer to bills until the 9th Congress (or later, depending on what you use as a reference).

@konklone
Copy link
Member

I do mean human-read and -input, but it'd be one-time only. It seems like
it might be a worthwhile project, if the metadata that you've gathered from
GPO's work is useful enough to build around. There is no fully official set
of scanned Statutes before 1951 that I'm aware of, but I've definitely seen
allegedly official unofficial sets of scanned Statutes PDFs going back a
long way. Whether or not to consider them official enough for use would be
an interesting question alone.

I don't yet think it's worth tearing up the way we've done bills yet in a
big way. Right now, we have a solid scraper for bill metadata from
1973-now, a scraper for useful-if-holey data from 1951-1972, and a
downloader for bill text from 1989-present (bill_versions.py). They all do
very different things, they're all straightforward to use, and there's no
friction or wasted effort yet. My inclination is usually to refactor
reactively rather than proactively, and each scraper being relatively
autonomous and separate allows us all to experiment more easily. I like how
things are working.

On Sun, Jan 27, 2013 at 10:22 PM, Gordon P. Hemsley <
notifications@github.com> wrote:

I think it would be good to separate the code that is related to all bills
from the code related to a single source of bills. And on top of that, it
might be good to have the scrapers be separate from the parsers. Then the
source-specific scripts could import the generic processing and output
methods (as I do in statutes.py). Consolidating fdsys might be a part of
that.

The Statutes data turns gaping holes into slightly-less-gaping holes by
reverse-engineering (sort of) the metadata related to bills that have
become law. The metadata is provided by the LOC somewhat accidentally, as a
byproduct of being needed for archiving. As it stands, this does not get
any information about the many bills that were never passed/enacted during
the period from 1951 to 1972, and the data that it does get sometimes
suffers from poor OCR. So yes, the Statutes data is essentially only a
fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills
(AFAIK—I could be wrong), at least not from FDsys. However, the LOC also
has information available on its American Memory site, such as herehttp://memory.loc.gov/ammem/amlaw/lwhbsb.htmlthat might be worth looking into. Some parts are in text form, which would
be (relatively) easy to scrape, while others are in GIF/TIFF image format
that would be a little more difficult. However, this does provide even
different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I
would definitely advise against that. There is just way too much metadata
to collect.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12767398.

Developer | sunlightfoundation.com

@GPHemsley
Copy link
Member

If I'm understanding your intentions correctly, you're talking hundreds of thousands—if not millions—of bills, aren't you? I think a much more worthwhile project in the short term would be scraping American Memory. That has text data from the 6th through the 42nd Congresses which would be much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I needed to import bill_info.py into statutes.py in order to get the generic output methods, but along with them came the THOMAS-related methods that I didn't need. At the very least, I think those two should be separated.

@konklone
Copy link
Member

I definitely don't want to micromanage anything here - if you think
something can be improved, improve it. I would just be careful about adding
any burden (making anyone writing a new script having to know more about
how other scripts work) to make things feel cleaner.

One way to make things better might be: utils.py is getting pretty weighty,
and is a mix of project-meta helpers and congress-meta helpers. Making a
congress.py file, and moving bill_info.output_bill, utils.current_congress,
utils.split_bill_id, etc.into it seems like a good idea - it separates them
like you describe, while keeping the scripts follow the same flat pattern
of "I only depend on myself plus there's a couple pools of utility methods
I can dip into". I could see fdsys.py becoming its own pool of methods, and
those files being put in their own directory.

Again, I do not want to nitpick, this is all going to work. We're hitting
an awesome stride of growth in this project, and it probably does merit a
bit of reorganization. I just think it will be easier in the long run for
all of us if this all stays flat and simple and mostly non-systematized.

On Sun, Jan 27, 2013 at 11:58 PM, Gordon P. Hemsley <
notifications@github.com> wrote:

If I'm understanding your intentions correctly, you're talking hundreds of
thousands—if not millions—of bills, aren't you? I think a much more
worthwhile project in the short term would be scraping American Memory.
That has text data from the 6th through the 42nd Congresses which would be
much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I
needed to import bill_info.py into statutes.py in order to get the generic
output methods, but along with them came the THOMAS-related methods that I
didn't need. At the very least, I think those two should be separated.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12768709.

Developer | sunlightfoundation.com

@GPHemsley
Copy link
Member

Speaking from my experience in writing statutes.py, I think splitting things out would make it easier, not harder, to write new scripts. I spent most of my time trying to track down all the various *_for() methods and what they meant and did, to see which ones I needed or could use. If all the generic ones were in their own file, it would have been somewhat easier for me to understand what was going on, I think.

For the record, these are the bill_info methods I used:

  • latest_status
  • history_from_actions
  • slip_law_from
  • current_title_for
  • output_bill

And there are probably others that I just didn't need but could be split out alongside them.

Speaking of utils, we could probably use some consolidating of the congress and congress-legislators utils.py files, perhaps as a separate project/repo. I had to do a lot of hacky things for legacy conversion to make things work together happily.

But yeah, I'm not attempting to make any crazy hierarchies here. Just splitting the pie up into smaller slices so I can pick only exactly what I need (while also making sure I can actually get what I need).

@GPHemsley
Copy link
Member

Pull request #39 is an important fix for making sure you get the right correspondence between bill number and bill text.

@JoshData
Copy link
Member Author

Let's not refactor yet. The next thing is pulling out bill text from 1951-1993 (there are fewer years of bill text on GPO than bill metadata on THOMAS).

@GPHemsley
Copy link
Member

I've tied in the Statute PDFs in pull request #41, so now you can actually see the bill/law associated with the often obscure titles.

Of course, pulling the text out of those PDFs is going to be quite an adventure unto itself. (Perhaps even one reserved for a separate issue.)

@GPHemsley
Copy link
Member

Would it be appropriate to include a reference such as "STATUTE-72-Pg3" alongside the action of enactment? The references field seems geared specifically towards the Congressional Record, but I think it would be good to open it up a little more to allow for other sources of information.

@konklone
Copy link
Member

I'd just add a new field. Even though it's called something general like
"references", I think it should remain only CR refs, to keep assumptions
when parsing that field simple. "source" might make sense.

On Wed, Jan 30, 2013 at 1:56 AM, Gordon P. Hemsley <notifications@github.com

wrote:

Would it be appropriate to include a reference such as "STATUTE-72-Pg3"
alongside the action of enactment? The references field seems geared
specifically towards the Congressional Record, but I think it would be good
to open it up a little more to allow for other sources of information.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12876892.

Developer | sunlightfoundation.com

@GPHemsley
Copy link
Member

Yeah, you're probably right. There will always be a record about it in the Congressional Record (or equivalent), but we might not always get the information from there. How about I make it a list named "sources", in case we ever have to combine sources to make a single action entry?

@konklone
Copy link
Member

Sure.

On Wed, Jan 30, 2013 at 11:44 AM, Gordon P. Hemsley <
notifications@github.com> wrote:

Yeah, you're probably right. There will always be a record about it in the
Congressional Record (or equivalent), but we might not always get the
information from there. How about I make it a list named "sources", in case
we ever have to combine sources to make a single action entry?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12898960.

Developer | sunlightfoundation.com

@GPHemsley
Copy link
Member

Of course that would leave the citation format to be determined. Should each source have a code for the general document/organization and then a specific citation within it, or should it just be { ... "sources": [ "STATUTE-72-Pg3" ] ... }?

@konklone
Copy link
Member

It's not too big a deal, since we can always regenerate it later, so how
about just a URL to the original document for now? The other option is a
full dict that's like [{source: "statutes", volume: "72", page: "3"}],
which is also fine.

On Wed, Jan 30, 2013 at 11:48 AM, Gordon P. Hemsley <
notifications@github.com> wrote:

Of course that would leave the citation format to be determined. Should
each source have a code for the general document/organization and then a
specific citation within it, or should it just be { ... "sources": [
"STATUTE-72-Pg3" ] ... }?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12899255.

Developer | sunlightfoundation.com

@JoshData
Copy link
Member Author

Both would be really helpful. I was going to add a source_url to all of our task output anyway, pointing to the page closest to where the information was scraped suitable for "see more" type links. I'd like to see source_url to all of the tasks, and for the Statutes-generated files just something special for that, i.e. statute_citation: { "volume": 72, "page": 3 ] which would match the "72 Stat 3" type citations people actually use.

@konklone
Copy link
Member

So the sources field would look like:

[{
  "source": "statutes",
  "source_url": "...",
  "volume": 72,
  "page": 3
}]

@GPHemsley
Copy link
Member

What URL should I use for the the source_url? MODS? PDF?

Also, I think it would be good to also include the access ID ("STATUTE-72-Pg3"), since that's the primary identifier of a particular statute at the GPO. (When multiple statutes appear on the same page, they get different access IDs; "72 Stat. 3" could be ambiguous, though I could also include a field containing the page position value.)

@JoshData
Copy link
Member Author

I'd like a URL I can link to, so these sort of pages would be good:
http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes, so it's useful.

@konklone
Copy link
Member

I think the "source_url" field should probably literally be the URL that
was used to get the data being output, for provenance' sake. But you could
add other URLs - and like you said, you can use the GPO identifier to
construct other kinds of detail URLs client-side, too.

On Thu, Jan 31, 2013 at 12:27 PM, Joshua Tauberer
notifications@github.comwrote:

I'd like a URL I can link to, so these sort of pages would be good:

http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's
contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes,
so it's useful.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12953943.

Developer | sunlightfoundation.com

@GPHemsley
Copy link
Member

I currently have it outputting this:

  "sources": [
    {
      "access_id": "STATUTE-71-PgB6", 
      "page": "B6", 
      "position": "1", 
      "source": "statute", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-71/STATUTE-71-PgB6/content-detail.html", 
      "volume": "71"
    }
  ], 

@JoshData
Copy link
Member Author

@konklone If that's different then what I was suggesting, then I'm just going to ask for yet another field for a human-readable page....

@JoshData
Copy link
Member Author

Gordon- Looks great to me.

@GPHemsley
Copy link
Member

Updated:

  "sources": [
    {
      "access_id": "STATUTE-73-Pg14-2", 
      "package_id": "STATUTE-73", 
      "page": "14", 
      "position": "2", 
      "source": "statutes", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-73/STATUTE-73-Pg14-2/content-detail.html", 
      "volume": "73"
    }
  ], 

@GPHemsley
Copy link
Member

I funny speak in summaries commit my. Pull request #43.

@JoshData
Copy link
Member Author

JoshData commented Feb 2, 2013

I've got this new bill data from 1951-1972 up on GovTrack now (http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys --store command puts current bill text.) Thoughts?

@konklone
Copy link
Member

konklone commented Feb 2, 2013

That makes sense to me. bill_versions.py is near-identical, putting a file
at bills/x/xddd/text-versions/enr.json. I'll change it to be enr/data.json
instead.

On Sat, Feb 2, 2013 at 12:29 PM, Joshua Tauberer
notifications@github.comwrote:

I've got this new bill data from 1951-1972 up on GovTrack now (
http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into
bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys
--store command puts current bill text.) Thoughts?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-13033948.

Developer | sunlightfoundation.com

@GPHemsley
Copy link
Member

@tauberer It looks like you missed 1951–1957 (82–84). Also, you might want to make sure that the 85–88 files have been generated by the latest version of all files/scripts involved.

@konklone
Copy link
Member

This looks done enough to close. Re-open if I'm wrong, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants