New Data: Statutes at Large #34

JoshData · 2013-01-26T19:17:15Z

Just a heads-up that Gordon and I are working on pulling info out of the new Statutes at Large MODS files.

konklone · 2013-01-27T06:19:21Z

Ooh - using fdsys.py?

Curious how you intend to use them?

GPHemsley · 2013-01-27T07:43:32Z

Josh can probably explain better, but the plan is to extract as much of the standard metadata from them as possible.

GPHemsley · 2013-01-27T07:49:25Z

I've got a script now that extracts the basic Congress and bill numbers for all bills that became law. I'm going to attempt to squeeze some more data out of the MODS files in the coming days.

I'm also thinking that at some point the bill scraping code should be separated from the bill format/metadata files, since there will be multiple sources for bill data but only one way to output for each format (json/xml). I may tackle that at the same time.

JoshData · 2013-01-27T15:47:44Z

@konklone: My primary goal is to turn it into bills on GovTrack (enacted only, of course). So we'll be generating output that looks similar to the bill output. Plus whatever other interesting metadata is in the MODS files. And something with the text layer in the PDF.

GPHemsley · 2013-01-27T16:44:41Z

I think the goal should also be to be able to do something like this and have it Just Work™:

./run bills --congress=82

(Even if it doesn't get all the bills in the 82nd Congress at first.)

GPHemsley · 2013-01-27T21:22:08Z

I've got a working version of the script here:

https://github.com/GPHemsley/congress/blob/historical-bills-1951/tasks/statutes.py

You can run it by running a command like this:

./run statutes --path=STATUTE/1951/STATUTE-65 --govtrack

(After you run the fdsys task to put the fdsys files in place.)

GPHemsley · 2013-01-27T22:47:19Z

So Josh pulled this into master, with a bunch of documentation, if you want to use it. (There may be more to do still, so I don't know if this should be closed just yet.)

konklone · 2013-01-28T03:00:11Z

Ah ha! I get it now (looked at the code). So this is intended to fill in gaps from 1951 to 1972. I like this a lot. I have a couple thoughts (surprise), related to keeping the system sane as we expand into more scripts.

Since the data we get from THOMAS is uniformly superior to the data from the Statutes (right? is there anything unique to the Statutes collection?), then the statutes script should probably default to an end year of 1972. This could become moot if we make the bills script default to a Statutes-driven approach for years before 1973.

It'd be great to keep this down to running one command, instead of two, using fdsys.py as a support library that the statutes.py script uses (rather than running it as a script as a prerequisite). You can see how I did this in bill_versions.py, to generate JSON files for each version of bill text.

konklone · 2013-01-28T03:00:24Z

Separate thought - how much work would have to go into using scanned copies of the Statutes at Large for years prior to 1951? Since there's so much value just in getting the metadata, and scanning accurate text is not a concern, would it be worth it to engage in a manual (and one-time!) metadata collection effort using copies of the Statutes going back into antiquity, if we had them scanned?

GPHemsley · 2013-01-28T03:22:35Z

I think it would be good to separate the code that is related to all bills from the code related to a single source of bills. And on top of that, it might be good to have the scrapers be separate from the parsers. Then the source-specific scripts could import the generic processing and output methods (as I do in statutes.py). Consolidating fdsys might be a part of that.

The Statutes data turns gaping holes into slightly-less-gaping holes by reverse-engineering (sort of) the metadata related to bills that have become law. The metadata is provided by the LOC somewhat accidentally, as a byproduct of being needed for archiving. As it stands, this does not get any information about the many bills that were never passed/enacted during the period from 1951 to 1972, and the data that it does get sometimes suffers from poor OCR. So yes, the Statutes data is essentially only a fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills (AFAIK—I could be wrong), at least not from FDsys. However, the LOC also has information available on its American Memory site, such as here that might be worth looking into. Some parts are in text form, which would be (relatively) easy to scrape, while others are in GIF/TIFF image format that would be a little more difficult. However, this does provide even different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I would definitely advise against that. There is just way too much metadata to collect.

GPHemsley · 2013-01-28T03:28:25Z

It might be worth noting that they don't appear to have begun using codes like "H.R." to refer to bills until the 9th Congress (or later, depending on what you use as a reference).

konklone · 2013-01-28T03:43:59Z

I do mean human-read and -input, but it'd be one-time only. It seems like
it might be a worthwhile project, if the metadata that you've gathered from
GPO's work is useful enough to build around. There is no fully official set
of scanned Statutes before 1951 that I'm aware of, but I've definitely seen
allegedly official unofficial sets of scanned Statutes PDFs going back a
long way. Whether or not to consider them official enough for use would be
an interesting question alone.

I don't yet think it's worth tearing up the way we've done bills yet in a
big way. Right now, we have a solid scraper for bill metadata from
1973-now, a scraper for useful-if-holey data from 1951-1972, and a
downloader for bill text from 1989-present (bill_versions.py). They all do
very different things, they're all straightforward to use, and there's no
friction or wasted effort yet. My inclination is usually to refactor
reactively rather than proactively, and each scraper being relatively
autonomous and separate allows us all to experiment more easily. I like how
things are working.

On Sun, Jan 27, 2013 at 10:22 PM, Gordon P. Hemsley <
notifications@github.com> wrote:

I think it would be good to separate the code that is related to all bills
from the code related to a single source of bills. And on top of that, it
might be good to have the scrapers be separate from the parsers. Then the
source-specific scripts could import the generic processing and output
methods (as I do in statutes.py). Consolidating fdsys might be a part of
that.

The Statutes data turns gaping holes into slightly-less-gaping holes by
reverse-engineering (sort of) the metadata related to bills that have
become law. The metadata is provided by the LOC somewhat accidentally, as a
byproduct of being needed for archiving. As it stands, this does not get
any information about the many bills that were never passed/enacted during
the period from 1951 to 1972, and the data that it does get sometimes
suffers from poor OCR. So yes, the Statutes data is essentially only a
fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills
(AFAIK—I could be wrong), at least not from FDsys. However, the LOC also
has information available on its American Memory site, such as herehttp://memory.loc.gov/ammem/amlaw/lwhbsb.htmlthat might be worth looking into. Some parts are in text form, which would
be (relatively) easy to scrape, while others are in GIF/TIFF image format
that would be a little more difficult. However, this does provide even
different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I
would definitely advise against that. There is just way too much metadata
to collect.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12767398.

Developer | sunlightfoundation.com

GPHemsley · 2013-01-28T04:58:18Z

If I'm understanding your intentions correctly, you're talking hundreds of thousands—if not millions—of bills, aren't you? I think a much more worthwhile project in the short term would be scraping American Memory. That has text data from the 6th through the 42nd Congresses which would be much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I needed to import bill_info.py into statutes.py in order to get the generic output methods, but along with them came the THOMAS-related methods that I didn't need. At the very least, I think those two should be separated.

konklone · 2013-01-28T05:35:28Z

I definitely don't want to micromanage anything here - if you think
something can be improved, improve it. I would just be careful about adding
any burden (making anyone writing a new script having to know more about
how other scripts work) to make things feel cleaner.

One way to make things better might be: utils.py is getting pretty weighty,
and is a mix of project-meta helpers and congress-meta helpers. Making a
congress.py file, and moving bill_info.output_bill, utils.current_congress,
utils.split_bill_id, etc.into it seems like a good idea - it separates them
like you describe, while keeping the scripts follow the same flat pattern
of "I only depend on myself plus there's a couple pools of utility methods
I can dip into". I could see fdsys.py becoming its own pool of methods, and
those files being put in their own directory.

Again, I do not want to nitpick, this is all going to work. We're hitting
an awesome stride of growth in this project, and it probably does merit a
bit of reorganization. I just think it will be easier in the long run for
all of us if this all stays flat and simple and mostly non-systematized.

On Sun, Jan 27, 2013 at 11:58 PM, Gordon P. Hemsley <
notifications@github.com> wrote:

If I'm understanding your intentions correctly, you're talking hundreds of
thousands—if not millions—of bills, aren't you? I think a much more
worthwhile project in the short term would be scraping American Memory.
That has text data from the 6th through the 42nd Congresses which would be
much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I
needed to import bill_info.py into statutes.py in order to get the generic
output methods, but along with them came the THOMAS-related methods that I
didn't need. At the very least, I think those two should be separated.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12768709.

Developer | sunlightfoundation.com

GPHemsley · 2013-01-28T05:57:51Z

Speaking from my experience in writing statutes.py, I think splitting things out would make it easier, not harder, to write new scripts. I spent most of my time trying to track down all the various *_for() methods and what they meant and did, to see which ones I needed or could use. If all the generic ones were in their own file, it would have been somewhat easier for me to understand what was going on, I think.

For the record, these are the bill_info methods I used:

latest_status
history_from_actions
slip_law_from
current_title_for
output_bill

And there are probably others that I just didn't need but could be split out alongside them.

Speaking of utils, we could probably use some consolidating of the congress and congress-legislators utils.py files, perhaps as a separate project/repo. I had to do a lot of hacky things for legacy conversion to make things work together happily.

But yeah, I'm not attempting to make any crazy hierarchies here. Just splitting the pie up into smaller slices so I can pick only exactly what I need (while also making sure I can actually get what I need).

GPHemsley · 2013-01-28T06:09:17Z

Pull request #39 is an important fix for making sure you get the right correspondence between bill number and bill text.

JoshData · 2013-01-28T12:41:55Z

Let's not refactor yet. The next thing is pulling out bill text from 1951-1993 (there are fewer years of bill text on GPO than bill metadata on THOMAS).

GPHemsley · 2013-01-29T19:49:07Z

I've tied in the Statute PDFs in pull request #41, so now you can actually see the bill/law associated with the often obscure titles.

Of course, pulling the text out of those PDFs is going to be quite an adventure unto itself. (Perhaps even one reserved for a separate issue.)

GPHemsley · 2013-01-30T06:56:19Z

Would it be appropriate to include a reference such as "STATUTE-72-Pg3" alongside the action of enactment? The references field seems geared specifically towards the Congressional Record, but I think it would be good to open it up a little more to allow for other sources of information.

konklone · 2013-01-30T13:53:36Z

I'd just add a new field. Even though it's called something general like
"references", I think it should remain only CR refs, to keep assumptions
when parsing that field simple. "source" might make sense.

On Wed, Jan 30, 2013 at 1:56 AM, Gordon P. Hemsley <notifications@github.com

wrote:

Would it be appropriate to include a reference such as "STATUTE-72-Pg3"
alongside the action of enactment? The references field seems geared
specifically towards the Congressional Record, but I think it would be good
to open it up a little more to allow for other sources of information.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12876892.

Developer | sunlightfoundation.com

GPHemsley · 2013-01-30T16:44:03Z

Yeah, you're probably right. There will always be a record about it in the Congressional Record (or equivalent), but we might not always get the information from there. How about I make it a list named "sources", in case we ever have to combine sources to make a single action entry?

konklone · 2013-01-30T16:47:29Z

Sure.

On Wed, Jan 30, 2013 at 11:44 AM, Gordon P. Hemsley <
notifications@github.com> wrote:

Yeah, you're probably right. There will always be a record about it in the
Congressional Record (or equivalent), but we might not always get the
information from there. How about I make it a list named "sources", in case
we ever have to combine sources to make a single action entry?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12898960.

Developer | sunlightfoundation.com

GPHemsley · 2013-01-30T16:48:31Z

Of course that would leave the citation format to be determined. Should each source have a code for the general document/organization and then a specific citation within it, or should it just be { ... "sources": [ "STATUTE-72-Pg3" ] ... }?

konklone · 2013-01-30T17:10:00Z

It's not too big a deal, since we can always regenerate it later, so how
about just a URL to the original document for now? The other option is a
full dict that's like [{source: "statutes", volume: "72", page: "3"}],
which is also fine.

On Wed, Jan 30, 2013 at 11:48 AM, Gordon P. Hemsley <
notifications@github.com> wrote:

Of course that would leave the citation format to be determined. Should
each source have a code for the general document/organization and then a
specific citation within it, or should it just be { ... "sources": [
"STATUTE-72-Pg3" ] ... }?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12899255.

Developer | sunlightfoundation.com

JoshData · 2013-01-31T15:02:38Z

Both would be really helpful. I was going to add a source_url to all of our task output anyway, pointing to the page closest to where the information was scraped suitable for "see more" type links. I'd like to see source_url to all of the tasks, and for the Statutes-generated files just something special for that, i.e. statute_citation: { "volume": 72, "page": 3 ] which would match the "72 Stat 3" type citations people actually use.

konklone · 2013-01-31T15:38:07Z

So the sources field would look like:

[{
  "source": "statutes",
  "source_url": "...",
  "volume": 72,
  "page": 3
}]

GPHemsley · 2013-01-31T17:16:00Z

What URL should I use for the the source_url? MODS? PDF?

Also, I think it would be good to also include the access ID ("STATUTE-72-Pg3"), since that's the primary identifier of a particular statute at the GPO. (When multiple statutes appear on the same page, they get different access IDs; "72 Stat. 3" could be ambiguous, though I could also include a field containing the page position value.)

JoshData · 2013-01-31T17:27:38Z

I'd like a URL I can link to, so these sort of pages would be good:
http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes, so it's useful.

konklone · 2013-01-31T17:33:16Z

I think the "source_url" field should probably literally be the URL that
was used to get the data being output, for provenance' sake. But you could
add other URLs - and like you said, you can use the GPO identifier to
construct other kinds of detail URLs client-side, too.

On Thu, Jan 31, 2013 at 12:27 PM, Joshua Tauberer
notifications@github.comwrote:

I'd like a URL I can link to, so these sort of pages would be good:

http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's
contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes,
so it's useful.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12953943.

Developer | sunlightfoundation.com

GPHemsley · 2013-01-31T17:33:59Z

I currently have it outputting this:

  "sources": [
    {
      "access_id": "STATUTE-71-PgB6", 
      "page": "B6", 
      "position": "1", 
      "source": "statute", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-71/STATUTE-71-PgB6/content-detail.html", 
      "volume": "71"
    }
  ],

JoshData · 2013-01-31T17:38:23Z

@konklone If that's different then what I was suggesting, then I'm just going to ask for yet another field for a human-readable page....

JoshData · 2013-01-31T17:39:29Z

Gordon- Looks great to me.

GPHemsley · 2013-01-31T17:43:27Z

Updated:

  "sources": [
    {
      "access_id": "STATUTE-73-Pg14-2", 
      "package_id": "STATUTE-73", 
      "page": "14", 
      "position": "2", 
      "source": "statutes", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-73/STATUTE-73-Pg14-2/content-detail.html", 
      "volume": "73"
    }
  ],

GPHemsley · 2013-01-31T17:48:05Z

I funny speak in summaries commit my. Pull request #43.

JoshData · 2013-02-02T17:29:06Z

I've got this new bill data from 1951-1972 up on GovTrack now (http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys --store command puts current bill text.) Thoughts?

konklone · 2013-02-02T18:42:32Z

That makes sense to me. bill_versions.py is near-identical, putting a file
at bills/x/xddd/text-versions/enr.json. I'll change it to be enr/data.json
instead.

On Sat, Feb 2, 2013 at 12:29 PM, Joshua Tauberer
notifications@github.comwrote:

I've got this new bill data from 1951-1972 up on GovTrack now (
http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into
bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys
--store command puts current bill text.) Thoughts?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-13033948.

Developer | sunlightfoundation.com

GPHemsley · 2013-02-03T03:44:51Z

@tauberer It looks like you missed 1951–1957 (82–84). Also, you might want to make sure that the 85–88 files have been generated by the latest version of all files/scripts involved.

konklone · 2013-03-29T15:20:50Z

This looks done enough to close. Re-open if I'm wrong, of course.

ghost assigned JoshData Jan 26, 2013

GPHemsley mentioned this issue Jan 28, 2013

Separate common utilities from source-specific scripts #38

Closed

GPHemsley mentioned this issue Jan 30, 2013

Use roll call votes to reverse engineer voting records for bills from Statutes At Large #42

Closed

GPHemsley mentioned this issue Jan 31, 2013

Add source information for statutes-derived data #43

Merged

konklone closed this as completed Mar 29, 2013

New Data: Statutes at Large #34

New Data: Statutes at Large #34

Comments

JoshData commented Jan 26, 2013

konklone commented Jan 27, 2013

GPHemsley commented Jan 27, 2013

GPHemsley commented Jan 27, 2013

JoshData commented Jan 27, 2013

GPHemsley commented Jan 27, 2013

GPHemsley commented Jan 27, 2013

GPHemsley commented Jan 27, 2013

konklone commented Jan 28, 2013

konklone commented Jan 28, 2013

GPHemsley commented Jan 28, 2013

GPHemsley commented Jan 28, 2013

konklone commented Jan 28, 2013

GPHemsley commented Jan 28, 2013

konklone commented Jan 28, 2013

GPHemsley commented Jan 28, 2013

GPHemsley commented Jan 28, 2013

JoshData commented Jan 28, 2013

GPHemsley commented Jan 29, 2013

GPHemsley commented Jan 30, 2013

konklone commented Jan 30, 2013

GPHemsley commented Jan 30, 2013

konklone commented Jan 30, 2013

GPHemsley commented Jan 30, 2013

konklone commented Jan 30, 2013

JoshData commented Jan 31, 2013

konklone commented Jan 31, 2013

GPHemsley commented Jan 31, 2013

JoshData commented Jan 31, 2013

konklone commented Jan 31, 2013

GPHemsley commented Jan 31, 2013

JoshData commented Jan 31, 2013

JoshData commented Jan 31, 2013

GPHemsley commented Jan 31, 2013

GPHemsley commented Jan 31, 2013

JoshData commented Feb 2, 2013

konklone commented Feb 2, 2013

GPHemsley commented Feb 3, 2013

konklone commented Mar 29, 2013