Text of amendments #52

Open
wilson428 opened this Issue Mar 20, 2013 · 27 comments

Comments

Projects
None yet
5 participants
@wilson428
Member

wilson428 commented Mar 20, 2013

I saw Gordon and Josh discussing this on the govtrack side of things, but am I right that this does not currently fetch amendment text from Congressional Record? If so, I can take a stab. Did you ever convert your Perl to Python, @JoshData?

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 20, 2013

Member

Looping in @drinks, who manages our CapitolWords project, which contains a big CR parser.

The text of amendments would be a huge get, and something I've wanted to parse out of the CR for a while. But I have no idea how hard that is.

Member

konklone commented Mar 20, 2013

Looping in @drinks, who manages our CapitolWords project, which contains a big CR parser.

The text of amendments would be a huge get, and something I've wanted to parse out of the CR for a while. But I have no idea how hard that is.

@JoshData

This comment has been minimized.

Show comment
Hide comment
@JoshData

JoshData Mar 20, 2013

Member

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.

If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl

The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.

Thanks Chris (and Eric and Dan).

Member

JoshData commented Mar 20, 2013

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.

If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl

The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.

Thanks Chris (and Eric and Dan).

@drinks

This comment has been minimized.

Show comment
Hide comment
@drinks

drinks Mar 20, 2013

Member

This could also be a starting place, though I'm not sure how complete:

http://capitolwords.org/api/1/text.json?apikey=&title=text%20of%20amendments (http://capitolwords.org/api/1/text.json?apikey=%3Ckey%3E&title=text%20of%20amendments)

On Wednesday, March 20, 2013 at 5:04 PM, Joshua Tauberer wrote:

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.
If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl (http://govtrack.us/govtrack/gather/us/parse_record.pl)
The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.
Thanks Chris (and Eric and Dan).


Reply to this email directly or view it on GitHub (#52 (comment)).

Member

drinks commented Mar 20, 2013

This could also be a starting place, though I'm not sure how complete:

http://capitolwords.org/api/1/text.json?apikey=&title=text%20of%20amendments (http://capitolwords.org/api/1/text.json?apikey=%3Ckey%3E&title=text%20of%20amendments)

On Wednesday, March 20, 2013 at 5:04 PM, Joshua Tauberer wrote:

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.
If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl (http://govtrack.us/govtrack/gather/us/parse_record.pl)
The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.
Thanks Chris (and Eric and Dan).


Reply to this email directly or view it on GitHub (#52 (comment)).

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 20, 2013

Member

I think we can get it all straight from THOMAS. Every amendment has a link to "text of amendment," which links to a landing page, which in turn links to a custom query (using their weird ephemeral URLs). Since amendments_info.py already hits this page, I may start there.

Member

wilson428 commented Mar 20, 2013

I think we can get it all straight from THOMAS. Every amendment has a link to "text of amendment," which links to a landing page, which in turn links to a custom query (using their weird ephemeral URLs). Since amendments_info.py already hits this page, I may start there.

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 20, 2013

Member

That's welcome. The most ideal thing here would be to extract and reuse
whatever code CapitolWords is using to parse the CR - like a
python-congressional-record library or something, that
unitedstates/congress could wield to retrieve amendment text and store
metadata for it. But if that's not feasible, then doing a one-off that uses
THOMAS.gov is fine.

I hope Congress.gov preserves the same features and linkage, though -
THOMAS.gov's shutoff date is getting closer.

On Wed, Mar 20, 2013 at 5:13 PM, Chris Wilson notifications@github.comwrote:

I think we can get it all straight from THOMAS. Every amendment has a link
to "text of amendment," which links to a landing page, which in turn links
to a custom query (using their weird ephemeral URLs). Since
amendments_info.py already hits this page, I may start there.


Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/52#issuecomment-15203861
.

Developer | sunlightfoundation.com

Member

konklone commented Mar 20, 2013

That's welcome. The most ideal thing here would be to extract and reuse
whatever code CapitolWords is using to parse the CR - like a
python-congressional-record library or something, that
unitedstates/congress could wield to retrieve amendment text and store
metadata for it. But if that's not feasible, then doing a one-off that uses
THOMAS.gov is fine.

I hope Congress.gov preserves the same features and linkage, though -
THOMAS.gov's shutoff date is getting closer.

On Wed, Mar 20, 2013 at 5:13 PM, Chris Wilson notifications@github.comwrote:

I think we can get it all straight from THOMAS. Every amendment has a link
to "text of amendment," which links to a landing page, which in turn links
to a custom query (using their weird ephemeral URLs). Since
amendments_info.py already hits this page, I may start there.


Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/52#issuecomment-15203861
.

Developer | sunlightfoundation.com

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 21, 2013

Member

I can't use permalinks for this stuff that I'm aware of, but it looks like if you go to an amendment's detail page and use the "Text of Amendment as Submitted" link, it takes you to a pretty terrible page, which links to a "printer friendly" page which looks a lot less terrible. I just looked at S. Amdt. 26 as an example, whose printer friendly page looks reasonable enough. The scraper might be a bit annoying and stateful, given the lack of permalinks.

My suggestion is to fetch text for amendments whose info has already been fetched, and maybe as a separate script, like amendment_text.py, which outputs new files in that amendment's output directory. The amendment number itself doesn't seem to appear on the printer friendly page.

Member

konklone commented Mar 21, 2013

I can't use permalinks for this stuff that I'm aware of, but it looks like if you go to an amendment's detail page and use the "Text of Amendment as Submitted" link, it takes you to a pretty terrible page, which links to a "printer friendly" page which looks a lot less terrible. I just looked at S. Amdt. 26 as an example, whose printer friendly page looks reasonable enough. The scraper might be a bit annoying and stateful, given the lack of permalinks.

My suggestion is to fetch text for amendments whose info has already been fetched, and maybe as a separate script, like amendment_text.py, which outputs new files in that amendment's output directory. The amendment number itself doesn't seem to appear on the printer friendly page.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 21, 2013

Member

Very good suggestion. Otherwise, THOMAS appears to behave differently based on the length of the text, since some amendments are all of 200 words. The CapitalWords project looks awesome, but too complex for me to integrate at the moment.

Member

wilson428 commented Mar 21, 2013

Very good suggestion. Otherwise, THOMAS appears to behave differently based on the length of the text, since some amendments are all of 200 words. The CapitalWords project looks awesome, but too complex for me to integrate at the moment.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 21, 2013

Member

Actually, does this URL structure mean anything to anyone?

http://thomas.loc.gov/cgi-bin/query/z?r113:S12MR3-0044:/

That's what I get if I use the share button in the upper right and go to "save" for SA27, a short Rubio amendment to SA26.

SA26 goes to here:

http://thomas.loc.gov/cgi-bin/query/z?r113:S11MR3-0039:/

I don't see any immediately obvious way to get this URL directly from the Amendment # though

Member

wilson428 commented Mar 21, 2013

Actually, does this URL structure mean anything to anyone?

http://thomas.loc.gov/cgi-bin/query/z?r113:S12MR3-0044:/

That's what I get if I use the share button in the upper right and go to "save" for SA27, a short Rubio amendment to SA26.

SA26 goes to here:

http://thomas.loc.gov/cgi-bin/query/z?r113:S11MR3-0039:/

I don't see any immediately obvious way to get this URL directly from the Amendment # though

@GPHemsley

This comment has been minimized.

Show comment
Hide comment
@GPHemsley

GPHemsley Mar 21, 2013

Member

AFAICT, the format of the identifier is:

z?r[congress]:[chamber][day][month][year]-[item ID]:

Where [chamber] is "H" or "S" (or "E" for extension of remarks or "D" for digest), [day] is the 2-digit day of the month, [month] is a 2-char representation of the month name, [year] is the significant digit of the year ("3" for 2013), and [item ID] is the 4-digit number of the item on the day's register.

For example, SA27 is the 44th item on the Senate register for March 12, 2013 in the 113th Congress:

http://thomas.loc.gov/cgi-bin/query/B?r113:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20130312%29

Member

GPHemsley commented Mar 21, 2013

AFAICT, the format of the identifier is:

z?r[congress]:[chamber][day][month][year]-[item ID]:

Where [chamber] is "H" or "S" (or "E" for extension of remarks or "D" for digest), [day] is the 2-digit day of the month, [month] is a 2-char representation of the month name, [year] is the significant digit of the year ("3" for 2013), and [item ID] is the 4-digit number of the item on the day's register.

For example, SA27 is the 44th item on the Senate register for March 12, 2013 in the 113th Congress:

http://thomas.loc.gov/cgi-bin/query/B?r113:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20130312%29

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 21, 2013

Member

Thx, Gordon!

Just pushed a branch called amendment_text. I'm having some encoding nightmares, so there's a sloppy catch in the utils unescape function right now with a URL in the comments that will break the original.

To try it, just pass --fulltext to amendments call (see README in branch). Needs so work.

Member

wilson428 commented Mar 21, 2013

Thx, Gordon!

Just pushed a branch called amendment_text. I'm having some encoding nightmares, so there's a sloppy catch in the utils unescape function right now with a URL in the comments that will break the original.

To try it, just pass --fulltext to amendments call (see README in branch). Needs so work.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 22, 2013

Member

I made rework to just pull the text file of amendments from each day's CR. e.g.
http://www.gpo.gov/fdsys/pkg/CREC-2013-03-21/html/CREC-2013-03-21-pt1-PgS2169.htm

Much easier to download this file and regex it than crawl through THOMAS

Member

wilson428 commented Mar 22, 2013

I made rework to just pull the text file of amendments from each day's CR. e.g.
http://www.gpo.gov/fdsys/pkg/CREC-2013-03-21/html/CREC-2013-03-21-pt1-PgS2169.htm

Much easier to download this file and regex it than crawl through THOMAS

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 22, 2013

Member

That's exactly the approach Capitol Words takes. CW doesn't do anything special to parse out amendment text right now, but you still might look at its parser for lessons and ideas.

Member

konklone commented Mar 22, 2013

That's exactly the approach Capitol Words takes. CW doesn't do anything special to parse out amendment text right now, but you still might look at its parser for lessons and ideas.

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 22, 2013

Member

In fact, I really should loop in @bycoffe too, since I believe he wrote the original version of that parser.

Member

konklone commented Mar 22, 2013

In fact, I really should loop in @bycoffe too, since I believe he wrote the original version of that parser.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 22, 2013

Member

Awesome, will do.

Has anyone tried to parse the amendment text and connect to the original legislation?

e.g., for "On page 4, line 6, decrease the amount by $20,000,000,000."

to connect to whatever's on page 4, line 6 of the actual legislation?

Member

wilson428 commented Mar 22, 2013

Awesome, will do.

Has anyone tried to parse the amendment text and connect to the original legislation?

e.g., for "On page 4, line 6, decrease the amount by $20,000,000,000."

to connect to whatever's on page 4, line 6 of the actual legislation?

@JoshData

This comment has been minimized.

Show comment
Hide comment
@JoshData

JoshData Mar 22, 2013

Member

As far as I know, no one has tried either that or anything in the bigger picture of parsing how things change other things.

Member

JoshData commented Mar 22, 2013

As far as I know, no one has tried either that or anything in the bigger picture of parsing how things change other things.

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 22, 2013

Member

People sure do talk about it a lot, though! Everyone wants this. Are you a bad enough dude to connect amendments to legislation through page and line numbers?

Member

konklone commented Mar 22, 2013

People sure do talk about it a lot, though! Everyone wants this. Are you a bad enough dude to connect amendments to legislation through page and line numbers?

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 25, 2013

Member

Any suggestions for how to integrate the fdsys routines into the amendment task, thus grabbing the unadulterated text of the amendment directly from GPO? I have the script to download the text file from GPO and parse it and place it in the right place, but right now it takes manual URLs from GPO (like the one a few messages back). Not sure how to go from an amendment (says, SA 136) and find it in the Congressional Record.

FYI, here's a crude demo of matching amendments to legislation. I can upload the guys of it once I can get amendment text working right.

http://experimentsinform.com/media/demos/revisions/site/

Member

wilson428 commented Mar 25, 2013

Any suggestions for how to integrate the fdsys routines into the amendment task, thus grabbing the unadulterated text of the amendment directly from GPO? I have the script to download the text file from GPO and parse it and place it in the right place, but right now it takes manual URLs from GPO (like the one a few messages back). Not sure how to go from an amendment (says, SA 136) and find it in the Congressional Record.

FYI, here's a crude demo of matching amendments to legislation. I can upload the guys of it once I can get amendment text working right.

http://experimentsinform.com/media/demos/revisions/site/

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Mar 25, 2013

Member

...whoa. Even this "crude demo" is more interesting than anything I've seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead of a standalone task) in bill_versions.py, which downloads bill text and version metadata from GPO and outputs JSON files for each version of each bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production systems yet (still using an older Ruby-based GPO sync script), but I plan to.

Member

konklone commented Mar 25, 2013

...whoa. Even this "crude demo" is more interesting than anything I've seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead of a standalone task) in bill_versions.py, which downloads bill text and version metadata from GPO and outputs JSON files for each version of each bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production systems yet (still using an older Ruby-based GPO sync script), but I plan to.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 25, 2013

Member

Awesome, thank you!

On Mon, Mar 25, 2013 at 4:13 PM, Eric Mill notifications@github.com wrote:

...whoa. Even this "crude demo" is more interesting than anything I've
seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead
of a standalone task) in bill_versions.py, which downloads bill text and
version metadata from GPO and outputs JSON files for each version of each
bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production
systems yet (still using an older Ruby-based GPO sync script), but I plan
to.


Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/52#issuecomment-15421789
.

christopher.e.wilson@gmail.com
434.242.9728

Member

wilson428 commented Mar 25, 2013

Awesome, thank you!

On Mon, Mar 25, 2013 at 4:13 PM, Eric Mill notifications@github.com wrote:

...whoa. Even this "crude demo" is more interesting than anything I've
seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead
of a standalone task) in bill_versions.py, which downloads bill text and
version metadata from GPO and outputs JSON files for each version of each
bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production
systems yet (still using an older Ruby-based GPO sync script), but I plan
to.


Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/52#issuecomment-15421789
.

christopher.e.wilson@gmail.com
434.242.9728

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 26, 2013

Member

The "amendment_text" branch now has a "parse" command that attempts to interpret the text of the amendments (which now are retrieved from GPO). It's smart enough to figure out how to pinpoint "Title III" to a page and line if the amendment doesn't specify the exact location, but still has a long ways to go. See README.

Member

wilson428 commented Mar 26, 2013

The "amendment_text" branch now has a "parse" command that attempts to interpret the text of the amendments (which now are retrieved from GPO). It's smart enough to figure out how to pinpoint "Title III" to a page and line if the amendment doesn't specify the exact location, but still has a long ways to go. See README.

@GPHemsley

This comment has been minimized.

Show comment
Hide comment
@GPHemsley

GPHemsley Mar 26, 2013

Member

I just want to pop in to say that this is pretty awesome work.

Member

GPHemsley commented Mar 26, 2013

I just want to pop in to say that this is pretty awesome work.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Mar 26, 2013

Member

Thanks! Will take a lot more. All help much appreciated.

Member

wilson428 commented Mar 26, 2013

Thanks! Will take a lot more. All help much appreciated.

@GPHemsley GPHemsley referenced this issue in govtrack/govtrack.us-web Apr 10, 2013

Open

Include text of amendments voted on #2

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Jun 17, 2013

Member

Hey @wilson428, could I summon you to do a quick summary of the State of Amendment Text?

I'm putting the finishing touches on getting a /amendments endpoint done in Sunlight's Congress API using this data, and am making it full-text searchable on the purpose and description only. I'd love to get the full text indexed as well, even if it's not in structured form the way that bill text is.

What's the remaining path to dropping .txt files, even unformatted blob-of-words style, for each amendment?

Member

konklone commented Jun 17, 2013

Hey @wilson428, could I summon you to do a quick summary of the State of Amendment Text?

I'm putting the finishing touches on getting a /amendments endpoint done in Sunlight's Congress API using this data, and am making it full-text searchable on the purpose and description only. I'd love to get the full text indexed as well, even if it's not in structured form the way that bill text is.

What's the remaining path to dropping .txt files, even unformatted blob-of-words style, for each amendment?

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Jun 17, 2013

Member

Awesome! This stalled over issues of where the raw text of amendments is stored in the CR, particularly in the House. I was following links from THOMAS to the CR to get the full text, but the agreement seemed to be that I should be using the fdsys scripts instead.

Then it turned out that House amendments in the CR are buried in the text of the daily activity. There was some sense that the text could be retrieved from a committee, I think?

I would love to dive back in, though I'd prefer to work on a forward-looking solution that doesn't die with Thomas end-of-life. Suggestions?

Member

wilson428 commented Jun 17, 2013

Awesome! This stalled over issues of where the raw text of amendments is stored in the CR, particularly in the House. I was following links from THOMAS to the CR to get the full text, but the agreement seemed to be that I should be using the fdsys scripts instead.

Then it turned out that House amendments in the CR are buried in the text of the daily activity. There was some sense that the text could be retrieved from a committee, I think?

I would love to dive back in, though I'd prefer to work on a forward-looking solution that doesn't die with Thomas end-of-life. Suggestions?

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Jun 17, 2013

Member

If the text appears in the CR for both chambers, that's probably the most reliable path, even if the parsing is particularly messy. There's a lot of value even if the text is extracted in a non-readable, but indexable, form.

I don't think any committee would have floor amendments, although committee reports (which are released after huge delays) might have committee amendments in them sometimes. Also possible that the House Committee Repository might grow to include them for the House some day.

Member

konklone commented Jun 17, 2013

If the text appears in the CR for both chambers, that's probably the most reliable path, even if the parsing is particularly messy. There's a lot of value even if the text is extracted in a non-readable, but indexable, form.

I don't think any committee would have floor amendments, although committee reports (which are released after huge delays) might have committee amendments in them sometimes. Also possible that the House Committee Repository might grow to include them for the House some day.

@wilson428

This comment has been minimized.

Show comment
Hide comment
@wilson428

wilson428 Jun 17, 2013

Member

Gotcha. I'll dive back in this week and give a more detailed report. Thanks for nudge.

Member

wilson428 commented Jun 17, 2013

Gotcha. I'll dive back in this week and give a more detailed report. Thanks for nudge.

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Jun 17, 2013

Member

And to be clear, I'm happy to work on this, either primarily or secondarily, once I emerge from my current unrelated work blitz.

Member

konklone commented Jun 17, 2013

And to be clear, I'm happy to work on this, either primarily or secondarily, once I emerge from my current unrelated work blitz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment