Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text of amendments #52

Open
wilson428 opened this issue Mar 20, 2013 · 27 comments
Open

Text of amendments #52

wilson428 opened this issue Mar 20, 2013 · 27 comments

Comments

@wilson428
Copy link
Member

@wilson428 wilson428 commented Mar 20, 2013

I saw Gordon and Josh discussing this on the govtrack side of things, but am I right that this does not currently fetch amendment text from Congressional Record? If so, I can take a stab. Did you ever convert your Perl to Python, @JoshData?

@konklone
Copy link
Member

@konklone konklone commented Mar 20, 2013

Looping in @drinks, who manages our CapitolWords project, which contains a big CR parser.

The text of amendments would be a huge get, and something I've wanted to parse out of the CR for a while. But I have no idea how hard that is.

@JoshData
Copy link
Member

@JoshData JoshData commented Mar 20, 2013

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.

If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl

The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.

Thanks Chris (and Eric and Dan).

@drinks
Copy link
Member

@drinks drinks commented Mar 20, 2013

This could also be a starting place, though I'm not sure how complete:

http://capitolwords.org/api/1/text.json?apikey=&title=text%20of%20amendments (http://capitolwords.org/api/1/text.json?apikey=%3Ckey%3E&title=text%20of%20amendments)

On Wednesday, March 20, 2013 at 5:04 PM, Joshua Tauberer wrote:

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.
If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl (http://govtrack.us/govtrack/gather/us/parse_record.pl)
The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.
Thanks Chris (and Eric and Dan).


Reply to this email directly or view it on GitHub (#52 (comment)).

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 20, 2013

I think we can get it all straight from THOMAS. Every amendment has a link to "text of amendment," which links to a landing page, which in turn links to a custom query (using their weird ephemeral URLs). Since amendments_info.py already hits this page, I may start there.

@konklone
Copy link
Member

@konklone konklone commented Mar 20, 2013

That's welcome. The most ideal thing here would be to extract and reuse
whatever code CapitolWords is using to parse the CR - like a
python-congressional-record library or something, that
unitedstates/congress could wield to retrieve amendment text and store
metadata for it. But if that's not feasible, then doing a one-off that uses
THOMAS.gov is fine.

I hope Congress.gov preserves the same features and linkage, though -
THOMAS.gov's shutoff date is getting closer.

On Wed, Mar 20, 2013 at 5:13 PM, Chris Wilson notifications@github.comwrote:

I think we can get it all straight from THOMAS. Every amendment has a link
to "text of amendment," which links to a landing page, which in turn links
to a custom query (using their weird ephemeral URLs). Since
amendments_info.py already hits this page, I may start there.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-15203861
.

Developer | sunlightfoundation.com

@konklone
Copy link
Member

@konklone konklone commented Mar 21, 2013

I can't use permalinks for this stuff that I'm aware of, but it looks like if you go to an amendment's detail page and use the "Text of Amendment as Submitted" link, it takes you to a pretty terrible page, which links to a "printer friendly" page which looks a lot less terrible. I just looked at S. Amdt. 26 as an example, whose printer friendly page looks reasonable enough. The scraper might be a bit annoying and stateful, given the lack of permalinks.

My suggestion is to fetch text for amendments whose info has already been fetched, and maybe as a separate script, like amendment_text.py, which outputs new files in that amendment's output directory. The amendment number itself doesn't seem to appear on the printer friendly page.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 21, 2013

Very good suggestion. Otherwise, THOMAS appears to behave differently based on the length of the text, since some amendments are all of 200 words. The CapitalWords project looks awesome, but too complex for me to integrate at the moment.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 21, 2013

Actually, does this URL structure mean anything to anyone?

http://thomas.loc.gov/cgi-bin/query/z?r113:S12MR3-0044:/

That's what I get if I use the share button in the upper right and go to "save" for SA27, a short Rubio amendment to SA26.

SA26 goes to here:

http://thomas.loc.gov/cgi-bin/query/z?r113:S11MR3-0039:/

I don't see any immediately obvious way to get this URL directly from the Amendment # though

@GPHemsley
Copy link
Member

@GPHemsley GPHemsley commented Mar 21, 2013

AFAICT, the format of the identifier is:

z?r[congress]:[chamber][day][month][year]-[item ID]:

Where [chamber] is "H" or "S" (or "E" for extension of remarks or "D" for digest), [day] is the 2-digit day of the month, [month] is a 2-char representation of the month name, [year] is the significant digit of the year ("3" for 2013), and [item ID] is the 4-digit number of the item on the day's register.

For example, SA27 is the 44th item on the Senate register for March 12, 2013 in the 113th Congress:

http://thomas.loc.gov/cgi-bin/query/B?r113:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20130312%29

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 21, 2013

Thx, Gordon!

Just pushed a branch called amendment_text. I'm having some encoding nightmares, so there's a sloppy catch in the utils unescape function right now with a URL in the comments that will break the original.

To try it, just pass --fulltext to amendments call (see README in branch). Needs so work.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 22, 2013

I made rework to just pull the text file of amendments from each day's CR. e.g.
http://www.gpo.gov/fdsys/pkg/CREC-2013-03-21/html/CREC-2013-03-21-pt1-PgS2169.htm

Much easier to download this file and regex it than crawl through THOMAS

@konklone
Copy link
Member

@konklone konklone commented Mar 22, 2013

That's exactly the approach Capitol Words takes. CW doesn't do anything special to parse out amendment text right now, but you still might look at its parser for lessons and ideas.

@konklone
Copy link
Member

@konklone konklone commented Mar 22, 2013

In fact, I really should loop in @bycoffe too, since I believe he wrote the original version of that parser.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 22, 2013

Awesome, will do.

Has anyone tried to parse the amendment text and connect to the original legislation?

e.g., for "On page 4, line 6, decrease the amount by $20,000,000,000."

to connect to whatever's on page 4, line 6 of the actual legislation?

@JoshData
Copy link
Member

@JoshData JoshData commented Mar 22, 2013

As far as I know, no one has tried either that or anything in the bigger picture of parsing how things change other things.

@konklone
Copy link
Member

@konklone konklone commented Mar 22, 2013

People sure do talk about it a lot, though! Everyone wants this. Are you a bad enough dude to connect amendments to legislation through page and line numbers?

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 25, 2013

Any suggestions for how to integrate the fdsys routines into the amendment task, thus grabbing the unadulterated text of the amendment directly from GPO? I have the script to download the text file from GPO and parse it and place it in the right place, but right now it takes manual URLs from GPO (like the one a few messages back). Not sure how to go from an amendment (says, SA 136) and find it in the Congressional Record.

FYI, here's a crude demo of matching amendments to legislation. I can upload the guys of it once I can get amendment text working right.

http://experimentsinform.com/media/demos/revisions/site/

@konklone
Copy link
Member

@konklone konklone commented Mar 25, 2013

...whoa. Even this "crude demo" is more interesting than anything I've seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead of a standalone task) in bill_versions.py, which downloads bill text and version metadata from GPO and outputs JSON files for each version of each bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production systems yet (still using an older Ruby-based GPO sync script), but I plan to.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 25, 2013

Awesome, thank you!

On Mon, Mar 25, 2013 at 4:13 PM, Eric Mill notifications@github.com wrote:

...whoa. Even this "crude demo" is more interesting than anything I've
seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead
of a standalone task) in bill_versions.py, which downloads bill text and
version metadata from GPO and outputs JSON files for each version of each
bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production
systems yet (still using an older Ruby-based GPO sync script), but I plan
to.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-15421789
.

christopher.e.wilson@gmail.com
434.242.9728

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 26, 2013

The "amendment_text" branch now has a "parse" command that attempts to interpret the text of the amendments (which now are retrieved from GPO). It's smart enough to figure out how to pinpoint "Title III" to a page and line if the amendment doesn't specify the exact location, but still has a long ways to go. See README.

@GPHemsley
Copy link
Member

@GPHemsley GPHemsley commented Mar 26, 2013

I just want to pop in to say that this is pretty awesome work.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Mar 26, 2013

Thanks! Will take a lot more. All help much appreciated.

@konklone
Copy link
Member

@konklone konklone commented Jun 17, 2013

Hey @wilson428, could I summon you to do a quick summary of the State of Amendment Text?

I'm putting the finishing touches on getting a /amendments endpoint done in Sunlight's Congress API using this data, and am making it full-text searchable on the purpose and description only. I'd love to get the full text indexed as well, even if it's not in structured form the way that bill text is.

What's the remaining path to dropping .txt files, even unformatted blob-of-words style, for each amendment?

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Jun 17, 2013

Awesome! This stalled over issues of where the raw text of amendments is stored in the CR, particularly in the House. I was following links from THOMAS to the CR to get the full text, but the agreement seemed to be that I should be using the fdsys scripts instead.

Then it turned out that House amendments in the CR are buried in the text of the daily activity. There was some sense that the text could be retrieved from a committee, I think?

I would love to dive back in, though I'd prefer to work on a forward-looking solution that doesn't die with Thomas end-of-life. Suggestions?

@konklone
Copy link
Member

@konklone konklone commented Jun 17, 2013

If the text appears in the CR for both chambers, that's probably the most reliable path, even if the parsing is particularly messy. There's a lot of value even if the text is extracted in a non-readable, but indexable, form.

I don't think any committee would have floor amendments, although committee reports (which are released after huge delays) might have committee amendments in them sometimes. Also possible that the House Committee Repository might grow to include them for the House some day.

@wilson428
Copy link
Member Author

@wilson428 wilson428 commented Jun 17, 2013

Gotcha. I'll dive back in this week and give a more detailed report. Thanks for nudge.

@konklone
Copy link
Member

@konklone konklone commented Jun 17, 2013

And to be clear, I'm happy to work on this, either primarily or secondarily, once I emerge from my current unrelated work blitz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.