-
Notifications
You must be signed in to change notification settings - Fork 208
Text of amendments #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looping in @drinks, who manages our CapitolWords project, which contains a big CR parser. The text of amendments would be a huge get, and something I've wanted to parse out of the CR for a while. But I have no idea how hard that is. |
Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS. If you have Subversion installed, you can get my old Perl CR parser this way: The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic. Thanks Chris (and Eric and Dan). |
This could also be a starting place, though I'm not sure how complete: http://capitolwords.org/api/1/text.json?apikey=&title=text%20of%20amendments (http://capitolwords.org/api/1/text.json?apikey=%3Ckey%3E&title=text%20of%20amendments) On Wednesday, March 20, 2013 at 5:04 PM, Joshua Tauberer wrote:
|
I think we can get it all straight from THOMAS. Every amendment has a link to "text of amendment," which links to a landing page, which in turn links to a custom query (using their weird ephemeral URLs). Since |
That's welcome. The most ideal thing here would be to extract and reuse I hope Congress.gov preserves the same features and linkage, though - On Wed, Mar 20, 2013 at 5:13 PM, Chris Wilson notifications@github.comwrote:
Developer | sunlightfoundation.com |
I can't use permalinks for this stuff that I'm aware of, but it looks like if you go to an amendment's detail page and use the "Text of Amendment as Submitted" link, it takes you to a pretty terrible page, which links to a "printer friendly" page which looks a lot less terrible. I just looked at S. Amdt. 26 as an example, whose printer friendly page looks reasonable enough. The scraper might be a bit annoying and stateful, given the lack of permalinks. My suggestion is to fetch text for amendments whose info has already been fetched, and maybe as a separate script, like amendment_text.py, which outputs new files in that amendment's output directory. The amendment number itself doesn't seem to appear on the printer friendly page. |
Very good suggestion. Otherwise, THOMAS appears to behave differently based on the length of the text, since some amendments are all of 200 words. The CapitalWords project looks awesome, but too complex for me to integrate at the moment. |
Actually, does this URL structure mean anything to anyone? http://thomas.loc.gov/cgi-bin/query/z?r113:S12MR3-0044:/ That's what I get if I use the share button in the upper right and go to "save" for SA27, a short Rubio amendment to SA26. SA26 goes to here: http://thomas.loc.gov/cgi-bin/query/z?r113:S11MR3-0039:/ I don't see any immediately obvious way to get this URL directly from the Amendment # though |
AFAICT, the format of the identifier is: z?r[congress]:[chamber][day][month][year]-[item ID]: Where [chamber] is "H" or "S" (or "E" for extension of remarks or "D" for digest), [day] is the 2-digit day of the month, [month] is a 2-char representation of the month name, [year] is the significant digit of the year ("3" for 2013), and [item ID] is the 4-digit number of the item on the day's register. For example, SA27 is the 44th item on the Senate register for March 12, 2013 in the 113th Congress: http://thomas.loc.gov/cgi-bin/query/B?r113:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20130312%29 |
Thx, Gordon! Just pushed a branch called amendment_text. I'm having some encoding nightmares, so there's a sloppy catch in the utils unescape function right now with a URL in the comments that will break the original. To try it, just pass --fulltext to amendments call (see README in branch). Needs so work. |
I made rework to just pull the text file of amendments from each day's CR. e.g. Much easier to download this file and regex it than crawl through THOMAS |
That's exactly the approach Capitol Words takes. CW doesn't do anything special to parse out amendment text right now, but you still might look at its parser for lessons and ideas. |
In fact, I really should loop in @bycoffe too, since I believe he wrote the original version of that parser. |
Awesome, will do. Has anyone tried to parse the amendment text and connect to the original legislation? e.g., for "On page 4, line 6, decrease the amount by $20,000,000,000." to connect to whatever's on page 4, line 6 of the actual legislation? |
As far as I know, no one has tried either that or anything in the bigger picture of parsing how things change other things. |
People sure do talk about it a lot, though! Everyone wants this. Are you a bad enough dude to connect amendments to legislation through page and line numbers? |
Any suggestions for how to integrate the fdsys routines into the amendment task, thus grabbing the unadulterated text of the amendment directly from GPO? I have the script to download the text file from GPO and parse it and place it in the right place, but right now it takes manual URLs from GPO (like the one a few messages back). Not sure how to go from an amendment (says, SA 136) and find it in the Congressional Record. FYI, here's a crude demo of matching amendments to legislation. I can upload the guys of it once I can get amendment text working right. |
...whoa. Even this "crude demo" is more interesting than anything I've seen yet in the US for matching up amendments. You can see an example of me using fdsys.py as a support library (instead of a standalone task) in bill_versions.py, which downloads bill text and version metadata from GPO and outputs JSON files for each version of each bill: I actually haven't integrated bill_versions output into my production systems yet (still using an older Ruby-based GPO sync script), but I plan to. |
Awesome, thank you! On Mon, Mar 25, 2013 at 4:13 PM, Eric Mill notifications@github.com wrote:
christopher.e.wilson@gmail.com |
The "amendment_text" branch now has a "parse" command that attempts to interpret the text of the amendments (which now are retrieved from GPO). It's smart enough to figure out how to pinpoint "Title III" to a page and line if the amendment doesn't specify the exact location, but still has a long ways to go. See README. |
I just want to pop in to say that this is pretty awesome work. |
Thanks! Will take a lot more. All help much appreciated. |
Hey @wilson428, could I summon you to do a quick summary of the State of Amendment Text? I'm putting the finishing touches on getting a What's the remaining path to dropping .txt files, even unformatted blob-of-words style, for each amendment? |
Awesome! This stalled over issues of where the raw text of amendments is stored in the CR, particularly in the House. I was following links from THOMAS to the CR to get the full text, but the agreement seemed to be that I should be using the fdsys scripts instead. Then it turned out that House amendments in the CR are buried in the text of the daily activity. There was some sense that the text could be retrieved from a committee, I think? I would love to dive back in, though I'd prefer to work on a forward-looking solution that doesn't die with Thomas end-of-life. Suggestions? |
If the text appears in the CR for both chambers, that's probably the most reliable path, even if the parsing is particularly messy. There's a lot of value even if the text is extracted in a non-readable, but indexable, form. I don't think any committee would have floor amendments, although committee reports (which are released after huge delays) might have committee amendments in them sometimes. Also possible that the House Committee Repository might grow to include them for the House some day. |
Gotcha. I'll dive back in this week and give a more detailed report. Thanks for nudge. |
And to be clear, I'm happy to work on this, either primarily or secondarily, once I emerge from my current unrelated work blitz. |
I saw Gordon and Josh discussing this on the govtrack side of things, but am I right that this does not currently fetch amendment text from Congressional Record? If so, I can take a stab. Did you ever convert your Perl to Python, @JoshData?
The text was updated successfully, but these errors were encountered: