-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image for congressman #160
Comments
You can generally get this for current or recent members if you known the bioguide ID. If you look on the bioguide page for Mo Cowan, for instance, you'll see his photo at this URL: http://bioguide.congress.gov/bioguide/photo/C/C001099.jpg In whatever language you use, you then can construct the URL from something like:
There is not 100% coverage for current members, but it's a good start |
Sunlight also offers a set of MoC photos, named by Bioguide ID, for download as a zip file. We normalize them into a bunch of different sizes, with the largest being 250x200. Even though this project doesn't actually host the MoC photos, the little shell script Sunlight uses to do the resize work is in it here, and you could adapt it to your needs. Either way, you'd want to put the result into S3 or something. |
Also, we get those photos from the Congressional Pictorial Directory, published by GPO, so they may not be the same as the ones on bioguide.congress.gov. |
Also bulk data from GovTrack: https://www.govtrack.us/developers/data I could probably add a has_photo field to the GovTrack API.... |
Awesome. Thanks guys! |
We had a terrific thread over at propublica/sunlight-congress#432 (comment) on this, and (after @mwweinberg picked up the phone and called the GPO), the resolution was to make a scraper for the GPO's Member Guide, and then offer the photos for download. I'm updating this ticket's description to reflect this. Does anyone have any objection to adding the images to this repository, or should they go somewhere else? I think it's convenient to have them here, since it's in scope for the repo. For reference, 812 The versions on the Member Guide are |
One additional idea: we could potentially store the photos in a |
There are high-res images over there, I think. I don't know whether we really need to store them in a repo (vs just a scraper), but if we do I'd strongly prefer a separate repo for it. |
I think it might be a neat, low-maintenance thing to version the images. But a separate repo is fine, and makes experimentation easier. Do you know how to get the high-res images? |
They were hires in the Wikipedia links, and the DOM inspector seemed to indicate they were bigger than on the site as displayed, but I didn't get ANY image when hitting the image URL so that's as far as I got. |
OK, so I did: wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=iqw/hTCdweheEMFH1iwn0bt5yckfRo6E2eA2JdiV4F5SafjBF0U12w==&I=1MKI2SYWd4A="` and that got me a |
A quick search to see if anyone has written a scraper before for this found nothing, but found this PDF of "Grading the Government’s Data Publication Practices" by Jim Harper, which you may find interesting (although it may already be familiar to you). "As noted above, the other ways of learning about House and Senate membership are ad hoc. The Government Printing Office has a “Guide to House and Senate Members” at http://memberguide.gpo.gov/ that duplicates information found elsewhere. The House website presents a list of members along with district information, party affiliation, and so on, in HTML format (http://www.house.gov/representatives/), and beta.congress.gov does as well (http://beta.congress.gov/members/). Someone who wants a complete dataset must collect data from these sources using a computer program to scrape the data and through manual curation. The HTML presentations do not break out key information in ways useful for computers. The Senate membership page, on the other hand, includes a link to an XML representation that is machine readable. That is the reason why the Senate scores so well compared to the House." http://www.cato.org/pubs/pas/PA711.pdf http://beta.congress.gov/members is nicely scrapable (I wonder if they have an API), but then some images are missing, and we are back wondering about the copyright. The mobile memberguide site is very scrapable, but the images are hosted on m.gpo.gov and are only a lo-res image and a lower-res thumbnail. But if wget works on memberguide.gpo.gov then that is a good start. As it happens, Wikipedia uses the same image of Vance McAllister, and their original file is also 589 × 719. |
"(I wonder if they have an API)" Welcome to the world of legislative data. A fantastic and frustrating world awaits. :) Thanks for doing the research on getting the images, btw. |
That is solid research. I think a new scraper, for the normal (non-mobile) member guide is what's called for, to get maximum size and the greatest guarantee of public domain. |
The normal member guide is defaulting to the 112nd congress. It can be downloaded with wget
I think we should be able to table with POST commands. For example, to select 113th congress with wget:
But I've not got it working yet. Anyway, that page has links to each member's own page, with the photo. Some other options:
|
I think a |
The couple of member pages I checked handily have a link to a bio page, which contains the Bioguide ID in the URL, for example:
|
Hmm, I see that, though when I checked a recent member's page for Vance McAllister, it didn't have one. So I think it's probably better to resolve using congress-legislators data, using the last name, state, and chamber for disambiguation where needed. When matching against legislators who served a term in a particular Congress only, that should be pretty doable. |
It's probably just because McAllister is new. He also has no photo. Don't throw the baby (bioguide IDs) out with the bathwater (McAllister)! |
There's some strange monkeybusiness in their code I think. For a while I wasn't getting reliable images even when using the same URL -- sometimes it produced photos of different legislators than the one I thought I was selecting, sometimes different resolutions, sometimes placeholder images. I suspect they're doing something stupid with session variables. This is pretty easy to verify given that a bare curl of the image src generally doesn't return the right photo. FWIW a working curl invocation (taken from chrome's network tab) is below, and there isn't too much to it. My testing makes me think referer is probably irrelevant but I'm not 100% sure. I suspect you are going to have to establish the session cookie, though, and perhaps grab each legislator page's HTML before attempting to grab the image. I could be wrong about this, but something weird seems to be going on.
|
Hmm, @JoshData, McAllister had a photo yesterday, his was the example I used yesterday. And I strongly suspect it has to do with what @sbma44 is isolating. But I wonder if it's time-based rather than session-based? Because yesterday, I ran:
And it gave me a photo for McAllister. Now, that exact command downloads a photo that says "No photo". I don't see why This is malarkey! |
I ran the exact same command yesterday and got the photo. Today, no photo. They're using cookies and sessions. Must be something in that URL. See also the I think we need to go in through the front door and proceed from there. |
Yeah, but a straight
and now the McAllister image works with Now I'm wondering if you can reliably |
Maybe just a |
By all means, see if you can get that working -- if it works, that'd be the easiest method. |
Wow. This is one of the wackiest web scraping situations I've seen. There's going to be an opportunity to bug GPO about things in a few weeks. One of us can bring it up, or @hugovk if you're in the DC area we can let you know how to come bug GPO in person. |
@JoshData Thanks, but I'm in the Helsinki area :) I've discovered Python's These were useful: I've made a first proto version here: Still to do: (Side note: after lots of testing, the http://www.memberguide.gpoaccess.gov/GetMembersSearch.aspx for page has been showing blank for me, in a browser and in code. A bit later it worked again, but now it's blank again. Perhaps there's some anti-scrape IP filtering. This may or may not be a problem in normal use, but perhaps some [random] delays will help.) |
Aweeesssoommme. A few thoughts: Yes, some rate limiting would probably keep the IP blocking in check. And since the script is running inside the congress-legislators repo, the easiest thing to is use the There are also some You may want to add some caching logic, so that if the page has already been downloaded, it doesn't need to fetch it again. There's a For now, it should output legislator images to a non-versioned directory -- I can handle making a new repo and moving it there (and migrating some |
I've just added what I'd done before your comment:
This loads the YAML into an array of dicts rather than using the CSV. You're right, it's much easier that way. I added some If the Bioguide ID isn't found in the member page, it's resolved against the YAML data.
It didn't resolve Bioguide IDs for four people.
The GPO data should be fixed (how to report?), but should we add a final resolution case for switched names? These three aren't in the YAML:
Chiesa left in 2013, Radel left in 2014, Young died in 2013 so have all been removed from legislators-current.yaml. I've just spotted legislators-historical.yaml. We could use this, but there'll be more risk of matching the wrong person. I suppose some year matching could be implemented, plus reverse sorting the YAML list of dicts. test_gpo_member_photos.py uses legislators-test.yaml, a subset of legislators-current.yaml, to unit test things like Bioguide ID matching and validation and Bioguide resolution. Run it like TODO: Add caching of downloaded pages, rate limiting. |
It's okay to add some hard coding for a tiny handful, or for it to miss |
OK, I've added a hardcoding for BB. I've also added: |
Very cool. Want to submit this as a PR to this project, since you've got it in your fork? I can migrate it to a new repo and give you write access from there. |
I've submitted PR #167. If there's any other useful data in the member guide, this code could be easily adapted to scrape it. Python's mechanize and BeautifulSoup are very useful! |
Closed by #167. |
Use the GPO's Member Guide to fetch images of each member of Congress, and store the results here.
Reference discussion on this approach, and copyright issues: propublica/sunlight-congress#432
Original ticket:
Hi,
It's extremely difficult to pull an image from somewhere that is reliable. Additionally, after grabbing the congressman, I would need to make an additional call to search for a profile image based on the representatives returned. It would be great if this was a field in the JSON response, perhaps a URL that we can go out and grab the image from!
Keep up the good work!
The text was updated successfully, but these errors were encountered: