101 real world web scraping exercises in Python 3 for data journalists
Python
Switch branches/tags
Nothing to show
Clone or download
Latest commit 9543e05 Oct 5, 2015
Permalink
Failed to load latest commit information.
scratchpad bigger salaries Aug 13, 2015
scripts exercise 11 Oct 5, 2015
README.md Update README.md Oct 5, 2015
generate_readme.py more automated readme Aug 26, 2015

README.md

Search-Script-Scrape: 101 webscraping and research tasks for the data journalist

Note: This exercise set is part of the Stanford Computational Journalism Lab. I've also written a blog post that gives a little more elaboration about the libraries used and a few of the exercises.


This repository contains 101 Web data-collection tasks in Python 3 that I assigned to my Computational Journalism class in Spring 2015 to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online.

The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the Computational Journalism class wasn't intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized...(especially since I'm new to Python myself!)

Some examples of the tasks:

Repo status

The table below links to the available scripts. If there's not a link, it means I haven't committed the code. Some of them I had to rethink a less verbose solution (or the target changed, as the Internet sometimes does), and now this repo has taken a backseat to many other data projects on my list. ¯\_(ツ)_/¯

Note: A lot of the code is not best practice. The tasks are a little repetitive so I got bored and ignored PEP8 and/or tried new libraries/conventions for fun.

Note: The "related URL" links to either the official source of the data, or at least a page with some background information. The second column of this table refers to line count of the script, not the answer to the prompt.

The tasks

The repo currently contains scripts for 100 of 101 tasks:

Title Line count
1. Number of datasets currently listed on data.gov
[related URL] [script]
7 lines
2. The name of the most recently added dataset on data.gov
[related URL] [script]
7 lines
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days
[related URL] [script]
4 lines
4. The number of librarian-related job positions that the federal government is currently hiring for
[related URL] [script]
6 lines
5. The name of the company cited in the most recent consumer complaint involving student loans
[related URL] [script]
27 lines
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees
[related URL] [script]
38 lines
7. The number of listed federal executive agency internet domains
[related URL] [script]
8 lines
8. The number of times when a New York heart surgeon's rate of patient deaths for all cardiac surgical procedures was "significantly higher" than the statewide rate, according to New York state's analysis.
[related URL] [script]
7 lines
9. The number of roll call votes that were rejected by a margin of less than 5 votes, in the first session of the U.S. Senate in the 114th Congress
[related URL] [script]
26 lines
10. The title of the highest paid California city government position in 2010
[related URL] [script]
35 lines
11. How much did the state of California collect in property taxes, according to the U.S. Census 2013 Annual Survey of State Government Tax Collections?
[related URL] [script]
23 lines
12. In 2010, the year-over-year change in enplanements at America's busiest airport
[related URL] [script]
51 lines
13. The number of armored carrier bank robberies recorded by the FBI in 2014
[related URL] [script]
15 lines
14. The number of workplace fatalities at reported to the federal and state OSHA in the latest fiscal year
[related URL] [script]
14 lines
15. Total number of wildlife strike incidents reported at San Francisco International Airport
[related URL] [script]
48 lines
16. The non-profit organization with the highest total revenue, according to the latest listing in ProPublica's Nonprofit Explorer
[related URL] [script]
11 lines
17. In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday
[related URL] [script]
11 lines
18. The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data
[related URL] [script]
9 lines
19. The total number of preliminary reports on aircraft safety incidents/accidents in the last 10 business days
[related URL] [script]
12 lines
20. The number of OSHA enforcement inspections involving Wal-Mart in California since 2014
[related URL] [script]
25 lines
21. The current humidity level at Great Smoky Mountains National Park
[related URL] [script]
6 lines
22. The names of the committees that Sen. Barbara Boxer currently serves on
[related URL] [script]
7 lines
23. The name of the California school with the highest number of girls enrolled in kindergarten, according to the CA Dept. of Education's latest enrollment data file.
[related URL] [script]
21 lines
24. Percentage of NYPD stop-and-frisk reports in which the suspect was white in 2014
[related URL] [script]
24 lines
25. Average frontal crash star rating for 2015 Honda Accords
[related URL] [script]
14 lines
26. The dropout rate for all of Santa Clara County high schools, according to the latest cohort data in CALPADS
[related URL] [script]
48 lines
27. The number of Class I Drug Recalls issued by the U.S. Food and Drug Administration since 2012
[related URL] [script]
14 lines
28. Total number of clinical trials as recorded by the National Institutes of Health
[related URL] [script]
7 lines
29. Number of days until Texas's next scheduled execution
[related URL] [script]
24 lines
30. The total number of inmates executed by Florida since 1976
[related URL] [script]
10 lines
31. The number of proposed U.S. federal regulations in which comments are due within the next 3 days
[related URL] [script]
29 lines
32. Number of Titles that have changed in the United States Code since its last release point
[related URL] [script]
6 lines
33. The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
[related URL] [script]
14 lines
34. In the latest FDA Weekly Enforcement Report, the number of Class I and Class II recalls involving food
[related URL] [script]
10 lines
35. Most viewed data set on New York state's open data portal as of this month
[related URL] [script]
9 lines
36. Total number of visitors to the White House in 2012
[related URL] [script]
27 lines
37. The last time the CIA's Leadership page has been updated
[related URL] [script]
6 lines
38. The domain of the most visited U.S. government website right now
[related URL] [script]
5 lines
39. Number of medical device recalls issued by the U.S. Food and Drug Administration in 2013
[related URL] [script]
6 lines
40. Number of FOIA requests made to the Chicago Public Library
[related URL] [script]
6 lines
41. The number of currently open medical trials involving alcohol-related disorders
[related URL] [script]
5 lines
42. The name of the Supreme Court justice who delivered the opinion in the most recently announced decision
[related URL] [script]
31 lines
43. The number of citations that resulted from FDA inspections in fiscal year 2012
[related URL] [script]
10 lines
44. Number of people visiting a U.S. government website right now
[related URL] [script]
6 lines
45. The number of security alerts issued by US-CERT in the current year
[related URL] [script]
6 lines
46. The number of Pinterest accounts maintained by U.S. State Department embassies and missions
[related URL] [script]
13 lines
47. The number of international travel alerts from the U.S. State Department currently in effect
[related URL] [script]
7 lines
48. The difference in total White House staffmember salaries in 2014 versus 2010
[related URL] [script]
19 lines
49. Number of sponsored bills by Rep. Nancy Pelosi that were vetoed by the President
[related URL] [script]
11 lines
50. In the most recently transcribed Supreme Court argument, the number of times laughter broke out
[related URL] [script]
22 lines
51. The title of the most recent decision handed down by the U.S. Supreme Court
[related URL] [script]
6 lines
52. The average wage of optomertrists according to the BLS's most recent National Occupational Employment and Wage Estimates report
[related URL] [script]
8 lines
53. The total number of on-campus hate crimes as reported to the U.S. Office of Postsecondary Education, in the most recent collection year
[related URL] [script]
45 lines
54. The number of people on FBI's Most Wanted List for white collar crimes
[related URL] [script]
6 lines
55. The number of Government Accountability Office reports and testimonies on the topic of veterans
[related URL] [script]
10 lines
56. Number of times Rep. Darrell Issa's remarks have made it onto the Congressional Record
[related URL] [script]
9 lines
57. The top 3 auto manufacturers, ranked by total number of recalls via NHTSA safety-related defect and compliance campaigns since 1967.
[related URL] [script]
24 lines
58. The number of published research papers from the NSA
[related URL] [script]
6 lines
59. The number of university-related datasets currently listed at data.gov
[related URL] [script]
7 lines
60. Number of chapters in Title 20 (Education) of the United States Code
[related URL] [script]
15 lines
61. The number of miles traveled by the current U.S. Secretary of State
[related URL] [script]
6 lines
62. For all of 2013, the number of potential signals of serious risks or new safety information that resulted from the FDA's FAERS
[related URL] [script]
14 lines
63. In the current dataset behind Medicare's Nusring Home Compare website, the total amount of fines received by penalized nursing homes
[related URL] [script]
35 lines
64. from March 1 to 7, 2015, the number of times in which designated FDA policy makers met with persons outside the U.S. federal executive branch
[related URL] [script]
5 lines
65. The number of failed votes in the roll calls 1 through 99, in the U.S. House of the 114th Congress
[related URL] [script]
12 lines
66. The highest minimum wage as mandated by state law.
[related URL] [script]
28 lines
67. For the most recently posted TSA.gov customer satisfication survey, post the percentage of respondents who rated their "overall experience today" as "Excellent"
[related URL]
68. Number of FDA-approved prescription drugs with GlaxoSmithKline as the applicant holder
[related URL] [script]
11 lines
69. The average number of comments on the last 50 posts on NASA's official Instagram account
[related URL] [script]
40 lines
70. The highest salary possible for a White House staffmember in 2014
[related URL] [script]
10 lines
71. The percent increase in number of babies named Archer nationwide in 2010 compared to 2000, according to the Social Security Administration
[related URL] [script]
32 lines
72. The number of magnitude 4.5+ earthquakes detected worldwide by the USGS
[related URL] [script]
8 lines
73. The total amount of contributions made by lobbyists to Congress according to the latest downloadable quarterly report
[related URL] [script]
34 lines
74. The description of the bill most recently signed into law by the governor of Georgia
[related URL] [script]
12 lines
75. Total number of officer-involved shooting incidents listed by the Philadelphia Police Department
[related URL] [script]
9 lines
76. The total number of publications produced by the U.S. Government Accountability Office
[related URL] [script]
9 lines
77. Number of Dallas officer-involved fatal shooting incidents in 2014
[related URL] [script]
7 lines
78. Number of Cupertino, CA restaurants that have been shut down due to health violations in the last six months.
[related URL] [script]
6 lines
79. The change in total airline revenues from baggage fees, from 2013 to 2014
[related URL] [script]
19 lines
80. The total number of babies named Odin born in Colorado according to the Social Security Administration
[related URL] [script]
20 lines
81. The latest release date for T-100 Domestic Market (U.S. Carriers) statistics report
[related URL] [script]
13 lines
82. In the most recent FDA Adverse Events Reports quarterly extract, the number of patient reactions mentioning "Death"
[related URL] [script]
47 lines
83. The sum of White House staffermember salaries in 2014
[related URL] [script]
12 lines
84. The total number of notices published on the most recent date to the Federal Register
[related URL] [script]
6 lines
85. The number of iPhone units sold in the latest quarter, according to Apple Inc's most recent 10-Q report
[related URL] [script]
49 lines
86. Number of computer vulnerabilities in which IBM was the vendor in the latest Cyber Security Bulletin
[related URL] [script]
10 lines
87. Number of airports with existing construction related activity
[related URL] [script]
6 lines
88. The number of posts on TSA's Instagram account
[related URL] [script]
24 lines
89. In fiscal year 2013, the short description of the most frequently cited type of FDA's inspectional observations related to food products.
[related URL] [script]
32 lines
90. The currently serving U.S. congressmember with the most Twitter followers
[related URL] [script]
76 lines
91. Number of stop-and-frisk reports from the NYPD in 2014
[related URL] [script]
22 lines
92. In 2012-Q4, the total amount paid by Rep. Aaron Schock to Lobair LLC, according to Congressional spending records, as compiled by the Sunlight Foundation
[related URL] [script]
14 lines
93. Number of Github repositories maintained by the GSA's 18F organization, as listed on Github.com
[related URL] [script]
5 lines
94. The New York City high school with the highest average math score in the latest SAT results
[related URL] [script]
96 lines
95. Since 2002, the most commonly occurring winning number in New York's Lottery Mega Millions
[related URL] [script]
9 lines
96. The number of scheduled arguments according to the most recent U.S. Supreme Court argument calendar
[related URL] [script]
11 lines
97. The New York school with the highest rate of religious exemptions to vaccinations
[related URL] [script]
10 lines
98. The latest estimated population percent change for Detroit, MI, according to the latest Census QuickFacts summary.
[related URL] [script]
8 lines
99. According to the Medill National Security Zone, the number of chambered guns confiscated at airports by the TSA
[related URL] [script]
11 lines
100. The California city whose city manager earns the most total wage per population of its city in 2012
[related URL] [script]
23 lines
101. The number of women currently serving in the U.S. Congress, according to Sunlight Foundation data
[related URL] [script]
8 lines

How to run this stuff

Each task is meant to be a self-contained script: you run it, and it prints the answer I'm looking for. The scripts in this repo should "just work"...if you have all the dependencies installed that I had while writing them, and the web URLs they target haven't changed...so, basically, these may not work at all.

To copy the scripts quickly via the command-line; by default, a ./search-script-scrape directory will be created:

$ git clone https://github.com/compjour/search-script-scrape.git

To run a script:

$ cd search-script-scrape
$ python3 scripts/1.py

I leave it to you and Google to figure out how to run Python 3 on your own system. FWIW, I was using the Python 3.4.3 provided by the Anaconda 2.2.0 installer for OS X. The most common third-party libraries used are Requests for downloading the files and lxml for HTML parsing.

Expanding on these scripts

To reiterate: each of these scripts are meant to print out single answers, and so they don't actually show the full potential of how programming can automate data collection. As you get better at programming and recognizing its patterns, you'll find out how easy it is to abstract what seemed like a narrow task into something much bigger.

For example, Script #50 prints out the number of times laughter broke out in the most recently transcribed Supreme Court argument. Change two lines and that script will print out the laugh count in every transcribed Supreme Court argument: (demo here)

The same kind of small code restructuring can be done to many of the tasks here. And you can also modify the parameters; why limit yourself to finding the highest paid "City Manager" in California when you can extend the search to every kind of California employee, across every year of salary data? (demo here)

And of course, in real-world data projects, you aren't typically interested in just printing the answer to your Terminal. You generally want to send them to a spreadsheet or spreadsheet and eventually to a web application (or other kind of publication). That's just a few more lines of programming, too...So while this repo contains a bunch of toy scripts, see if you can think of ways to turn them into bigger data explorations.

Post-mortem

The original requirement was that students finish all 100 scripts by the end of the quarter. That didn't quite work out so I reduced the requirement to 50. It was a bad idea to make this a "oh, just turn it in at the end of the year", as most people have the tendency to wait for finals week to do such work.

Most of the tasks are pretty straightforward, in terms of the Python programming. The majority of the time is figuring out exactly what the hell I'm referring to, so next time I do this, I'll probably provide the URL of the target page rather than having people attempt to divine the Google Path I used to get to the data.