Join GitHub today
Data are revolutionizing all fields of science including political science. Managing unstructured data (particularly text) is a non-trivial challenge for social scientists, especially at a large scale. An example is the .gov dataset curated by the Internet Archive (IA). The IA curates web crawls from 1996 to the present, and has carved out a database of all .gov pages. These pages have been parsed so that it is possible to query (for example) just the .html text. The resulting 82 TB database (WARC format) is currently hosted pro bono by a private company (Altiscale), distributed across a dozen or so servers. Running a query via Hadoop takes about 2 days. Investigating research questions using Altiscale is a very time consuming process (and beyond the technical ability of nearly all political scientists). As well, we hope to identify and circumvent key challenges faced as a result of non-scientific research design that were used for web crawls and the changing nature of content now posted on the web.
For most recent update, please see our blog.
For final project report, please see below.
For code, please see our github repository.
- Collect a subset of .gov data and parse it out from Hadoop into relational database (Provide data that can answer a set of diverse but concrete questions). Possible subsets include: climate change-related pages; history of the White House web domain; threat related terms.
- Put data into an SQL database and use it to answer some of the proposed questions.
- Construct reliability and validity measures (e.g. link-tracing sampling)
- Develop and document a repeatable process for asking new questions over the data
- Create an interface for sharing data with other users (private and public)
- Apply resources for text analytics and visualization to data (perhaps D3?)
- Finish Threat Construction paper and submit for publication
Project lead (who will spend time in the Data Science Studio):
Faculty (who are trying to start a long-term collaboration based on this project) :
Our aim was to investigate the 82 terrabytes of the .gov parsed test data and evaluate its potential for political science research, including extracting some smaller datasets that might be a useful starting place for investigating the wealth of data hosted in the cluster. We were able to extract two datasets of interest: a word frequency dataset and the full text of cites mentioning climate change related terms around the election cycles between 2004 - 2012. The word frequency dataset contains terms drawn from presidential decision directives from 1990 to present, which presidents have identified as "threats" to the union. They include drug and crime related terms, climate change related terms, financial crisis related terms, terrorism and weapons of mass destruction related terms, and also words pertaining to human rights violations. These data take the form of a panel dataset: year, month, and URL root (whitehouse.gov, senate.gov, any of the major government departments and all fifty states), term and count of terms, as well as an entry for "total words" across each year, month, and URL groups. The second dataset is the full text of any page that had at least one mention of any of the following climate change related terms: natural disaster, global warming, fresh water, forest conservation, food security, security of food, desertification, intergovernmental panel on climate change, climatic research unit, climategate, greenhouse gas, anthropogenic, anthropocene, ocean acidification, pollution, or climate change.
The data is hosted on a Hadoop cluster, which runs MapReduce (under written in Java). We wrote scripts in Apache Pig and Apache Hive to access and process the data. In some cases, we called Python functions because writing user defined functions (UDFs) is easier in Python than in Pig. However, including UDFs in Python significantly slows down processing. To construct the first dataset, we used regular expression searches in the URL field and in the content field to find mataches of URL roots (e.g. whitehouse.gov) and key terms (e.g. terrorism, climate change). Once these matches were found, we used a Python UDF to count the terms by group and also to "split" content into unigrams and count all the unigrams (for the total word count of a given page). This gave us both term counts and total terms per page, and generated a new field which indicated that the page was part of a given URL group of interest (e.g. whitehouse.gov). We then aggregated/summed counts by URL group, month, year and term to get total counts of each term per group and time period. This was output as a tab delimited file and at only 8.2 MB, can be emailed to other scholars or posted online as a possible resource for political science research.
The second dataset is the full text of any parsed text capture that mentioned at least one of the above terms. This involved the same regular expressions search in the content field, but instead of counting total terms, we simply "flagged" the document and piped it into a new file. These include all the original fields of the files, for example, page "title", full URL, date of the capture, any "tags" or keywords on the page, the page "code" (a numeric operator that tells what time of page it is), and the finally the text content of the page. This full text dataset was still 252GB. We attempted to host this in an Amazon Instance, and we used a GUI front end so that users could sort through the results via point and click instead of the complicated log-in and script process that is required to run jobs in Hadoop. It was a PostgreSQL database, and this allowed for SQL-like commands. However, this was still slow, cumbersome and not user-friendly enough to make it accessible to the every-day political scientist, so we ended up going another route. The final product is a dataset just around the election cycles - between Nov and Jan in 2004, 2008, 2010 and 2012. This limited the overall size of the dataset but still allows for a rigorous analysis.
We were also able to determine that the only "complete crawls" of the .gov domaine were done around election cycles, making the periods between Nov and Jan in 2004, 2008, 2010 and 2012 the "gold standard" for our project. a total count of terms from those times, and a total count of URLs from those times, should represent an "exhaustive" crawl of the .gov domain.
We have presented the data at the PoliInformatics Conference, and will make the data public via the PInet website. Code is available on my github site. We are working on two papers based on this data, one is a methodological paper presenting the .gov data as a potential tool for social science research, and describing the methods and hurdles involved. The second is an "agenda analysis" about threat construction in the United States.
In short, we achieved all the key goals that we set out to conconquer (albeit not in the ways we had originally envisioned). The incubator was a huge success for us and we are very grateful to Andrew Whitaker, Bill Howe, and Dan Halperin for their time and support, and to the eScience Institute, for presenting this opportunity.