Table Scraps: An Actionable Framework for Multi-Table Data Wrangling From An Artifact Study of Computational Journalism
To appear, IEEE TVCG (Proc. InfoVis 2020)
For the many journalists who use data and computation to report the news, data wrangling is an integral part of their work. Despite an abundance of literature on data wrangling in the context of enterprise data analysis, little is known about the specificoperations, processes, and pain points journalists encounter while performing this tedious, time-consuming task. To better understandthe needs of this user group, we conduct a technical observation study of 50 public repositories of data and analysis code authoredby 33 professional journalists at 26 news organizations. We develop two detailed and cross-cutting taxonomies of data wrangling incomputational journalism, for actions and for processes. We observe the extensive use of multiple tables, a notable gap in previouswrangling analyses. We develop a concise, actionable framework for general multi-table data wrangling that includes wranglingoperations documented in our taxonomy that are without clear parallels in other work. This framework, the first to incorporate tablesas first-class objects, will support future interactive wrangling tools for both computational journalism and general-purpose use. Weassess the generative and descriptive power of our framework through discussion of its relationship to our set of taxonomies.
Aisch, Gregor; Keller, Josh; Eddelbuettel, Dirk. (2016, June 13). Analysis of NICS gun purchase background checks. New YorkTimes. Retrieved from https://github.com/nytimes/gunsales
Aldhous, Peter. (2016, September 16). "Shy Trumpers" polling analysis. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2016-09-shy-trumpers
Arthur, Rob. (2015, July 30). Buster Posey MVP. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/buster-posey-mvp
Bi, Frank. (2016, Jan 13). Uber launch cities and date. Vox. Retrieved from https://github.com/voxmedia/data-projects/tree/master/verge-uber-launch-dates
Bradshaw, Paul. (2019, April 6). Lack of electric car charging points 'putting off drivers'. BBC. Retrieved from https://github.com/BBC-Data-Unit/electric-car-charging-points
Bradshaw, Paul. (2019, March 8). Birmingham remains top destination for Londoners. BBC. Retrieved from https://github.com/BBC-Data-Unit/internal-migration-london
Bradshaw, Paul. (2016, December 8). Midwife units see one in four mums transferred by ambulance to hospital. BBC. Retrieved from https://github.com/BBC-Data-Unit/midwife-led-units
Ceccon, Stefano. (2016, January 16). Analysis: How the Conservatives won. The Times and Sunday Times. Retrieved from https://github.com/times/data/tree/master/general-election-2015-classification-tree
Chinoy, Sahil. (2018, November 12). The Cube Root Law. New York Times. Retrieved from https://observablehq.com/@sahilchinoy/the-cube-root-law
Chinoy, Sahil. (2018, October 11). Heat index. New York Times. Retrieved from https://observablehq.com/@sahilchinoy/heat-index
Flowers, Andrew. (2014, June 3). Infrastructure jobs. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/infrastructure-jobs
Flowers, Andrew. (2014, April 11). Librarians. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/librarians
Flowers, Andrew. (2014, April 8). Bechdel. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/bechdel
Fresques, Hannah. (2019, April 1). IRS Audit Rates by County. ProPublica. Retrieved from https://github.com/propublica/auditData
Groskopf, Christopher. (2017, June 27). Analysis of rideshare trips taken in New York City. Quartz. Retrieved from https://github.com/Quartz/nyc-trips
Groskopf, Christopher. (2017, April 10). Analysis of work from home IPUMS datawo. Quartz. Retrieved from https://github.com/Quartz/work-from-home
Heinle, Lexie. (2017, August 22). Analysis of NYS ed data for Erie, Niagara counties. The Buffalo News. Retrieved from https://github.com/thebuffalonews/new-york-schools-assessment
Hickey, Walter. (2018, February 26). Bob Ross. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/bob-ross
Jones, Brent. (2018, May 31). Crime and heat analysis. St Louis Public Radio. Retrieved from https://github.com/stlpublicradio/2018-05-31-crime-and-heat-analysis
Kolly, Marie-José. (2018, January 31). 1805-regionen im fokus des US-praesidenten. Neue Zürcher Zeitung. Retrieved from https://github.com/nzzdev/st-methods/tree/master/1805-regionen im fokus des US-praesidenten
Keemahill, Dan. (2019, February 3). Analysis of Austin-Travis County EMS call data. Austin American-Statesman. Retrieved from https://github.com/statesman/2019-ems-analysis
Keller, Josh; Pearce, Adam. (2016, September 7). US State prison admissions by county. New York Times. Retrieved from https://github.com/TheUpshot/prison-admissions
Olson, Randy. (2015, July 22). US Weather History. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/us-weather-history
McDonald, Christian. (2018, April 15). Residential demolitions in Austin. Austin American-Statesman. Retrieved from https://github.com/statesman/demolitions
Mayes, Brittany Renee. (2017, April 17). Data analysis for education's school choice in Indiana project. National Public Radio. Retrieved from https://github.com/nprapps/school-choice
Meiners, Joan. (2018, August 16). Endangered species act Louisiana: American alligator. NOLA. Retrieved from https://github.com/beecycles/Endangered-Species-Act-Louisiana
Meiners, Joan. (2017, November 30). Power of Irma. WUFT. Retrieved from https://github.com/beecycles/Power_of_Irma
Menezes, Ryan; Stevens, Matt; Welsh, Ben. (2016, October 31). California "Conservation-Consumption Score" analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/california-ccscore-analysis
Oh, Soo. (2015, July 8). Central line infection data. Vox. Retrieved from https://github.com/voxmedia/data-projects/tree/master/vox-central-line-infections
Singer-Vine, Jeremy. (2019, April 16). Analysis of early 2020 Democratic campaign co-donors. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2019-04-democratic-candidate-codonors
Singer-Vine, Jeremy. (2015, November 18). US Refugee Data and Analysis. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2015-11-refugees-in-the-united-states
Templon, John. (2016, November 2). Counties That Predict The Election. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2016-11-bellwether-counties
Templon, John. (2016, April 26). Analysis of Republican Donor Movement. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2016-04-republican-donor-movements
Tran, Andrew. (2017, December 18). How the Trump era is changing the federal bureaucracy. The Washington Post. Retrieved from https://github.com/wpinvestigates/federal_employees_trump_2017
Tran, Andrew. (2016, May 16). Analyzing LendingClub loan data for Connecticut. TrendCT. Retrieved from https://github.com/trendct/data/tree/master/2016/05/lending-club
Webster, MaryJo. (2019, June 1). Education achievement gap analysis. Star Tribune. Retrieved from https://github.com/striblab/201901-achievementgap
Webster, MaryJo. (2019, January 10). Hospital quality ratings data. Star Tribune. Retrieved from https://github.com/striblab/201901-hospitalquality
Wehrmeyer, Stefan. (2016, March 21). Euros für Ärzte Data Analysis. CORRECTIV. Retrieved from https://github.com/correctiv/awb-notebook
Welsh, Ben. (2019, April 29). Census "hard to count" analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/census-hard-to-map-analysis
Welsh, Ben. (2018, December 18). California buildings in severe fire hazard zones. Los Angeles Times. Retrieved from https://observablehq.com/@palewire/california-buildings-in-severe-fire-hazard-zones
Welsh, Ben. (2017, May 25). California H-2A visas analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/california-h2a-visas-analysis
Welsh, Ben. (2017, March 28). SWANA population map. Los Angeles Times. Retrieved from https://observablehq.com/@datadesk/swana-population-map
Welsh, Ben. (2017, March 17). California crop production wages analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/california-crop-production-wages-analysis
Wilber, Jared. (2018, June 5). skatemusic. Polygraph. Retrieved from https://github.com/polygraph-cool/skatemusic
Wilson, Chris. (2016, December 20). Baby Name Politics. Time. Retrieved from https://github.com/TimeMagazine/babyname_politics
Wilson, Chris. (2014, May 27). Wikipedia rankings. Time. Retrieved from https://github.com/TimeMagazine/wikipedia-rankings
Yerardi, Joe. (2019, February 28). Injustice at Work. Center for Public Integrity. Retrieved from https://github.com/PublicI/employment-discrimination
Zarkhin, Fedor. (2017, April 21). Long-term care complaints data and analysis. The Oregonian. Retrieved from https://github.com/TheOregonian/long-term-care-db
Zhang, Christine. (2018, October 15). Maryland voter registration analysis. Baltimore Sun. Retrieved from https://github.com/baltimore-sun-data/2018-voter-registration
Zhang, Christine. (2018, December 4). Maryland schools star ratings analysis. Baltimore Sun. Retrieved from https://github.com/baltimore-sun-data/school-star-ratings-2018