Skip to content
This repository has been archived by the owner on Oct 6, 2020. It is now read-only.
/ table-scraps Public archive

Supplemental and other materials related to the paper Table Scraps, to appear in IEEE TVCG (Proc. InfoVis 2020).

Notifications You must be signed in to change notification settings

steve-kasica/table-scraps

Repository files navigation

Table Scraps: An Actionable Framework for Multi-Table Data Wrangling From An Artifact Study of Computational Journalism

To appear, IEEE TVCG (Proc. InfoVis 2020)

For the many journalists who use data and computation to report the news, data wrangling is an integral part of their work. Despite an abundance of literature on data wrangling in the context of enterprise data analysis, little is known about the specificoperations, processes, and pain points journalists encounter while performing this tedious, time-consuming task. To better understandthe needs of this user group, we conduct a technical observation study of 50 public repositories of data and analysis code authoredby 33 professional journalists at 26 news organizations. We develop two detailed and cross-cutting taxonomies of data wrangling incomputational journalism, for actions and for processes. We observe the extensive use of multiple tables, a notable gap in previouswrangling analyses. We develop a concise, actionable framework for general multi-table data wrangling that includes wranglingoperations documented in our taxonomy that are without clear parallels in other work. This framework, the first to incorporate tablesas first-class objects, will support future interactive wrangling tools for both computational journalism and general-purpose use. Weassess the generative and descriptive power of our framework through discussion of its relationship to our set of taxonomies.

Arxiv pre-print

Journalist notebooks included in open-coding analysis

Aisch, Gregor; Keller, Josh; Eddelbuettel, Dirk. (2016, June 13). Analysis of NICS gun purchase background checks. New YorkTimes. Retrieved from https://github.com/nytimes/gunsales

Aldhous, Peter. (2016, September 16). "Shy Trumpers" polling analysis. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2016-09-shy-trumpers

Arthur, Rob. (2015, July 30). Buster Posey MVP. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/buster-posey-mvp

Bi, Frank. (2016, Jan 13). Uber launch cities and date. Vox. Retrieved from https://github.com/voxmedia/data-projects/tree/master/verge-uber-launch-dates

Bradshaw, Paul. (2019, April 6). Lack of electric car charging points 'putting off drivers'. BBC. Retrieved from https://github.com/BBC-Data-Unit/electric-car-charging-points

Bradshaw, Paul. (2019, March 8). Birmingham remains top destination for Londoners. BBC. Retrieved from https://github.com/BBC-Data-Unit/internal-migration-london

Bradshaw, Paul. (2016, December 8). Midwife units see one in four mums transferred by ambulance to hospital. BBC. Retrieved from https://github.com/BBC-Data-Unit/midwife-led-units

Ceccon, Stefano. (2016, January 16). Analysis: How the Conservatives won. The Times and Sunday Times. Retrieved from https://github.com/times/data/tree/master/general-election-2015-classification-tree

Chinoy, Sahil. (2018, November 12). The Cube Root Law. New York Times. Retrieved from https://observablehq.com/@sahilchinoy/the-cube-root-law

Chinoy, Sahil. (2018, October 11). Heat index. New York Times. Retrieved from https://observablehq.com/@sahilchinoy/heat-index

Flowers, Andrew. (2014, June 3). Infrastructure jobs. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/infrastructure-jobs

Flowers, Andrew. (2014, April 11). Librarians. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/librarians

Flowers, Andrew. (2014, April 8). Bechdel. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/bechdel

Fresques, Hannah. (2019, April 1). IRS Audit Rates by County. ProPublica. Retrieved from https://github.com/propublica/auditData

Groskopf, Christopher. (2017, June 27). Analysis of rideshare trips taken in New York City. Quartz. Retrieved from https://github.com/Quartz/nyc-trips

Groskopf, Christopher. (2017, April 10). Analysis of work from home IPUMS datawo. Quartz. Retrieved from https://github.com/Quartz/work-from-home

Heinle, Lexie. (2017, August 22). Analysis of NYS ed data for Erie, Niagara counties. The Buffalo News. Retrieved from https://github.com/thebuffalonews/new-york-schools-assessment

Hickey, Walter. (2018, February 26). Bob Ross. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/bob-ross

Jones, Brent. (2018, May 31). Crime and heat analysis. St Louis Public Radio. Retrieved from https://github.com/stlpublicradio/2018-05-31-crime-and-heat-analysis

Kolly, Marie-José. (2018, January 31). 1805-regionen im fokus des US-praesidenten. Neue Zürcher Zeitung. Retrieved from https://github.com/nzzdev/st-methods/tree/master/1805-regionen im fokus des US-praesidenten

Keemahill, Dan. (2019, February 3). Analysis of Austin-Travis County EMS call data. Austin American-Statesman. Retrieved from https://github.com/statesman/2019-ems-analysis

Keller, Josh; Pearce, Adam. (2016, September 7). US State prison admissions by county. New York Times. Retrieved from https://github.com/TheUpshot/prison-admissions

Olson, Randy. (2015, July 22). US Weather History. FiveThirtyEight. Retrieved from https://github.com/fivethirtyeight/data/tree/master/us-weather-history

McDonald, Christian. (2018, April 15). Residential demolitions in Austin. Austin American-Statesman. Retrieved from https://github.com/statesman/demolitions

Mayes, Brittany Renee. (2017, April 17). Data analysis for education's school choice in Indiana project. National Public Radio. Retrieved from https://github.com/nprapps/school-choice

Meiners, Joan. (2018, August 16). Endangered species act Louisiana: American alligator. NOLA. Retrieved from https://github.com/beecycles/Endangered-Species-Act-Louisiana

Meiners, Joan. (2017, November 30). Power of Irma. WUFT. Retrieved from https://github.com/beecycles/Power_of_Irma

Menezes, Ryan; Stevens, Matt; Welsh, Ben. (2016, October 31). California "Conservation-Consumption Score" analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/california-ccscore-analysis

Oh, Soo. (2015, July 8). Central line infection data. Vox. Retrieved from https://github.com/voxmedia/data-projects/tree/master/vox-central-line-infections

Singer-Vine, Jeremy. (2019, April 16). Analysis of early 2020 Democratic campaign co-donors. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2019-04-democratic-candidate-codonors

Singer-Vine, Jeremy. (2015, November 18). US Refugee Data and Analysis. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2015-11-refugees-in-the-united-states

Templon, John. (2016, November 2). Counties That Predict The Election. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2016-11-bellwether-counties

Templon, John. (2016, April 26). Analysis of Republican Donor Movement. BuzzFeed News. Retrieved from https://github.com/BuzzFeedNews/2016-04-republican-donor-movements

Tran, Andrew. (2017, December 18). How the Trump era is changing the federal bureaucracy. The Washington Post. Retrieved from https://github.com/wpinvestigates/federal_employees_trump_2017

Tran, Andrew. (2016, May 16). Analyzing LendingClub loan data for Connecticut. TrendCT. Retrieved from https://github.com/trendct/data/tree/master/2016/05/lending-club

Webster, MaryJo. (2019, June 1). Education achievement gap analysis. Star Tribune. Retrieved from https://github.com/striblab/201901-achievementgap

Webster, MaryJo. (2019, January 10). Hospital quality ratings data. Star Tribune. Retrieved from https://github.com/striblab/201901-hospitalquality

Wehrmeyer, Stefan. (2016, March 21). Euros für Ärzte Data Analysis. CORRECTIV. Retrieved from https://github.com/correctiv/awb-notebook

Welsh, Ben. (2019, April 29). Census "hard to count" analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/census-hard-to-map-analysis

Welsh, Ben. (2018, December 18). California buildings in severe fire hazard zones. Los Angeles Times. Retrieved from https://observablehq.com/@palewire/california-buildings-in-severe-fire-hazard-zones

Welsh, Ben. (2017, May 25). California H-2A visas analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/california-h2a-visas-analysis

Welsh, Ben. (2017, March 28). SWANA population map. Los Angeles Times. Retrieved from https://observablehq.com/@datadesk/swana-population-map

Welsh, Ben. (2017, March 17). California crop production wages analysis. Los Angeles Times. Retrieved from https://github.com/datadesk/california-crop-production-wages-analysis

Wilber, Jared. (2018, June 5). skatemusic. Polygraph. Retrieved from https://github.com/polygraph-cool/skatemusic

Wilson, Chris. (2016, December 20). Baby Name Politics. Time. Retrieved from https://github.com/TimeMagazine/babyname_politics

Wilson, Chris. (2014, May 27). Wikipedia rankings. Time. Retrieved from https://github.com/TimeMagazine/wikipedia-rankings

Yerardi, Joe. (2019, February 28). Injustice at Work. Center for Public Integrity. Retrieved from https://github.com/PublicI/employment-discrimination

Zarkhin, Fedor. (2017, April 21). Long-term care complaints data and analysis. The Oregonian. Retrieved from https://github.com/TheOregonian/long-term-care-db

Zhang, Christine. (2018, October 15). Maryland voter registration analysis. Baltimore Sun. Retrieved from https://github.com/baltimore-sun-data/2018-voter-registration

Zhang, Christine. (2018, December 4). Maryland schools star ratings analysis. Baltimore Sun. Retrieved from https://github.com/baltimore-sun-data/school-star-ratings-2018

About

Supplemental and other materials related to the paper Table Scraps, to appear in IEEE TVCG (Proc. InfoVis 2020).

Topics

Resources

Stars

Watchers

Forks