# Table Scraps: An Actionable Framework for Multi-Table Data Wrangling From An Artifact Study of Computational Journalism




## Unabridged Taxonomies of Data Wrangling in Computational Journalism

Here we present complete versions of our two taxonomies of data wrangling in computational journalism: *Actions* and *Processes*. We use shortcodes to refer to longer descriptions of open and axial codes in the paper, follow this naming convention:

<1>.<2>.<3>.<4>.<5>

<1>: *A* for Action and *P* for Process

<2>: The first character of the code, capitalized

<3>: Letters a-z, lowercase

<4>: Arabic numerals, 1 - 9

<5>: Roman numerals, lowercase, i - x

In [18]:
%%html
<style>
    ol > li {
        list-style: none
    }
</style>

In [3]:
import pandas as pd
import numpy as np
from lib.util import displayMarkdown, getCodeset

codeset = pd.read_csv('data/codeset.csv').replace(np.nan, '', regex=True)

In [25]:
def displayTree(codes, minLevel=1000):
    """Format codeset dataframe as a tree in Markdown"""
    codeMarkdownTree = []
    for i, c in codes[codes.level <= minLevel].iterrows():
        item = '{}1. {}.&nbsp;&nbsp;&nbsp;**{}**: {}\n'.format(
            '\t' * c['level'],
            c['shortcode'],
            c['name'],
            c['desc']
        )
        codeMarkdownTree.append(item)
    return displayMarkdown("{}".format('\n'.join(codeMarkdownTree)))

## Actions Taxonomy

The *Actions* taxonomy details individual data wrangling steps made by journalists.

In [26]:
actions = getCodeset('actions.yaml')
actions['uniq'] = actions.name.str.lower()

actions = pd.merge(actions, codeset[codeset.type=='actions'][['name', 'level', 'shortcode']], left_on='uniq', right_on='name')
actions = actions[['name_x', 'desc', 'shortcode', 'level_x']] \
    .rename(columns={
    'name_x': 'name', 
    'level_x': 'level'
})
displayTree(actions, 3)

1. A.I.&nbsp;&nbsp;&nbsp;**Import**: How raw data is introduced into the wrangling environment

	1. A.I.a.&nbsp;&nbsp;&nbsp;**Fetch**: Data is retrieved from a source external to the wrangling environment

		1. A.I.a.1.&nbsp;&nbsp;&nbsp;**Extract Data From PDF**: Using a data extraction tool, such as Tabula, to parse tables inside PDF documents

	1. A.I.b.&nbsp;&nbsp;&nbsp;**Create**: Data is created inside the wrangling environment

		1. A.I.b.1.&nbsp;&nbsp;&nbsp;**Construct Data Manually**: The data is either copy-and-pasted or values are created manually

		1. A.I.b.2.&nbsp;&nbsp;&nbsp;**Generate Data Computationally**: Using data with values generated programmatically

		1. A.I.b.4.&nbsp;&nbsp;&nbsp;**Impute Missing Data**: Replace missing data values either manually, through data entry, or systematically through an R functions such as `lag`.

	1. A.I.c.&nbsp;&nbsp;&nbsp;**Load**: Data resides on the local disk and is loaded into the environment

1. A.C.&nbsp;&nbsp;&nbsp;**Clean**: The process of removing incorrect, incomplete, inaccurate, misformatted or otherwise corrupt observations, variables, and values within a dataset.

	1. A.C.a.&nbsp;&nbsp;&nbsp;**Remove**: Approaches to cleaning data that involve removing observations, variables, and values

		1. A.C.a.1.&nbsp;&nbsp;&nbsp;**Deduplicate**: Remove duplicate observations

		1. A.C.a.2.&nbsp;&nbsp;&nbsp;**Remove Non-Data Rows**: Remove notes and comments that are not observations

		1. A.C.a.3.&nbsp;&nbsp;&nbsp;**Remove Incomplete Data**: Drop observation if it contains incomplete values, often denoted as NA or Null

	1. A.C.b.&nbsp;&nbsp;&nbsp;**Replace**: Approaches to data cleaning that involve replacing observations, variables, and values

		1. A.C.b.1.&nbsp;&nbsp;&nbsp;**Replace NA Values**: Journalists replace NA values of a variable with another values. NA values can be denoted in various ways

		1. A.C.b.2.&nbsp;&nbsp;&nbsp;**Edit Values**: Existing data values are incorrect, but not NA, and must be changed by the journalist

		1. A.C.b.3.&nbsp;&nbsp;&nbsp;**Resolve Entities**: Resolving the issue of different categorical values for the same entitiy. This wrangling actions is particularly suited towards fixing common data issues such as misspellings, inconsistent date formats, and name ordering.

		1. A.C.b.4.&nbsp;&nbsp;&nbsp;**Standardize Categorical Variables**: Make levels conform to some set of rules, such as replacing whitespace for underscore, trimming whitespace, etc

		1. A.C.b.5.&nbsp;&nbsp;&nbsp;**Scale Values**: Operations that apply some mathematical operation to a quantitative variable in the spirit of fixing data errors. For example, quantitative data may be in the millions and only display significant digits.

	1. A.C.c.&nbsp;&nbsp;&nbsp;**Reformat**: Wrangling operations that modify the table entry's appearance or style, but not value

		1. A.C.c.1.&nbsp;&nbsp;&nbsp;**Format Values**: Operations that change value appearence, but not the underly variable type: changing case, specifying date format, rounding floats

		1. A.C.c.2.&nbsp;&nbsp;&nbsp;**Canonicalize Variable Names**: Operations that change column names

1. A.M.&nbsp;&nbsp;&nbsp;**Merge**: Operations that combining multiple datasets

	1. A.M.a.&nbsp;&nbsp;&nbsp;**Union Datasets**: Combining multiple datasets with identical variables into one dataset

	1. A.M.b.&nbsp;&nbsp;&nbsp;**Inner Join**: Take the intersection of two datasets on a shared key variable

	1. A.M.c.&nbsp;&nbsp;&nbsp;**Supplement**: The variables of one dataset are supplemented with the variables of another dataset

		1. A.M.c.1.&nbsp;&nbsp;&nbsp;**Outer Join**: Retain observations with no corresponding match in the dataset being joined upon

		1. A.M.c.2.&nbsp;&nbsp;&nbsp;**Full Join**: Retain observations with no corresponding match in either dataset

		1. A.M.c.3.&nbsp;&nbsp;&nbsp;**Concat Parallel Datasets**: Join two datasets by position, without specifying a joining key

	1. A.M.d.&nbsp;&nbsp;&nbsp;**Cartesian Product**: Create a new dataset by the unique pairing of each key in their respective datasets

	1. A.M.e.&nbsp;&nbsp;&nbsp;**Self Join Dataset**: Create a new dataset by joining it with itself

1. A.P.&nbsp;&nbsp;&nbsp;**Profile**: Operations the inspect the state of the data during wrangling

	1. A.P.a.&nbsp;&nbsp;&nbsp;**Run a Test**: Audit the data by constructing a pass or fail scenario

		1. A.P.a.1.&nbsp;&nbsp;&nbsp;**Report Rows With Column Number Discrepancies**: Check the number of columns or rows between tables

		1. A.P.a.2.&nbsp;&nbsp;&nbsp;**Test for Equality**: Test if two data structures are exactly the same

		1. A.P.a.3.&nbsp;&nbsp;&nbsp;**Test for Null Values**: Test the results of a calculation against different methods/packages

		1. A.P.a.4.&nbsp;&nbsp;&nbsp;**Validate Data Quality with Domain-Specific Rules**: Test with a domain-specific rule for the data, such as checking if the average temperature is higher than the maximum recorded value

	1. A.P.b.&nbsp;&nbsp;&nbsp;**Check Results**: Output the dataset for review

		1. A.P.b.1.&nbsp;&nbsp;&nbsp;**Peek at Data**: Display the first *n* observations, or take a random sample, with all variables of the dataset

		1. A.P.b.2.&nbsp;&nbsp;&nbsp;**Inspect Data Schema**: Check the data types of columns

		1. A.P.b.3.&nbsp;&nbsp;&nbsp;**Select Rows with Missing Values**: Inspect the dataset for observations with a missing value, often denoted as NA

		1. A.P.b.4.&nbsp;&nbsp;&nbsp;**Check for NAs**: See if any observations have NA values

		1. A.P.b.5.&nbsp;&nbsp;&nbsp;**Visualize Data**: Employ any kind of data visualization, including tables

	1. A.P.c.&nbsp;&nbsp;&nbsp;**Summarize Dataset**: Summarize the dataset numerically

		1. A.P.c.2.&nbsp;&nbsp;&nbsp;**Count Unique Values**: Report the number of unique values in one or more variables

		1. A.P.c.3.&nbsp;&nbsp;&nbsp;**Describe Statistically**: Generate descriptive statistics of the dataset, such as central tendency, dispersion, or distribution shape

1. A.D.&nbsp;&nbsp;&nbsp;**Derive**: Expand upon the original dataset without integrating another dataset

	1. A.D.a.&nbsp;&nbsp;&nbsp;**Detrend**: Remove the secular effect from a variable; these are not considered data cleaning operations because values are not erroneous

		1. A.D.a.1.&nbsp;&nbsp;&nbsp;**Adjust for Inflation**: Remove the effect of price inflation from data

		1. A.D.a.2.&nbsp;&nbsp;&nbsp;**Compute Index Number**: Calculate the change in a variable over time

		1. A.D.a.3.&nbsp;&nbsp;&nbsp;**Adjust for Season**: Adjust a variable to compensate for seasonal effect

	1. A.D.b.&nbsp;&nbsp;&nbsp;**Consolidate Variable Values**: Map a set of unique values to a smaller set, which is different from entity resolution

		1. A.D.b.1.&nbsp;&nbsp;&nbsp;**Bin Values**: Consolidate a quantitative variable into a smaller set of ordinal data

		1. A.D.b.2.&nbsp;&nbsp;&nbsp;**Combine Categorical Values**: Consolidate the levels of a categorical variable into a smaller set of levels.

	1. A.D.c.&nbsp;&nbsp;&nbsp;**Generate Unique Identifiers**: Attempt to create unique identifiers

		1. A.D.c.1.&nbsp;&nbsp;&nbsp;**Generate Observation Identification**: Produce unique identification for each observation

			1. A.D.c.1.i.&nbsp;&nbsp;&nbsp;**Create Soft Key**: Keys not guarenteed to be unique per observation

			1. A.D.c.1.ii.&nbsp;&nbsp;&nbsp;**Create a Unique Key**: Keys are guarenteed to be unique per observation

		1. A.D.c.2.&nbsp;&nbsp;&nbsp;**Generate Dataset Identification**: Add a table identification value as a variable for all observations

	1. A.D.d.&nbsp;&nbsp;&nbsp;**Subset the Dataset**: Reduce the size or complexity of the actively wrangled dataset

		1. A.D.d.1.&nbsp;&nbsp;&nbsp;**Remove Variables**: Specify which variables to remove or retain from a dataset

		1. A.D.d.2.&nbsp;&nbsp;&nbsp;**Remove Observations**: Specify which observations to remove or retain from a dataset

			1. A.D.d.2.i.&nbsp;&nbsp;&nbsp;**Trim by Date Range**: Remove based on observations inside or outside a range of dates

			1. A.D.d.2.ii.&nbsp;&nbsp;&nbsp;**Trim by Geographic Area**: Remove based on observations inside or outside the geographic region

			1. A.D.d.2.iii.&nbsp;&nbsp;&nbsp;**Trim by Quantitative Threshold**: Remove based on observations above, below, equal to, or not equal to a quantitative value

			1. A.D.d.2.iv.&nbsp;&nbsp;&nbsp;**Trim by Categorical Value**: Remove based on observations that do or do not contain specific a specific value

	1. A.D.e.&nbsp;&nbsp;&nbsp;**Formulate a Performance Metric**: Calculate a quantitative variable

		1. A.D.e.1.&nbsp;&nbsp;&nbsp;**Assign Ranks**: Order observations explicitly as a variable

		1. A.D.e.2.&nbsp;&nbsp;&nbsp;**Standardize Variable**: Measure deviation from "normal," such as z-scores

		1. A.D.e.3.&nbsp;&nbsp;&nbsp;**Figure a Rate**: Calculate a normalized rate to provide a baseline for comparison

		1. A.D.e.4.&nbsp;&nbsp;&nbsp;**Calculate Change Over Time**: Calculate percentage change over time

		1. A.D.e.5.&nbsp;&nbsp;&nbsp;**Calculate Spread**: Calculate the difference between two values or rates

		1. A.D.e.6.&nbsp;&nbsp;&nbsp;**Domain-Specific Performance Metric**: Calculate a domain-specific metric

		1. A.D.e.7.&nbsp;&nbsp;&nbsp;**Get Extreme Values**: Calculate the highest or lowest values in a variable

1. A.T.&nbsp;&nbsp;&nbsp;**Transform**: Create or revise table variables based on existing variables, without *integrating* other tables

		1. A.T.a.1.&nbsp;&nbsp;&nbsp;**Transpose**: Change places between rows and columns within a table or matrix

		1. A.T.a.2.&nbsp;&nbsp;&nbsp;**Cross Tabulate**: Create a pivot table or crosstab

		1. A.T.a.3.&nbsp;&nbsp;&nbsp;**Spread Table**: Expand two columns of key value pairs into multiple columns

		1. A.T.a.4.&nbsp;&nbsp;&nbsp;**Gather**: Collapse table into key value pairs

		1. A.T.a.5.&nbsp;&nbsp;&nbsp;**Create a Flag**: Spread a categorical variable into multiple boolean variables

	1. A.T.b.&nbsp;&nbsp;&nbsp;**Modify Variables**: Change properties of variables within a dataset

		1. A.T.b.1.&nbsp;&nbsp;&nbsp;**Parse Variable**: Separate variables into multiple new variables using position or regular expressions

		1. A.T.b.2.&nbsp;&nbsp;&nbsp;**Consolidate Variables**: Combine two different variables into one composite variable

		1. A.T.b.3.&nbsp;&nbsp;&nbsp;**Replace Variable Levels**: Change the value of a level in a categorical variable to another value

	1. A.T.c.&nbsp;&nbsp;&nbsp;**Summarize**: Aggregate observations to summarize a phenomenon; we consider this a structural change as it effectively coarsens the dataset

		1. A.T.c.1.&nbsp;&nbsp;&nbsp;**Group By Variable**: Group or partition by the levels of one or more categorical variables

		1. A.T.c.2.&nbsp;&nbsp;&nbsp;**Aggregate**: Aggregate quantitative values using functions such as sum, mean, median, or count

		1. A.T.c.3.&nbsp;&nbsp;&nbsp;**Rolling Window Calculation**: Perform rolling-window aggregation

	1. A.T.d.&nbsp;&nbsp;&nbsp;**Sort**: Order observations implicitly by position within a data structure

1. A.E.&nbsp;&nbsp;&nbsp;**Export**: Export the results of data wrangling either by writing results to disk or return data from a function


## Process Taxonomy

The process taxonomy consists of the the paper authors' interpretations of the processes that occur during data wrangling.

In [29]:
process = getCodeset('process.yaml')
process['uniq'] = process.name.str.lower()

process = pd.merge(process, codeset[codeset.type=='process'][['name', 'level', 'shortcode']], left_on='uniq', right_on='name')
process = process[['name_x', 'desc', 'shortcode', 'level_x']] \
    .rename(columns={
    'name_x': 'name', 
    'level_x': 'level'
})
displayTree(process, 3)

1. P.S.&nbsp;&nbsp;&nbsp;**source**: Codes that describe how the raw data was obtained by journalists

	1. P.S.a.&nbsp;&nbsp;&nbsp;**collect data**: Journalists are the initial data collector

		1. P.S.a.1.&nbsp;&nbsp;&nbsp;**collect raw data**: The journalist collected the raw data themselves.

		1. P.S.a.2.&nbsp;&nbsp;&nbsp;**Freedom of Information data**: Data that was obtained via FOI/FOIA requests

	1. P.S.b.&nbsp;&nbsp;&nbsp;**acquire data**: Journalists acquired data from another party

		1. P.S.b.1.&nbsp;&nbsp;&nbsp;**use previously cleaned data**: Data that originated from a colleague

		1. P.S.b.2.&nbsp;&nbsp;&nbsp;**use public data**: Includes open-source datasets, datasets on Wikipedia, etc

		1. P.S.b.3.&nbsp;&nbsp;&nbsp;**use academic data**: Use data collected from an academic study

		1. P.S.b.4.&nbsp;&nbsp;&nbsp;**use non-public, provided data**: Use data that is not publically available

		1. P.S.b.6.&nbsp;&nbsp;&nbsp;**use another news orgs data**: A dataset previously published by another news organization

		1. P.S.b.7.&nbsp;&nbsp;&nbsp;**use data from colleague**: A dataset was provided by another journalist

1. P.W.&nbsp;&nbsp;&nbsp;**workflow**: Codes pertaining to how the wrangling workflow is built.

	1. P.W.a.&nbsp;&nbsp;&nbsp;**Annotations**: Adding comments or notes in Markdown that explain what the journalists doing.

	1. P.W.b.&nbsp;&nbsp;&nbsp;**Comp. processes**: Codes that demonstrate computational thinking on the part of the journalist.

		1. P.W.b.1.&nbsp;&nbsp;&nbsp;**construct a subroutine**: A set of instructions grouped together to be performed multiple times

		1. P.W.b.2.&nbsp;&nbsp;&nbsp;**construct data pipeline**: An instance where one script is designed to handle multiple data sources. Often journalists construct subroutines and loops.

	1. P.W.c.&nbsp;&nbsp;&nbsp;**toggle operation**: Ensuring that some code segments are not always run, such as by commenting out lines of code

1. P.C.&nbsp;&nbsp;&nbsp;**cause**: Based on the final output and comments, why does it seem like this data needs to be wrangled?

	1. P.C.a.&nbsp;&nbsp;&nbsp;**downstream input**: Output from wrangling will be input into some other program

		1. P.C.a.1.&nbsp;&nbsp;&nbsp;**wrangle data for graphics**: Data need to be formatted in order to be visualized in an article, including datasets.

		1. P.C.a.2.&nbsp;&nbsp;&nbsp;**wrangle data for model**: Data is being wrangled in order to create a model, whether the main point of the piece is for prediction or classification

		1. P.C.a.3.&nbsp;&nbsp;&nbsp;**create new datasets**: These raw datasets are being wrangled in order to create a new dataset

			1. P.C.a.3.i.&nbsp;&nbsp;&nbsp;**Combine periodic data**: Combine many separate datasets published over time into one dataset

			1. P.C.a.3.ii.&nbsp;&nbsp;&nbsp;**Merge seemingly disparate datasets**: When a notebook largely constitutes combining seemingly unrelated datasets

			1. P.C.a.3.iii.&nbsp;&nbsp;&nbsp;**Geolocate dataset records**: Pairing data with GIS info

		1. P.C.a.4.&nbsp;&nbsp;&nbsp;**Generate high-level summary**: Data of individual observations is aggregated in an attempt to find some meaningful structure or patterns

1. P.T.&nbsp;&nbsp;&nbsp;**themes**: General themes for how data objects are transformed throughout the wrangling process

	1. P.T.a.&nbsp;&nbsp;&nbsp;**divide and conquer**: Instances where the data wrangling processes separates one objects into smaller components

		1. P.T.a.1.&nbsp;&nbsp;&nbsp;**split, compute, and merge**: First, the journalist partitions a single data frame into multiple, separate data frames. Then, often identical computations are run on all the data frame. Finally, the multiple data frames are consolidated into one data frame again

		1. P.T.a.2.&nbsp;&nbsp;&nbsp;**split and compute**: One dataset is split into two or more and identical computations are applied to each dataset

	1. P.T.b.&nbsp;&nbsp;&nbsp;**join aggregate**: When aggregated statistics about a dataset are added to the datasets as a variable, either columnwise or row-wise (as with `adorn_totals` in R)

	1. P.T.c.&nbsp;&nbsp;&nbsp;**create a frequency table**: A table the displays the frequency of categorical variables within a column

	1. P.T.d.&nbsp;&nbsp;&nbsp;**trim fat**: Trim the fat refers to when large amounts of observations or variables are removed from the dataset early in the wrangling processes, if not as the first step of wrangling. We infer that these sections are irrelevant to further analysis.

	1. P.T.e.&nbsp;&nbsp;&nbsp;**align variables**: Modifying dataset variables to match each other, often prior to merging datasets.

1. P.A.&nbsp;&nbsp;&nbsp;**analysis**: Kinds of analysis data journalists need to wrangle data to perform

	1. P.A.b.&nbsp;&nbsp;&nbsp;**compare groups**: The end analysis is just comparing different groups by a common metric

	1. P.A.c.&nbsp;&nbsp;&nbsp;**identify extreme values**: Identify values that are at the ends of the range, but not strictly outliers

	1. P.A.d.&nbsp;&nbsp;&nbsp;**show trend over time**: Analysis consists of showing how values change over time

	1. P.A.e.&nbsp;&nbsp;&nbsp;**calculate a statistic**: Calculate a single value from a dataset, such as number of records

	1. P.A.f.&nbsp;&nbsp;&nbsp;**count the data**: Analysis involves count-based metrics on the datasets including percentages, with optional filtering and aggregation

	1. P.A.g.&nbsp;&nbsp;&nbsp;**Lookup table values**: Analysis consists looking up values in a table

	1. P.A.h.&nbsp;&nbsp;&nbsp;**examine relationship**: Analysis consists of examining the relationship between different phenomena

	1. P.A.i.&nbsp;&nbsp;&nbsp;**explain variance**: This can be done via PCA

	1. P.A.j.&nbsp;&nbsp;&nbsp;**search for clusters**: Look for groups within the data where its presence, or lack thereof, is significant

	1. P.A.k.&nbsp;&nbsp;&nbsp;**perform network analysis**: Journalists perform any kind of network analysis, such as finding all nearest neighbors in the network

	1. P.A.l.&nbsp;&nbsp;&nbsp;**explore dynamic network flow**: (Network analysis) explore the flow between different nodes in the graph, e.g. migration between cities

	1. P.A.m.&nbsp;&nbsp;&nbsp;**Create lookup table**: Make a table with two columns to map from one value to another

	1. P.A.n.&nbsp;&nbsp;&nbsp;**aggregate join**: Aggregating a table and then joining those results to the original table

1. P.M.&nbsp;&nbsp;&nbsp;**management**: General strategies journalists for anaging data within the wrangling environment

	1. P.M.a.&nbsp;&nbsp;&nbsp;**object persistence**: How do journalists regard previous version of datasets after applying transformation functions?

		1. P.M.a.1.&nbsp;&nbsp;&nbsp;**data evolves**: Data and objects are overwritten and replaced during the wrangling process

			1. P.M.a.1.i.&nbsp;&nbsp;&nbsp;**variable replacement**: The output of any column calculation is reassigned to an existing column

			1. P.M.a.1.ii.&nbsp;&nbsp;&nbsp;**temporary joining column**: When a key for joining two datasets is created and deleted immediately after the join

			1. P.M.a.1.iii.&nbsp;&nbsp;&nbsp;**refine table**: dataset refinement refers to when a table is subset in place, a new object is not created in the environment

	1. P.M.b.&nbsp;&nbsp;&nbsp;**data quality**: How journalits proceed when data may be incomplete, erroneous, or otherwise not 100% clean

		1. P.M.b.1.&nbsp;&nbsp;&nbsp;**set data confidence threshold**: Removes rows where a quantitative value is less than, greater than, or not equal to a numeric value

		1. P.M.b.2.&nbsp;&nbsp;&nbsp;**tolerate dirty data**: Analysis continues despite clear data quality issues

1. P.P.&nbsp;&nbsp;&nbsp;**pain points**: Areas where journalist seem/could be frustrated in the wrangling process

	1. P.P.a.&nbsp;&nbsp;&nbsp;**fix incorrect calculation**: Calculations in the data are incorrect and the journalist must recalculate them

	1. P.P.b.&nbsp;&nbsp;&nbsp;**repetitive code**: Instances where code is repetitively copied and pasted

	1. P.P.c.&nbsp;&nbsp;&nbsp;**make an incorrect conclusion**: Instances where the journalist has made an incorrect conclusion about the data

	1. P.P.d.&nbsp;&nbsp;&nbsp;**post-merge clean up**: Pain points that come from the result of merging two datasets together

		1. P.P.d.1.&nbsp;&nbsp;&nbsp;**resort after merge**: When a sort has to be re-done because a merge ruining the pre-merged order

		1. P.P.d.2.&nbsp;&nbsp;&nbsp;**fill in NA values after an outer join**: As outer joins do not drop non-matching rows, those values have NA

		1. P.P.d.3.&nbsp;&nbsp;&nbsp;**lossy join**: When data is lost after integrating two tables

		1. P.P.d.4.&nbsp;&nbsp;&nbsp;**remove duplicate variables**: Two tables may have duplicate variables and duplicate variables need to be removed

	1. P.P.e.&nbsp;&nbsp;&nbsp;**post-aggregation clean up**: Pain points that come from the result of grouping a table

		1. P.P.e.1.&nbsp;&nbsp;&nbsp;**data loss from aggregation**: When table columns are lost because they were dropped form resulting dataset due to not being relevant in aggregation

		1. P.P.e.2.&nbsp;&nbsp;&nbsp;**silently dropping values after groupby**: Values other than those being grouped and calculated upon are lost in a group-by operation

	1. P.P.f.&nbsp;&nbsp;&nbsp;**data too large for repo**: Raw data cannot be included because files are too large

	1. P.P.g.&nbsp;&nbsp;&nbsp;**schema drift**: When the schema of a perennially published datasets varies from edition to edition

	1. P.P.h.&nbsp;&nbsp;&nbsp;**data type shyness**: Users often seem to avoid using built in data types
