This notebook includes analysis of the iterative coding process.

In [1]:
import pandas as pd
import altair as alt
from lib.util import displayMarkdown

alt.renderers.enable('notebook')

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

# Import data 

## Code tree

This CSV file is computed in `codebooky.ipynb`.

In [2]:
codes = pd.read_csv('data/codes.csv')
codes.head()

Unnamed: 0,parent,name,desc,level,is_leaf,analysis
0,root,actions,Codes that describe actions the journalist has...,0,False,50
1,actions,import,How raw data is introduced into the programmin...,1,False,39
2,import,fetch,Data is retrieved from some external sources t...,2,False,6
3,fetch,pull tables out of pdf,"Using a table extraction tool, such as Tabula,...",3,True,1
4,fetch,api request,"Make a request to a web API, such as addresses...",3,True,1


Now group the code-analyses pairs in the `codes` data frame by code and count the number of analyses per code. Remember that the frequency column, called `freq`, is the number of analyses that contain at least one instance of that particular code.

In [3]:
codeCounts = codes.groupby(['name', 'level', 'is_leaf']) \
    .analysis.sum() \
    .to_frame('freq') \
    .reset_index()

codeCounts = pd.merge(codeCounts, codes[['name', 'desc']], how='left')

codeCounts['coverage'] = codeCounts.freq / 50

def recurse(root, pNode, func):
    root = root if root != None else pNode
    func(root, pNode)
    children = codes.loc[codes.parent == pNode]
    for child in children.name:
        recurse(root, child, func)

def markBranch(root, pnode):
    codeCounts.loc[codeCounts.name == pnode, 'is_' + root] = True

for branch in ['actions', 'strategies', 'observations', 'analysis']:
    codeCounts['is_' + branch] = False
    recurse(None, branch, markBranch) 

# Populate is leaf parent
codeCounts['is_leaf_parent'] = False
for parent in codes[codes.is_leaf].parent.unique():
    codeCounts.loc[codeCounts.name == parent, 'is_leaf_parent'] = True

# Bin coverage into discrete values
bins = ['always', 'frequently', 'sometimes', 'seldom']
codeCounts['commonness'] = pd.cut(codeCounts.coverage, len(bins), labels=bins[::-1])
    
# Peek at results
codeCounts.head()

Unnamed: 0,name,level,is_leaf,freq,desc,coverage,is_actions,is_strategies,is_observations,is_analysis,is_leaf_parent,commonness
0,actions,0,False,50,Codes that describe actions the journalist has...,1.0,True,False,False,False,True,always
1,adjust for inflation,3,True,3,TK,0.06,True,False,False,False,False,seldom
2,adjust for season,3,True,1,Making seasonal adjustments to the data to det...,0.02,True,False,False,False,False,seldom
3,aggregate the forest from the trees,2,True,2,Data of individual observations is aggregated ...,0.04,False,False,True,False,False,seldom
4,analysis,1,False,40,Kinds of analysis data journalists need to wra...,0.8,False,False,True,True,True,always


## Code-analysis-notebook network

This notebook also uses the data exported by `codebook.ipynb` to `data/code-analysis-network.csv`, which maps codes to notebooks to analyses.

In [4]:
analysisCodes = pd.read_csv('data/code-analysis-network.csv')
analysisCodes.head()

Unnamed: 0,name,analysis,notebook,level,is_leaf
0,pull tables out of pdf,2018-voter-registration,01_processing.ipynb,3.0,True
1,api request,california-h2a-visas-analysis,03_geocode.ipynb,3.0,True
2,query database,201901-achievementgap,build_data.R,3.0,True
3,scrape web for data,us-weather-history,wunderground_scraper.py,3.0,True
4,scrape web for data,long-term-care-db,mung-3-25-scrape,3.0,True


# Codes by Level

In [5]:
def displayTable(df):
    header = "| " + " | ".join(df.columns) + " |"
    breakline = "|" + "---|" * len(df.columns)
    rowTmpl = "| " + " | ".join(['{' + col + '}'for col in df.columns ]) + " |"
    rows = '\n'.join([ rowTmpl.format(**row) for i, row in df.iterrows() ])
    displayMarkdown(header + '\n' + breakline + '\n' + rows)

def rollupTable(is_actions=True):
    for i in range(1, max(codeCounts.level)):
        codeLevel = codeCounts[(codeCounts.level == i) & (codeCounts.is_actions == is_actions)] \
            .sort_values(['commonness', 'coverage'], ascending=False)
        displayMarkdown("### Level {} {} Codes".format(i, 'Action' if is_actions else 'Observation'))
        displayTable(codeLevel[['name', 'desc', 'commonness']])

## Action Codes

In [6]:
rollupTable()

### Level 1 Action Codes

| name | desc | commonness |
|---|---|---|
| import | How raw data is introduced into the programming/wrangling environment | always |
| clean | Operations to correct data that might be considered "dirty" | always |
| modify | *Modifying* the data constitutes make minor changes without *integrating* other tables. | frequently |
| integrate | Combining data residing in different tables into one table. | frequently |
| recalculate | Expanding the table by calculating new columns based on existing columns without *integrating* other tables.* | frequently |
| transform | Operations that transform a table into an aggregated, lower-resolution view of the original table. | frequently |
| check sanity | Operations that confirm the effect of a previous wrangling operation. | sometimes |
| display dataset | Different ways to check in on the state of the dataset during wrangling. | sometimes |
| export | Ways in which journalist export the results of their data wrangling. | sometimes |

### Level 2 Action Codes

| name | desc | commonness |
|---|---|---|
| load | Raw data resides on the local disk and is *loaded* into the environment, includes these file formats: .csv, .xlsx, .fec, .shp, .RData, etc. | frequently |
| format | Operations that modify the table values appearance or style | frequently |
| formulate performance metric | Perform a calculation that is later used to compare different entities or the same entity over time. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters. | frequently |
| trim fat | Remove complete sections of data that just are relevant | frequently |
| supplement | Supplementation is characterized by integration operations that essentially add columns to existing data | sometimes |
| check results | Operations that output some visual representation of the table | sometimes |
| summarize | Codes that aggregate and calculate tables to get a more coarse view of the data. | sometimes |
| visualize data | Employing any kind of data visualization, including a table | sometimes |
| reshape | Operations fundamentally change the table's structure, but do not perform any kind of summarization calculation. *Constructing a pivot table* often involves a *spread-like* operation when defining what values to use as columns in the new table. The difference with *reshaping* is that sometimes the journalist may not summarize the reshaped table. | sometimes |
| column reshaping | Codes concerning either separating one column into more than one or combining more than one column into one. | sometimes |
| union tables | TK | sometimes |
| rank data | Operations that encode semantic meaning about the data with table index. | sometimes |
| fix values | Individual values have errors that must be corrected. | sometimes |
| create | Data is created inside the programming environment | seldom |
| consolidate values of a single column | Codes that map a set of entities to a smaller set of entities | seldom |
| count the data | Operation that count things in the data set | seldom |
| deduplicate | Remove rows that contain two or more of the same observation, with identical values in all, one, or zero columns. | seldom |
| generate keys | Operations that create "key" columns. | seldom |
| fetch | Data is retrieved from some external sources to the programming environment | seldom |
| format table display | Operations that adjust the table displace, such as how many decimals to round floats | seldom |
| detrend | "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*. | seldom |
| run a test | Operations output a clear pass or fail value, often implemented by counting things | seldom |
| create flag | Flags are boolean expressed computed upon column values and used in filtering and grouping | seldom |
| encode table identification in row | When some way of identifying the table is encoded as a separate column in each row. Common identification methods include the name of the corresponding file, an arbitrary table name, or boolean value | seldom |
| inner join tables | TK | seldom |
| remove incomplete data | Drop row if value(s) are incomplete, usually denoted as NA. | seldom |
| cartesian product | TK | seldom |
| describe statistically | Generates any kind of descriptive statistics of the dataset's central tendency, dispersion and distribution shape | seldom |
| pad column values | Adding either character prefixes or suffixes consistently to every row within a column | seldom |
| scale values | Operations that apply some mathematical operation to columns of quantitative data. This code is different from the codes under **Formulate performance metric** because this closer to cleaning. | seldom |
| network-ify the data | When the data is inherently a graph, encode table columns and values to represent this structure tabularly | seldom |
| self join table | Join a table with itself | seldom |

### Level 3 Action Codes

| name | desc | commonness |
|---|---|---|
| format schema | Operations that modify anything except table values | frequently |
| winnow rows | Simply put, these operations remove table rows. | sometimes |
| subset columns | Removing columns from a table by specifying which ones to remove or keep. | sometimes |
| outer join tables | A join that returns rows with no corresponding match in the table being joined two, e.g. left or right joins. | sometimes |
| peek at data | Display the first *n* rows and all columns of the table | sometimes |
| figure a rate | Convert numbers to a normalized rate to "provide a comparison against some easily recognized baseline" Philip Meyer in *Precision Journalism* | sometimes |
| sort table | When rank is implicitly assigned by rearranging row position in the table. | sometimes |
| extract column values | Separating a column containing multiple rows | seldom |
| group by single column | When a table is grouped by a single column. | seldom |
| group by multiple columns | When a table is grouped by multiple columns, creating hierarchy. | seldom |
| cross tabulate | such as with a pivot table/crosstab | seldom |
| calculate spread | Calculating the difference between two values or rates | seldom |
| calculate change over time | Such as the percentage difference over time. | seldom |
| combine entities | Consolidating a range of categorical values into a smaller set of categorical values | seldom |
| create frequency table | Count the frequency of non-quantitative variables within a column | seldom |
| format values | Operations that change value appearence, e.g. change case, specifying date format, rounding floats. | seldom |
| remove value characters | When characters inside a value are removed, such as periods, commas, dollar signs, etc | seldom |
| calculate central tendency | Measuring what a typical value is in the data, e.g. mean, median. | seldom |
| combine columns | Combining two columns into one, either through concatenation, addition, etc... | seldom |
| construct table manually | Using tables where column names and table values were either copy-and-pasted or entered manually. | seldom |
| count number of rows | Printing out the total number of rows in a table | seldom |
| create soft key | Keys used in matching without a guarentee of uniqueness, such as combinations of names and addresses | seldom |
| full join tables | Combine all rows and all columns of the two tables. a.k.a full outer join | seldom |
| gather table | Collapses table into key value pairs. | seldom |
| spread table | Expand two columns of key value pairs into multiple columns. | seldom |
| use lookup table | Using a table with two columns to map from one value to another. | seldom |
| create a unique key | Journalist create a key that is actually unique in the table | seldom |
| inspect table schema | Check the data types of columns | seldom |
| replace na values | Raw data may contain incomplete table values (denoted as NA) or empty values (denoted as NULL) | seldom |
| rolling window calculation | Performs rolling-window aggregation | seldom |
| adjust for inflation | TK | seldom |
| count unique values | Report the number of unique values in one or more columns | seldom |
| resolve entities | Classic entity resolution: a column of categorical values has different names for the same entity. | seldom |
| scrape web for data | Systematically parsing HTML web pages for relevant data | seldom |
| standardize values | Measuring deviation from some definition of "normal", e.g. Z-scores | seldom |
| backfill missing data | Create data observations where there are missing entries. | seldom |
| domain-specific performance metric | A domain specific metric, such as the Cube Root Law for legislatures. | seldom |
| fix data errors manually | Instances where individual row-column values are changed manually. | seldom |
| generate data computationally | Using tables populated programmatically. | seldom |
| get extreme values | Calculate the highest or lowest value(s) | seldom |
| join aggregate | Extend the table (columnwise) with aggregate values, hence the number of rows stays constant but columns increase | seldom |
| rollup | Rename entity to the name of its parent (for hierarchical data) | seldom |
| strip whitespace | Removing extra whitespace characters from entity name | seldom |
| test for equality | Test if two data structures are exactly the same, e.g. two data frames | seldom |
| adjust for season | Making seasonal adjustments to the data to detrend for seasonal trends. | seldom |
| api request | Make a request to a web API, such as addresses translation service for lat-long coordinates from Bing. | seldom |
| assign ranks | When a column of numerical ranks is explicitly assigned to rows in the table. | seldom |
| bin values | Consolidating a range of quantitative data into ordinal data | seldom |
| check for nas | See if any rows have NA values. | seldom |
| compute index number | TK | seldom |
| concat parallel tables | When columns from multiple, parallel tables are concatenated together to form a new table. | seldom |
| copy table schema | A table is copied without any values but table column names and type identical, or nearly identical, to another table | seldom |
| create edge | A column with value that define a relationship to another row, which is not necessarily in a different table | seldom |
| define edge weights | Columns that define edge weights | seldom |
| display rows with missing values | E.g. filtering rows with a NA value in a particular column | seldom |
| fix mixed data types | Sometimes a column with be mixed with two data type, e.g. integers and strings. | seldom |
| pull tables out of pdf | Using a table extraction tool, such as Tabula, to parse tables inside PDF documents. | seldom |
| query database | Data is imported through a database query | seldom |
| report rows with column number discrepancies | Finds if a row has a different number of columns than the header row | seldom |
| test different computations for equality | Test the results of a calculation against different methods/packages. The Upshot did this with variance. | seldom |
| validate data quality with domain-specific rules | Such as if the average temperature is higher than the maximum recorded temperature | seldom |

## Observation Codes

In [7]:
rollupTable(False)

### Level 1 Observation Codes

| name | desc | commonness |
|---|---|---|
| analysis | Kinds of analysis data journalists need to wrangle data to perform. | always |
| wrangling purpose | Why does this data need to be wrangled? | frequently |
| data acquisition | How the data was acquired by journalists | sometimes |
| pain points | Areas where journalist seem/could be frustrated in the wrangling process. | sometimes |
| workflow building | Codes pertaining to how the wrangling workflow is built. | sometimes |
| strategies | General strategies journalists employ when wrangling data. | sometimes |

### Level 2 Observation Codes

| name | desc | commonness |
|---|---|---|
| input for downstream applications | Output from wrangling will be input into some other program | sometimes |
| compare different groups along a common metric | The end analysis is just comparing different groups by a common metric. | sometimes |
| repetitive code | Instances where code is repetitively copied and pasted. | seldom |
| show trend over time | Analysis consists of showing how values change over time | seldom |
| think computationally | Codes that demonstrate computational thinking on the part of the journalist. | seldom |
| table splitting | Tables may be divided, partitioned, or otherwise split into multiple tables to accomplish a transformation goal. | seldom |
| use open government data | Data publically available on open data portals, such as data.gov | seldom |
| creating new datasets | nan | seldom |
| answer a question | Analysis consists of using data to answer a specific question | seldom |
| calculate a statistic | Calculate a single value for from a dataset, such as number of records. | seldom |
| tables evolve | Data and objects are destroyed during the wrangling process. | seldom |
| annotate workflow | Adding comments or notes in Markdown that explain what the journalists doing. | seldom |
| collect raw data | Using first-hand observations or logs as data. | seldom |
| interpret statistical/ml model | Analyze features from a model such as linear regression or classification trees | seldom |
| outlier detection | Finding extreme cases or outliers in the data | seldom |
| post-merge clean up | Pain points that come from the result of merging two datasets together | seldom |
| tolerate dirty data | Analysis continues despite clear data quality issues. | seldom |
| explore dynamic network flow | (Network analysis) explore the flow between different nodes in the graph, e.g. migration between cities. | seldom |
| toggle step on and off | Some wrangling steps were not always run. Toggling off is often accomplished by commenting out code. | seldom |
| aggregate the forest from the trees | Data of individual observations is aggregated in an attempt to find some meaningful structure or patterns | seldom |
| data is precious | Data and objects are neverly actually lost in the programming environment. | seldom |
| post-aggregation clean up | Pain points that come from the result of grouping a table. | seldom |
| remove erroneous data | There are errors in the data that need to be removed | seldom |
| use another news orgs data | A dataset previously published by another news organization | seldom |
| use non-public, provided data | nan | seldom |
| use public data | Includes open-source datasets, tables on Wikipedia, etc.. | seldom |
| data too large for repo | Raw data cannot be included in SCM because files are too large | seldom |
| encode redundant information | When data that already exists in the table is recoded into the table. | seldom |
| explain variance | This can be done via PCA | seldom |
| find nearest neighbours in the network | (Network analysis) Find the closest neighbours for all points | seldom |
| fix incorrect calculation | Calculations in the data are incorrect and the journalist must recalculate them | seldom |
| freedom of information data | Data that was obtained via FOI/FOIA requests. | seldom |
| identify clusters or lack of clusters | nan | seldom |
| identify extreme values | nan | seldom |
| make an incorrect conclusion | Instances where the journalist has made an incorrect conclusion about the data. | seldom |
| set data confidence threshold | Removes rows where a quantitative value is less than, greater than, or not equal to a numeric value. | seldom |
| use academic data | nan | seldom |
| use data from colleague | A dataset was provided by another journalist. | seldom |
| use previously cleaned data | Data that originated from a colleague. | seldom |

### Level 3 Observation Codes

| name | desc | commonness |
|---|---|---|
| wrangle data for graphics | Data need to be formatted in order to be visualized in an article, including tables. | sometimes |
| architect repeating process | Instances where journalists employed a loop. | seldom |
| architect a subroutine | A set of instructions grouped together to be performed multiple times. | seldom |
| split, compute, and merge | First, the journalist partitions a single data frame into multiple, separate data frames. Then, often identical computations are run on all the data frame. Finally, the multiple data frames are consolidated into one data frame again. | seldom |
| refine table | Table refinement refers to when a table is subset *in place*, a new object is not created in the environment. | seldom |
| combine drifting datasets | Reconcile difference in periodically published datasets that have superficially changed over time, such as schema differences or entity names, to consolidate more than one dataset. | seldom |
| fill in na values after an outer join | As outer joins do not drop non-matching rows, those values have NA | seldom |
| split and compute | One table is split into two or more and identical computations are applied to each table. | seldom |
| combine data and geography | Pairing data with GIS info. | seldom |
| combine seemingly disparate datasets | When a notebook largely constitutes combining seemingly unrelated datasets. | seldom |
| preserve existing values | The output of any column calculation is assigned to a new column | seldom |
| create child table | A child table is a subset of the parent table declared as a new object in the environment. | seldom |
| data loss from aggregation | When table columns are lost because they were dropped form resulting table due to not being relevant in aggregation. | seldom |
| resort after merge | When a sort has to be re-done because a merge ruining the pre-merged order. | seldom |
| silently dropping values after groupby | Values other than thsoe being grouped and calculated upon are lost in a group by operation | seldom |
| temporary joining column | When a key for joining two tables is created and destroyed immediately after the join. | seldom |
| value replacement | The output of any column calculation is reassigned to an existing column. | seldom |
| wrangle data for model | Data is being wrangled in order to create a model, whether the main point of the piece is for prediction or classification | seldom |