The `codebook.ipynb` notebook has three purposes. First, to parse the taxonomy of open and axial codes in `code_tree.yaml`. Second, to parse all the open codes in PDF printouts of computational notebooks located in the `notebooks/` directory. Third, it calculates code-analysis frequency for each code, the number of unique analyses that contain at least one instance of each open and axial code. It combines all this data in the `codes` data frame and exports this for further analysis in `data/codes.csv`.

In [1]:
import re, yaml
import pandas as pd
from lib.util import getCodes, displayMarkdown

%autosave 0

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

Autosave disabled


# Parse code tree

Recursively traverse the YAML code tree to transform the data from a tree into tabular form. The node called "root" does not actually exist in the code tree.

In [2]:
with open('code_tree.yaml', 'r') as f:
    code_yaml = yaml.safe_load(f)
    
codes = []

def preTreeWalk(pNode, node, func, lvl=0):
    """ A recursive, pre-order traversal of the code groups YAML structure"""
    leaf = 'sub' not in node.keys()
    func(pNode, node, lvl, leaf)
    if not leaf:
        for child in node['sub']:
            preTreeWalk(node, child, func, lvl + 1)

parseYaml = lambda parent, child, lvl, leaf: codes.append({
    'parent': parent['name'].lower(),
    'name': child['name'].lower(),
    'desc': child['desc'],
    'level': lvl,
    'is_leaf': leaf
})

for grp in code_yaml:
    preTreeWalk({'name': 'root'}, grp, parseYaml)

codes = pd.DataFrame(codes)[['parent', 'name', 'desc', 'level', 'is_leaf']]
codes.head()

Unnamed: 0,parent,name,desc,level,is_leaf
0,root,actions,Codes that describe actions the journalist has...,0,False
1,actions,import,How raw data is introduced into the programmin...,1,False
2,import,fetch,Data is retrieved from some external sources t...,2,False
3,fetch,pull tables out of pdf,"Using a table extraction tool, such as Tabula,...",3,True
4,fetch,geocode addresses,Translate addresses to latitude-longitude coor...,3,True


## Display code tree

All open codes, their descriptions, and the corresponding axial codes are stored in the `code_tree.yaml` file. As the master copy for all open and axial codes resides here, the raw text itself can be difficult to read. Thus, it can be helpful to read this tree in Markdown.

In [3]:
codeMarkdownTree = [ '{}* **{}**: {}\n'.format('\t' * c['level'], c['name'].title(), c['desc']) for i, c in codes.iterrows() ]

displayMarkdown("""
### Codes\n {}
""".format('\n'.join(codeMarkdownTree)))


### Codes
 * **Actions**: Codes that describe actions the journalist has taken to wrangle data for further analysis

	* **Import**: How raw data is introduced into the programming/wrangling environment

		* **Fetch**: Data is retrieved from some external sources to the programming environment

			* **Pull Tables Out Of Pdf**: Using a table extraction tool, such as Tabula, to parse tables inside PDF documents.

			* **Geocode Addresses**: Translate addresses to latitude-longitude coordinates through web service, such as those from Bing.

			* **Query Database**: Data is imported through a database query

			* **Scrape Web For Data**: Systematically parsing HTML web pages for relevant data

		* **Create**: Data is created inside the programming environment

			* **Construct Table Manually**: Using tables where column names and table values were either copy-and-pasted or entered manually.

			* **Generate Data Computationally**: Using tables populated programmatically.

			* **Copy Table Schema**: A table is copied without any values but table column names and type identical, or nearly identical, to another table

			* **Backfill Missing Data**: Create data observations where there are missing entries.

		* **Load**: Raw data resides on the local disk and is *loaded* into the environment, includes these file formats: .csv, .xlsx, .fec, .shp, .RData, etc.

	* **Amend**: *Amending* the data constitutes make minor changes without *integrating* other tables.

		* **Detrend**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*.

			* **Adjust For Inflation**: TK

			* **Compute Index Number**: TK

			* **Adjust For Season**: Making seasonal adjustments to the data to detrend for seasonal trends.

		* **Encode Table Identification In Row**: When some way of identifying the table is encoded as a separate column in each row. Common identification methods include the name of the corresponding file, an arbitrary table name, or boolean value

		* **Network-Ify The Data**: When the data is inherently a graph, encode table columns and values to represent this structure tabularly

			* **Create Edge**: A column with value that define a relationship to another row, which is not necessarily in a different table

			* **Define Edge Weights**: Columns that define edge weights

		* **Formulate Performance Metric**: Codes in this category specify a calculation that is later used to compare different entities or the same entity over time. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.

			* **Standardize Values**: Measuring deviation from some definition of "normal", e.g. Z-scores

			* **Figure A Rate**: Convert numbers to a normalized rate to "provide a comparison against some easily recognized baseline" Philip Meyer in *Precision Journalism*

			* **Calculate Central Tendency**: Measuring what a typical value is in the data, e.g. mean, median.

			* **Calculate Change Over Time**: Such as the percentage difference over time.

			* **Calculate Spread**: Calculating the difference between two values or rates

			* **Domain-Specific Performance Metric**: A domain specific metric, such as the Cube Root Law for legislatures.

		* **Generate Keys**: Operations that create "key" columns.

			* **Create Soft Key**: Keys used in matching without a guarentee of uniqueness, such as combinations of names and addresses

			* **Create A Unique Key**: Journalist create a key that is actually unique in the table

			* **Concatenate Columns Into Key**: Combine two string columns into one to create a key, e.g. combine city and state.

			* **Designate Column As Primary Key**: Designating a column as the unique identifier for all rows in the table.

		* **Rank Data**: Operations that encode semantic meaning about the data with table index.

			* **Assign Ranks**: When a column of numerical ranks is explicitly assigned to rows in the table.

			* **Sort Table**: When rank is implicitly assigned by rearranging row position in the table.

		* **Create Flag**: Flags are boolean expressed computed upon column values and used in filtering and grouping

	* **Clean**: Operations to correct erroneous or remove otherwise unwanted rows and values from the table.

		* **Trim Fat**: Winnow down data that is not relevant to analysis.

			* **Subset Columns**: Removing columns from a table by specifying which ones to remove or keep.

			* **Winnow Rows**: Simply put, these operations remove table rows.

				* **Trim By Date Range**: Removing rows that are inside or outside a specific date range. This can be a method for detrending data by adjusting for season.

				* **Trim By Geographic Area**: Remove rows that are inside or outside the geographic area.

				* **Trim By Quantitative Threshold**: Remove rows that are above, below, equal to, or not equal to a numeric value.

				* **Trim By Contains Value**: Remove rows that do or do not contain specific values or types of values.

		* **Remove Incomplete Data**: Drop row if value(s) are incomplete, usually denoted as NA.

		* **Deduplicate**: Remove rows from the table that contain two or more of the same "observation." Duplicates may constitute rows with identical values in all, one, or zero columns.

		* **Edit**: Operations that modify table values

			* **Edit Table Values**: Directly editing values within a column

				* **Fix Data Errors Manually**: Instances where individual row-column values are changed by a journalist.

				* **Fix Mixed Data Types**: Sometimes a column with be mixed with two data type, e.g. integers and strings.

				* **Remove Value Characters**: When characters inside a value are removed, such as periods, commas, dollar signs, etc

				* **Replace Na Values**: Raw data may contain incomplete table values (denoted as NA) or empty values (denoted as NULL)

			* **Map Column Values**: Edit operations that change all values within a column

				* **Translate Entity Names**: Performing a one-to-one mapping between values.

					* **Translate Entity Names Manually**: Manually specify the mapping between individual

					* **Pad Column Values**: Adding either character prefixes or suffixes consistently to every row within a column

					* **Strip Whitespace**: Removing extra whitespace characters from entity name

					* **Scale Values**: Operations that apply some mathematical operation to columns of quantitative data. This code is different from the codes under **Formulate performance metric** because this closer to cleaning.

				* **Combine Values**: Codes that map a set of entities to a smaller set of entities

					* **Bin Values**: Classifying quantitative data into ordinal data.

					* **Combine Entities**: Combining values in categorical data

					* **Resolve Entities**: Classic entity resolution: a column of categorical values has different names for the same entity.

		* **Format**: Operations that modify the table values appearance or style

			* **Format Values**: Operations that change value appearence, e.g. change case, specifying date format, rounding floats.

			* **Correct Bad Formatting**: Changes that correct ill-formed data such as HTML entities and new lines (\n)

			* **Format Schema**: Operations that modify anything except table values

				* **Canonicalize Column Names**: Operations that change column names

				* **Change Column Data Type**: For example, changing a column of values from strings to integers

			* **Sort Table Rows**: Sorting a table in a way that does not rank rows, such as by a unique identifier

		* **Separate**: Mapping one column into more than one because multiple dimensions of the dataset packed into one column

			* **Extract Value Component**: A single row-column value may have multiple bits of info, e.g. dates and addresses, and journalists extracts one component from that value, e.g. year and street, respectively

			* **Extract Property From Datetime**: Such as extracting the day of the month, year, etc.. from a datetime column

			* **Slice Column Values**: Extracting the relevant column values by character position, e.g. the first five digits of a zip code

			* **Split Column On Delimiter**: Separate data dimensions by a common character, e.g. lat-long coordinates separated by a comma

			* **Get Unique Values**: TK

		* **Combine Columns**: Combining two columns into one

	* **Integrate**: Combining data residing in different tables into one table.

		* **Union Tables**: TK

		* **Inner Join Tables**: TK

		* **Supplement**: Supplementation is characterized by integration operations that essentially add columns to existing data

			* **Outer Join Tables**: A join that returns rows with no corresponding match in the table being joined two, e.g. left or right joins.

			* **Full Join Tables**: Combine all rows and all columns of the two tables. a.k.a full outer join

			* **Concat Parallel Tables**: When columns from multiple, parallel tables are concatenated together to form a new table.

			* **Use Lookup Table**: Using a table with two columns to map from one value to another.

		* **Cartesian Product**: TK

		* **Self Join Table**: Join a table with itself

	* **Transform**: Operations that transform a table into an aggregated, lower-resolution view of the original table.

		* **Summarize**: Codes that aggregate and calculate tables to get a more coarse view of the data.

			* **Join Aggregate**: "extends the input data objects with aggregate values in a new field" - Vega-Lite Join Aggregate docs.

			* **Rollup**: Rename entity to the name of its parent (for hierarchical data)

			* **Aggregate And Calculate**: When the data is grouped by one non-quantitative value and some calculation (sum, count, count unique) is applied to a different quantitative value

				* **Group By Single Column**: When a table is grouped by a single column.

				* **Group By Multiple Columns**: When a table is grouped by multiple columns, creating hierarchy.

				* **Rolling Window Calculation**: Performs rolling-window aggregation

				* **Sum Along Dimension**: Calculate the sum of all values within a row or column

			* **Create Frequency Table**: Count the frequency of non-quantitative variables within a column

			* **Count Value Frequency**: Count the frequency of categorical variables within a column

		* **Calculate**: These are within-column calculations that often, but not always, immediately follow an *aggregation* operation.

			* **Get Extreme Values**: Calculate the highest or lowest value(s)

			* **Count Unique Values In Column**: Produces a scalar with unique values in the column.

		* **Reshape**: Operations fundamentally change the table's structure, but do not perform any kind of summarization calculation. *Constructing a pivot table* often involves a *spread-like* operation when defining what values to use as columns in the new table. The difference with *reshaping* is that sometimes the journalist may not summarize the reshaped table.

			* **Spread Table**: Expand two columns of key value pairs into multiple columns.

			* **Gather Table**: Collapses table into key value pairs.

			* **Cross Tabulate**: such as with a pivot table/crosstab

	* **Display Dataset**: Different ways to check in on the state of the dataset during wrangling.

		* **Format Table Display**: Operations that adjust the table displace, such as how many decimals to round floats

		* **Visualize Data**: Employing any kind of data visualization, including a table

		* **Describe Statistically**: Generates any kind of descriptive statistics of the dataset's central tendency, dispersion and distribution shape

	* **Check Sanity**: Operations that confirm the effect of a previous wrangling operation.

		* **Run A Test**: Operations output a clear pass or fail value, often implemented by counting things

			* **Report Rows With Column Number Discrepancies**: Finds if a row has a different number of columns than the header row

			* **Test For Equality**: Test if two data structures are exactly the same, e.g. two data frames

			* **Test Different Computations For Equality**: Test the results of a calculation against different methods/packages. The Upshot did this with variance.

			* **Validate Data Quality With Domain-Specific Rules**: Such as if the average temperature is higher than the maximum recorded temperature

		* **Check Results**: Operations that output some visual representation of the table

			* **Check Results Of Previous Operation**: 

			* **Peek At Data**: Display the first *n* rows and all columns of the table

			* **Inspect Table Schema**: Check the data types of columns

			* **Display Rows With Missing Values**: E.g. filtering rows with a NA value in a particular column

			* **Check For Nas**: See if any rows have NA values.

			* **Count Number Of Rows**: Printing out the total number of rows in a table

	* **Export**: Ways in which journalist export the results of their data wrangling.

* **Observations**: These codes cover observations from the coder about the wrangling processes, not actions performed by the journalist.

	* **Data Acquisition**: How the data was acquired by journalists

		* **Collect Raw Data**: Using first-hand observations or logs as data.

		* **Use Previously Cleaned Data**: Data that originated from a colleague.

		* **Use Public Data**: Includes open-source datasets, tables on Wikipedia, etc..

		* **Use Academic Data**: 

		* **Use Non-Public, Provided Data**: 

		* **Use Open Government Data**: Data publically available on open data portals, such as data.gov

		* **Freedom Of Information Data**: Data that was obtained via FOI/FOIA requests.

		* **Use Another News Orgs Data**: A dataset previously published by another news organization

		* **Use Data From Colleague**: A dataset was provided by another journalist.

	* **Workflow Building**: Codes pertaining to how the wrangling workflow is built.

		* **Annotate Workflow**: Adding comments or notes in Markdown that explain what the journalists doing.

		* **Think Computationally**: Codes that demonstrate computational thinking on the part of the journalist.

			* **Architect A Subroutine**: A set of instructions grouped together to be performed multiple times.

			* **Architect Repeating Process**: Instances where journalists employed a loop.

		* **Toggle Step On And Off**: Some wrangling steps were not always run. Toggling off is often accomplished by commenting out code.

	* **Wrangling Purpose**: Why does this data need to be wrangled?

		* **Input For Downstream Applications**: Output from wrangling will be input into some other program

			* **Wrangle Data For Graphics**: Data need to be formatted in order to be visualized in an article, including tables.

			* **Wrangle Data For Model**: Data is being wrangled in order to create a model, whether the main point of the piece is for prediction or classification

		* **Remove Erroneous Data**: There are errors in the data that need to be removed

		* **Creating New Datasets**: 

			* **Combine Drifting Datasets**: Reconcile difference in periodically published datasets that have superficially changed over time, such as schema differences or entity names, to consolidate more than one dataset.

			* **Combine Seemingly Disparate Datasets**: When a notebook largely constitutes combining seemingly unrelated datasets.

			* **Combine Data And Geography**: Pairing data with GIS info.

		* **Aggregate The Forest From The Trees**: Data of individual observations is aggregated in an attempt to find some meaningful structure or patterns

	* **Analysis**: Kinds of analysis data journalists need to wrangle data to perform.

		* **Interpret Statistical/Ml Model**: Analyze features from a model such as linear regression or classification trees

		* **Compare Different Groups Along A Common Metric**: The end analysis is just comparing different groups by a common metric.

		* **Show Trend Over Time**: Analysis consists of showing how values change over time

		* **Calculate A Statistic**: Calculate a single value for from a dataset, such as number of records.

		* **Explain Variance**: This can be done via PCA

		* **Answer A Question**: Analysis consists of using data to answer a specific question

		* **Outlier Detection**: Finding extreme cases or outliers in the data

		* **Find Nearest Neighbours In The Network**: (Network analysis) Find the closest neighbours for all points

		* **Explore Dynamic Network Flow**: (Network analysis) explore the flow between different nodes in the graph, e.g. migration between cities.

	* **Strategies**: General strategies journalists employ when wrangling data.

		* **Tables Evolve**: Data and objects are destroyed during the wrangling process.

			* **Value Replacement**: The output of any column calculation is reassigned to an existing column.

			* **Temporary Joining Column**: When a key for joining two tables is created and destroyed immediately after the join.

			* **Refine Table**: Table refinement refers to when a table is subset *in place*, a new object is not created in the environment.

		* **Data Is Precious**: Data and objects are neverly actually lost in the programming environment.

			* **Preserve Existing Values**: The output of any column calculation is assigned to a new column

			* **Create Child Table**: A child table is a subset of the parent table declared as a new object in the environment.

		* **Set Data Confidence Threshold**: Removes rows where a quantitative value is less than, greater than, or not equal to a numeric value.

		* **Table Splitting**: Tables may be divided, partitioned, or otherwise split into multiple tables to accomplish a transformation goal.

			* **Split, Compute, And Merge**: First, the journalist partitions a single data frame into multiple, separate data frames. Then, often identical computations are run on all the data frame. Finally, the multiple data frames are consolidated into one data frame again.

			* **Split And Compute**: One table is split into two or more and identical computations are applied to each table.

		* **Tolerate Dirty Data**: Analysis continues despite clear data quality issues.

	* **Pain Points**: Areas where journalist seem/could be frustrated in the wrangling process.

		* **Fix Incorrect Calculation**: Calculations in the data are incorrect and the journalist must recalculate them

		* **Repetitive Code**: Instances where code is repetitively copied and pasted.

		* **Make An Incorrect Conclusion**: Instances where the journalist has made an incorrect conclusion about the data.

		* **Post-Merge Clean Up**: Pain points that come from the result of merging two datasets together

			* **Resort After Merge**: When a sort has to be re-done because a merge ruining the pre-merged order.

			* **Fill In Na Values After An Outer Join**: As outer joins do not drop non-matching rows, those values have NA

		* **Encode Redundant Information**: When data that already exists in the table is recoded into the table.

		* **Post-Aggregation Clean Up**: Pain points that come from the result of grouping a table.

			* **Data Loss From Aggregation**: When table columns are lost because they were dropped form resulting table due to not being relevant in aggregation.

			* **Silently Dropping Values After Groupby**: Values other than thsoe being grouped and calculated upon are lost in a group by operation

		* **Data Too Large For Repo**: Raw data cannot be included in SCM because files are too large



# Parse coded notebooks

For each computational notebook and script used for wrangling data in each analysis, we created PDF printouts with a `.html.pdf`. This extension distinguishes them from possible PDFs checked into the repositories by contributors. All of these printouts fit the glob pattern `notebooks/**/**/*.html.pdf`. We open-coded PDF printouts using the comments feature in [Adobe Acrobat DC](https://acrobat.adobe.com/en/acrobat.html). Open codes are extracted from each PDF using some internals of the open-source [pdfannots CLI](https://github.com/0xabu/pdfannots). See the [main function in pdfannots.py](https://github.com/0xabu/pdfannots/blob/6dd8dd29a93a0f5ec55e4b47f0eb27d8088a11a0/pdfannots.py#L469) for more details. 

The `codeData` data frame links open codes with the notebooks in which they appear. Warning: this cell may take awhile to execute.

In [4]:
%%time
codeData = getCodes()

CPU times: user 56 s, sys: 140 ms, total: 56.2 s
Wall time: 56.6 s


## Sanity check

The cell below ensures that there aren't any codes in the code tree that aren't in the PDF printouts and vice versa. More precisely, it checks that the difference between the set of open codes in `code_tree.yaml` and the set of unique codes that appear in every PDF printout (`notebooks/**/**/*.html.pdf`) is the empty set.

In [5]:
# Parse the code YAML for just the open codes (leaves)
leaves = []
def collectLeaves(node, repo):
    """Recursively traverse dictionary tree and collect only the leave nodes"""
    if 'sub' in node.keys():
        for subnode in node['sub']:
            collectLeaves(subnode, repo)
    else:
        safeCode = node['name'].strip().lower()
        repo.append(safeCode)

for grp in code_yaml:
    collectLeaves(grp, leaves)

# Convert from lists to sets
leaves = set(leaves)
pdf_codes = set(codeData['code'].unique())

# Find any discrepancies
diff = lambda a, b, codes: displayMarkdown('Codes in `{}` but not in `{}`:\n{}\n'.format(a, b, '\n'.join(['* ' + c for c in codes])))

falsePositives = pdf_codes.difference(leaves)
falseNegatives = leaves.difference(pdf_codes)

if not (bool(falsePositives) or bool(falseNegatives)):
    # Both sets are the null set
    displayMarkdown('<p>All codes have been grouped!</p><img src="https://media.giphy.com/media/XreQmk7ETCak0/giphy.gif"> ')
else:
    # Problems
    if len(pdf_codes.difference(leaves)) > 0:
        diff('*.html.pdf', 'code_tree.yaml', pdf_codes.difference(leaves))
    if len(leaves.difference(pdf_codes)) > 0:
        diff('code_tree.yaml', '*.html.pdf', leaves.difference(pdf_codes))

<p>All codes have been grouped!</p><img src="https://media.giphy.com/media/XreQmk7ETCak0/giphy.gif"> 

### Find notebooks with certain codes

If extracted codes and the codes in `code_tree.yaml` don't match, then we can find the corresponding open code by grouping data by code, article, and analysis.

In [6]:
needles = ['calculate normalized values']

codeData[codeData.code.isin([n.lower() for n in needles ])] \
    .groupby(['code', 'analysis', 'notebook']) \
    ['notebook'].count() \
    .to_frame('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
code,analysis,notebook,Unnamed: 3_level_1


# Calculating code-analysis frequency

Because PDF printouts have only open codes inside, we have to do a little bit of data wrangling to figure out how many analyses 


In [7]:
codeAnalysisTmp = pd.merge(codes[codes.is_leaf].copy()[['parent', 'name']], codeData[['code', 'analysis']],
                        how='left',
                        left_on='name',
                        right_on='code') \
                 .drop(['code'], axis=1) \
                 .drop_duplicates()

codeAnalysis = codeAnalysisTmp[['name', 'analysis']].copy()

while codeAnalysisTmp.parent.nunique() > 0:
    codeAnalysisTmp = codeAnalysisTmp[['parent', 'analysis']] \
        .rename(columns={'parent': 'name'}) \
        .drop_duplicates()

    codeAnalysisTmp = pd.merge(
        codeAnalysisTmp,
        codes[['parent', 'name']],    
        how = 'left',
        on = 'name')

    codeAnalysis = pd.concat([codeAnalysis, codeAnalysisTmp[['name', 'analysis']]]) \
        .drop_duplicates()

codeAnalysis = codeAnalysis.groupby('name')['analysis'].nunique().to_frame('analysis').reset_index()

displayMarkdown('The minimum analysis count for any code should be 1: {}'.format(1 == min(codeAnalysis.analysis)))

The minimum analysis count for any code should be 1: True

In [8]:
priorSize = codes.shape[0]

codes = pd.merge(codes, codeAnalysis, how='left', on='name')

displayMarkdown(('The data frame `codes` differ by {} rows after the join'.format(priorSize - codes.shape[0])))
codes.head()

The data frame `codes` differ by 0 rows after the join

Unnamed: 0,parent,name,desc,level,is_leaf,analysis
0,root,actions,Codes that describe actions the journalist has...,0,False,50
1,actions,import,How raw data is introduced into the programmin...,1,False,39
2,import,fetch,Data is retrieved from some external sources t...,2,False,6
3,fetch,pull tables out of pdf,"Using a table extraction tool, such as Tabula,...",3,True,1
4,fetch,geocode addresses,Translate addresses to latitude-longitude coor...,3,True,1


# Export results

In [9]:
codes.to_csv('data/codes.csv')