# Codebook

This method contains a human-readable version of the codebook.

In [1]:
import os, glob, re, subprocess
import pandas as pd
import yaml
from pdfannots import pdfannots
import codecs
from IPython.display import display, Markdown

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

## Display Codes

All open codes, their descriptions, and the corresponding axial codes are stored in the `code_tree.yaml` file.

In [2]:
with open('code_tree.yaml', 'r') as f:
    code_yaml = yaml.safe_load(f)

As the master copy for all open and axial codes resides in the `code_tree.yaml` file, the raw text itself can be difficult to read. Thus, this snippet renders all the codes in Markdown.

In [3]:
codes = []

def getCodeTree(node, func, lvl=0):
    """ A recursive, pre-order traversal of the code groups YAML structure"""
    func(node['name'], node['desc'], lvl)
    if 'sub' in node.keys():
        for child in node['sub']:
            getCodeTree(child, func, lvl + 1)

parseYaml = lambda k, d, l: codes.append('{}* **{}**: {}\n'.format('\t' * l, k.title(), d))

for grp in code_yaml:
    getCodeTree(grp, parseYaml)

display(Markdown('### Codes\n' + ''.join(codes)))

### Codes
* **Actions**: Codes that describe actions the journalist has taken to wrangle data for further analysis.
	* **Amend**: *Amending* a table constitutes creating new columns in the table without *integrating* other tables.
		* **Detrend**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*.
			* **Adjust For Inflation**: TK
			* **Adjust For Season**: TK
			* **Compute Index Number**: TK
		* **Encode Table-Level Data**: Adding columns populated with data aggregated from the whole table, such as the frequency of nominal variables.
			* **Encode Table Identification In Row**: When some way of identifying the table is encoded as a separate column in each row. Common identification methods include the name of the corresponding file, an arbitrary table name, or boolean value.
			* **Encode Table Summary Data In Row**: When high-level, aggregated data about the table is encoded in the rows. For example, a table column contains the frequency of nominal values in a different column of the same table.
		* **Formulate Performance Metric**: Codes in this category specify a calculation that is later used to compare different entities or the same entity over time. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.
			* **Calculate Standardized Score**: Standardized scores are metrics that quantify deviation from some definition of "normal."
				* **Calculate Z-Score**: Calculate how many standard deviations a value in a column is away from the mean. Journalists perform this function to simply find outliers in a dataset or when preparing the data for principle component analysis.
			* **Figure A Ratio**: Operations that normalize quantitative variables to allow comparisons between groups of different sizes.
				* **Calculate Ratio**: Dividing a quantitative variable by another in such a way that enables fair comparisons.
				* **Calculate Scaled Ratio**: For example, calculating per 1,000 rates and percentages.
			* **Calculate Central Tendency**: These metrics try to find typical value in the data.
				* **Calculate Mean**: TK
				* **Calculate Median**: The middle value in a range of numbers.
			* **Quantify Change**: Measuring how much things change, usually over time.
				* **Calculate Percentage Difference**: "the difference between two values taken as a percentage of whichever value you are using as the base," according to Philip Meyer in *Precision Journalism.* This term is synonymous with percent change.
				* **Calculate Difference**: Subtracting two quantitative variables, including scalar values, vectors, and matricies.
			* **Calculate Spread**: TK
		* **Key Generation**: Operations that create "key" columns. These columns are often, but not always, used in group by or join operations. As this step is often a discrete precursor to data *integration*, it belongs in the *amend* group.
			* **Create A Semi-Unique Key**: TK
			* **Create A Unique Key**: TK
			* **Concatenate Columns Into Key**: TK
			* **Designate Column As Primary Key**: TK
		* **Rank Data**: Operations that encode semantic meaning about the data with table index.
			* **Assign Ranks**: When a column of numerical ranks is explicitly assigned to rows in the table.
			* **Break Ties**: TK
			* **Sort Table**: When rank is implicitly assigned by rearranging row position in the table.
		* **Create Flag**: Flags are boolean expressed computed upon column values and used in filtering and grouping
	* **Clean**: Operations to correct erroneous or remove otherwise unwanted rows and values from the table.
		* **Trim Fat**: Removing portions of the table not relevant to analysis.
			* **Prune Columns**: Simply put, these operations remove table columns.
				* **Drop Columns**: TK
				* **Select Columns**: TK
				* **Align Table Columns For Consolidation**: TK
			* **Prune Rows**: Simply put, these operations remove table rows.
				* **Trim By Date Range**: TK
				* **Trim By Geographic Area**: TK
				* **Trim By Quantitative Threshold**: TK
			* **Filter Rows**: TK
			* **Drop Erroneous Rows**: TK
			* **Remove Incomplete Data**: TK
			* **Deduplicate**: Remove rows from the table that contain two or more of the same "observation." Duplicates may constitute rows with identical values in all, one, or zero columns.
				* **Prevent Double-Counting**: TK
				* **Drop Entirely Duplicate Rows**: TK
				* **Drop Rows With Duplicate Value In One Column**: TK
				* **Remove All Rows But The Master Record**: TK
		* **Edit**: Operations that modify table values
			* **Replace Na Values**: Raw data may contain incomplete table values (denoted as NA) or empty values (denoted as NULL)
			* **Fix Data Errors Manually**: TK
			* **Fix Incorrect Calculation**: TK
			* **Remove With Regular Expression**: TK
			* **Resolve Entity Names**: A surjective mapping from previous column values to new column values
				* **Perform Name Entity Resolution Manually**: Manually specify the mapping between old and new column values
				* **Strip Whitespace**: Note that this might also fall under *resolving entity names*.
				* **Combine Entities By String Matching**: TK
			* **Translate Entity Names**: Performing a bijective mapping between values, often to improve semantic meaning.
				* **Translate Entity Names Manually**: Manually specify the mapping between individual
				* **Join With Lookup Table**: Two column tables meant for mapping a key from one table to the unique column in the lookup table.
			* **Scale Values**: Operations that apply some mathematical operation to columns of quantitative data. This code is different from the codes under **Formulate performance metric** because this closer to cleaning.
				* **Log-Ify Values**: TK
				* **Perform Scalar Multiplication**: TK
				* **Whiten Matrix**: Divide each feature by its standard deviation across all observations to give it unit variance.
		* **Format**: Operations that modify the table values appearance or style.
			* **Format Values**: Operations that modify the values within the table.
				* **Change Case**: TK
				* **Change Date Format**: TK
				* **Round Floating Point**: TK
			* **Format Schema**: Operations that modify anything except table values.
				* **Canonicalize Column Names**: Operations that change column names
				* **Change Column Data Type**: For example, changing a column of values from strings to integers
		* **Separate**: Mapping one column into more than one because multiple dimensions of the dataset packed into one column.
			* **Extract Property From Datetime**: Such as extracting the day of the month, year, etc.. from a datetime column
			* **Slice Column Values**: Extracting the relevant column values by character position, e.g. the first five digits of a zip code.
			* **Split Column On Delimiter**: TK
			* **Get Unique Values**: TK
	* **Integrate**: Combining data residing in different tables into one table.
		* **Consolidate**: Combination of multiple tables into one. Schema changes are non-existent or inconsequential.
			* **Union Tables**: TK
			* **Concatenate Files Together**: TK
			* **Full Join Tables**: Combine all rows and all columns of the two tables.
		* **Intersect**: Joining two tables such that non-matching rows are excluded from the combined table.
			* **Inner Join Tables**: TK
			* **Natural Join**: TK
		* **Supplement**: Joining a table with a primary table such that all the rows in the primary table are in the combined table.
			* **Right Join Tables**: TK
			* **Left Join Tables**: TK
			* **Merge Metadata**: TK
			* **Ping Web Service**: TK
				* **Add Calculated Column From Axillary Data**: TK
				* **Geocode Addresses**: TK
		* **Other**: Integration operations that do not fall into the previous two categories
			* **Cartesian Product**: TK
			* **Self Join Table**: TK
	* **Transform**: Operations that transform a table into an aggregated, lower-resolution view of the original table.
		* **Summarize**: 
			* **Aggregate**: Codes that group the table along one or more table dimension.
				* **Group By Single Axis**: Grouping one or more columns such that grouped columns are hierarchically ordered when grouping by two or more columns. This operation is commonly implemented with `groupby` in Pandas.
					* **Group By Single Column**: 
					* **Group By Multiple Columns**: 
				* **Group By Double Axis**: Grouping by more than one column such that one grouped column is not hierarchically paired with another grouped column.
					* **Construct Pivot Table**: Is essentially the same as a crosstab except that the table axes may contain hierarchical, nominal data.
					* **Create A Crosstab**:  User performs a crosstab query, as defined by [Microsoft Office](https://support.office.com/en-us/article/make-summary-data-easier-to-read-by-using-a-crosstab-query-8465b89c-2ff2-4cc8-ba60-2cd8484667e8). Crosstabs are very similar to the reshaping operation *spread*, except that they summarize values using aggregate functions.
				* **Create Rolling Window**: 
		* **Calculate**: These are within-column calculations that often, but not always, immediately follow an *aggregation* operation.
			* **Sum Column Values**: 
			* **Get Max Value**: 
			* **Count Value Frequency**: 
			* **Count Unique Values In Column**: 
		* **Reshape**: Operations fundamentally change the table's structure, but do not perform any kind of summarization calculation. *Constructing a pivot table* often involves a *spread-like* operation when defining what values to use as columns in the new table. The difference with *reshaping* is that sometimes the journalist may not summarize the reshaped table.
			* **Spread Table**: 
			* **Gather Table**: Collapses table into key value pairs.
	* **Display Dataset**: Different ways to check in on the state of the dataset during wrangling.
		* **Display A Table**: Operations that have to do with displaying the raw data as a table.
			* **Format Table Display**: 
			* **Display Entire Table**: 
		* **Understand Distribution**: Operations that reveal something of the underlying distribution of data.
			* **Plot Histogram**: 
			* **Plot Stacked Bar Chart**: 
			* **Plot Stacked Column Chart**: 
			* **Plot Scatterplot**: 
			* **Plot Trendline**: 
			* **Plot Column Chart**: 
			* **Plot Violin Plot**: 
			* **Plot Boxplot**: 
			* **Plot Scree Plot**: 
			* **Plot Line Chart**: Visualizations with lines connecting points on a chart.
	* **Check Sanity**: Operations that confirm the effect of a previous wrangling operation.
		* **Check Results Of Previous Operation**: 
		* **Compare Total Number Of Rows**: 
		* **Test For Equality**: Test if two data structures are exactly the same, e.g. two data frames.
		* **Peek At Data**: Display the first *n* rows and all columns of the table
		* **Inspect Table Schema**: 
* **Observations**: 
	* **Document**: When journalist annotate their data wrangling processes with non-executing comments or notes.
		* **Annotate Workflow**: 
	* **Cache Results From External Service**: 
	* **Export Data**: Ways in which journalist export the results of their data wrangling.
		* **Export Intermediate Results**: 
		* **Export Results**: 
	* **Workflow Building**: 
		* **Think Computationally**: Codes that demonstrate computational thinking on the part of the journalist.
			* **Architect A Subroutine**: 
			* **Architect Repeating Process**: Instances where journalists employed a loop.
		* **Toggle Step On And Off**: Some wrangling steps were not always run. Toggling off is often accomplished by commenting out code.
	* **Acquire Data**: Codes relating to how data is originally acquired by journalists.
		* **Extracted Data**: Extraction occurs when data is originally in a format that is not readily accessible for wrangling and analysis through programmatic methods. 
			* **Pull Tables Out Of Pdf**: Instances where journalists used a PDF extraction tool, such as Tabula, to work with raw data.
			* **Scrape Web For Data**: 
	* **Create**: Data used in wrangling/analysis is collected or generated by the journalist. In "Heat and Index" Sahil Chinoy computationally generates temperature and humidity data.
		* **Construct Table Manually**: Journalists hand-type the column names and table values.
		* **Generate Data Computationally**: 
	* **Collect Raw Data**: First-hand observations or logs
	* **Data Properties**: 
		* **History**: 
			* **Use Previously Cleaned Data**: 
		* **Structure**: Is the data tabular, geospatial. What kind of data is the journalist dealing with?
			* **Use Structured Ascii**: 
			* **Use Geospatial Data**: 
			* **Use Tabular Data**: 
		* **Source**: Where did this data come from?
			* **Use Public Disclosure Data**: 
			* **Use Public Data**: 
			* **Use Academic Data**: 
			* **Use Non-Public, Provided Data**: 
			* **Use Another News Orgs Data**: 
			* **Use Data From Colleague**: 
	* **Purpose**: Why does this data need to be wrangled? For what end does wrangling serve? This category ventures into analysis, which not wrangling.
		* **Analysis**: Kinds of analysis data journalists need to wrangle data to perform.
			* **Extract Single Value**: Sometimes, the whole point of wrangling is to calculate and report a single value for a story.
			* **Analyze Principle Components**: 
			* **Run Cluster Analysis**: Run some kind of clusting analysis, such as K-means.
			* **Fit A Generalized Linear Model**: 
			* **Look For Trends**: 
			* **Find Most Frequently Occurring**: 
			* **Find Worst Offender**: 
			* **Count Number Of Records**: 
			* **Image Analysis**: A programmatic, quantitative analysis of images.
		* **Wrangle Data For Graphics**: When a purpose of the notebook is to format data for other visualization tools
		* **Combine Seemingly Disparate Datasets**: When a notebook largely constitutes combining seemingly unrelated datasets.
	* **Strategies**: 
		* **Value Replacement**: The output of an intra- or inter- column calculation is reassigned to an existing column.
		* **Preserve Existing Values**: The output of an intra- or inter- column calculation is assigned to a new row
		* **Set Data Confidence Threshold**: 
		* **Table Splitting**: Tables may be divided, partitioned, or otherwise split into multiple tables to accomplish a transformation goal.
			* **Split, Compute, And Merge**: First, the journalist partitions a single data frame into multiple, separate data frames. Then, often identical computations are run on all the data frame. Finally, the multiple data frames are consolidated into one data frame again.
			* **Split And Compute**: One table is split into two or more and identical computations are applied to each table.
			* **Peel And Merge**: When a single column of a data frame is isolated and computed upon, such as computing the frequency of a nominal column, and the results are merged back into the original table.
			* **Merge Tables To Create Pivot Table**: 
		* **Tolerate Dirty Data**: 
		* **Omits Data Quality Exploration**: 
		* **Temporary Joining Column**: 
	* **Pain Points**: 
		* **Repetitive Code**: Instances where code is repetitively copied and pasted.
		* **Make An Incorrect Conclusion**: Instances where the journalist has made an incorrect conclusion about the data.
		* **Resort After Merge**: 
		* **Data Loss From Aggregation**: 
		* **Encoding Provenance In Data**: 
		* **Data Too Large For Repo**: Raw data cannot be included in SCM because files are too large


## Methods

### Extracting open codes from PDF files

Open codes were extracted from each PDF using some internals of the open-source [pdfannots CLI](https://github.com/0xabu/pdfannots). See the [main function in pdfannots.py](https://github.com/0xabu/pdfannots/blob/6dd8dd29a93a0f5ec55e4b47f0eb27d8088a11a0/pdfannots.py#L469) for more details.

In [None]:
%%time
rep = lambda s, n: [ s for i in range(n) ]
codec = codecs.lookup('cp1252')
data = pd.DataFrame(columns=['org', 'article', 'analysis', 'index', 'cell', 'code'])
code_re = r'\[([^\]]+)\]\s([A-za-z][^\n]+)\n?'  # Regular expression for parsing my coding comments

ptrn = os.path.join('.', 'notebooks', '**', '**', '*.html.pdf')
for fn in glob.iglob(ptrn, recursive=False):        
    org, article, analysis = fn.split('/')[2:]
    with open(fn, 'rb') as fobj:
        annots, outlines = pdfannots.process_file(fobj, codec, False)
    codes = []
    for annot in annots:
        if annot.contents != None:
            codes += re.findall(code_re, annot.contents)
    df = pd.DataFrame({
        'org': rep(org, len(codes)),
        'article': rep(article, len(codes)),
        'analysis': rep(analysis[:-9], len(codes)),  # slice off file extension
        'index': [ i for i in range(len(codes)) ],
        'cell': [ c[0].strip() for c in codes ],
        'code': [ c[1].strip().lower() for c in codes ]
    })
    data = data.append(df)

Many data journalists take a peek at the data many times over the course of wrangling their data and so will I. 

* **org**: The organization that published the analysis.
* **article**: THe name of the repository that contains analysis.
* **index**: ??
* **cell**: The execution order of the cell in the PDF or the line number. This column is useful for tracing open codes back to where they occurred in these computational notebooks.
* **code**: The name of the open code.

In [None]:
data.sample(frac=1).head(10)

## Quality Control

Summarize the current coding progress

In [None]:
summary_stats = [
    len(data['article'].unique()) + 1,  # Add one for two stories in wuft/Power_of_Irma repo
    len(data['code'].unique())
]

print('Articles: {}\nCodes: {}'.format(*summary_stats))

### Sanity Check 

Double check that every code generated from open coding has been covered in `code_tree.yaml` and every entity in `code_tree.yaml` is actually in a `.html.pdf` file.

In [None]:
# Parse the code YAML for just the open codes (leaves)
leaves = []
def collectLeaves(node, repo):
    """Recursively traverse dictionary tree and collect only the leave nodes"""
    if 'sub' in node.keys():
        for subnode in node['sub']:
            collectLeaves(subnode, repo)
    else:
        safeCode = node['name'].strip().lower()
        repo.append(safeCode)
for grp in code_yaml:
    collectLeaves(grp, leaves)

# Convert from lists to sets
leaves = set(leaves)
pdf_codes = set(data['code'].unique())

# Find any discrepancies
diff = lambda a, b, codes: display(Markdown('Codes in `{}` but not in `{}`:\n{}\n'.format(a, b, '\n'.join(['* ' + c for c in codes]))))

falsePositives = pdf_codes.difference(leaves)
falseNegatives = leaves.difference(pdf_codes)

if not (bool(falsePositives) or bool(falseNegatives)):
    # Both sets are the null set
    display(Markdown('<p>All codes have been grouped!</p><img src="https://media.giphy.com/media/XreQmk7ETCak0/giphy.gif"> '))
else:
    # Problems
    if len(pdf_codes.difference(leaves)) > 0:
        diff('*.html.pdf', 'code_tree.yaml', pdf_codes.difference(leaves))
    if len(leaves.difference(pdf_codes)) > 0:
        diff('code_tree.yaml', '*.html.pdf', leaves.difference(pdf_codes))

If extracted codes and the codes in `code_tree.yaml` don't match, then we can find the corresponding open code by grouping data by code, article, and analysis.

In [10]:
needles = []

data[data.code.isin([n.lower() for n in needles ])] \
    .groupby(['code', 'article', 'analysis']) \
    ['analysis'].count() \
    .to_frame('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
code,article,analysis,Unnamed: 3_level_1
run cluster analysis,data,cluster-paintings.py,1


## Display all codes

Show all the unique codes generated so far, and link them to the articles in which they appear.

In [None]:
data['mark'] = '✔'

(
    data[['code', 'org', 'mark']]
        .drop_duplicates(['code', 'org'])  # Drop duplicate codes within an article
        .set_index(['code', 'org'])
        .unstack(fill_value='')
)