# Codebook

In [92]:
import os, glob, re, subprocess
import pandas as pd
import yaml
from pdfannots import pdfannots
import codecs
import treelib

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

# Extract codes from PDFs

The cell below extracts the codes from each PDF using some internals of the [pdfannots CLI](https://github.com/0xabu/pdfannots). See the [main function in pdfannots.py](https://github.com/0xabu/pdfannots/blob/6dd8dd29a93a0f5ec55e4b47f0eb27d8088a11a0/pdfannots.py#L469) for more details.

In [93]:
%%time
rep = lambda s, n: [ s for i in range(n) ]

codec = codecs.lookup('cp1252')

data = pd.DataFrame(columns=['org', 'article', 'analysis', 'index', 'cell', 'code'])

code_re = r'\[([^\]]+)\]\s([A-za-z][^\n]+)\n?'  # Regular expression for parsing my coding comments

ptrn = os.path.join('.', 'notebooks', '**', '**', '*.html.pdf')
for fn in glob.iglob(ptrn, recursive=False):        
    org, article, analysis = fn.split('/')[2:]
    with open(fn, 'rb') as fobj:
        annots, outlines = pdfannots.process_file(fobj, codec, False)
    codes = []
    for annot in annots:
        if annot.contents != None:
            codes += re.findall(code_re, annot.contents)
    df = pd.DataFrame({
        'org': rep(org, len(codes)),
        'article': rep(article, len(codes)),
        'analysis': rep(analysis[:-9], len(codes)),  # slice off file extension
        'index': [ i for i in range(len(codes)) ],
        'cell': [ c[0].strip() for c in codes ],
        'code': [ c[1].strip().lower() for c in codes ]
    })
    data = data.append(df)    

CPU times: user 20 s, sys: 70 ms, total: 20.1 s
Wall time: 20.2 s


In [94]:
data.head()

Unnamed: 0,org,article,analysis,index,cell,code
0,baltimore-sun-data,2018-voter-registration,01_processing,0,paragraph 1,use public disclosure data
1,baltimore-sun-data,2018-voter-registration,01_processing,1,paragraph 1,pull tables out of pdf
2,baltimore-sun-data,2018-voter-registration,01_processing,2,1,annotate workflow
3,baltimore-sun-data,2018-voter-registration,01_processing,3,1,change column data type
4,baltimore-sun-data,2018-voter-registration,01_processing,4,1,canonicalize column names


Summarize the current coding progress

In [95]:
summary_stats = [
    len(data['article'].unique()),
    len(data['code'].unique())
]

print('Articles: {}\nCodes: {}'.format(*summary_stats))

Articles: 17
Codes: 128


# Hierarchical Code Groups

In [96]:
def walkTheYaml(parent, children, func):
    """ A recursive, pre-order traversal of the code groups YAML structure"""
    for child in children:
        if isinstance(child, str):
            # Leaf nodes are strings.
            func(parent, child, True)
        elif isinstance(child, dict):
            # Interior nodes are dictionaries.
            key = list(child.keys())[0]
            func(parent, key, False)
            walkTheYaml(key, child[key], func)

## Sanity Check

Double check that every code generated from open coding has been covered in the hierarchy and every entity in `code_tree.yaml` is actually in a `.html.pdf` file.

In [97]:
root = 'Wrangling'
leaves = []
gatherLeaves = lambda p, c, l: leaves.append(c.strip().lower()) if l else None
code_tree = 'code_tree.yaml'

with open(code_tree, 'r') as f:
    code_hierarchy = yaml.safe_load(f)

walkTheYaml(root, code_hierarchy, gatherLeaves)
leaves = set(leaves)
pdf_codes = set(data['code'].unique())

diff = lambda a, b, codes: print('Codes in {} but not in {}:\n{}\n'.format(a, b, '\n'.join(['\t- ' + c for c in codes])))

if len(pdf_codes - leaves) == 0:  # is null set
    print("All codes have been grouped üòé")
else:
    if len(pdf_codes - leaves) > 0:
        diff('*.html.pdf', code_tree, pdf_codes - leaves)
    if len(leaves - pdf_codes) > 0:
        diff(code_tree, '*.html.pdf', leaves - pdf_codes)

All codes have been grouped üòé


## A detailed description of codes

### Wrangling actions

* **Amend**: *Amending* a table constitutes creating new columns in the table without *integrating* other tables, hence it is different from codes under the *integration* category.
    * **Encode table-level data**: I think of each row in a table as representing an "observation" and columns representing dimensions of that "observation." *Encoding table-level data* occurs when journalists add columns to the table populated with data at a level higher than observation.
        * **Encode table summary data in row**: When high-level, aggregated data about the table is encoded in the rows. For example, a table column contains the frequency of nominal values in a different column of the same table.
    * **Formulate performance metric**: Codes in this category specify a calculation that is later used to compare different entities or the same entity over time. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.
        * **Calculate standardized scores**: Standardized scores are metrics that quantify deviation from some definition of "normal."
        * **Figuring a normalized rate**: Figuring a rate often normalizes some performance metric to allow fair comparisons between entities.
        * **Calculate central tendency**: These metrics try to find typical value in the data.
        * **Quantify Change**: Measuring how much things change, usually over time.
            * **Percentage Difference**: "the difference between two values taken as a percentage of whichever value you are using as the base," according to Philip Meyer in *Precision Journalism.* This term is synonymous with percent change.        
    * **Key Generation**: Operations that create "key" columns. These columns are often, but not always, used in group by or join operations. As this step is often a discrete precursor to data *integration*, it belongs in the *amend* group.

* **Clean**: Operations to correct erroneous or remove otherwise unwanted rows and values from the table.
    * **Trim fat**: Removing portions of the table not relevant to analysis.
        * **Reduce dimensionality**: Simply put, these operations remove table columns.
        * **Prune data**: Simply put, these operations remove table rows.
            * **Deduplicate**: Remove rows from the table that contain two or more of the same "observation." Duplicates may constitute rows with identical values in all, one, or zero columns. In ["Analysis of early 2020 Democratic campaign co-donors"](notebooks/buzzfeednews/2019-04-democratic-candidate-codonors), a self join created separate records for each donation to multiple candidates for each permutation of candidates. The values in each row were unique, but still constituted duplicates that would have resulted in double-counting during analysis.
    * **Edit**: Operations that fundamentally modify table values.
        * **Handle incomplete data**: Raw data may contain incomplete table values (denoted as NA) or empty values (denoted as NULL) 
        * **Resolve entity names**: Perform name entity resolution. For example, the postal code for British Columbia is BC. If a column contained both the strings "British Columbia" and "BC" name entity resolution would replace all instances of "BC" to "British Columbia" or vice versa.
            * **Strip whitespace**: Note that this might also fall under *resolving entity names*.       
        * **Scale values**: Operations that apply some mathematical operation to columns of quantitative data. This code is different from the codes under **Formulate performance metric** because this closer to cleaning.
    * **Format**: Operations that modify the table values appearance or style.
        * **Value Formatting**: Operations that modify the values within the table.
        * **Meta Formatting**: Operations that modify anything except table values.
    * **Separate**: Dataset may contain multiple dimensions of the dataset packed into one column. A typical *separation* task for journalists involves parsing addresses encoded as one column in a raw table. A journalist must separate that field out into separate one or all of the entities that constitute an address, such as street, zip code, or country.
        * **Slice Column Values**: Taken from the `slice` method of the `String` class in JavaScript, this operation refers to extracting the relevant column values by character position. This code was often used when American journalists extract just the first five digits of a nine digit zip code.
    * **Combine** The act of combining more than one column into one column.
    * **Detrend data**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*. This includes adjusting for inflation, population growth, and season.    
    * **Rank entities**: A common pre-analysis task is to rank entities in the dataset by a common performance metric.    

* **Integrate**: combining data residing in different tables into one table.
    * **Consolidate**: Row-wise concatenation of multiple tables into a single table, such that schema changes are non-existent or inconsequential. This is similar to a `UNION` operation in SQL. For example, a government agency may publish data yearly, and a journalist *consolidates* these individual files into one dataset across many years.
    * **Intersect**: Joining two tables such that non-matching rows are excluded from the combined table. *Intersection* is usually implemented through an `INNER JOIN` operation in SQL terminology. 
    * **Supplement**: Joining an auxiliary table with the primary table such that all the rows present in the combined table were also present in the primary table. In SQL terminology, *supplementation* is similar `LEFT JOIN` and `RIGHT JOIN` operations. For example, a journalist may *supplement* a dataset of political candidates with aggregated campaign contribution data, as *BuzzFeed News* did in ["Analysis of early 2020 Democratic campaign co-donors"](notebooks/buzzfeednews/2019-04-democratic-candidate-codonors).
    * **Other**: integration operations that don't fall into the previous two categories
* **Transform**:
    * **Summarize**: Operations that transform a table into an aggregated, lower-resolution view of the original table.
        * **Aggregate**: Code that group the table along one or more table dimension. Grouping dimensions are often company names, but can also be dates, geographic entities, boolean expressions, etc.
        * **Calculate**: These are within-column calculations that often, but not always, immediately follow an *aggregation* operation.
        * **Create a crosstab**: User performs a crosstab query, as defined by [Microsoft Office](https://support.office.com/en-us/article/make-summary-data-easier-to-read-by-using-a-crosstab-query-8465b89c-2ff2-4cc8-ba60-2cd8484667e8). Crosstabs are very similar to the reshaping operation *spread*, except that they summarize values using aggregate functions.
        * **Construct a pivot table**: Is essentially the same as a crosstab except that the table axes may contain hierarchical, nominal data.
    * **Reshape**: Operations fundamentally change the table's structure, but do not perform any kind of summarization calculation. *Constructing a pivot table* often involves a *spread-like* operation when defining what values to use as columns in the new table. The difference with *reshaping* is that sometimes the journalist may not summarize the reshaped table.

### Wrangling support

Some operations aren't directly relevant to modifying tables but still relevant in a survey of how data journalists wrangle data.

* **Display dataset**: Different ways to check in on the state of the dataset during wrangling.
    * **Check Sanity**: Operations that confirm the effect of a previous wrangling operation.
    * **Display a table**: Operations that have to do with displaying the raw data as a table.
    * **Understand distribution**: Operations that reveal something of the underlying distribution of data.
* **Documentation**: When journalist annotate their data wrangling processes with non-executing comments or notes.
* **Export data**: Ways in which journalist export the results of their data wrangling.
* **Workflow Building**:
    * **Think computationally**: Codes that demonstrate computational thinking on the part of the journalist.
* **Data Acquisition**: Codes relating to how data is originally acquired by journalists.
    * **Extracted data**: Extraction occurs when data is originally in a format that is not readily accessible for wrangling and analysis through programmatic methods. 
    * **Existing tabular data**: The data exists in a programmatic accessible format.
    * **Create new data**: Data used in wrangling/analysis is collected or generated by the journalist.    
* **data properties**:
    * **Quality**: Has this table undergone any previous wrangling?


### Wrangling purpose

Why does this data need to be wrangled? For what end does wrangling serve? This category ventures into analysis, which not wrangling.

* **Analysis**: Kinds of analysis data journalists need to wrangle data to perform.
* **Wrangle data for graphics**: The data needs to be modified in order to be visualized with other tools.
* **Combine seemingly disparate datasets**: When a journalist's analysis largely constitutes combining seemingly unrelated datasets. Such as baby names and state-wide results from the 2016 U.S. Presidential Election.

### Wrangling strategies

* **Value replacement** When modifications to table values overwrite the previous values. For example, if a journalists replaced a column of nine-digit zip codes with five-digit versions, then the values would be replaced.
* **Table splitting** Tables may be divided, partitioned, or otherwise split into multiple tables to accomplish a transformation goal. This often, but not always, involves re-merging the split table back into one table.
    * **Split, compute, and merge**: First, the journalist partitions a single data frame into multiple, separate data frames. Then, often identical computations are run on all the data frame. Finally, the multiple data frames are consolidated into one data frame again.
    * **Split and compute**: Is similar to *Split, compute, and merge* but the separate data frames are not recombined at the end of the wrangling process.
    * **Peel and merge**: When a single column of a data frame is isolated and computed upon, such as computing the frequency of a nominal column, and the results are merged back into the original table.


### Wrangling pain points

* **Repetitive code**: Instances where code is repetitively copied and pasted.


#### Other
        * **Encode table-level detail**: Analysts often encode the table name or results from a frequency table 

* **Figure a rate**: any operation that considers one group's relation to the whole. This code covers: simple rational numbers, percentages, and per-1000 rates.

* **Extract data from non-tabular form** includes scraping data from the web, parsing structured ASCII data (such as .fec files)

* **Generate data computationally** refers to programatically generating raw data, e.g. `range` in Python. In "Heat and Index" Sahil Chinoy computationally generates temperature and humidity data.

* **Create Unique Key** *added donor-movement* is an interesting code because it occurs frequently when dealing with campaign finance data. It seems like there could be algorithmic approaches that find a unique key out of any given combination of columns. Mainly unique keys are names and places concatenated. You could maybe random sample the dataset to save time?

* **Calculate z-score** Calculate how many standard deviations a value in a column is away from the mean, $(x_i - \bar{x})/\sigma_x \quad \forall x_i \in x$. Journalists perform this function to simply find outliers in a dataset or when preparing the data for principle component analysis.

In [98]:
tree = treelib.Tree()
addNode = lambda p, c, foo: tree.create_node(c.title(), c.lower(), parent=p.lower() if p != None else None)

addNode(None, root, False)
walkTheYaml(root, code_hierarchy, addNode)

tree.show(line_type='ascii-em')

Wrangling
‚ï†‚ïê‚ïê Wrangling Actions
‚ïë   ‚ï†‚ïê‚ïê Amend
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Detrend
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Adjust For Inflation
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Adjust For Season
‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Compute Index Number
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Encode Table-Level Data
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Encode Table Identification In Row
‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Encode Table Summary Data In Row
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Formulate Performance Metric
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Calculate Central Tendency
‚ïë   ‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Calculate Mean
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Calculate Spread
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Calculate Standardized Score
‚ïë   ‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Calculate Z-Score
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Figuring A Normalized Rate
‚ïë   ‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Calculate Per 1K
‚ïë   ‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Calculate Percentage
‚ïë   ‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Calculate Proportion
‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Quantify Change
‚ïë   ‚ïë   ‚ïë       ‚ï†‚ï

## Display all codes

Show all the unique codes generated so far, and link them to the articles in which they appear.

In [99]:
data.groupby(['code', 'article', 'analysis'])['analysis'].count().to_frame('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
code,article,analysis,Unnamed: 3_level_1
add calculated column from axillary data,cube_root_law,the_cube_root_law,1
adjust for inflation,california-crop-production-wages-analysis,02-transform,1
adjust for season,california-ccscore-analysis,analysis,1
analyze principle components,wikipedia-rankings,wikipedia.r,1
annotate workflow,2018-voter-registration,01_processing,1
annotate workflow,2018-voter-registration,02_analysis,1
annotate workflow,2019-04-democratic-candidate-codonors,analyze-campaign-codonors,1
annotate workflow,california-crop-production-wages-analysis,02-transform,1
annotate workflow,california-crop-production-wages-analysis,03-analysis,1
architect a subroutine,2016-04-republican-donor-movements,donor-movement-analysis,1


In [100]:
data['mark'] = '‚úî'

(
    data[['code', 'org', 'mark']]
        .drop_duplicates(['code', 'org'])  # Drop duplicate codes within an article
        .set_index(['code', 'org'])
        .unstack(fill_value='')
)

Unnamed: 0_level_0,mark,mark,mark,mark,mark,mark,mark,mark,mark
org,TheOregonian,baltimore-sun-data,buzzfeednews,la_times,nola,nytimes,stlpublicradio,time,wuft
code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
add calculated column from axillary data,,,,,,‚úî,,,
adjust for inflation,,,,‚úî,,,,,
adjust for season,,,,‚úî,,,,,
analyze principle components,,,,,,,,‚úî,
annotate workflow,,‚úî,‚úî,‚úî,,,,,
architect a subroutine,,,‚úî,‚úî,,,,‚úî,
architect repeating process,,‚úî,‚úî,‚úî,,,,,
assign ranks,,,,,,,,‚úî,
break ties,,,,,,,,‚úî,
cache results from external service,,,,‚úî,,,,,
