# Codebook

In [20]:
import os, glob, re, subprocess
import pandas as pd
import yaml
from pdfannots import pdfannots
import codecs
import treelib

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

# Extract codes from PDFs

The cell below extracts the code from each PDF using some internals of the [pdfannots CLI](https://github.com/0xabu/pdfannots). See the [main function in pdfannots.py](https://github.com/0xabu/pdfannots/blob/6dd8dd29a93a0f5ec55e4b47f0eb27d8088a11a0/pdfannots.py#L469) for more details.

In [21]:
%%time
rep = lambda s, n: [ s for i in range(n) ]

codec = codecs.lookup('cp1252')

data = pd.DataFrame(columns=['org', 'article', 'analysis', 'index', 'cell', 'code'])

code_re = r'\[([^\]]+)\]\s([A-za-z][^\n]+)\n?'  # Regular expression for parsing my coding comments

ptrn = os.path.join('.', 'notebooks', '**', '**', '*.html.pdf')
for fn in glob.iglob(ptrn, recursive=False):        
    org, article, analysis = fn.split('/')[2:]
    with open(fn, 'rb') as fobj:
        annots, outlines = pdfannots.process_file(fobj, codec, False)
    codes = []
    for annot in annots:
        if annot.contents != None:
            codes += re.findall(code_re, annot.contents)
    df = pd.DataFrame({
        'org': rep(org, len(codes)),
        'article': rep(article, len(codes)),
        'analysis': rep(analysis[:-9], len(codes)),  # slice off file extension
        'index': [ i for i in range(len(codes)) ],
        'cell': [ c[0].strip() for c in codes ],
        'code': [ c[1].strip().lower() for c in codes ]
    })
    data = data.append(df)    

CPU times: user 16.9 s, sys: 20 ms, total: 16.9 s
Wall time: 17.2 s


In [22]:
data.head()

Unnamed: 0,org,article,analysis,index,cell,code
0,baltimore-sun-data,2018-voter-registration,01_processing,0,paragraph 1,use third-party data
1,baltimore-sun-data,2018-voter-registration,01_processing,1,paragraph 1,pull tables out of pdf
2,baltimore-sun-data,2018-voter-registration,01_processing,2,1,annotate workflow
3,baltimore-sun-data,2018-voter-registration,01_processing,3,1,change column data type
4,baltimore-sun-data,2018-voter-registration,01_processing,4,1,canonicalize column names


Summarize the current coding progress

In [23]:
summary_stats = [
    len(data['article'].unique()),
    len(data['code'].unique())
]

print('Articles: {}\nCodes: {}'.format(*summary_stats))

Articles: 14
Codes: 112


# Hierarchical Code Groups

In [24]:
def walkTheYaml(parent, children, func):
    """ A recursive, pre-order traversal of the code groups YAML structure"""
    for child in children:
        if isinstance(child, str):
            # Leaf nodes are strings.
            func(parent, child, True)
        elif isinstance(child, dict):
            # Interior nodes are dictionaries.
            key = list(child.keys())[0]
            func(parent, key, False)
            walkTheYaml(key, child[key], func)

Double check that every code generated from open coding has been covered in the hierarchy

In [25]:
root = 'Wrangling'
leaves = []
gatherLeaves = lambda p, c, l: leaves.append(c.strip().lower()) if l else None
code_tree = 'code_tree.yaml'

with open(code_tree, 'r') as f:
    code_hierarchy = yaml.safe_load(f)

walkTheYaml(root, code_hierarchy, gatherLeaves)
leaves = set(leaves)
pdf_codes = set(data['code'].unique())

diff = lambda a, b, codes: print('Codes in {} but not in {}:\n{}\n'.format(a, b, '\n'.join(['\t- ' + c for c in codes])))

if len(pdf_codes - leaves) == 0:  # is null set
    print("All codes have been grouped 😎")
else:
    if len(pdf_codes - leaves) > 0:
        diff('*.html.pdf', code_tree, pdf_codes - leaves)
    if len(leaves - pdf_codes) > 0:
        diff(code_tree, '*.html.pdf', leaves - pdf_codes)

Codes in *.html.pdf but not in code_tree.yaml:
	- find most frequently occurring
	- set data confidence threshold

Codes in code_tree.yaml but not in *.html.pdf:
	- wrangle data for graphics
	- scale numeric values
	- find most frequently occuring
	- plot trendline



## Display hierarchy

### Notes on Codes

* **Formulate performance metric**: specifying a calculation that is later used to compare different entities. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.
* **Figure a rate**: any operation that considers one group's relation to the whole. This code covers: simple rational numbers, percentages, and per-1000 rates.
* **Merge metadata**: Joining an auxilary table to the primary table to provide context to the phenomenon currently being analyzed.
* **Merge data sources** refers to combining different schemas into one table. For example, *The Oregonian* compared complaints provided from a government agency with complaints scraped from the web.
* **Detrend data**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*. This includes adjusting for inflation, population growth, and season. 
* **Extract data from non-tabular form** includes scraping data from the web, parsing structured ASCII data (such as .fec files)
* **Change dataset resolution** refers to decreasing, usually but not necessarily, the granularity of observations represents as rows in the table. Changes to dataset resolution often are caused by aggregation operations. For example, if every row in a table represents the date of an observation, then the dataset can be grouped by a coarser time interval, such as month, and aggregate quantitative values, such as sum or mean, can be computed.
* **Consolidate data sources** refers to combining multiple tables into one table. This wrangling activity often occurs when data is located in separate, although not disparate, sources. For example, a government agency may publish data in a CSV file every year, but a data journalist wants to compare data across many years.
* **Generate data computationally** refers to programatically generating raw data, e.g. `range` in Python. In "Heat and Index" Sahil Chinoy computationally generates temperature and humidity data.
* **Create Unique Key** *added donor-movement* is an interesting code because it occurs frequently when dealing with campaign finance data. It seems like there could be algorithmic approaches that find a unique key out of any given combination of columns. Mainly unique keys are names and places concatenated. You could maybe random sample the dataset to save time?

In [26]:
tree = treelib.Tree()
addNode = lambda p, c, foo: tree.create_node(c.title(), c.lower(), parent=p.lower() if p != None else None)

addNode(None, root, False)
walkTheYaml(root, code_hierarchy, addNode)

tree.show(line_type='ascii-em')

Wrangling
╠══ Acquire
║   ╠══ Data Source
║   ║   ╠══ Create New Data
║   ║   ║   ╠══ Construct Table Manually
║   ║   ║   ╚══ Generate Data Computationally
║   ║   ╠══ Extraction
║   ║   ║   ╠══ Pull Tables Out Of Pdf
║   ║   ║   ╚══ Scrape Web For Data
║   ║   ╚══ Use Existing Data
║   ║       ╠══ Use Another News Orgs Data
║   ║       ╠══ Use Previously Cleaned Data
║   ║       ╚══ Use Third-Party Data
║   ╚══ Data Type
║       ╠══ Use Non-Tabular Data
║       ║   ╠══ Use Geospatial Data
║       ║   ╚══ Use Structured Ascii
║       ╚══ Use Tabular Data
╠══ Build Workflow
║   ╠══ Cache Results From External Service
║   ╠══ Document
║   ║   ╚══ Annotate Workflow
║   ╠══ Export Data
║   ║   ╠══ Export Data For Graphics
║   ║   ╠══ Export Intermediate Results
║   ║   ╚══ Export Results
║   ╚══ Think Computationally
║       ╠══ Architect A Subroutine
║       ╠══ Architect Repeating Process
║       ╚══ Repetitive Code
╠══ Cleaning
║   ╠══ Deduplicate
║   ║   ╠══ Create A Unique Key
║   ║ 

## Display all codes

Show all the unique codes generated so far, and link them to the articles in which they appear.

In [27]:
data.groupby(['code', 'article', 'analysis'])['analysis'].count().to_frame('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
code,article,analysis,Unnamed: 3_level_1
add calculated column from axillary data,cube_root_law,the_cube_root_law,1
add column from intra-table calculation,2019-04-democratic-candidate-codonors,analyze-campaign-codonors,1
adjust for inflation,california-crop-production-wages-analysis,02-transform,1
adjust for season,california-ccscore-analysis,analysis,1
annotate workflow,2018-voter-registration,01_processing,1
annotate workflow,2018-voter-registration,02_analysis,1
annotate workflow,2019-04-democratic-candidate-codonors,analyze-campaign-codonors,1
annotate workflow,california-crop-production-wages-analysis,02-transform,1
annotate workflow,california-crop-production-wages-analysis,03-analysis,1
append to table,2018-05-31-crime-and-heat-analysis,crimes-and-heat,1


In [28]:
data['mark'] = '✔'

(
    data[['code', 'org', 'mark']]
        .drop_duplicates(['code', 'org'])  # Drop duplicate codes within an article
        .set_index(['code', 'org'])
        .unstack(fill_value='')
)

Unnamed: 0_level_0,mark,mark,mark,mark,mark,mark,mark
org,TheOregonian,baltimore-sun-data,buzzfeednews,la_times,nytimes,stlpublicradio,wuft
code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
add calculated column from axillary data,,,,,✔,,
add column from intra-table calculation,,,✔,,,,
adjust for inflation,,,,✔,,,
adjust for season,,,,✔,,,
annotate workflow,,✔,✔,✔,,,
append to table,,,,,,✔,
architect a subroutine,,,✔,✔,,,
architect parallel workflows,,✔,✔,✔,,,
architect repeating process,,✔,✔,✔,,,
cache results from external service,,,,✔,,,
