# Codebook

In [8]:
import os, glob, re, subprocess
import pandas as pd
import yaml
from pdfannots import pdfannots
import codecs
import treelib

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

# Extract codes from PDFs

The cell below extracts the codes from each PDF using some internals of the [pdfannots CLI](https://github.com/0xabu/pdfannots). See the [main function in pdfannots.py](https://github.com/0xabu/pdfannots/blob/6dd8dd29a93a0f5ec55e4b47f0eb27d8088a11a0/pdfannots.py#L469) for more details.

In [9]:
%%time
rep = lambda s, n: [ s for i in range(n) ]

codec = codecs.lookup('cp1252')

data = pd.DataFrame(columns=['org', 'article', 'analysis', 'index', 'cell', 'code'])

code_re = r'\[([^\]]+)\]\s([A-za-z][^\n]+)\n?'  # Regular expression for parsing my coding comments

ptrn = os.path.join('.', 'notebooks', '**', '**', '*.html.pdf')
for fn in glob.iglob(ptrn, recursive=False):        
    org, article, analysis = fn.split('/')[2:]
    with open(fn, 'rb') as fobj:
        annots, outlines = pdfannots.process_file(fobj, codec, False)
    codes = []
    for annot in annots:
        if annot.contents != None:
            codes += re.findall(code_re, annot.contents)
    df = pd.DataFrame({
        'org': rep(org, len(codes)),
        'article': rep(article, len(codes)),
        'analysis': rep(analysis[:-9], len(codes)),  # slice off file extension
        'index': [ i for i in range(len(codes)) ],
        'cell': [ c[0].strip() for c in codes ],
        'code': [ c[1].strip().lower() for c in codes ]
    })
    data = data.append(df)    

CPU times: user 19 s, sys: 80 ms, total: 19 s
Wall time: 19.2 s


In [10]:
data.head()

Unnamed: 0,org,article,analysis,index,cell,code
0,baltimore-sun-data,2018-voter-registration,01_processing,0,paragraph 1,use third-party data
1,baltimore-sun-data,2018-voter-registration,01_processing,1,paragraph 1,pull tables out of pdf
2,baltimore-sun-data,2018-voter-registration,01_processing,2,1,annotate workflow
3,baltimore-sun-data,2018-voter-registration,01_processing,3,1,change column data type
4,baltimore-sun-data,2018-voter-registration,01_processing,4,1,canonicalize column names


Summarize the current coding progress

In [11]:
summary_stats = [
    len(data['article'].unique()),
    len(data['code'].unique())
]

print('Articles: {}\nCodes: {}'.format(*summary_stats))

Articles: 17
Codes: 140


# Hierarchical Code Groups

In [18]:
def walkTheYaml(parent, children, func):
    """ A recursive, pre-order traversal of the code groups YAML structure"""
    for child in children:
        if isinstance(child, str):
            # Leaf nodes are strings.
            func(parent, child, True)
        elif isinstance(child, dict):
            # Interior nodes are dictionaries.
            key = list(child.keys())[0]
            func(parent, key, False)
            walkTheYaml(key, child[key], func)

## Sanity Check

Double check that every code generated from open coding has been covered in the hierarchy and every entity in `code_tree.yaml` is actually in a `.html.pdf` file.

In [20]:
root = 'Wrangling'
leaves = []
gatherLeaves = lambda p, c, l: leaves.append(c.strip().lower()) if l else None
code_tree = 'code_tree.yaml'

with open(code_tree, 'r') as f:
    code_hierarchy = yaml.safe_load(f)

walkTheYaml(root, code_hierarchy, gatherLeaves)
leaves = set(leaves)
pdf_codes = set(data['code'].unique())

diff = lambda a, b, codes: print('Codes in {} but not in {}:\n{}\n'.format(a, b, '\n'.join(['\t- ' + c for c in codes])))

if len(pdf_codes - leaves) == 0:  # is null set
    print("All codes have been grouped 😎")
else:
    if len(pdf_codes - leaves) > 0:
        diff('*.html.pdf', code_tree, pdf_codes - leaves)
    if len(leaves - pdf_codes) > 0:
        diff(code_tree, '*.html.pdf', leaves - pdf_codes)

All codes have been grouped 😎


## Display hierarchy

### Notes on Codes

* **Formulate performance metric**: specifying a calculation that is later used to compare different entities. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.
* **Figure a rate**: any operation that considers one group's relation to the whole. This code covers: simple rational numbers, percentages, and per-1000 rates.
* **Merge metadata**: Joining an auxilary table to the primary table to provide context to the phenomenon currently being analyzed.
* **Merge data sources** refers to combining different schemas into one table. For example, *The Oregonian* compared complaints provided from a government agency with complaints scraped from the web.
* **Detrend data**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*. This includes adjusting for inflation, population growth, and season. 
* **Extract data from non-tabular form** includes scraping data from the web, parsing structured ASCII data (such as .fec files)
* **Change dataset resolution** refers to decreasing, usually but not necessarily, the granularity of observations represents as rows in the table. Changes to dataset resolution often are caused by aggregation operations. For example, if every row in a table represents the date of an observation, then the dataset can be grouped by a coarser time interval, such as month, and aggregate quantitative values, such as sum or mean, can be computed.
* **Consolidate data sources** refers to combining multiple tables into one table. This wrangling activity often occurs when data is located in separate, although not disparate, sources. For example, a government agency may publish data in a CSV file every year, but a data journalist wants to compare data across many years.
* **Generate data computationally** refers to programatically generating raw data, e.g. `range` in Python. In "Heat and Index" Sahil Chinoy computationally generates temperature and humidity data.
* **Create Unique Key** *added donor-movement* is an interesting code because it occurs frequently when dealing with campaign finance data. It seems like there could be algorithmic approaches that find a unique key out of any given combination of columns. Mainly unique keys are names and places concatenated. You could maybe random sample the dataset to save time?
* **Calculate z-score** Calculate how many standard deviations a value in a column is away from the mean, $(x_i - \bar{x})/\sigma_x \quad \forall x_i \in x$. Journalists perform this function to simply find outliers in a dataset or when preparing the data for principle component analysis.

In [None]:
tree = treelib.Tree()
addNode = lambda p, c, foo: tree.create_node(c.title(), c.lower(), parent=p.lower() if p != None else None)

addNode(None, root, False)
walkTheYaml(root, code_hierarchy, addNode)

tree.show(line_type='ascii-em')

## Display all codes

Show all the unique codes generated so far, and link them to the articles in which they appear.

In [None]:
data.groupby(['code', 'article', 'analysis'])['analysis'].count().to_frame('count')

In [None]:
data['mark'] = '✔'

(
    data[['code', 'org', 'mark']]
        .drop_duplicates(['code', 'org'])  # Drop duplicate codes within an article
        .set_index(['code', 'org'])
        .unstack(fill_value='')
)