# Codebook

In [3]:
import os, glob, re, subprocess
import pandas as pd
import yaml
import treelib

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

# Extract codes from PDFs

The cell below extracts the code from each PDF

In [10]:
%%time
rep = lambda s, n: [ s for i in range(n) ]

data = pd.DataFrame(columns=['org', 'article', 'analysis', 'index', 'cell', 'code'])
levels = {'ll': 'low', 'ml': 'medium', 'hl': 'high'}

ptrn = os.path.join('.', 'notebooks', '**', '**', '*.html.pdf')
for fn in glob.iglob(ptrn, recursive=True):
    org, article, analysis = fn.split('/')[2:]
    
    proc = subprocess.Popen(['python', './pdfannots/pdfannots.py', fn], stdout=subprocess.PIPE)
    annots = proc.communicate()[0].decode('utf-8')
    codes = re.findall(r'--\s\[([^\]]+)\]\s([^\n]+)\n', annots)
    df = pd.DataFrame({
        'org': rep(org, len(codes)),
        'article': rep(article, len(codes)),
        'analysis': rep(analysis[:-9], len(codes)),  # slice off file extension
        'index': [ i for i in range(len(codes)) ],
        'cell': [ c[0].strip() for c in codes ],
        'code': [ c[1].strip().lower() for c in codes ]
    })
    data = data.append(df)    

data.head()

CPU times: user 130 ms, sys: 200 ms, total: 330 ms
Wall time: 45.7 s


Give me a summary of coding progress so far

In [11]:
summary_stats = [
    len(data['article'].unique()),
    len(data['code'].unique())
]

print('Articles: {}\nCodes: {}'.format(*summary_stats))

Articles: 8
Codes: 63


# Axial Coding

In [12]:
def walkTheYaml(parent, children, func):
    """ A recursive, pre-order traversal of the code groups YAML structure"""
    for child in children:
        if isinstance(child, str):
            # Leaf nodes are strings.
            func(parent, child, True)
        elif isinstance(child, dict):
            # Interior nodes are dictionaries.
            key = list(child.keys())[0]
            func(parent, key, False)
            walkTheYaml(key, child[key], func)

Double check that every code generated from open coding has been covered in the hierarchy

In [14]:
root = 'Wrangling'
leaves = []
gatherLeaves = lambda p, c, l: leaves.append(c.lower()) if l else None

with open('axial.yaml', 'r') as f:
    code_hierarchy = yaml.safe_load(f)

walkTheYaml(root, code_hierarchy, gatherLeaves)
pdf_codes = set(data['code'].unique())

diff = set(pdf_codes) - set(leaves)
if (len(diff) == 0):  # is null set
    print("All codes have been grouped üòé")
else:
    print("Not all codes accounted for! ")
    print('\n'.join([ '- ' + d for d in list(diff)]))

Not all codes accounted for! 
- extract date from datetime column


## Display hierarchy

### Notes on Codes

* **Formulate performance metric**: specifying a calculation that is later used to compare different entities. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.
* **Figure a rate**: any operation that considers one group's relation to the whole. This code covers: simple rational numbers, percentages, and per-1000 rates.
* **Merge metadata**: Joining an auxilary table to the primary table to provide context to the phenomenon currently being analyzed.
* **Detrend data**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*. This includes adjusting for inflation, population growth, and season. 
* **Count in table** denotes looking up the total rows in a table as well as the number of unique values in a column.
* **Intra-column arithmetic** refers to any arithmetic operation, including count, on all values within a column.
* **Inter-column arithmetic** refers to any arithmetic operation on all vaues between columns. 
* **Extract data from non-tabular form** includes scraping data from the web, parsing structured ASCII data (such as .fec files)
* **De-null data**: drop table rows where a column value is null.
* **melt table** refers to ??

* **Change dataset resolution** refers to decreasing, usually but not necessarily, the granularity of observations represents as rows in the table. Changes to dataset resolution often are caused by aggregation operations. For example, if every row in a table represents the date of an observation, then the dataset can be grouped by a coarser time interval, such as month, and aggregate quantitative values, such as sum or mean, can be computed.

* **Consolidate data sources** refers to combining multiple tables into one table. This wrangling activity often occurs when data is located in separate, although not disparate, sources. For example, a government agency may publish data in a CSV file every year, but a data journalist wants to compare data across many years.

* **Merge data sources** refers to combining fundamentally different tables into one table. For example, *The Oregonian* compared complaints provided from a government agency with complaints scraped from the web. The key distinction here is that *merge* denotes combining datasets for insights while *consolidate* denotes combining datasets for convenience

In [8]:
tree = treelib.Tree()
addNode = lambda p, c, foo: tree.create_node(c.title(), c.lower(), parent=p.lower() if p != None else None)

addNode(None, root, False)
walkTheYaml(root, code_hierarchy, addNode)

tree.show(line_type='ascii-em')

Wrangling
‚ï†‚ïê‚ïê Acquisition
‚ïë   ‚ï†‚ïê‚ïê Data Source
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Extract Data In Non-Tabular
‚ïë   ‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Pull Tables Out Of Pdf
‚ïë   ‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Scrape Web For Data
‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Use Third-Party Data
‚ïë   ‚ïö‚ïê‚ïê Data Type
‚ïë       ‚ï†‚ïê‚ïê Import Structured Ascii
‚ïë       ‚ï†‚ïê‚ïê Use Geospatial Data
‚ïë       ‚ï†‚ïê‚ïê Use Structured Ascii
‚ïë       ‚ïö‚ïê‚ïê Use Tabular Data
‚ï†‚ïê‚ïê Analysis
‚ïë   ‚ï†‚ïê‚ïê Compare Trends Over Time
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Calculate Difference
‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Percentage Difference
‚ïë   ‚ï†‚ïê‚ïê Count Column Values
‚ïë   ‚ïö‚ïê‚ïê Count Unique Values In Column
‚ï†‚ïê‚ïê Cleaning
‚ïë   ‚ï†‚ïê‚ïê Canonicalize Column Names
‚ïë   ‚ï†‚ïê‚ïê Change Column Data Type
‚ïë   ‚ï†‚ïê‚ïê Deduplicate
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Create A Unique Key
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Drop Entirely Duplicate Rows
‚ïë   ‚ïë   ‚ï†‚ïê‚ïê Drop Rows With Duplicate Value In One Column
‚ïë   ‚ïë   ‚ïö‚ïê‚ïê Prevent Double-Count

## Display all codes

Show all the unique codes generated so far, and link them to the articles in which they appear.

In [9]:
data['mark'] = '‚úî'

(
    data[['code', 'analysis', 'mark']]
        .drop_duplicates(['code', 'analysis'])  # Drop duplicate codes within an article
        .set_index(['code', 'analysis'])
        .unstack(fill_value='')
)

Unnamed: 0_level_0,mark,mark,mark,mark,mark,mark,mark,mark,mark,mark,mark
analysis,01_processing,02-transform,02_analysis,03-analysis,analysis,analyze-campaign-codonors,crimes-and-heat,facilities-analysis,mung-3-25-scrape,notebook,sahil_chinoy_heat_index
code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
adjust for inflation,,‚úî,,,,,,,,,
adjust for season,,,,,‚úî,,,,,,
annotate workflow,‚úî,,‚úî,‚úî,,‚úî,,,,,
architect a subroutine,,,,,,‚úî,,,,,
architect parallel workflows,‚úî,‚úî,,,‚úî,‚úî,,,,,
architect repeating process,‚úî,‚úî,,,,,,,,,
calculate difference,,,,,‚úî,,,,,,
calculate mean,,‚úî,,,‚úî,,‚úî,,,,
calculate per 1k,,,,,,‚úî,,,,,
calculate percentage,‚úî,,,,,,,‚úî,,,
