# Codebook

In [1]:
import os, glob, re, subprocess
import pandas as pd

pd.set_option("display.max_rows", None)  # Don't do that elipse things between rows with long tables

In [2]:
rep = lambda s, n: [ s for i in range(n) ]

data = pd.DataFrame(columns=['org', 'article', 'analysis', 'index', 'level', 'cell', 'code'])
levels = {'ll': 'low', 'ml': 'medium', 'hl': 'high'}

ptrn = os.path.join('.', 'notebooks', '**', '**', '*.html.pdf')
for fn in glob.iglob(ptrn, recursive=True):
    org, article, analysis = fn.split('/')[2:]
    
    proc = subprocess.Popen(['python', './pdfannots/pdfannots.py', fn], stdout=subprocess.PIPE)
    annots = proc.communicate()[0].decode('utf-8')
    codes = re.findall(r'([HML]L)\s\[([^\]]+)\]\s([^\n]+)\n', annots)
    df = pd.DataFrame({
        'org': rep(org, len(codes)),
        'article': rep(article, len(codes)),
        'analysis': rep(analysis[:-9], len(codes)),  # slice off file extension
        'index': [ i for i in range(len(codes)) ],
        'level': [ levels[c[0].strip().lower()] for c in codes ],
        'cell': [ c[1].strip() for c in codes ],
        'code': [ c[2].strip().lower() for c in codes ]
    })
    data = data.append(df)    

data.head()

Unnamed: 0,org,article,analysis,index,level,cell,code
0,baltimore-sun-data,2018-voter-registration,01_processing,0,medium,paragraph 1,use third-party data
1,baltimore-sun-data,2018-voter-registration,01_processing,1,low,paragraph 1,extract data from non-tabular form
2,baltimore-sun-data,2018-voter-registration,01_processing,2,low,1,annotate workflow
3,baltimore-sun-data,2018-voter-registration,01_processing,3,low,1,change column data type
4,baltimore-sun-data,2018-voter-registration,01_processing,4,high,1,cleanup data


Give me a summary of coding progress so far

In [3]:
summary_stats = [
    len(data['article'].unique()),
    len(data['code'].unique())
]

print('Articles: {}\nCodes: {}'.format(*summary_stats))

Articles: 6
Codes: 58


## Display all codes

Show all the unique codes generated so far, and link them to the articles in which they appear.

In [4]:
data['mark'] = '✔'
codes = (
    data[['level', 'code', 'article', 'mark']]
        .drop_duplicates(['code', 'article'])  # Drop duplicate codes within an article
        .set_index(['level', 'code', 'article'])
        .unstack(fill_value='')
)

## High level codes

* **Change dataset resolution** refers to decreasing, usually but not necessarily, the granularity of observations represents as rows in the table. Changes to dataset resolution often are caused by aggregation operations. For example, if every row in a table represents the date of an observation, then the dataset can be grouped by a coarser time interval, such as month, and aggregate quantitative values, such as sum or mean, can be computed.

* **Consolidate data sources** refers to combining multiple tables into one table. This wrangling activity often occurs when data is located in separate, although not disparate, sources. For example, a government agency may publish data in a CSV file every year, but a data journalist wants to compare data across many years.

* **Merge data sources** refers to combining fundamentally different tables into one table. For example, *The Oregonian* compared complaints provided from a government agency with complaints scraped from the web. The key distinction here is that *merge* denotes combining datasets for insights while *consolidate* denotes combining datasets for convenience


In [5]:
codes.query("level == 'high'").reset_index(level='level', col_level=1)['mark']

article,2018-voter-registration,2019-04-democratic-candidate-codonors,california-ccscore-analysis,california-crop-production-wages-analysis,census-hard-to-map-analysis,long-term-care-db
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
architect a subroutine,,✔,,,,
architect parallel workflows,✔,✔,✔,✔,,
architect repeating process,✔,,,✔,,
change dataset resolution,✔,✔,✔,✔,✔,✔
cleanup data,✔,✔,,✔,✔,✔
compare trends over time,✔,,✔,✔,,
consolidate data sources,✔,,,✔,,✔
format data for graphics,✔,,,✔,✔,
locate outliers in dataset,,,✔,,,
merge data sources,,,,,,✔


## Medium level codes

* **Formulate performance metric**: specifying a calculation that is later used to compare different entities. A recurring theme between many of these notebooks is to compare different entities, such as political parties, by a common, quantitative metric, such as percentage of all newly registered voters.
* **Figure a rate**: any operation that considers one group's relation to the whole. This code covers: simple rational numbers, percentages, and per-1000 rates.
* **Merge metadata**: Joining an auxilary table to the primary table to provide context to the phenomenon currently being analyzed.
* **Detrend data**: "filter out the secular effect in order to see what is going on specifically with the phenomenon you are investigating," Philip Meyer in *Precision Journalism*. This includes adjusting for inflation, population growth, and season. 

In [6]:
codes.query("level == 'medium'").reset_index(level='level', col_level=1)['mark']

article,2018-voter-registration,2019-04-democratic-candidate-codonors,california-ccscore-analysis,california-crop-production-wages-analysis,census-hard-to-map-analysis,long-term-care-db
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
calculate central tendency,,,✔,✔,,
calculate percentage difference,✔,,✔,,,
calculate standardized scores,,,✔,,,
create a lookup table,,,,,,✔
detrend data,,,✔,✔,,
encode provenance in table,,,,,,✔
figure a rate,✔,✔,,,,
formulate performance metric,✔,,✔,✔,,✔
merge metadata,,✔,,,✔,✔
prevent double-counting,,✔,,,,


## Low level codes

* **Count in table** denotes looking up the total rows in a table as well as the number of unique values in a column.
* **Intra-column arithmetic** refers to any arithmetic operation, including count, on all values within a column.
* **Inter-column arithmetic** refers to any arithmetic operation on all vaues between columns. 
* **Extract data from non-tabular form** includes scraping data from the web, parsing structured ASCII data (such as .fec files)
* **De-null data**: drop table rows where a column value is null.
* **melt table** refers to ??

In [7]:
codes.query("level == 'low'").reset_index(level='level', col_level=1)['mark']

article,2018-voter-registration,2019-04-democratic-candidate-codonors,california-ccscore-analysis,california-crop-production-wages-analysis,census-hard-to-map-analysis,long-term-care-db
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
annotate workflow,✔,✔,,✔,,
canonicalize column names,✔,✔,,,✔,✔
change column data type,✔,,✔,✔,✔,
construct pivot table,,,,,,✔
count in table,,✔,✔,,,✔
create a unique key,,✔,,,,
create constant column,,,,,,✔
de-null data,,,,,,✔
deduplicate,,,✔,,,✔
drop columns,✔,✔,✔,✔,✔,✔
