# Codebook

This Jupyter notebook has three purposes:

1. Parse all the open codes in PDF printouts of computational notebooks located in the `notebooks/` directory.
1. Parse the taxonomy of open and axial codes in `actions.yaml` and `process.yaml`. 
1. Calculate simple statistics about the codeset, such as frequency within analysis. 
1. Export codes in a .CSV file that can be used by other notebooks without having to re-compile all the codes from the annotated PDFs.

In [1]:
import re, yaml
import pandas as pd
from lib.util import getCodes, displayMarkdown

%autosave 0

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

Autosave disabled


## Parse coded notebooks

To relocate annotated PDFs, all our of annotated PDFs have a `.html.pdf` extension and are located within this project directory to match the glob pattern `notebooks/**/**/*.html.pdf` This extension distinguishes them from possible PDFs checked into the repositories by the repo's original contributors. 

We open-coded PDF printouts of computational notebooks and programming scripts using the comments feature in [Adobe Acrobat DC](https://acrobat.adobe.com/en/acrobat.html). Open codes are extracted from each PDF using some internals of the open-source [pdfannots CLI](https://github.com/0xabu/pdfannots). See the [main function in pdfannots.py](https://github.com/0xabu/pdfannots/blob/6dd8dd29a93a0f5ec55e4b47f0eb27d8088a11a0/pdfannots.py#L469) for more details. 

Warning, the data in `pdf_codes` is not cleaned for spelling inconsistencies and other issues that crop up through manual data entry.

In [2]:
%%time
# this cell may take awhile to complete.

pdf_codes = getCodes()
pdf_codes.head()

CPU times: user 4min 26s, sys: 1.39 s, total: 4min 27s
Wall time: 4min 30s


Unnamed: 0,org,analysis,notebook,index,cell,code
0,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,0,paragraph 1,combine periodic data
1,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,1,paragraph 1,extract data from pdf
2,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,2,1,annotate workflow
3,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,3,1,load
4,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,4,1.1,change var type


## Parse code sets

Recursively traverse the YAML code trees to transform the data from a tree into tabular form. The node called "root" does not actually exist in the code tree.

In [3]:
def getCodeset(fn, debug=False):
    """Parse YAML file into a codeset dataframe"""
    with open(fn, 'r') as f:
        code_yaml = yaml.safe_load(f)
    
    codes = []

    def preTreeWalk(pNode, node, func, lvl=0):
        """ A recursive, pre-order traversal of the code groups YAML structure"""
        leaf = 'sub' not in node.keys()
        func(pNode, node, lvl, leaf)
        if not leaf:
            for child in node['sub']:
                if debug:
                    print(node, child)
                preTreeWalk(node, child, func, lvl + 1)

    parseYaml = lambda parent, child, lvl, leaf: codes.append({
        'parent': parent.get('name', '').lower(),
        'name': child['name'].lower(),
        'alias': child.get('alias'),
        'desc': child['desc'],
        'level': lvl,
        'is_leaf': leaf
    })

    for grp in code_yaml:
        preTreeWalk({'name': 'root'}, grp, parseYaml)

    codes = pd.DataFrame(codes)#[['parent', 'name', 'desc', 'level', 'is_leaf']]
    codes['type'] = fn.split('.')[0]
    return codes

In [4]:
actions = getCodeset('actions.yaml')
processes = getCodeset('process.yaml')
codeset = pd.concat([actions, processes])
codeset.head()

Unnamed: 0,parent,name,alias,desc,level,is_leaf,type
0,root,import,,How raw data is introduced into the wrangling ...,0,False,actions
1,import,fetch,,Data is retrieved from a source external to th...,1,False,actions
2,fetch,extract data from pdf,,"Using a data extraction tool, such as Tabula, ...",2,True,actions
3,fetch,api request,Make an API Request,Making a request to a web service,2,True,actions
4,fetch,query database,Query a Database,Importing data through a database connection,2,True,actions


The file `translator.csv` is a lookup table for correcting typos applied in the PDFs. It maps typos to the matching string in the codeset.

In [5]:
translator = pd.read_csv('translate.csv')
translator['updated'] = translator.updated.str.strip().str.lower()
translator['code'] = translator.code.str.strip().str.lower()

codes = pdf_codes.merge(translator, how="left")
codes['name'] = codes['updated'].mask(
    pd.isnull, 
    codes['code']
)

codes.drop(columns=['code', 'updated'], inplace=True)
codes.head()

Unnamed: 0,org,analysis,notebook,index,cell,name
0,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,0,paragraph 1,combine periodic data
1,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,1,paragraph 1,extract data from pdf
2,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,2,1,annotations
3,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,3,1,load
4,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,4,1.1,change var type


### Calculate Shortcodes

We use shortcodes to refer to longer descriptions of open and axial codes in the paper. We follow this naming convention:

$$<1>.<2>.<3>.<4>$$

1. *A* for action and *P* for process
2. The first character of the code, capitalized
4. Letters a-z, lowercase
3. Arabic numerial 1-9
4. Roman numeral

In [7]:
shortcodes = {}
roman=[None, 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x']
alphabet = ' abcdefghijklmnopqrstuvwxyz asdfasdfsadfsdfasdf'

for codeType in codeset.type.unique():
    j = 1
    df = codeset[codeset.type==codeType].sort_values(['level', 'parent'])\
        .reset_index()\
        .drop(columns='index')

    for i, row in df.iterrows():
        name = row['name']
        parent = row['parent']
        lvl = row['level']
        shortcodes[name] = []
        if (i > 0):
            prevParent = df.loc[i - 1, 'parent']        
            j = (j + 1) if prevParent == row['parent'] else 1            
        
        if lvl == 0:  # space 2: Capitliazed
            shortcodes[name] += [ codeType[0].capitalize(), name[0].capitalize() ]
        elif lvl == 1:  # space 3: letters lower
            j3 = (j3 + 1) if prevParent == row['parent'] else 1
            shortcodes[name] = shortcodes.get(parent) + [alphabet[j]]
        elif lvl == 2:  # space 4: Arabic numerials
            j4 = (j4 + 1) if prevParent == row['parent'] else 1            
            shortcodes[name] += shortcodes.get(parent) + [ str(j) ]
        elif lvl == 3: # space 5: roman numerials
            j5 = (j5 + 1) if prevParent == row['parent'] else 1            
            shortcodes[name] += shortcodes.get(parent) + [ roman[j] ]
        else:
            raise BaseException("Codes went too deeep")

shortcodes = { k: '.'.join(v) for (k,v) in shortcodes.items() }
shortcodes = pd.DataFrame.from_dict(shortcodes, orient='index', columns=['shortcode'])\
    .reset_index() \
    .rename(columns={'index': 'name'})


codeset = pd.merge(codeset, shortcodes, how='left', on='name')

### Quality Assurance

In this section we perform various methods for performing quality assurance on our coding process.

#### Matching codes between notebooks and the codeset

The cell below ensures that there aren't any codes in the codeset that are not in the PDFs and vice versa. More precisely, it checks that the difference between the set of open codes in `actions.yaml` + `process.yaml` and the set of unique codes that appear in every PDF printout (and vice verse) is the empty set.

In [8]:
# Convert from lists to sets
codeset_names = set(codeset[codeset.is_leaf == True].name)
pdf_names = set(codes['name'].unique())

# Find any discrepancies
diff = lambda a, b, codes: displayMarkdown('Codes in {} but not in {}:\n{}\n'.format(a, b, '\n'.join(['* ' + c for c in codes])))

falsePositives = pdf_names.difference(codeset_names)
falseNegatives = codeset_names.difference(pdf_names)

if not (bool(falsePositives) or bool(falseNegatives)):
    # Both sets are the null set
    displayMarkdown('<p>All the codes are A-OK!</p><img src="https://media.giphy.com/media/XreQmk7ETCak0/giphy.gif"> ')
else:
    # Problems
    if len(pdf_names.difference(codeset_names)) > 0:
        diff('PDFs', 'codeset', pdf_names.difference(codeset_names))
    if len(codeset_names.difference(pdf_names)) > 0:
        diff('codeset', 'PDFs', codeset_names.difference(pdf_names))

<p>All the codes are A-OK!</p><img src="https://media.giphy.com/media/XreQmk7ETCak0/giphy.gif"> 

#### Find codes in notebooks

Given a list of open codes in this `needles` variable, this code cell will reveal which notebooks have those open codes in them.

In [9]:
needles = ['visualize data', 'trim by date range']

codes['mark'] = '✔️'

codes[codes.name.isin([n.lower() for n in needles ])] \
    [['org', 'analysis', 'notebook', 'name', 'mark']] \
     .drop_duplicates() \
     .set_index(['org', 'analysis', 'notebook', 'name']) \
     .unstack(fill_value='')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mark,mark
Unnamed: 0_level_1,Unnamed: 1_level_1,name,trim by date range,visualize data
org,analysis,notebook,Unnamed: 3_level_2,Unnamed: 4_level_2
baltimore-sun-data,2018-voter-registration,02_analysis.ipynb,✔️,✔️
baltimore-sun-data,school-star-ratings-2018,analysis.ipynb,,✔️
buzzfeednews,2015-11-refugees-in-the-united-states,us-refugee-analysis.ipynb,,✔️
buzzfeednews,2016-09-shy-trumpers,shy-trumpers.nb,,✔️
buzzfeednews,2016-11-bellwether-counties,county-predictiveness.ipynb,✔️,
correctiv,awb-notebook,awb_meldungen.ipynb,,✔️
fivethirtyeight,bechdel,analyze-bechdel.R,✔️,✔️
fivethirtyeight,buster-posey-mvp,catcher_framing_capture.R,,✔️
fivethirtyeight,librarians,librarians.R,,✔️
fivethirtyeight,us-weather-history,visualize_weather.py,✔️,✔️


### Analysis Code Coverage

There exists a few taxonomies of data journalism focusing on analysis. At one point in time we were interested in this avenue of inquiry and wanted to make sure that every repo had an analysis code associated with it

In [10]:
analysisCodes = codeset[codeset.parent == 'analysis'].name.unique()
codedAnalyses = set(codes[codes.name.isin(analysisCodes)].analysis.unique())
analyses = set(codes.analysis.unique())
codeDiff = analyses.difference(codedAnalyses)

if len(codeDiff) > 0:
    displayMarkdown("""
    The following analyses do not have analysis codes:

    * {}

    """.format('\n* '.join(codeDiff)))
else:
    displayMarkdown("All analyses have analysis codes")

All analyses have analysis codes

Right now, we only have data that associates an open-code with an analysis scripts or computational notebook within a repo. We can't answer the question, how many analyses have a code within the clean branch of the actions taxonomy? The `codeAnalysis` data frame creates a dataset that pairs axial codes with analyses notebook names as a single observation. 

In [11]:
codeAnalysisTmp = pd.merge(codeset[codeset.is_leaf].copy()[['parent', 'name']], codes[['name', 'analysis', 'notebook']],
                        how='left',
                        left_on='name',
                        right_on='name') \
                .drop_duplicates()

codeAnalysis = codeAnalysisTmp[['name', 'analysis', 'notebook']].copy()

while codeAnalysisTmp.parent.nunique() > 0:
    codeAnalysisTmp = codeAnalysisTmp[['parent', 'analysis', 'notebook']] \
        .rename(columns={'parent': 'name'}) \
        .drop_duplicates()

    codeAnalysisTmp = pd.merge(
        codeAnalysisTmp,
        codeset[['parent', 'name']],    
        how = 'left',
        on = 'name')

    codeAnalysis = pd.concat([codeAnalysis, codeAnalysisTmp[['name', 'analysis', 'notebook']]]) \
        .drop_duplicates()

codeAnalysis = pd.merge(
    codeAnalysis,
    codeset[codeset.name != 'root'][['name', 'level', 'is_leaf']], 
    how='left')

codeAnalysis[codeAnalysis.name == 'clean'].head()

Unnamed: 0,name,analysis,notebook,level,is_leaf
2107,clean,school-star-ratings-2018,analysis.ipynb,0.0,False
2108,clean,2019-04-democratic-candidate-codonors,analyze-campaign-codonors.ipynb,0.0,False
2109,clean,california-ccscore-analysis,analysis.ipynb,0.0,False
2110,clean,california-h2a-visas-analysis,02_transform.ipynb,0.0,False
2111,clean,1805-regionen im fokus des US-praesidenten,stateOfUnion.R,0.0,False


#### Code frequency by note

Count the number of times a qualitative code uniquely appears per analysis. We will then add this aggregate dataset to the `codeset` dataset in what's an *aggregate join*.

In [12]:
codesByNote = codeAnalysis.groupby('name')['analysis'] \
    .nunique() \
    .to_frame('total') \
    .reset_index()

codesByNote['percent'] = codesByNote.total / codesByNote.total.max()

# Aggregate join this data to our codeset table
codeset = codeset.merge(codesByNote, on='name')
codeset.head()

Unnamed: 0,parent,name,alias,desc,level,is_leaf,type,shortcode_x,shortcode_y,total,percent
0,root,import,,How raw data is introduced into the wrangling ...,0,False,actions,A.I,A.I,48,0.96
1,import,fetch,,Data is retrieved from a source external to th...,1,False,actions,A.I.a,A.I.a,6,0.12
2,fetch,extract data from pdf,,"Using a data extraction tool, such as Tabula, ...",2,True,actions,A.I.a.1,A.I.a.1,1,0.02
3,fetch,api request,Make an API Request,Making a request to a web service,2,True,actions,A.I.a.2,A.I.a.2,1,0.02
4,fetch,query database,Query a Database,Importing data through a database connection,2,True,actions,A.I.a.3,A.I.a.3,1,0.02


Make sure that every analysis script or computational notebook has at least one code.

In [13]:
priorSize = codes.shape[0]

codes = pd.merge(codes, codesByNote, how='left', on='name')

displayMarkdown(('The data frame `codes` differ by {} rows after the aggregate join'.format(priorSize - codes.shape[0])))
codes.head()

The data frame `codes` differ by 0 rows after the aggregate join

Unnamed: 0,org,analysis,notebook,index,cell,name,mark,total,percent
0,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,0,paragraph 1,combine periodic data,✔️,5,0.1
1,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,1,paragraph 1,extract data from pdf,✔️,1,0.02
2,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,2,1,annotations,✔️,5,0.1
3,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,3,1,load,✔️,42,0.84
4,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,4,1.1,change var type,✔️,34,0.68


## Summary

Here is where we learn a little bit about the codes we have applied.

### Display taxonomies

All open codes, their descriptions, and the corresponding axial codes are stored in the `actions.yaml` and `processes.yaml` files. As the master copy for all open and axial codes resides here, the raw text itself can be difficult to read. Thus, it can be helpful to read this tree in Markdown.

#### Action codes

In [14]:
codeset[codeset.type=='actions']

Unnamed: 0,parent,name,alias,desc,level,is_leaf,type,shortcode_x,shortcode_y,total,percent
0,root,import,,How raw data is introduced into the wrangling ...,0,False,actions,A.I,A.I,48,0.96
1,import,fetch,,Data is retrieved from a source external to th...,1,False,actions,A.I.a,A.I.a,6,0.12
2,fetch,extract data from pdf,,"Using a data extraction tool, such as Tabula, ...",2,True,actions,A.I.a.1,A.I.a.1,1,0.02
3,fetch,api request,Make an API Request,Making a request to a web service,2,True,actions,A.I.a.2,A.I.a.2,1,0.02
4,fetch,query database,Query a Database,Importing data through a database connection,2,True,actions,A.I.a.3,A.I.a.3,1,0.02
5,fetch,scrape web for data,Scrape the Web for Data,Parsing HTML web pages for data,2,True,actions,A.I.a.4,A.I.a.4,3,0.06
6,import,create,,Data is created inside the wrangling environment,1,False,actions,A.I.b,A.I.b,26,0.52
7,create,construct data manually,,The data is either copy-and-pasted or values a...,2,True,actions,A.I.b.1,A.I.b.1,7,0.14
8,create,generate data computationally,,Using data with values generated programmatically,2,True,actions,A.I.b.2,A.I.b.2,4,0.08
9,create,copy table schema,Copy Data Schema,Data is copied with a schema but without any v...,2,True,actions,A.I.b.3,A.I.b.3,1,0.02


#### Processes

In [15]:
codeset[codeset.type=='process']

Unnamed: 0,parent,name,alias,desc,level,is_leaf,type,shortcode_x,shortcode_y,total,percent
98,root,source,,Codes that describe how the raw data was obtai...,0,False,process,P.S,P.S,24,0.48
99,source,collect data,,Journalists are the initial data collector,1,False,process,P.S.a,P.S.a,6,0.12
100,collect data,collect raw data,,The journalist collected the raw data themselves.,2,True,process,P.S.a.1,P.S.a.1,5,0.1
101,collect data,freedom of information data,,Data that was obtained via FOI/FOIA requests,2,True,process,P.S.a.2,P.S.a.2,1,0.02
102,source,acquire data,,Journalists acquired data from another party,1,False,process,P.S.b,P.S.b,19,0.38
103,acquire data,use previously cleaned data,,Data that originated from a colleague,2,True,process,P.S.b.1,P.S.b.1,1,0.02
104,acquire data,use public data,,"Includes open-source datasets, datasets on Wik...",2,True,process,P.S.b.2,P.S.b.2,2,0.04
105,acquire data,use academic data,,Use data collected from an academic study,2,True,process,P.S.b.3,P.S.b.3,1,0.02
106,acquire data,"use non-public, provided data",,Use data that is not publically available,2,True,process,P.S.b.4,P.S.b.4,2,0.04
107,acquire data,govt data portal,use open-government data portal,Data publically available on civic data portals,2,True,process,P.S.b.5,P.S.b.5,11,0.22


### Codeset Counts

Here is where we calculate various counts of our codeset.

In [16]:
cntUniq = lambda df: df.name.nunique()

displayMarkdown("""
| Category            | count           |
| ------------------- | --------------- |
| Total codes         | {total-codes}   |
| Max depth           | {max-depth}     |
| Open codes          | {open-codes}    |
| Axial codes         | {axial-codes}   |
| Action open codes   | {action-open}   |
| Action axial codes  | {action-axial}  |
| Process open codes  | {process-open}  |
| Process axial codes | {process-axial} |
""".format(**{
    'total-codes': cntUniq(codeset),
    'max-depth': codeset.level.max(),
    'open-codes':  cntUniq(codeset[codeset.is_leaf==True]),
    'axial-codes': cntUniq(codeset[codeset.is_leaf==False]),
    'action-open': cntUniq(codeset[(codeset.is_leaf) & (codeset.type=='actions')]),
    'action-axial': cntUniq(codeset[(~codeset.is_leaf) & (codeset.type=='actions')]),
    'process-open': cntUniq(codeset[(codeset.is_leaf) & (codeset.type=='process')]),
    'process-axial': cntUniq(codeset[(~codeset.is_leaf) & (codeset.type=='process')])   
}))


| Category            | count           |
| ------------------- | --------------- |
| Total codes         | 173   |
| Max depth           | 3     |
| Open codes          | 130    |
| Axial codes         | 43   |
| Action open codes   | 73   |
| Action axial codes  | 25  |
| Process open codes  | 57  |
| Process axial codes | 18 |


## Export results

We export a couple of CSV files for other notebooks to use.

* `data/codeset.csv` contains information on individual axial codes such as their level in the tree and how many analyses in which the code occurs.

* `data/code-analysis-network.csv` contains the occurrence of open and axial codes in individual notebooks.

### Export for LaTex Table Guts

The code below are all the table guts that go inbetween `\begin{tabular}` and `\end{tabular}`

In [17]:
maxChildren = 4
whitelist = pd.concat([codeset[codeset.level == 0],  codeset[codeset.level == 1].groupby('parent').head(maxChildren)])['name']
latexCodeset = codeset[codeset.name.isin(whitelist)]
latexCodeset.head()

Unnamed: 0,parent,name,alias,desc,level,is_leaf,type,shortcode_x,shortcode_y,total,percent
0,root,import,,How raw data is introduced into the wrangling ...,0,False,actions,A.I,A.I,48,0.96
1,import,fetch,,Data is retrieved from a source external to th...,1,False,actions,A.I.a,A.I.a,6,0.12
6,import,create,,Data is created inside the wrangling environment,1,False,actions,A.I.b,A.I.b,26,0.52
12,import,load,,Data resides on the local disk and is loaded i...,1,True,actions,A.I.c,A.I.c,42,0.84
13,root,clean,,"The process of removing incorrect, incomplete,...",0,False,actions,A.C,A.C,41,0.82


In [19]:
# maxlvl = 1
# def parseRow(row):
#     name = row['name'].capitalize()        
#     args = {
#         'shortcode': row['shortcode'],
#         'name': '\\textbf{' + name + '}' if row['level'] == 0 else name,
#         'indent': row['level'] * 2,
#         'newline': '\n'
#     }
#     return "\hspace{{{indent}mm}} {name}".format(**args)

# actions = []
# process = []
# for i, row in latexCodeset[(latexCodeset.type == 'actions')].iterrows():
#     actions.append(parseRow(row))

# for i, row in latexCodeset[(latexCodeset.type == 'process')].iterrows():
#     process.append(parseRow(row))

# lines = []
# for i in range(len(process)):
#     try:
#         lines.append(actions[i] + ' & ' + process[i])
#     except IndexError:
#         lines.append(' & ' + process[i])
        
# print(' \\\ \n'.join(lines) + ' \\\ ')

#### Actions LaTex Table

In [None]:
printTable('actions', 2)

#### Process LaTex Table

In [None]:
printTable('process', 2)

### Export to File

In [20]:
codeset.to_csv('data/codeset.csv', index=False)

codes.to_csv('data/code-analysis-network.csv', index=False)