This notebook includes analysis of the iterative coding process.

In [1]:
import pandas as pd
import altair as alt
import numpy as np
from lib.util import displayMarkdown, getCitations

alt.renderers.enable('notebook')

pd.set_option("display.max_rows", None)  # Don't truncate rows when printing a Pandas DataFrame instance

# Import data 

## Code tree

This CSV file is computed in `codebooky.ipynb`.

In [2]:
codes = pd.read_csv('data/codes.csv')
codes.head()

Unnamed: 0,parent,name,desc,level,is_leaf,analysis
0,root,actions,Codes that describe actions the journalist has...,0,False,50
1,actions,import,How raw data is introduced into the programmin...,1,False,39
2,import,fetch,Data is retrieved from some external sources t...,2,False,6
3,fetch,pull tables out of pdf,"Using a table extraction tool, such as Tabula,...",3,True,1
4,fetch,api request,"Make a request to a web API, such as addresses...",3,True,1


## Citations

In [3]:
citations = getCitations()
citations.head()

Unnamed: 0,journalist,year,month,date,analysis,organization,path
0,"Aisch, Gregor; Keller, Josh; Eddelbuettel, Dirk",2016,June,13,Analysis of NICS gun purchase background checks.,New YorkTimes,gunsales
1,"Aldhous, Peter",2016,September,16,"""Shy Trumpers"" polling analysis.",BuzzFeed News,2016-09-shy-trumpers
2,"Arthur, Rob",2015,July,30,Buster Posey MVP.,FiveThirtyEight,buster-posey-mvp
3,"Bi, Frank",2016,Jan,13,Uber launch cities and date.,Vox,verge-uber-launch-dates
4,"Bradshaw, Paul",2019,April,6,Lack of electric car charging points 'putting ...,BBC,electric-car-charging-points


## Code-analysis-notebook network

This notebook also uses the data exported by `codebook.ipynb` to `data/code-analysis-network.csv`, which maps codes to notebooks to analyses. I merge this data frame with the `citations` data frame to associate codes used with organizations and journalists.

In [4]:
analysisCodes = pd.read_csv('data/code-analysis-network.csv')
analysisCodes = pd.merge(analysisCodes, 
         citations[['organization', 'path', 'journalist']].drop_duplicates(),
         how='left',
         left_on='analysis',
         right_on='path')[['name', 'analysis', 'notebook', 'level', 'is_leaf', 'organization', 'journalist']]

analysisCodes.head()

Unnamed: 0,name,analysis,notebook,level,is_leaf,organization,journalist
0,pull tables out of pdf,2018-voter-registration,01_processing.ipynb,3.0,True,Baltimore Sun,"Zhang, Christine"
1,api request,california-h2a-visas-analysis,03_geocode.ipynb,3.0,True,Los Angeles Times,"Welsh, Ben"
2,query database,201901-achievementgap,build_data.R,3.0,True,Star Tribune,"Webster, MaryJo"
3,scrape web for data,us-weather-history,wunderground_scraper.py,3.0,True,FiveThirtyEight,"Olson, Randy"
4,scrape web for data,long-term-care-db,mung-3-25-scrape,3.0,True,The Oregonian,"Zarkhin, Fedor"


# Code Counts

Now group the code-analyses pairs in the `codes` data frame by code and count the number of analyses per code. Remember that the frequency column, called `freq`, is the number of analyses that contain at least one instance of that particular code.

In [5]:
nuniq = {
    'analysis': analysisCodes.analysis.nunique(),
    'journalist': analysisCodes.journalist.nunique(),
    'organization': analysisCodes.organization.nunique(),
}

codeCounts = analysisCodes.groupby(['name', 'level', 'is_leaf']) \
    .agg({
        'analysis': lambda x: round((len(set(x)) / nuniq.get('analysis')) * 100, 2),
        'journalist': lambda x: round((len(set(x)) / nuniq.get('journalist') * 100), 2),
        'organization': lambda x: round((len(set(x)) / nuniq.get('organization')) * 100, 2),
    }) \
    .rename(columns={
        'analysis': 'analysis_percent',
        'journalist': 'journalist_percent',
        'organization': 'organization_percent'
    }) \
    .reset_index()

codeCounts = pd.merge(codeCounts, codes[['name', 'desc']], how='left')

def recurse(root, pNode, func):
    root = root if root != None else pNode
    func(root, pNode)
    children = codes.loc[codes.parent == pNode]
    for child in children.name:
        recurse(root, child, func)

def markBranch(root, pnode):
    codeCounts.loc[codeCounts.name == pnode, 'is_' + root] = True

for branch in ['actions', 'strategies', 'observations', 'analysis']:
    codeCounts['is_' + branch] = False
    recurse(None, branch, markBranch) 

# Populate is leaf parent
codeCounts['is_leaf_parent'] = False
for parent in codes[codes.is_leaf].parent.unique():
    codeCounts.loc[codeCounts.name == parent, 'is_leaf_parent'] = True

# Bin coverage into discrete values
bins = ['lots', 'frequently', 'infrequently', 'seldom']
codeCounts['commonness'] = pd.cut(codeCounts.analysis_percent, len(bins), labels=bins[::-1])

# # Peek at results
codeCounts.head()

Unnamed: 0,name,level,is_leaf,analysis_percent,journalist_percent,organization_percent,desc,is_actions,is_strategies,is_observations,is_analysis,is_leaf_parent,commonness
0,actions,0.0,False,100.0,100.0,100.0,Codes that describe actions the journalist has...,True,False,False,False,True,lots
1,adjust for inflation,3.0,True,6.0,9.09,11.54,TK,True,False,False,False,False,seldom
2,adjust for season,3.0,True,2.0,3.03,3.85,Making seasonal adjustments to the data to det...,True,False,False,False,False,seldom
3,aggregate the forest from the trees,2.0,True,4.0,6.06,7.69,Data of individual observations is aggregated ...,False,False,True,False,False,seldom
4,analysis,1.0,False,100.0,100.0,100.0,Kinds of analysis data journalists need to wra...,False,False,True,True,True,lots


# Code Prevalence

The table below provides a way to qualitatively evaluate the prevalence of certain codes in the corpus, by the number of analyses, the number of journalists, or the number of organizations.

In [13]:
codeLevel = codeCounts[codeCounts.level > 0]

codeLevel.loc[~codeLevel.is_actions, 'category'] = 'Observation'
codeLevel.loc[codeLevel.is_actions, 'category'] = 'Action'

codeLevel = codeLevel.sort_values(['category', 'level', 'commonness', 'analysis_percent'], 
                      ascending=[True, True, False, False])

for col in ['analysis_percent', 'journalist_percent', 'organization_percent']:
    codeLevel[col] = codeLevel[col].apply(str) + '%'

codeLevel[['category', 'level', 'name', 'commonness' , 'analysis_percent', 'journalist_percent', 'organization_percent']]

Unnamed: 0,category,level,name,commonness,analysis_percent,journalist_percent,organization_percent
90,Action,1.0,import,lots,78.0%,81.82%,84.62%
23,Action,1.0,clean,lots,76.0%,81.82%,80.77%
99,Action,1.0,modify,frequently,70.0%,81.82%,96.15%
94,Action,1.0,integrate,frequently,68.0%,78.79%,88.46%
113,Action,1.0,recalculate,frequently,64.0%,75.76%,76.92%
151,Action,1.0,transform,frequently,56.0%,66.67%,69.23%
22,Action,1.0,check sanity,infrequently,50.0%,63.64%,73.08%
57,Action,1.0,display dataset,infrequently,38.0%,45.45%,50.0%
65,Action,1.0,export,infrequently,38.0%,45.45%,50.0%
97,Action,2.0,load,frequently,70.0%,75.76%,76.92%
