# Code Sequences

In [1]:
import pandas as pd
import altair as alt
pd.set_option('max_colwidth', 100)
pd.options.display.max_rows = 100

## Importing a PDF-code dataset
`Codebook.ipynb` already parses the annotated PDFs for comments as assigned the position the appear in the PDF in the `index` variable of the dataset located `data/code-analysis-network.csv`. This data has to be joined with the values in `codeset.csv` in order to determine if the code was a *process* or *action* code.

In [2]:
codes = pd.read_csv('data/code-analysis-network.csv')
codeTypeLookup = pd.read_csv('data/codeset.csv')[['name', 'type']]
codes = codes.merge(codeTypeLookup)
codes['location'] = 'notebooks/' + codes.org + '/' + codes.analysis + '/' + codes.notebook + '.html.pdf'

# I was stupid and forgot Pandas DataFrames have an "index," which is accessable via codes.index.
# Use codes['index'] to access position in the PDF
codes['index'] = codes['index'] + 1

codes.head()

Unnamed: 0,org,analysis,notebook,index,cell,name,mark,total,percent,type,location
0,baltimore-sun-data,2018-voter-registration,01_processing.ipynb,1,paragraph 1,combine periodic data,✔️,5,0.1,process,notebooks/baltimore-sun-data/2018-voter-registration/01_processing.ipynb.html.pdf
1,buffalonews,new-york-schools-assessment,combine-csv.ipynb,1,paragraph 1,combine periodic data,✔️,5,0.1,process,notebooks/buffalonews/new-york-schools-assessment/combine-csv.ipynb.html.pdf
2,correctiv,awb-notebook,awb_meldungen.ipynb,11,4,combine periodic data,✔️,5,0.1,process,notebooks/correctiv/awb-notebook/awb_meldungen.ipynb.html.pdf
3,la_times,california-crop-production-wages-analysis,02-transform.ipynb,5,p1,combine periodic data,✔️,5,0.1,process,notebooks/la_times/california-crop-production-wages-analysis/02-transform.ipynb.html.pdf
4,la_times,california-h2a-visas-analysis,02_transform.ipynb,3,56,combine periodic data,✔️,5,0.1,process,notebooks/la_times/california-h2a-visas-analysis/02_transform.ipynb.html.pdf


In [3]:
seq = codes[['location', 'type', 'name', 'index']]

# Aggregate join the total codes in each PDF to each row
codeCountsByLocation = codes.groupby('location').index.count().to_frame('total').reset_index()
seq = seq.merge(codeCountsByLocation, on='location')

# Filter out process codes
seq = seq[seq.type == 'actions']

# Calculate relation "position" within the notebook
seq['position'] = seq['index'] / seq['total']

seq['position_bin'] = pd.cut(x=seq.position, 
                             bins=[0, 0.25, 0.5, 0.75, 1.0], 
                             labels=['Beginning', 'Early Middle', 'Later Middle', 'End'])

## Visual Analysis

In this section, I begin to analyze this sequence data.

### Quartile Heat Map
We bin the relative `postion` variable, which is the comment index normalized by the number of comments in a PDF document. Then we visualize these ordinal values as heatmap sprovides rough sense of where codes occur during analysis.

In [4]:
seqBin = seq.groupby(['name', 'position_bin'])\
    .position_bin.count().to_frame('count').reset_index()\
    .rename(columns={
        'position_bin': 'Quartile', 
        'name': 'Code Name',
        'count': 'Code Count'
    })

alt.Chart(data=seqBin, 
          title='Code Occurence Frequency',
          width=300)\
    .mark_rect()\
    .encode(
        y='Code Name:O',
        x='Quartile:O',
        color='Code Count:Q',
        tooltip=['Code Name', 'Quartile', 'Code Count']
    )\
    .configure_axisX(labelAngle=0)\
    .interactive()

# TODO group by axial code, maybe

In [9]:
alt.Chart(seq).mark_boxplot().encode(
    y='name:O',
    x='position:Q'
)