# Edge Probing Predictions Sandbox

Use this notebook as a starting point for #datascience on Edge Probing predictions. The code below (from `probing/analysis.py`) will load predictions from a run, do some pre-processing for convenience, and expose two DataFrames for analysis.

We load the data into Pandas so it's easier to filter by various fields, and to select particular columns of interest (such as `labels.khot` and `preds.proba` for computing metrics). For an introduction to Pandas, see here: https://pandas.pydata.org/pandas-docs/stable/10min.html 

In [1]:
import sys, os, re, json
import itertools
import collections
from importlib import reload
import pandas as pd
import numpy as np
from sklearn import metrics

  return f(*args, **kwds)
  return f(*args, **kwds)


The latest runs are here:

In [2]:
ls /nfs/jsalt/home/iftenney/exp/edges-20180913/

ls: cannot access '/nfs/jsalt/home/iftenney/exp/edges-20180913/': No such file or directory


The `elmo-chars` experiments probe the char CNN layer only (lexical baseline), while the `elmo-full` models use full ELMo with learned mixing weights. The run dir for each is just called "run" by default. 

In [3]:
import analysis
reload(analysis)

run_dir = "/home/itcast/jiant-v1-legacy/out3/ep_bertlarge/run"
preds = analysis.Predictions.from_run(run_dir, 'sup', 'test')
print("Number of examples: %d" % len(preds.example_df))
print("Number of total targets: %d" % len(preds.target_df))
print("Labels (%d total):" % len(preds.all_labels))
print(preds.all_labels)

07/28 09:12:29 AM: Loading vocabulary from /home/itcast/jiant-v1-legacy/out3/ep_bertlarge/vocab
07/28 09:12:29 AM: Loading token dictionary from /home/itcast/jiant-v1-legacy/out3/ep_bertlarge/vocab.
07/28 09:12:29 AM: Loading predictions from /home/itcast/jiant-v1-legacy/out3/ep_bertlarge/run/sup_test.json


Number of examples: 231480
Number of total targets: 598983
Labels (66 total):
['ARG0', 'ARG1', 'ARG2', 'ARG3', 'ARG4', 'ARG5', 'ARGA', 'ARGM-ADJ', 'ARGM-ADV', 'ARGM-CAU', 'ARGM-COM', 'ARGM-DIR', 'ARGM-DIS', 'ARGM-DSP', 'ARGM-EXT', 'ARGM-GOL', 'ARGM-LOC', 'ARGM-LVB', 'ARGM-MNR', 'ARGM-MOD', 'ARGM-NEG', 'ARGM-PNC', 'ARGM-PRD', 'ARGM-PRP', 'ARGM-PRR', 'ARGM-PRX', 'ARGM-REC', 'ARGM-TMP', 'C-ARG0', 'C-ARG1', 'C-ARG2', 'C-ARG3', 'C-ARG4', 'C-ARGM-ADJ', 'C-ARGM-ADV', 'C-ARGM-CAU', 'C-ARGM-COM', 'C-ARGM-DIR', 'C-ARGM-DIS', 'C-ARGM-DSP', 'C-ARGM-EXT', 'C-ARGM-LOC', 'C-ARGM-MNR', 'C-ARGM-MOD', 'C-ARGM-NEG', 'C-ARGM-PRP', 'C-ARGM-TMP', 'R-ARG0', 'R-ARG1', 'R-ARG2', 'R-ARG3', 'R-ARG4', 'R-ARG5', 'R-ARGM-ADV', 'R-ARGM-CAU', 'R-ARGM-COM', 'R-ARGM-DIR', 'R-ARGM-EXT', 'R-ARGM-GOL', 'R-ARGM-LOC', 'R-ARGM-MNR', 'R-ARGM-MOD', 'R-ARGM-PNC', 'R-ARGM-PRD', 'R-ARGM-PRP', 'R-ARGM-TMP']


### Top-level example info

`preds.example_df` contains information on the top-level examples. Mostly, this just stores the input text and any metadata fields that were present in the original data. This is useful if you want to link the targets back to the text, but you shouldn't need it to compute most metrics.

In [4]:
preds.example_df.head()

Unnamed: 0_level_0,idx,info.document_id,info.sentence_id,text
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,nw/dev_09_c2e/00/dev_09_c2e_0065,0,"In July of 1992 , after Abkhazia &apos; s self..."
1,1,nw/dev_09_c2e/00/dev_09_c2e_0036,0,"He said , It is natural for the Chinese to con..."
2,2,nw/dev_09_c2e/00/dev_09_c2e_0042,0,All units must reply according to the format r...
3,3,nw/dev_09_c2e/00/dev_09_c2e_0061,0,"At present , related authorities have already ..."
4,4,nw/dev_09_c2e/00/dev_09_c2e_0013,0,"Second , the annual report of Shanghai has dis..."


### Target info and predictions

`preds.target_df` contains the per-target input fields (`span1`, `span2`, and `label`) as well as any metadata associated with individual targets. The `idx` column references a row in `example_df` that this target belongs to, if you need to recover the original text.

The loader code does some preprocessing for convenience. In particular, we add a `label.ids` column which maps the list-of-string `label` column into a list of integer ids for these targets, as well as `label.khot` which contains a K-hot encoding of these ids. 

Each entry in `label.khot` should align to the corresponding entry in `preds.proba`, which contains the model's predicted probabilities $\hat{y} \in [0,1]$ for each class.

For specific analysis, it might be easier to work with the wide and long forms of this DataFrame - see cells below.

In [5]:
preds.target_df.head()

Unnamed: 0,idx,label,preds.proba,span1,span2,label.ids,label.khot
0,0,[ARG0],"[0.0362212136387825, 0.9109857082366943, 0.413...","(11, 12)","(6, 9)",[0],"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,0,[ARGM-REC],"[0.05116146430373192, 0.4776630103588104, 0.08...","(11, 12)","(9, 10)",[26],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,0,[ARG1],"[0.3328816890716553, 0.829319179058075, 0.0821...","(11, 12)","(12, 14)",[1],"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,1,[ARG1],"[0.5545434951782227, 0.8481532335281372, 0.370...","(15, 16)","(16, 18)",[1],"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,2,[ARGM-TMP],"[0.798281192779541, 0.017672736197710037, 0.11...","(43, 44)","(42, 43)",[27],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


### Wide and Long Data

For background on these views, see https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data

Here's a "wide" version of the data, with the usual metadata plus `2* num_labels` columns: `label.true.<label_name>` and `preds.proba.<label_name>` for each target class.

In [6]:
preds.target_df_wide.head()

07/28 09:13:26 AM: Generating wide-form target DataFrame. May be slow... 
07/28 09:16:04 AM: Done!


Unnamed: 0,idx,span1,span2,label.true.ARG0,label.true.ARG1,label.true.ARG2,label.true.ARG3,label.true.ARG4,label.true.ARG5,label.true.ARGA,...,preds.proba.R-ARGM-DIR,preds.proba.R-ARGM-EXT,preds.proba.R-ARGM-GOL,preds.proba.R-ARGM-LOC,preds.proba.R-ARGM-MNR,preds.proba.R-ARGM-MOD,preds.proba.R-ARGM-PNC,preds.proba.R-ARGM-PRD,preds.proba.R-ARGM-PRP,preds.proba.R-ARGM-TMP
0,0,"(11, 12)","(6, 9)",1,0,0,0,0,0,0,...,0.247644,0.221382,0.305474,0.016793,0.2724,0.264148,0.123025,0.19651,0.302759,0.043375
1,0,"(11, 12)","(9, 10)",0,0,0,0,0,0,0,...,0.655565,0.751496,0.775576,0.149934,0.655053,0.64877,0.692132,0.690925,0.684106,0.588227
2,0,"(11, 12)","(12, 14)",0,1,0,0,0,0,0,...,0.380756,0.353931,0.394354,0.041905,0.364044,0.364817,0.243931,0.408371,0.416636,0.254229
3,1,"(15, 16)","(16, 18)",0,1,0,0,0,0,0,...,0.216123,0.218544,0.28999,0.040835,0.221564,0.350754,0.228039,0.298226,0.328966,0.210418
4,2,"(43, 44)","(42, 43)",0,0,0,0,0,0,0,...,0.058375,0.195195,0.133012,0.01404,0.064443,0.150105,0.156278,0.215066,0.123972,0.173104


We can fairly easily compute per-label metrics from the wide form, by selecting the appropriate pair of columns:

In [7]:
wide_df = preds.target_df_wide
scores_by_label = {}
for label in preds.all_labels:
    y_true = wide_df['label.true.' + label]
    y_pred = wide_df['preds.proba.' + label] >= 0.5
    score = metrics.f1_score(y_true=y_true, y_pred=y_pred)
    scores_by_label[label] = score
scores = pd.Series(scores_by_label)
print(scores)
print("Macro average F1: %.04f" % scores.mean())

ARG0          0.378234
ARG1          0.525697
ARG2          0.351483
ARG3          0.052658
ARG4          0.064258
ARG5          0.001333
ARGA          0.000000
ARGM-ADJ      0.021439
ARGM-ADV      0.178669
ARGM-CAU      0.052401
ARGM-COM      0.000836
ARGM-DIR      0.103585
ARGM-DIS      0.266265
ARGM-DSP      0.000121
ARGM-EXT      0.044612
ARGM-GOL      0.018816
ARGM-LOC      0.070110
ARGM-LVB      0.013464
ARGM-MNR      0.100407
ARGM-MOD      0.215194
ARGM-NEG      0.374471
ARGM-PNC      0.014363
ARGM-PRD      0.023037
ARGM-PRP      0.057057
ARGM-PRR      0.000000
ARGM-PRX      0.000000
ARGM-REC      0.000947
ARGM-TMP      0.181524
C-ARG0        0.000869
C-ARG1        0.022423
                ...   
C-ARGM-COM    0.000000
C-ARGM-DIR    0.000000
C-ARGM-DIS    0.000000
C-ARGM-DSP    0.000000
C-ARGM-EXT    0.000000
C-ARGM-LOC    0.000000
C-ARGM-MNR    0.000381
C-ARGM-MOD    0.000000
C-ARGM-NEG    0.000000
C-ARGM-PRP    0.000000
C-ARGM-TMP    0.000000
R-ARG0        0.202565
R-ARG1     

And here's a "long" version of the same, with a single `label` column, and one column each for `label.true` and `preds.proba` for that label:

In [8]:
preds.target_df_long.head()

07/28 09:18:03 AM: Generating long-form target DataFrame. May be slow... 
07/28 09:18:08 AM: span2 detected; adding span_distance to long-form DataFrame.
07/28 09:19:10 AM: Done!


Unnamed: 0,idx,label,label.true,preds.proba,ex_idx,span_distance
0,0,ARG0,1,0.036221,0,2
1,0,ARG1,0,0.910986,0,2
2,0,ARG2,0,0.413718,0,2
3,0,ARG3,0,0.088801,0,2
4,0,ARG4,0,0.095298,0,2


We can easily get the set of labels available here:

In [9]:
preds.target_df_long.label.unique()

array(['ARG0', 'ARG1', 'ARG2', 'ARG3', 'ARG4', 'ARG5', 'ARGA', 'ARGM-ADJ',
       'ARGM-ADV', 'ARGM-CAU', 'ARGM-COM', 'ARGM-DIR', 'ARGM-DIS',
       'ARGM-DSP', 'ARGM-EXT', 'ARGM-GOL', 'ARGM-LOC', 'ARGM-LVB',
       'ARGM-MNR', 'ARGM-MOD', 'ARGM-NEG', 'ARGM-PNC', 'ARGM-PRD',
       'ARGM-PRP', 'ARGM-PRR', 'ARGM-PRX', 'ARGM-REC', 'ARGM-TMP',
       'C-ARG0', 'C-ARG1', 'C-ARG2', 'C-ARG3', 'C-ARG4', 'C-ARGM-ADJ',
       'C-ARGM-ADV', 'C-ARGM-CAU', 'C-ARGM-COM', 'C-ARGM-DIR',
       'C-ARGM-DIS', 'C-ARGM-DSP', 'C-ARGM-EXT', 'C-ARGM-LOC',
       'C-ARGM-MNR', 'C-ARGM-MOD', 'C-ARGM-NEG', 'C-ARGM-PRP',
       'C-ARGM-TMP', 'R-ARG0', 'R-ARG1', 'R-ARG2', 'R-ARG3', 'R-ARG4',
       'R-ARG5', 'R-ARGM-ADV', 'R-ARGM-CAU', 'R-ARGM-COM', 'R-ARGM-DIR',
       'R-ARGM-EXT', 'R-ARGM-GOL', 'R-ARGM-LOC', 'R-ARGM-MNR',
       'R-ARGM-MOD', 'R-ARGM-PNC', 'R-ARGM-PRD', 'R-ARGM-PRP',
       'R-ARGM-TMP'], dtype=object)

And easily compute micro-averaged metrics by simply comparing the `label.true` and `preds.proba` columns:

In [10]:
from sklearn import metrics
long_df = preds.target_df_long
metrics.f1_score(y_true=long_df['label.true'], y_pred=(long_df['preds.proba'] >= 0.5))

0.1928442044222027