In [1]:
from run_experiment import load_results

In [2]:
cols = ['model', 'train_layer', 'train_source', 'val_source', 'optimum_threshold', 'calibrated_acc', 'accuracy']
results = load_results()

### 1. Does SAPLMA work to identify incorrect answers in the QA task?
- Using (MC) QA data to train on, because it wasn't feasible for us to generate a full training set ourselves
- Evaluating on both held out QA MC data, and on some generated answers to check that the method still works in this more realistic case
- Are incorrect statements generated by the model harder to spot than artificial ones?
- Does performance vary by QA subtopic?

#### Results when evaluating on MC answers

In [11]:
results[
    (results.train_source == 'qa')
    & (results.val_source == 'qa')
    & (results.val_topic.isin(['None', 'NaN']))
][cols].sort_values('calibrated_acc', ascending=False)

Unnamed: 0,model,train_layer,train_source,val_source,optimum_threshold,calibrated_acc,accuracy
60,lr,-13.0,qa,qa,0.6002482175827026,0.8690140845070422,0.8642605633802818
88,lr,-9.0,qa,qa,0.5001144409179688,0.868838028169014,0.8679577464788732
104,lr,-5.0,qa,qa,0.5951988101005554,0.858450704225352,0.856161971830986
72,lr,-1.0,qa,qa,0.4912859797477722,0.8545774647887324,0.8519366197183098
97,saplma,-13.0,qa,qa,0.6245877146720886,0.8091549295774649,0.7959507042253521
5,saplma,-5.0,qa,qa,0.7086103558540344,0.8024647887323944,0.7913732394366197
45,saplma,-9.0,qa,qa,0.6225416660308838,0.8021126760563382,0.7913732394366197
87,mm,-13.0,qa,qa,8.314564089115326e-28,0.7737676056338029,0.6901408450704225
79,saplma,-1.0,qa,qa,0.6882284283638,0.7709507042253521,0.691725352112676
13,mm,-9.0,qa,qa,5.509172958403877e-35,0.7588028169014085,0.6672535211267606


- LR does better than SAPLMA across all layers! Why?
- Results get slightly better as you go deeper. Though the difference is very minimal once you go beyond the last layer

#### Results when evaluating on generated answers

**NOTE** the generated answers are ~57% true! Unsure if we can just state this when reporting our results or whether we should make some adjustment / balance the dataset.

In [13]:
results[
    (results.train_source == 'qa')
    & (results.val_source == 'qa-gen')
    & (results.val_topic.isin(['None', 'NaN']))
][cols].sort_values('calibrated_acc', ascending=False)

Unnamed: 0,model,train_layer,train_source,val_source,optimum_threshold,calibrated_acc,accuracy
18,mm,-13.0,qa,qa-gen,1.5063328567479166e-25,0.7527539779681762,0.6401468788249693
77,saplma,-13.0,qa,qa-gen,0.5229383707046509,0.7515299877600979,0.7471236230110158
64,lr,-5.0,qa,qa-gen,0.6722432374954224,0.747858017135863,0.7439412484700123
40,lr,-9.0,qa,qa-gen,0.8213123083114624,0.7422276621787025,0.7287637698898408
14,mm,-9.0,qa,qa-gen,6.778570251991411e-35,0.7405140758873929,0.5850673194614443
65,lr,-13.0,qa,qa-gen,0.8011077046394348,0.7358629130966952,0.7307221542227662
98,saplma,-5.0,qa,qa-gen,0.5588269233703613,0.7321909424724602,0.7091799265605875
41,saplma,-9.0,qa,qa-gen,0.5656270980834961,0.7309669522643819,0.7118727050183598
61,lr,-1.0,qa,qa-gen,0.5834947228431702,0.7297429620563036,0.7272949816401468
31,saplma,-1.0,qa,qa-gen,0.4230741858482361,0.7260709914320687,0.6952264381884945


- MM getting the best calibrated acc score is kinda crazy -- it's using a threshold of 1e-25. We might want to look into this result more. Is it actually quite a good method but its confidence is incredibly uncalibrated?
- SAPLMA and LR more evenly matched now

#### Comparison of MC and generative results

In [31]:
mc_res = results[
    (results.train_source == 'qa')
    & (results.val_source == 'qa')
    & (results.val_topic.isin(['None', 'NaN']))
].sort_values(['model', 'train_layer']).reset_index()

gen_res = results[
    (results.train_source == 'qa')
    & (results.val_source == 'qa-gen')
    & (results.val_topic.isin(['None', 'NaN']))
].sort_values(['model', 'train_layer']).reset_index()

mc_res['gen_accuracy'] = gen_res.accuracy
mc_res['gen_cal_acc'] = gen_res.calibrated_acc
mc_res['acc_diff'] = mc_res.accuracy.astype(float) - gen_res.accuracy.astype(float)
mc_res['cal_acc_diff'] = mc_res.calibrated_acc.astype(float) - gen_res.calibrated_acc.astype(float)
mc_res[['model', 'train_layer', 'accuracy', 'calibrated_acc', 'gen_accuracy', 'gen_cal_acc', 'acc_diff', 'cal_acc_diff']]

Unnamed: 0,model,train_layer,accuracy,calibrated_acc,gen_accuracy,gen_cal_acc,acc_diff,cal_acc_diff
0,lr,-13.0,0.8642605633802818,0.8690140845070422,0.7307221542227662,0.7358629130966952,0.133538,0.133151
1,lr,-9.0,0.8679577464788732,0.868838028169014,0.7287637698898408,0.7422276621787025,0.139194,0.12661
2,lr,-5.0,0.856161971830986,0.858450704225352,0.7439412484700123,0.747858017135863,0.112221,0.110593
3,lr,-1.0,0.8519366197183098,0.8545774647887324,0.7272949816401468,0.7297429620563036,0.124642,0.124835
4,mm,-13.0,0.6901408450704225,0.7737676056338029,0.6401468788249693,0.7527539779681762,0.049994,0.021014
5,mm,-9.0,0.6672535211267606,0.7588028169014085,0.5850673194614443,0.7405140758873929,0.082186,0.018289
6,mm,-5.0,0.6716549295774648,0.7376760563380281,0.543451652386781,0.6927784577723378,0.128203,0.044898
7,mm,-1.0,0.5589788732394366,0.5774647887323944,0.4651162790697675,0.5006119951040392,0.093863,0.076853
8,saplma,-13.0,0.7959507042253521,0.8091549295774649,0.7471236230110158,0.7515299877600979,0.048827,0.057625
9,saplma,-9.0,0.7913732394366197,0.8021126760563382,0.7118727050183598,0.7309669522643819,0.079501,0.071146


- LR consistently suffers more from switching to the generated answers compared to SAPLMA (change of 0.11-0.14 vs 0-0.08)

### 2. Are the models identifying truth? Can they generalise between QA and simple true/false datasets?

#### Train on true/false, evaluate on QA gen

In [33]:
results[
    (results.train_source == 'tf')
    & (results.val_source == 'qa-gen')
    & (results.val_topic.isin(['None', 'NaN']))
][cols].sort_values('calibrated_acc', ascending=False)

Unnamed: 0,model,train_layer,train_source,val_source,optimum_threshold,calibrated_acc,accuracy
0,saplma,-13.0,tf,qa-gen,0.4979552626609802,0.6793145654834761,0.6558139534883721
29,saplma,-9.0,tf,qa-gen,0.485476404428482,0.6621787025703794,0.6178702570379437
49,saplma,-5.0,tf,qa-gen,0.4913727343082428,0.6602203182374542,0.6506731946144431
76,lr,-5.0,tf,qa-gen,0.608629584312439,0.6271725826193391,0.6181150550795593
32,mm,-5.0,tf,qa-gen,8.762726366740027e-08,0.6266829865361077,0.572827417380661
90,lr,-9.0,tf,qa-gen,0.1324804872274398,0.609547123623011,0.583108935128519
85,lr,-13.0,tf,qa-gen,0.0818003788590431,0.6090575275397796,0.5351285189718482
17,lr,-1.0,tf,qa-gen,0.0131074236705899,0.5880048959608323,0.5385556915544676
80,mm,-1.0,tf,qa-gen,1.74934020431241e-22,0.5789473684210527,0.5630354957160343
52,mm,-9.0,tf,qa-gen,2.704772896322538e-06,0.5777233782129743,0.5507955936352509


- SAPLMA now quite a bit better than LR.
- In terms of standard accuracy (0.5 threshold), only results that are better than always guessing True are SAPLMA layers -5..-13 and LR layer -5
- This is completely at odds with Alex Mallen's (Google Doc) results that a linear probe solved problems with the NN in generalising to different kinds of statement (negated ones).
- I wonder if we should try doing what Mallen does and seeing if we get the same or different? (train on true/false no negations, evaluate on true/false negations)