In [1]:
import sys
sys.path.append('..')

# Extract ">" Pairs from one evaluated BWS set
We extract `>` (gt) relations only throughout the whole python module.

## Demo
The dictionaries `dok_..` store the counts or resp. frequencies for each `>` relationship, e.g. `('B', 'C'): 1` means `B>C` was counted `1` times.

In [2]:
from bwsample import extract_pairs

ids = ['A', 'B', 'C', 'D']
states = [0, 0, 2, 1]  # BEST=1, WORST=2

dok_all, dok_direct, dok_best, dok_worst = extract_pairs(ids, states)

print(dok_all)

{('D', 'C'): 1, ('D', 'A'): 1, ('A', 'C'): 1, ('D', 'B'): 1, ('B', 'C'): 1}


## Three types pairs
We can distinct three types of pairs (`dok_all` contains 3 types of pairs).

- `"BEST > WORST"` : The dictionary `dok_direct` counts only pairs with both objects are explicitly selected as `BEST=1` or `WORST=2`.
- `"BEST > MIDDLE"` : The dictionary `dok_best` counts only pairs with the lhs object selected as `BEST:1` and rhs object unselected (`MIDDLE:0`).
- `"MIDDLE > WORST"` : The dictionary `doc_worst` counts only pairs with the lhs object unselected (`MIDDLE:0`) and the rhs object selected as `WORST:2`.

The corresponding pairwise comparison matrix:

<img alt="Identify pairs from BWS set, and increment counts in dictionary." src="bwsample-extract.png" width="200px">

The three additonal dictionaries `dok_direct`, `dok_best`, and `doc_worst` could be used for attribution analysis lateron.

In [3]:
print("  BEST > WORST:", dok_direct)
print(" BEST > MIDDLE:", dok_best)
print("MIDDLE > WORST:", dok_worst)

  BEST > WORST: {('D', 'C'): 1}
 BEST > MIDDLE: {('D', 'A'): 1, ('D', 'B'): 1}
MIDDLE > WORST: {('A', 'C'): 1, ('B', 'C'): 1}


## Update dictionaries
You can update the dictionaries as follows:

In [4]:
ids = ['D', 'E', 'F', 'A']
states = [0, 1, 0, 2]

dok_all, dok_direct, dok_best, dok_worst = extract_pairs(
    ids, states, dok_all=dok_all, dok_direct=dok_direct, dok_best=dok_best, dok_worst=dok_worst)

e.g. the pair `D>A` has 2 counts now.

In [5]:
print(dok_all)

{('D', 'C'): 1, ('D', 'A'): 2, ('A', 'C'): 1, ('D', 'B'): 1, ('B', 'C'): 1, ('E', 'A'): 1, ('E', 'D'): 1, ('E', 'F'): 1, ('F', 'A'): 1}


## Convert dictionary to SciPy sparse matrix


In [6]:
from bwsample import to_scipy

cnts, idx = to_scipy(dok_all)

print("IDs: ", idx)
print("Counts:\n", cnts.todense().astype(int))

IDs:  ['A', 'B', 'C', 'D', 'E', 'F']
Counts:
 [[0 0 1 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 0 0 0]
 [2 1 1 0 0 0]
 [1 0 0 1 0 1]
 [1 0 0 0 0 0]]


## Process multiple BWS sets
Use `extract_pairs_batch` 

In [7]:
from bwsample import extract_pairs_batch, to_scipy

evaluated_combostates = ([0, 0, 2, 1], [0, 1, 0, 2])
mapped_sent_ids = (['id1', 'id2', 'id3', 'id4'], ['id4', 'id5', 'id6', 'id1'])

dok_all, dok_direct, dok_best, dok_worst = extract_pairs_batch(
    evaluated_combostates, mapped_sent_ids)

cnts, idx = to_scipy(dok_all)
print("IDs: ", idx)
print("Counts:\n", cnts.todense().astype(int))

IDs:  ['id1', 'id2', 'id3', 'id4', 'id5', 'id6']
Counts:
 [[0 0 1 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 0 0 0]
 [2 1 1 0 0 0]
 [1 0 0 1 0 1]
 [1 0 0 0 0 0]]


or `extract_pairs_batch2`

In [8]:
from bwsample import extract_pairs_batch2, to_scipy

data = (
    ([0, 0, 2, 1], ['id1', 'id2', 'id3', 'id4']), 
    ([0, 1, 0, 2], ['id4', 'id5', 'id6', 'id1'])
)

dok_all, dok_direct, dok_best, dok_worst = extract_pairs_batch2(data)

cnts, idx = to_scipy(dok_all)
print("IDs: ", idx)
print("Counts:\n", cnts.todense().astype(int))

IDs:  ['id1', 'id2', 'id3', 'id4', 'id5', 'id6']
Counts:
 [[0 0 1 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 0 0 0]
 [2 1 1 0 0 0]
 [1 0 0 1 0 1]
 [1 0 0 0 0 0]]
