# Aligning constituency parse to gold reading

6/30/20

Okay, so for the moment the only reason we are concerned with a constituency parse is because we want to extract **phrases** from passages.

Constituency parsing offered by accurate, easy-to-use parsers end up parsing the passages in a format that isn't exactly compatible with our alignments. The differences include:

- Constituency parsing splits contractions, while forced alignments are a word-by-word.
- Constituency parsing preserves punctuation (minor difference)
- Forced alignments contain **pauses**

In [1]:
import json
import spacy
from benepar.spacy_plugin import BeneparComponent
from collections import defaultdict

Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(BeneparComponent('benepar_en'))
doc = nlp("Sam and Jo went for a hike. They took a path through the woods. Suddenly, Sam heard a noise coming from the tree above their heads. Jo climbed up to see what the noise was, and found two baby squirrels. The babies were alone, but their mother must be somewhere near. The children watched and waited. Sure enough, the mother soon returned with a mouthful of nuts. The noises stopped as the baby squirrels began to eat. Sam and Jo smiled, knowing the squirrels were safe with their mother.")

## Method

1. Remove pauses from gold readings somehow. Then all the gold readings should have the same words

2. Align **constituency parse** to one of the gold readings. Punctuation tokens will have to be dropped, and contractions will have to be dealt with at a later time. (Item 330 has no contractions)

#### Removing pauses.

I join the duration of all pauses into the next word, but for the time being this is irrelevant. I'm just trying to get only the **words** to create a *transcript*.

In [3]:
with open('data/raw/align.json') as f:
    raw_align = json.load(f)
rv = {}
for sess, alignment in raw_align.items():
    no_pauses = []
    curr_nframes = 0
    for token, sframe, nframes in alignment:
        if token[0] != '<':
            no_pauses.append([token, curr_nframes + nframes])
            curr_nframes = 0
        else:
            curr_nframes += nframes
    rv[sess] = no_pauses
align_no_pause = {sess: rv[sess] for sess in rv if len(rv[sess]) == 89}
# with open('data/raw/align-no-pause.json', 'w') as f:
#     json.dump(align_no_pause, f)

#### Align constituency parse to a gold reading

Each token in the constituency parse (incl. punctuation and sub-words) has an associated *index*. I match each token to its correponding index in the gold readings, if possible. So tokens like punctuation don't get assigned, but that's okay.

In [4]:
# align_no_pause['54344']
gold_tokens = [xy[0] for xy in align_no_pause['54344']]
idxs = [token.i for token in doc if str(token) not in {'.', ',', '?'}]
doc_idx_to_align_idx = dict(zip(idxs, range(len(gold_tokens))))

#### Collect sentence indices

In [5]:
STRUCTURE_sentences = []
for sent in doc.sents:
    sent_indexed = [doc_idx_to_align_idx[token.i] for token in sent if token.i in doc_idx_to_align_idx]
    print(sent_indexed)
    print(' '.join(gold_tokens[idx] for idx in sent_indexed))
    STRUCTURE_sentences.append(sent_indexed)

[0, 1, 2, 3, 4, 5, 6]
sam and jo went for a hike
[7, 8, 9, 10, 11, 12, 13]
they took a path through the woods
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
suddenly sam heard a noise coming from the tree above their heads
[26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
jo climbed up to see what the noise was and found two baby squirrels
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
the babies were alone but their mother must be somewhere near
[51, 52, 53, 54, 55]
the children watched and waited
[56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66]
sure enough the mother soon returned with a mouthful of nuts
[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]
the noises stopped as the baby squirrels began to eat
[77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]
sam and jo smiled knowing the squirrels were safe with their mother


#### Collect phrase indices:

In [6]:
def print_children(span, sent_num, STRUCTURE_phrases, level):
    children = list(span._.children)
    for child in children:
        TAG = child._.labels[0] if child._.labels else None
        token_indices = [token.i for token in child]
        # No leaf nodes and no trivial parents
        if TAG and len(token_indices) > 1:
            STRUCTURE_phrases[sent_num].append([
                level,
                TAG,
                token_indices
            ])
        print_children(child, sent_num, STRUCTURE_phrases, level + 1)

sents = list(doc.sents)
STRUCTURE_phrases = [[] for _ in range(len(sents))]
for sent_num, sent in enumerate(sents):
    print_children(sent, sent_num, STRUCTURE_phrases, level=0)

Demo: Sentence **3**, depth **2**

In [7]:
demo = [value for value in STRUCTURE_phrases[3] if value[0] == 2]

for _, TAG, TOKEN_IDS in demo:
    print(TAG + '\t' + ' '.join([gold_tokens[doc_idx_to_align_idx[idx]] for idx in TOKEN_IDS]))

S	to see what the noise was
NP	two baby squirrels


<img src='misc/fig-parse-tree-depth-nodes.png' width='600'>

Demo: Sentence **3**, all **NP**s

In [8]:
demo = [value for value in STRUCTURE_phrases[3] if value[1] == 'NP']
for _, TAG, TOKEN_IDS in demo:
    print(TAG + '\t' + ' '.join([gold_tokens[doc_idx_to_align_idx[idx]] for idx in TOKEN_IDS]))

NP	the noise
NP	two baby squirrels


<img src='misc/fig-parse-tree-np-nodes.png' width='600'>