# Featurize passage text

**Goal:** Given a passage of text, ultimately generate features that can be processed alongside transcripts and alignments to predict reading experts' **observations**.

**Constraint:** features should be interpretable/explainable

Begin by processing one single passage

In [17]:
import re
import spacy
from spacy import displacy
from benepar.spacy_plugin import BeneparComponent

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
with open('data/moby-passages-36/passages-with-line-breaks.tsv') as f:
    lines = (l.split('\t') for l in f)
    lines = {xy[0]: xy[1].strip() for xy in lines}

line = lines['330']
print('\tRaw passage [incl. line breaks]', line, sep='\n')
line_text_only = line[:line.index('#')].replace('$', ' ')
print('\tProcessed passage', line_text_only, sep='\n')

	Raw passage [incl. line breaks]
Sam and Jo went for a hike. They took a path through the $woods. Suddenly, Sam heard a noise coming from the tree $above their heads. Jo climbed up to see what the noise was $and found two baby squirrels. The babies were alone, but $their mother must be somewhere near. The children watched $and waited.$$Sure enough, the mother soon returned with a mouthful of nuts. $The noises stopped as the baby squirrels began to eat.$$Sam and Jo smiled, knowing the squirrels were safe with $their mother.#1#1.35#1#0.535
	Processed passage
Sam and Jo went for a hike. They took a path through the  woods. Suddenly, Sam heard a noise coming from the tree  above their heads. Jo climbed up to see what the noise was  and found two baby squirrels. The babies were alone, but  their mother must be somewhere near. The children watched  and waited.  Sure enough, the mother soon returned with a mouthful of nuts.  The noises stopped as the baby squirrels began to eat.  Sam and Jo s

## Constituency parsing

Using an implementation of "Constituency Parsing with a Self-Attentive Encoder" (ACL 2018). Gives much better results than more rudimentary statistical parsers.

In [4]:
nlp.add_pipe(BeneparComponent('benepar_en2'))

In [51]:
doc = nlp(line_text_only)
sent = list(doc.sents)[0]
const_parse = sent._.parse_string
print(const_parse)
# for s in list(doc.sents):
#     print(s._.parse_string)

(S (NP (NNP Sam) (CC and) (NNP Jo)) (VP (VBD went) (PP (IN for) (NP (DT a) (NN hike)))) (. .))


<img src='misc/fig-parse-tree.png' width="400">

## Dependency parsing

In [6]:
displacy.render(sent, style='dep', options={'distance': 100})

## Features pertaining to entire passage

 - length of passage
 - number of NPs
 - average number of syllables
 - average number of morphemes per word
 - number of distinct phonetic sounds

## Features that pertain to a single token

 - POS
 - number of syllables
 - number of distinct phonetic sounds
 - left/right-distance to end of phrase (what constitutes *phrase*?)
 - function word? (and, to, etc.)
 - frequency over some corpus (a measure of how common a word is)