# Featurize passage text

**Goal:** Given a passage of text, ultimately generate features that can be processed alongside transcripts and alignments to predict reading experts' **observations**.

**Constraint:** features should be interpretable/explainable

Begin by processing one single passage

In [1]:
import re
import json
import spacy
import time
import eng_to_ipa
import pandas as pd
from spacy import displacy
from benepar.spacy_plugin import BeneparComponent
from NgramFetcher import NgramFetcher

Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
with open('data/moby-passages-36/passages-with-line-breaks.tsv') as f:
    lines = (l.split('\t') for l in f)
    lines = {xy[0]: xy[1].strip() for xy in lines}

line = lines['330']
print('\tRaw passage [incl. line breaks]', line, sep='\n')
line_text_only = line[:line.index('#')].replace('$', ' ').replace('  ', ' ')
print('\tProcessed passage', line_text_only, sep='\n')

	Raw passage [incl. line breaks]
Sam and Jo went for a hike. They took a path through the $woods. Suddenly, Sam heard a noise coming from the tree $above their heads. Jo climbed up to see what the noise was $and found two baby squirrels. The babies were alone, but $their mother must be somewhere near. The children watched $and waited.$$Sure enough, the mother soon returned with a mouthful of nuts. $The noises stopped as the baby squirrels began to eat.$$Sam and Jo smiled, knowing the squirrels were safe with $their mother.#1#1.35#1#0.535
	Processed passage
Sam and Jo went for a hike. They took a path through the woods. Suddenly, Sam heard a noise coming from the tree above their heads. Jo climbed up to see what the noise was and found two baby squirrels. The babies were alone, but their mother must be somewhere near. The children watched and waited. Sure enough, the mother soon returned with a mouthful of nuts. The noises stopped as the baby squirrels began to eat. Sam and Jo smiled, k

## Constituency parsing

Using an implementation of "Constituency Parsing with a Self-Attentive Encoder" (ACL 2018). Gives much better results than more rudimentary statistical parsers.

In [4]:
nlp.add_pipe(BeneparComponent('benepar_en2'))

In [5]:
doc = nlp(line_text_only)
sent = list(doc.sents)[0]
const_parse = sent._.parse_string
print(const_parse)

(S (NP (NNP Sam) (CC and) (NNP Jo)) (VP (VBD went) (PP (IN for) (NP (DT a) (NN hike)))) (. .))


<img src='misc/fig-parse-tree.png' width="400">

## Dependency parsing

In [6]:
displacy.render(sent, style='dep', options={'distance': 100})

## Features pertaining to entire passage

 - length of passage
 - number of NPs
 - average number of syllables
 - average number of morphemes per word
 - number of distinct phonetic sounds

## Features that pertain to a single token

 - POS
 - number of syllables
 - number of distinct phonetic sounds
 - left/right-distance to end of phrase (what constitutes *phrase*?)
 - function word? (and, to, etc.)
 - frequency over some corpus (a measure of how common a word is)

In [7]:
fetcher = NgramFetcher('data/staging/ngram_freqs.json')

Finished loading data/staging/ngram_freqs.json


In [8]:
def compute_ngrams_for_sentence(doc, n_before, n_after):
    sentences = [['_START_'] + [token.text for token in sent] for sent in doc.sents]
    rv = []
    for sent in sentences:
        for idx, word in enumerate(sent):
            query = ' '.join(sent[max(0, idx - n_before): idx + n_after + 1]).replace(',', '_._')
            score, was_scraped = fetcher.fetch(query)
            rv.append(tuple([word, score]))
            if was_scraped:
                time.sleep(5)
    return rv

In [9]:
rv = compute_ngrams_for_sentence(doc, 1, 1)

trying: Jo went for
Ngram not found for: Jo went for
trying: Jo climbed up
Ngram not found for: Jo climbed up
trying: baby squirrels began
Ngram not found for: baby squirrels began
trying: knowing the squirrels
Ngram not found for: knowing the squirrels
trying: squirrels were safe
Ngram not found for: squirrels were safe


## Feature: trigram likelihood

- Likelihood of `center` given: `context` `center` `context`

In [10]:
f_trigram_likelihood = rv
f_trigram_likelihood[:5]

[('_START_', 2.89174061585493),
 ('Sam', 0.08594705249092478),
 ('and', 0.0002478075859267613),
 ('Jo', 0.001364227892765734),
 ('went', 0)]

## Feature: word length

- number of letters
- `_START_` token is 0.

In [11]:
sentences = [['_START_'] + [token.text for token in sent] for sent in doc.sents]
rv = []
for sent in sentences:
    for word in sent:
        if word == '_START_':
            rv.append(tuple([word, 0]))
            continue
        rv.append(tuple([word, len(word)]))
f_word_length = rv
f_word_length[:5]

[('_START_', 0), ('Sam', 3), ('and', 3), ('Jo', 2), ('went', 4)]

## Feature: sight word
- 1 if sight word AND at/below current level of passage
- `_START_` token counts as sight word

In [12]:
with open('data/general-resources/dolch-sight-words.json') as f:
    dolch = json.load(f)

In [13]:
appropriate_sight_words = dolch['pre-primer'] + dolch['primer'] + dolch['first-grade'] + dolch['second-grade']

In [14]:
rv = []
for sent in sentences:
    for word in sent:
        if word == '_START_':
            rv.append(tuple([word, 1]))
            continue
        if word in appropriate_sight_words:
            rv.append(tuple([word, 1]))
        else:
            rv.append(tuple([word, 0]))
f_sight_word = rv
f_sight_word[:5]

[('_START_', 1), ('Sam', 0), ('and', 1), ('Jo', 0), ('went', 1)]

## Features: more ngrams

In [15]:
fetcher.save()

File saved to data/staging/ngram_freqs.json


In [18]:
rv_ngrams_1_0 = compute_ngrams_for_sentence(doc, 1, 0) # bigram: v w

In [20]:
rv_ngrams_2_0 = compute_ngrams_for_sentence(doc, 2, 0) # bigram: v v w

trying: Jo went for
Ngram not found for: Jo went for
trying: Jo climbed up
Ngram not found for: Jo climbed up
trying: baby squirrels began
Ngram not found for: baby squirrels began
trying: knowing the squirrels
Ngram not found for: knowing the squirrels
trying: squirrels were safe
Ngram not found for: squirrels were safe


In [22]:
rv_ngrams_0_1 = compute_ngrams_for_sentence(doc, 0, 1) # bigram: w v

In [23]:
rv_ngrams_0_2 = compute_ngrams_for_sentence(doc, 0, 2) # bigram: w v v

trying: Jo went for
Ngram not found for: Jo went for
trying: Jo climbed up
Ngram not found for: Jo climbed up
trying: baby squirrels began
Ngram not found for: baby squirrels began
trying: knowing the squirrels
Ngram not found for: knowing the squirrels
trying: squirrels were safe
Ngram not found for: squirrels were safe


In [25]:
rv_ngrams_1_2 = compute_ngrams_for_sentence(doc, 1, 2) # bigram: v w v v

trying: _START_ Sam and Jo
Ngram not found for: _START_ Sam and Jo
trying: Sam and Jo went
Ngram not found for: Sam and Jo went
trying: and Jo went for
Ngram not found for: and Jo went for
trying: Jo went for a
Ngram not found for: Jo went for a
trying: _START_ They took a
Ngram not found for: _START_ They took a
trying: _START_ Suddenly _._ Sam
Ngram not found for: _START_ Suddenly _._ Sam
trying: Suddenly _._ Sam heard
Ngram not found for: Suddenly _._ Sam heard
trying: _._ Sam heard a
Ngram not found for: _._ Sam heard a
trying: Sam heard a noise
Ngram not found for: Sam heard a noise
trying: _START_ Jo climbed up
Ngram not found for: _START_ Jo climbed up
trying: Jo climbed up to
Ngram not found for: Jo climbed up to
trying: noise was and found
Ngram not found for: noise was and found
trying: was and found two
Ngram not found for: was and found two
trying: and found two baby
Ngram not found for: and found two baby
trying: found two baby squirrels
Ngram not found for: found two baby

In [27]:
rv_ngrams_2_1 = compute_ngrams_for_sentence(doc, 2, 1) # bigram: v v w v

trying: _START_ Sam and Jo
Ngram not found for: _START_ Sam and Jo
trying: Sam and Jo went
Ngram not found for: Sam and Jo went
trying: and Jo went for
Ngram not found for: and Jo went for
trying: Jo went for a
Ngram not found for: Jo went for a
trying: _START_ They took a
Ngram not found for: _START_ They took a
trying: _START_ Suddenly _._ Sam
Ngram not found for: _START_ Suddenly _._ Sam
trying: Suddenly _._ Sam heard
Ngram not found for: Suddenly _._ Sam heard
trying: _._ Sam heard a
Ngram not found for: _._ Sam heard a
trying: Sam heard a noise
Ngram not found for: Sam heard a noise
trying: _START_ Jo climbed up
Ngram not found for: _START_ Jo climbed up
trying: Jo climbed up to
Ngram not found for: Jo climbed up to
trying: noise was and found
Ngram not found for: noise was and found
trying: was and found two
Ngram not found for: was and found two
trying: and found two baby
Ngram not found for: and found two baby
trying: found two baby squirrels
Ngram not found for: found two baby

In [None]:
rv_ngrams_2_2 = compute_ngrams_for_sentence(doc, 2, 2) # bigram: v v w v v

## Feature: IPA (very naive)

- number of distinct glyphs in the IPA for each word

In [31]:
rv_ipa = []
for sent in sentences:
    for word in sent:
        if word in {'.', ',', '?', '!', '_START_'}:
            rv_ipa.append(tuple([word, 0]))
        else:
            rv_ipa.append(tuple([word, len(set(eng_to_ipa.convert(word)))]))

Merge features

In [45]:
# text_matrix = pd.DataFrame(f_trigram_likelihood)
# text_matrix = text_matrix.rename(columns={0: 'token', 1: 'ngram-1-1'})

In [47]:
# values_of = lambda x: [xy[1] for xy in x]

In [48]:
# text_matrix['ngrams-1-0'] = values_of(rv_ngrams_1_0)

In [50]:
# text_matrix['ngrams-2-0'] = values_of(rv_ngrams_2_0)

In [51]:
# text_matrix['ngrams-0-1'] = values_of(rv_ngrams_0_1)
# text_matrix['ngrams-0-2'] = values_of(rv_ngrams_0_2)

In [53]:
# text_matrix['word-len'] = values_of(f_word_length)
# text_matrix['sight-word'] = values_of(f_sight_word)

In [55]:
# text_matrix['naive-ipa'] = values_of(rv_ipa)

In [59]:
# text_matrix.to_csv('output/20200623-text-matrix-330.tsv', sep='\t', index=False)