# Identifying difficulties in a single reading session using text-analysis and performance matrices

Inputs:
- Text-analysis matrix for a passage
- Performance matrix for session (three readings)

Output:
- Report of text-analysis fields where problems are detected

*Methodological hurdles*: 

1. Linguistic: How do we distinguish words on which a reader struggles?
2. Computational: How do we decide if a reader is struggling on a particular text-analysis field enough for us to consider it a 'problem-area?'

## Solution to linguistic hurdle (very naive)
Define
- $d$: threshold duration for a `<pause>` marker to count as a pause, say 25 (frames).

If a reader misses a word or pauses for $d$ frames before a word, we label that word as "difficult."

**Future consideration**: Readers may pause at the beginning of a small phrase (**NP**s or **PP**s) if the phrase contains a difficult word. (For instance, pausing before reading the phrase "*of the organization*".) To account for this, we can devise a way to incorporate constituency parses into the above methodology.

## Solution to computational hurdle (less naive)
*Assume we can accurately identify words that a reader struggles on, and we wish to determine if the reader is struggling on words for a quality $Q$.*

Define 
- $D_i$: the set of words that a reader has trouble on reading $i$
- $Q_i$ the set of words in reading $i$ with quality $Q$.
- $n_{\text{readings}}$: the threshold number of readings in which a reader needs to show difficulty, say 1 or 2
- $p$: the threshold overlap proportion for words in a particular reading on which a reader needs to show difficulty, say 0.5.

If $Q$ is a binary-valued (e.g. $Q=$ words that are decodable at the reader's grade level), we can reduce this task to:

If for at least $n_{\text{readings}}$ of readings $i$ we observe that
$$
\frac{|D_i\cap Q_i|}{Q_i}>p\quad\text{or}\quad\frac{|D_i\cap Q_i^c|}{Q_i^c}>p,
$$
then we report that the reader is struggling on quality $Q$.

If $Q$ is discrete but not binary-valued (e.g. number of letters), we modify this approach. Suppose $Q$ takes on values $\{q_1,\dots,q_k\}$. Then we report each $q_m$ for which
$$
\frac{|D_i\cap Q^{(m)}_i|}{Q^{(m)}_i}>p\quad\text{or}\quad\frac{|D_i\cap Q_i^{(m)c}|}{Q_i^{(m)c}}>p,
$$
where $Q^{(m)}_i$ is the set of words in reading $i$ whose value for quality $Q$ falls in the range $<q_m$.

If $Q$ is continuous and ranges from $[a,b]$, we repeat the above process with a fixed partition of $[a,b]$.

In [1]:
THRESHOLD_PAUSE_DURATION = 25
THRESHOLD_NUMBER_OF_DIFFICULT_READINGS = 1
THRESHOLD_OVERLAP_PROPORTION = .5

In [2]:
import sys, os, json
import pandas as pd