# Identifying difficulties in a single reading session using text-analysis and performance matrices

Inputs:
- Text-analysis matrix for a passage
- Performance matrix for session (three readings)

Output:
- Report of text-analysis fields where problems are detected

*Methodological hurdles*: 

1. Linguistic: How do we distinguish words on which a reader struggles?
2. Computational: How do we decide if a reader is struggling on a particular text-analysis field enough for us to consider it a 'problem-area?'

## Solution to linguistic hurdle (very naive)
Define
- $d$: threshold duration for a `<pause>` marker to count as a pause, say 25 (frames).

If a reader misses a word, we label that word as "difficult." If a reader pauses for $d$ frames or more before a word in which 

**Future consideration**: Readers may pause at the beginning of a small phrase (**NP**s or **PP**s) if the phrase contains a difficult word. (For instance, pausing before reading the phrase "*of the organization*".) To account for this, we can devise a way to incorporate constituency parses into the above methodology.

## Solution to computational hurdle (less naive)
*Assume we can accurately identify words that a reader struggles on, and we wish to determine if the reader is struggling on words for a quality $Q$.*

Define 
- $D_i$: the set of words that a reader has trouble on reading $i$
- $Q_i$ the set of words in reading $i$ with quality $Q$.
- $n_{\text{readings}}$: the threshold number of readings in which a reader needs to show difficulty, say 1 or 2
- $p$: the threshold overlap proportion for words in a particular reading on which a reader needs to show difficulty, say 0.5.

If $Q$ is a binary-valued (e.g. $Q=$ words that are decodable at the reader's grade level), we can reduce this task to:

If for at least $n_{\text{readings}}$ of readings $i$ we observe that
$$
\frac{|D_i\cap Q_i|}{Q_i}>p\quad\text{or}\quad\frac{|D_i\cap Q_i^c|}{Q_i^c}>p,
$$
then we report that the reader is struggling on quality $Q$.

If $Q$ is discrete but not binary-valued (e.g. number of letters), we modify this approach. Suppose $Q$ takes on values $\{q_1,\dots,q_k\}$. Then we report each $q_m$ for which
$$
\frac{|D_i\cap Q^{(m)}_i|}{Q^{(m)}_i}>p\quad\text{or}\quad\frac{|D_i\cap Q_i^{(m)c}|}{Q_i^{(m)c}}>p,
$$
where $Q^{(m)}_i$ is the set of words in reading $i$ whose value for quality $Q$ falls in the range $<q_m$.

If $Q$ is continuous and ranges from $[a,b]$, we repeat the above process with a fixed partition of $[a,b]$.

In [1]:
THRESHOLD_PAUSE_DURATION = 25
THRESHOLD_NUMBER_OF_DIFFICULT_READINGS = 1
THRESHOLD_OVERLAP_PROPORTION = .5

# Do gold readers pause for noticeably different lengths of time?
THRESHOLD_PAUSE_DURATION_FOR_GOLD_READINGS = 5

In [2]:
import sys, os, json
import pandas as pd
import identify_difficulties_util as IDU

In [3]:
TEXT_ANALYSIS_MATRIX_1 = '../output/text-analysis-matrix/20200822_330.tsv'
TEXT_ANALYSIS_MATRIX_2 = '../output/text-analysis-matrix/20200822_2201.tsv'
TEXT_ANALYSIS_MATRIX_3 = '../output/text-analysis-matrix/20200822_2202.tsv'

PERFORMANCE_MATRIX_1 = '../output/performance-matrix/330_child_updated_20200823/30908.tsv'
PERFORMANCE_MATRIX_2 = '../output/performance-matrix/2201_child_updated_20200823/30908.tsv'
PERFORMANCE_MATRIX_3 = '../output/performance-matrix/2202_child_updated_20200823/30908.tsv'

GOLD_PERFORMANCE_MATRIX_DIR_1 = '../output/performance-matrix/330_gold_updated_20200823/'
GOLD_PERFORMANCE_MATRIX_DIR_2 = '../output/performance-matrix/2201_gold_updated_20200823/'
GOLD_PERFORMANCE_MATRIX_DIR_3 = '../output/performance-matrix/2202_gold_updated_20200823/'

In [4]:
text_analysis_df_1 = pd.read_csv(TEXT_ANALYSIS_MATRIX_1, sep='\t')
text_analysis_df_2 = pd.read_csv(TEXT_ANALYSIS_MATRIX_2, sep='\t')
text_analysis_df_3 = pd.read_csv(TEXT_ANALYSIS_MATRIX_3, sep='\t')

performance_df_1 = pd.read_csv(PERFORMANCE_MATRIX_1, sep='\t')
performance_df_2 = pd.read_csv(PERFORMANCE_MATRIX_2, sep='\t')
performance_df_3 = pd.read_csv(PERFORMANCE_MATRIX_3, sep='\t')

In [5]:
def find_expected_pauses_from_dir(directory):
    rv = set()
    for filename in os.listdir(directory):
        gold_df = pd.read_csv(os.path.join(directory, filename), sep='\t')
        nframes_list = list(gold_df.nframes)
        matches_expected_list = list(gold_df.matches_expected)
        for idx, val in enumerate(matches_expected_list):
            # filter for only pauses
            # not necessary, but maybe this is good to have
            if idx % 2 == 0: 
                if nframes_list[idx] >= THRESHOLD_PAUSE_DURATION_FOR_GOLD_READINGS:
                    rv.add(idx)
    return sorted(list(rv))

In [6]:
gold_expected_pauses_1 = find_expected_pauses_from_dir(GOLD_PERFORMANCE_MATRIX_DIR_1)
gold_expected_pauses_2 = find_expected_pauses_from_dir(GOLD_PERFORMANCE_MATRIX_DIR_2)
gold_expected_pauses_3 = find_expected_pauses_from_dir(GOLD_PERFORMANCE_MATRIX_DIR_3)

In [7]:
TEST_difficult_words = IDU.find_problem_words_given_df_naive(
    performance_df_1,
    THRESHOLD_PAUSE_DURATION,
    gold_expected_pauses_1
)

In [8]:
text_analysis_df_1

Unnamed: 0.1,Unnamed: 0,word_is_start_of_line,word_is_end_of_line,word_is_start_of_paragraph,word_is_end_of_paragraph,word_contains_punctuation,word_is_start_of_sentence,word_is_end_of_sentence,word_length,CMU_length,...,lts_8,lts_9,lts_10,lts_11,lts_12,n_morphs,is_decodable_at_grade_level,sightword_pp,sightword_p,sightword_1
0,sam,True,False,True,False,False,True,False,3,3,...,False,False,False,False,False,1,True,False,False,False
1,and,False,False,False,False,False,False,False,3,3,...,False,False,False,False,False,1,True,True,True,True
2,jo,False,False,False,False,False,False,False,2,2,...,False,False,False,False,False,1,False,False,False,False
3,went,False,False,False,False,False,False,False,4,4,...,False,False,False,False,False,1,True,False,True,True
4,for,False,False,False,False,False,False,False,3,3,...,True,False,False,False,False,1,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,were,False,False,False,False,False,False,False,4,2,...,False,False,False,False,False,1,False,False,True,True
85,safe,False,False,False,False,False,False,False,4,3,...,False,False,False,False,False,1,True,False,False,False
86,with,False,True,False,False,False,False,False,4,3,...,False,False,False,False,False,1,True,True,True,True
87,their,True,False,False,False,False,False,False,5,3,...,False,False,False,False,False,1,False,False,False,True


In [9]:
for i in TEST_difficult_words:
    print(performance_df_1['Unnamed: 0'][i], i, performance_df_1.token[i])

went 7 went
a 11 a
took 17 took
through 23 through
suddenly 29 nan
sam 31 nan
heard 33 heard
from 41 from
their 49 their
climbed 55 <pause>
up 57 up
to 59 to
the 65 the
was 69 was
found 73 found
noises 137 noises
stopped 139 stopped
and 157 and
with 173 with
mother 177 mother


In [10]:
TEST_proportions = IDU.compute_difficult_words_overlap_proportion(
    TEST_difficult_words,
    list(text_analysis_df_1.is_decodable_at_grade_level),
    mode='binary'
)

In [11]:
TEST_proportions

{'true_overlap_proportion': 0.2,
 'false_overlap_proportion': 0.24074074074074073}