# Identifying difficulties in a single reading session using text-analysis and performance matrices

Inputs:
- Text-analysis matrix for a passage
- Performance matrix for session (three readings)

Output:
- Report of text-analysis fields where problems are detected

*Methodological hurdles*: 

1. Linguistic: How do we distinguish words on which a reader struggles?
2. Computational: How do we decide if a reader is struggling on a particular text-analysis field enough for us to consider it a 'problem-area?'

## Solution to linguistic hurdle (very naive)
Define
- $d$: threshold duration for a `<pause>` marker to count as a pause, say 25 (frames).

If a reader misses a word, we label that word as "difficult." If a reader pauses for $d$ frames or more before a word in which 

**Future consideration**: Readers may pause at the beginning of a small phrase (**NP**s or **PP**s) if the phrase contains a difficult word. (For instance, pausing before reading the phrase "*of the organization*".) To account for this, we can devise a way to incorporate constituency parses into the above methodology.

## Solution to computational hurdle (less naive)
*Assume we can accurately identify words that a reader struggles on, and we wish to determine if the reader is struggling on words for a quality $Q$.*

Define 
- $D_i$: the set of words that a reader has trouble on reading $i$
- $Q_i$ the set of words in reading $i$ with quality $Q$.
- $n_{\text{readings}}$: the threshold number of readings in which a reader needs to show difficulty, say 1 or 2
- $p$: the threshold overlap proportion for words in a particular reading on which a reader needs to show difficulty, say 0.5.

If $Q$ is a binary-valued (e.g. $Q=$ words that are decodable at the reader's grade level), we can reduce this task to:

If for at least $n_{\text{readings}}$ of readings $i$ we observe that
$$
\frac{|D_i\cap Q_i|}{|Q_i|}>p\quad\text{or}\quad\frac{|D_i\cap Q_i^c|}{|Q_i^c|}>p,
$$
then we report that the reader is struggling on quality $Q$.

If $Q$ is discrete but not binary-valued (e.g. number of letters), we modify this approach. Suppose $Q$ takes on values $\{q_1,\dots,q_k\}$. Then we report each $q_m$ for which
$$
\frac{|D_i\cap Q^{(m)}_i|}{|Q^{(m)}_i|}>p\quad\text{or}\quad\frac{|D_i\cap Q_i^{(m)c}|}{|Q_i^{(m)c}|}>p,
$$
where $Q^{(m)}_i$ is the set of words in reading $i$ whose value for quality $Q$ falls in the range $<q_m$.

If $Q$ is continuous and ranges from $[a,b]$, we repeat the above process with a fixed partition of $[a,b]$.

In [1]:
THRESHOLD_PAUSE_DURATION = 25
THRESHOLD_NUMBER_OF_DIFFICULT_READINGS = 1
THRESHOLD_OVERLAP_PROPORTION = .25

# Do gold readers pause for noticeably different lengths of time?
THRESHOLD_PAUSE_DURATION_FOR_GOLD_READINGS = 5

In [2]:
import sys, os, json
import pandas as pd
import identify_difficulties_util as IDU

from collections import Counter

In [3]:
TEXT_ANALYSIS_MATRIX_1 = '../output/text-analysis-matrix/20200822_330.tsv'
TEXT_ANALYSIS_MATRIX_2 = '../output/text-analysis-matrix/20200822_2201.tsv'
TEXT_ANALYSIS_MATRIX_3 = '../output/text-analysis-matrix/20200822_2202.tsv'

PERFORMANCE_MATRIX_1 = '../output/performance-matrix/330_child_updated_20200823/30908.tsv'
PERFORMANCE_MATRIX_2 = '../output/performance-matrix/2201_child_updated_20200823/30908.tsv'
PERFORMANCE_MATRIX_3 = '../output/performance-matrix/2202_child_updated_20200823/30908.tsv'

GOLD_PERFORMANCE_MATRIX_DIR_1 = '../output/performance-matrix/330_gold_updated_20200823/'
GOLD_PERFORMANCE_MATRIX_DIR_2 = '../output/performance-matrix/2201_gold_updated_20200823/'
GOLD_PERFORMANCE_MATRIX_DIR_3 = '../output/performance-matrix/2202_gold_updated_20200823/'

In [4]:
text_analysis_df_1 = pd.read_csv(TEXT_ANALYSIS_MATRIX_1, sep='\t')
text_analysis_df_2 = pd.read_csv(TEXT_ANALYSIS_MATRIX_2, sep='\t')
text_analysis_df_3 = pd.read_csv(TEXT_ANALYSIS_MATRIX_3, sep='\t')

performance_df_1 = pd.read_csv(PERFORMANCE_MATRIX_1, sep='\t')
performance_df_2 = pd.read_csv(PERFORMANCE_MATRIX_2, sep='\t')
performance_df_3 = pd.read_csv(PERFORMANCE_MATRIX_3, sep='\t')

In [5]:
def find_expected_pauses_from_dir(directory):
    rv = set()
    for filename in os.listdir(directory):
        gold_df = pd.read_csv(os.path.join(directory, filename), sep='\t')
        nframes_list = list(gold_df.nframes)
        matches_expected_list = list(gold_df.matches_expected)
        for idx, val in enumerate(matches_expected_list):
            # filter for only pauses
            # not necessary, but maybe this is good to have
            if idx % 2 == 0: 
                if nframes_list[idx] >= THRESHOLD_PAUSE_DURATION_FOR_GOLD_READINGS:
                    rv.add(idx)
    return sorted(list(rv))

In [6]:
gold_expected_pauses_1 = find_expected_pauses_from_dir(GOLD_PERFORMANCE_MATRIX_DIR_1)
gold_expected_pauses_2 = find_expected_pauses_from_dir(GOLD_PERFORMANCE_MATRIX_DIR_2)
gold_expected_pauses_3 = find_expected_pauses_from_dir(GOLD_PERFORMANCE_MATRIX_DIR_3)

In [7]:
assert list(text_analysis_df_1.columns) == list(text_analysis_df_2.columns) ==\
    list(text_analysis_df_3.columns)

In [8]:
def infer_mode(col_1, col_2, col_3):
    values = set(col_1 + col_2 + col_3)
    if values == {True, False}:
        return 'binary'
    if '.' in str(values):
        return 'continuous'
    return 'discrete'

In [9]:
kept_proportions_for_readings = []
for perf_df, gold_expected_pauses in zip(
    [performance_df_1, performance_df_2, performance_df_3],
    [gold_expected_pauses_1, gold_expected_pauses_2, gold_expected_pauses_3]):

    difficult_words = IDU.find_problem_words_given_df_naive(
        perf_df,
        THRESHOLD_PAUSE_DURATION,
        gold_expected_pauses
    )

    kept_proportions_for_current_reading = {}
    for column in text_analysis_df_1.columns[1:]:
        column_1 = list(text_analysis_df_1[column])
        column_2 = list(text_analysis_df_2[column])
        column_3 = list(text_analysis_df_3[column])

        mode = infer_mode(column_1, column_2, column_3)

        proportions = IDU.compute_difficult_words_overlap_proportion(
            difficult_words,
            column,
            column_1,
            column_2,
            column_3,
            mode=mode
        )

        kept_proportions = {
            proportion_name: proportion
            for proportion_name, proportion in proportions.items()
            if proportion > THRESHOLD_OVERLAP_PROPORTION
        }

        kept_proportions_for_current_reading = {
            **kept_proportions_for_current_reading, **kept_proportions # merge
        }

    kept_proportions_for_readings.append(kept_proportions_for_current_reading)

In [10]:
print('''REPORT\n
Threshold pause duration (child)       = {}
Threshold number of difficult readings = {}
Threshold overlap proportion           = {}
Threshold pause duration (gold)        = {}\n\n
**********\n
'''.format(
    THRESHOLD_PAUSE_DURATION,
    THRESHOLD_NUMBER_OF_DIFFICULT_READINGS,
    THRESHOLD_OVERLAP_PROPORTION,
    THRESHOLD_PAUSE_DURATION_FOR_GOLD_READINGS
))

proportion_counter = Counter()
for kept_proportions_for_reading in kept_proportions_for_readings:
    for proportion_name, proportion in kept_proportions_for_reading.items():
        proportion_counter[proportion_name] += 1

for proportion_name, proportion_count in proportion_counter.most_common():
    if proportion_count >= THRESHOLD_NUMBER_OF_DIFFICULT_READINGS:
        print('({} counts) {}'.format(proportion_count, proportion_name))

        for idx, dictionary in enumerate(kept_proportions_for_readings):
            try:
                proportion = dictionary[proportion_name]
                print('\tReading {}: {:.4f}'.format(idx + 1, proportion))
            except:
                pass
    print()

REPORT

Threshold pause duration (child)       = 25
Threshold number of difficult readings = 1
Threshold overlap proportion           = 0.25
Threshold pause duration (gold)        = 5


**********


(3 counts) lts_6_true_overlap_proportion
	Reading 1: 0.6000
	Reading 2: 0.4000
	Reading 3: 0.4000

(2 counts) word_is_end_of_paragraph_true_overlap_proportion
	Reading 1: 0.3333
	Reading 2: 0.3333

(2 counts) sightword_pp_true_overlap_proportion
	Reading 1: 0.2581
	Reading 2: 0.2581

(1 counts) word_is_start_of_line_false_overlap_proportion
	Reading 1: 0.2532

(1 counts) word_is_end_of_line_true_overlap_proportion
	Reading 1: 0.3000

(1 counts) word_length_more_than_5_proportion
	Reading 1: 0.2647

(1 counts) word_length_more_than_6_proportion
	Reading 1: 0.2727

(1 counts) word_length_more_than_7_proportion
	Reading 1: 0.3077

(1 counts) CMU_length_more_than_3_proportion
	Reading 1: 0.2623

(1 counts) lts_1_false_overlap_proportion
	Reading 1: 0.2564

(1 counts) lts_2_true_overlap_proporti