# Sentence categorization analysis
In order to improve the heuristic algorithm used to identify candidate relation phrases to embed in `abstract.py`, we need to get a sense of where this algorithm fails. There are some sentence structures that result in the candidate phrase being made up of non-continuous spans from the sentence, which means that the phrase cannot be directly matched back to the tokenization used for embedding. This is solveable by allowing disjoint spans when indexing into the sentence, but is indicative of a deeper problem -- for the sentences that I've observed behaving this way, there is a bunch of text included in the candidate phrase that isn't relevant to the relation, so I don't want to just program a workaround. Instead, I've had the code output the parse trees of the sentences where it fails vs succeeds. This way, we can determine how to improve the algorithm based on sentence structure.

In [1]:
import json
import pandas as pd
import numpy as np
from collections import defaultdict

## Reading in data

In [2]:
with open('../data/distant_sup_output/all_docs_bugfix_21Feb2023_skipped_sentences.json') as f:
    skipped = json.load(f)
with open('../data/distant_sup_output/all_docs_bugfix_21Feb2023_success_sentences.json') as f:
    success = json.load(f)

In [12]:
# Convert to multiindex dataframes
skipped_tups = {(i,j): skipped[i][j] 
       for i in skipped.keys() 
       for j in skipped[i].keys()}

skipped_mux = pd.MultiIndex.from_tuples(skipped_tups.keys())
skipped_df = pd.DataFrame(list(skipped_tups.values()), index=skipped_mux)

success_tups = {(i,j): success[i][j] 
       for i in success.keys() 
       for j in success[i].keys()}

success_mux = pd.MultiIndex.from_tuples(success_tups.keys())
success_df = pd.DataFrame(list(success_tups.values()), index=success_mux)


In [14]:
skipped

{'PMID30076223_abstract': {'parse': ['(S (ADVP (RB Here)) (, ,) (NP (PRP we)) (VP (VBD explored) (NP (NP (NP (DT the) (JJ physiological) (NNS functions)) (PP (IN of) (NP (CD NO-LRB-2) (-RRB- -RRB-))) (PP (IN in) (NP (NN plant) (NNS cells)))) (VBG using) (NP (NN short-term) (NN fumigation))) (PP (IN of) (NP (NP (NNP Arabidopsis)) (-LRB- -LRB-) (NP (NNP Arabidopsis) (NNP thaliana)) (-RRB- -RRB-))) (PP (IN for) (NP (CD 1) (NN h))) (PP (IN with) (NP (NP (NP (NP (CD 10) (NN µL) (CD L-LRB--1) (-RRB- -RRB-) (CD NO-LRB-2)) (. .)) (-RRB- -RRB-)) (SBAR (IN Although) (S (NP (NN leaf) (NNS symptoms)) (VP (VBD were) (ADJP (JJ absent))))))) (, ,) (NP (NP (DT the) (NN expression)) (PP (IN of) (NP (NP (NNS genes)) (VP (VBN related) (PP (IN to) (NP (NN pathogen) (NN resistance))))))) (VP (VBD was) (VP (VBN induced)))) (. .))',
   '(S (NP (JJ Fumigated) (NNS plants)) (VP (VBD developed) (NP (NP (NP (NP (NML (JJ basal) (NN disease)) (NN resistance)) (, ,) (CC or) (NP (VBN pattern-triggered) (NN immunity)

In [13]:
skipped_df

Unnamed: 0,Unnamed: 1,0,1,2,3,4,5,6,7,8,9
PMID30076223_abstract,parse,"(S (ADVP (RB Here)) (, ,) (NP (PRP we)) (VP (V...",(S (NP (JJ Fumigated) (NNS plants)) (VP (VBD d...,"(S (PP (IN In) (NP (NN sum))) (, ,) (NP (JJ ex...",,,,,,,
PMID30076223_abstract,phrase,explored of Arabidopsis ( Arabidopsis thaliana...,developed were both required for the full expr...,"triggers , pointing to a possible role for end...",,,,,,,
PMID33863060_abstract,parse,(S (NP (NP (DT These) (JJ same) (NNS elements)...,(S (S (NP (NP (JJ Few) (NNS receptors)) (PP (I...,(S (S (NP (NP (NN Induction)) (PP (IN of) (NP ...,"(S (NP (NP (DT The) (ADJP (JJ separate) (, ,) ...",(S (NP (NP (NN Analysis)) (PP (IN of) (NP (DT ...,(S (NP (DT The) (JJ same) (NN complexity)) (VP...,(NP (NP (NN Phytoalexin) (NN accumulation)) (P...,,,
PMID33863060_abstract,phrase,,NO PHRASE,NO PHRASE,means that must be relayed to the genes by mea...,,,NO PHRASE,,,
PMID7639774_abstract,parse,(S (S (NP (NP (DT The) (NML (NN signal) (NN tr...,"(S (PP (IN In) (NP (DT this) (NN chapter))) (,...",,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
PMID18992204_abstract,phrase,decreased while was raised in all treated groups,,,,,,,,,
PMID33272793_abstract,parse,(S (NP (NNP H-LRB-2-RRB-S)) (VP (VBZ is) (VP (...,(S (NP (NP (JJ Exogenous) (NN application)) (P...,,,,,,,,
PMID33272793_abstract,phrase,,facilitates in plants under normal and environ...,,,,,,,,
PMID26747288_abstract,parse,(S (NP (NP (NNS Studies)) (PP (IN of) (NP (NP ...,(S (NP (VBN Photo-induced) (NML (NML (JJ oxida...,(S (NP (DT This)) (VP (VP (VBD occurred) (PP (...,(S (NP (VBN OXI1-mediated) (NN -LRB-1-RRB-O2) ...,(S (PP (IN In) (NP (JJ high-light-stressed) (N...,(S (NP (PRP$ Our) (NNS results)) (VP (VBP show...,,,,


## Preliminary exploration

### Basic numbers
First, let's take a look at how many from each doc were skipped or successful.

In [3]:
skipped_num = {k: len(v) for k, v in skipped.items()}
success_num = {k: len(v) for k, v in success.items()}

nums_df = pd.DataFrame({'skipped':pd.Series(skipped_num),'success':pd.Series(success_num)})
nums_df.head()

Unnamed: 0,skipped,success
PMID30076223_abstract,3,7
PMID33863060_abstract,7,10
PMID7639774_abstract,2,2
PMID16663587_abstract,2,1
PMID31140930_abstract,6,9


Looking at the entire dataframe, it actually looks like we're doing an okay job, with more than half the sentences being categorized in most docs:

In [4]:
f'We are able to get a candidate relation phrase for more than half of the sentences in {nums_df[nums_df["success"] > nums_df["skipped"]].shape[0]} of {nums_df.shape[0]} docs'

'We are able to get a candidate relation phrase for more than half of the sentences in 43 of 56 docs'

### Level exploration
Now, let's do a little exploring of the constituency parse levels. First, let's just look at how many levels each sentence has, and if that's different for successes vs skips.

In [5]:
skipped_tree_depths = []
for doc_key, sents in skipped.items():
    for sent_idx, sent_struct in sents.items():
        num_levels = len(sent_struct.keys())
        skipped_tree_depths.append(num_levels)
        
success_tree_depths = []
for doc_key, sents in success.items():
    for sent_idx, sent_struct in sents.items():
        num_levels = len(sent_struct.keys())
        success_tree_depths.append(num_levels)

In [6]:
for group, lens in {'skipped': skipped_tree_depths, 'success': success_tree_depths}.items():
    print(f'Summary of level depths for {group} sentences:')
    print('------------------------------------------------')
    print(f'Max depth: {max(lens)}')
    print(f'Min depth: {min(lens)}')
    print(f'Mean depth: {np.mean(lens)}')
    print()

Summary of level depths for skipped sentences:
------------------------------------------------
Max depth: 22
Min depth: 5
Mean depth: 10.58445945945946

Summary of level depths for success sentences:
------------------------------------------------
Max depth: 22
Min depth: 5
Mean depth: 10.678048780487805



First of all, 22 levels... dang! However, it looks like the difference between succeess and failure isn't a function of the depth of the tree -- it must have to do with some other facet of the tree. What about how many children are on each level?

In [7]:
skipped_child_nums = defaultdict(list)
for doc_key, doc in skipped.items():
    for sent_idx, sent in doc.items():
        for level, components in sent.items():
            skipped_child_nums[level].append(len(components))

success_child_nums = defaultdict(list)
for doc_key, doc in success.items():
    for sent_idx, sent in doc.items():
        for level, components in sent.items():
            success_child_nums[level].append(len(components))

In [8]:
# Collect summary stats into a dataframe
skipped_child_means = {k: np.mean(v) for k, v in skipped_child_nums.items()}
success_child_means = {k: np.mean(v) for k, v in success_child_nums.items()}
skipped_child_max = {k: max(v) for k, v in skipped_child_nums.items()}
success_child_max = {k: max(v) for k, v in success_child_nums.items()}
skipped_child_min = {k: min(v) for k, v in skipped_child_nums.items()}
success_child_min = {k: min(v) for k, v in success_child_nums.items()}

child_stats_df = pd.DataFrame({'skipped_mean':pd.Series(skipped_child_means),
                               'success_mean':pd.Series(success_child_means),
                               'skipped_max':pd.Series(skipped_child_max),
                               'success_max':pd.Series(success_child_max),
                               'skipped_min':pd.Series(skipped_child_min),
                               'success_min':pd.Series(success_child_min)})
child_stats_df.head()

Unnamed: 0,skipped_mean,success_mean,skipped_max,success_max,skipped_min,success_min
0,1.0,1.0,1,1,1,1
1,3.844595,3.634146,7,7,2,2
2,5.47973,5.27561,13,13,3,3
3,6.793919,5.94878,18,18,1,1
4,6.179054,5.72439,20,18,1,1


The summary statistics here don't seem that different either; therefore, it must be the identity of the constituency components that differentiate the two groups.

## Detailed analysis
Now, we need to do the trickier work of figuring out how to meaningfully examine the identities at each level for skipped vs success. We'll start at the topmost levels and work our way down, as the first few levels tend to be relatively sensical, and may help us differentiate the failure modes from success modes.

### Convert json to df
To more easily manipulate the data, let's make a dataframe where the columns are level numbers, the indices are `doc_key_sent_idx`, and the values in the cells are lists of the labels at that level in that sentence of that doc. It would make sense to use a multiindex with the doc key as the outer level, but I think it's easiest to do it this way since I really only want to look at all the sentences, just maintaining the doc keys in case I want them later.

In [9]:
# Flatten jsons, then make df
skipped_flat = {}
for doc_key, doc in skipped.items():
    for sent_idx, sent in doc.items():
        ident = f'{doc_key}_{sent_idx}'
        vals = []
        for i in range(22):
            try:
                vals.append(tuple(sent[f'{i}']))
            except:
                vals.append(tuple([]))
        skipped_flat[ident] = vals     
skipped_df = pd.DataFrame.from_dict(skipped_flat, columns=[f'{i}' for i in range(22)], orient='index')

success_flat = {}
for doc_key, doc in success.items():
    for sent_idx, sent in doc.items():
        ident = f'{doc_key}_{sent_idx}'
        vals = []
        for i in range(22):
            try:
                vals.append(tuple(sent[f'{i}']))
            except:
                vals.append(tuple([]))
        success_flat[ident] = vals
success_df = pd.DataFrame.from_dict(success_flat, columns=[f'{i}' for i in range(22)], orient='index')

### Level 0

In [10]:
success_df['0'].unique()

array([('S',), ('SINV',)], dtype=object)

In [11]:
skipped_df['0'].unique()

array([('S',), ('NP',), ('SINV',)], dtype=object)

This is surprising, because I expected there only to be `'S'` in the first level. Interestingly, sentences whose trees begin with `'NP'` are skipped -- but is this just one sentence or is this a pattern?

In [12]:
success_df['0'].value_counts()

(S,)       409
(SINV,)      1
Name: 0, dtype: int64

In [13]:
skipped_df['0'].value_counts() 

(S,)       283
(NP,)       12
(SINV,)      1
Name: 0, dtype: int64

It looks like the `'NP'` top level failure is a trend -- let's bookmark that as thet first fix.

### Level 1
What about the next level?

In [14]:
succ_types_1 = success_df['1'].value_counts()

In [15]:
skip_types_1 = skipped_df['1'].value_counts()

First, let's see how many overlap between the two categories. It's likely that even if they overlap, the parts of the tree below this level are different between the two groups to explain why some were skipped and some weren't -- we'll want to do a more comprehensive analysis of this later on.

In [16]:
print(f'There are {len(succ_types_1)} and {len(skip_types_1)} patterns in '
     'success and skipped, respectively.\n\n')
succ_diff = set(succ_types_1.index).difference(set(skip_types_1.index))
skip_diff = set(skip_types_1.index).difference(set(succ_types_1.index))
print(f'There are {len(skip_diff)} unique patterns in the skipped sentences, which are:')
print('--------------------------------------------------------------------')
print(skip_diff)
print(f'\n\nThere are {len(succ_diff)} unique patterns in the success sentences, which are:')
print('--------------------------------------------------------------------')
print(succ_diff)

There are 24 and 29 patterns in success and skipped, respectively.


There are 12 unique patterns in the skipped sentences, which are:
--------------------------------------------------------------------
{('S', 'CC', 'S', '.'), ('SBAR', ',', 'SBAR', ',', 'CC', 'SBAR', '.'), ('S', 'CC', ',', 'S', '.'), ('PP', ',', 'S', ',', 'CC', 'S', '.'), ('S', ':', 'S', '.'), ('PP', ',', 'NP', ':', 'S', '.'), ('NP', ':', 'S', '.'), ('NP', 'PP', '.'), ('NP', ':', 'S', ',', 'CC', 'S', '.'), ('S', ',', 'RB', 'S', '.'), ('ADVP', ',', 'S', ',', 'CC', 'S', '.'), ('S', ',', 'CC', 'S', '.')}


There are 7 unique patterns in the success sentences, which are:
--------------------------------------------------------------------
{('-RRB-', 'ADVP', ',', 'NP', 'VP', '.'), ('ADVP', ',', 'SBAR', 'VP', '.'), ('ADVP', ',', 'ADJP', ',', 'NP', 'VP', '.'), ('ADVP', 'NP', 'VP', '.'), (':', 'S', ',', 'NP', 'VP', '.'), ('ADVP', ',', 'NP', 'ADVP', 'VP', '.'), ('SBAR', ',', 'SBAR', 'VP', '.')}


Even on just the second level, this is starting to get visually/manually overwhelming; we need a better way to assess the differences between success and failure modes.

### Developing a way to analyze patterns
#### Brainstorming
Firstly, let's hypothesize about what patterns we think might differentiate between siccess and failiure, and then we can think about ways to identify those patterns. Some ideas are:
* The presence/absence of a certain type at a certain level
    * For example, maybe the folloing two trees would have different outcomes:
```
                S                               S
                |                               |
              ------                         -------
              |    |                         |     |
              NP   VP                       NP     VP
           ------  ------                -------   ------
           |    |  |    |                |     |   |  |  |
          ADJ   NN PP   VBZ             ADJ    NN  PP ,  VBZ
```
    * In this case, we would look for trees whose similarity diverges at some level
    * First attempt: walk down the levels starting with S, and see how far we can go down the trees wile being able to maintain groups of more than 1
        * Group sizes of 1 defeat the purpose of this exercise
    * Here, it's important to note that we have in fact lost some potentially important information here by sacrificing the tree structure -- we may want to go back and just save out the parse strings and then change them to dictionaries within this notebok if we still want that data format.

#### First attempt

In [30]:
def get_levelgroups(df, depth=2):
    """
    Generates a representation of the different categories of structures down to
    depth.
    
    parameters:
        df, pandas df: rows are sentences, columns are the labels at each level
        depth, int: the level to go down to. Default is 2.
        
    returns:
        depth_df, pandas df: rows are the groups, with the last column as the count
            of sentences in the group
    """
    cols = [f'{i}' for i in range(depth)]
    if cols == []:
        cols = ['0']
    depth_df = df.groupby(cols).size().reset_index().rename(columns={0:'count'})
    return depth_df

In [36]:
for depth in range(22):
    depth_df = get_levelgroups(skipped_df, depth=depth)
    print(f'\nThe largest group down to level {depth} is {depth_df["count"].max()}')


The largest group down to level 0 is 283

The largest group down to level 1 is 283

The largest group down to level 2 is 155

The largest group down to level 3 is 10

The largest group down to level 4 is 5

The largest group down to level 5 is 4

The largest group down to level 6 is 1

The largest group down to level 7 is 1

The largest group down to level 8 is 1

The largest group down to level 9 is 1

The largest group down to level 10 is 1

The largest group down to level 11 is 1

The largest group down to level 12 is 1

The largest group down to level 13 is 1

The largest group down to level 14 is 1

The largest group down to level 15 is 1

The largest group down to level 16 is 1

The largest group down to level 17 is 1

The largest group down to level 18 is 1

The largest group down to level 19 is 1

The largest group down to level 20 is 1

The largest group down to level 21 is 1


It looks like this isn't going to be particularly helpful beyond level 5. What happens if we change this to only look at the sets of labels (ignoring order and repeats?)?

In [45]:
def get_levelgroups_agnostic(df, depth=2):
    """
    Generates a representation of the different categories of structures down to
    depth, ignoring the order of the labels as well as repeat labels.
    
    parameters:
        df, pandas df: rows are sentences, columns are the labels at each level
        depth, int: the level to go down to. Default is 2.
        
    returns:
        depth_df, pandas df: rows are the groups, with the last column as the count
            of sentences in the group
    """
    cols = [f'{i}' for i in range(depth)]
    if cols == []:
        cols = ['0']
    for col in cols:
        new_col = [frozenset(e) for e in df[col].values]
        df[col] = new_col
    depth_df = df.groupby(cols).size().reset_index().rename(columns={0:'count'})
    return depth_df

In [46]:
for depth in range(22):
    depth_df = get_levelgroups_agnostic(skipped_df, depth=depth)
    print(f'\nThe largest group down to level {depth} is {depth_df["count"].max()}')


The largest group down to level 0 is 283

The largest group down to level 1 is 283

The largest group down to level 2 is 155

The largest group down to level 3 is 10

The largest group down to level 4 is 5

The largest group down to level 5 is 4

The largest group down to level 6 is 1

The largest group down to level 7 is 1

The largest group down to level 8 is 1

The largest group down to level 9 is 1

The largest group down to level 10 is 1

The largest group down to level 11 is 1

The largest group down to level 12 is 1

The largest group down to level 13 is 1

The largest group down to level 14 is 1

The largest group down to level 15 is 1

The largest group down to level 16 is 1

The largest group down to level 17 is 1

The largest group down to level 18 is 1

The largest group down to level 19 is 1

The largest group down to level 20 is 1

The largest group down to level 21 is 1


It looks like these are truly unique groupings, as the number of unique groups doesn't change when evaluating them as sets instead of as tuples with repeats and order differences. Now, let's take a look at the groupings for the top few levels, and see if we can find any obvious trends in skipped vs not.

In [71]:
skipped_lv2_groups = get_levelgroups(skipped_df, depth=2)
success_lv2_groups = get_levelgroups(success_df, depth=2)

In [72]:
skipped_lv2_groups.shape, success_lv2_groups.shape

((25, 3), (24, 3))

In [60]:
in_common = pd.merge(skipped_lv2_groups[['0', '1']], success_lv2_groups[['0', '1']], how='inner', on=['0', '1'])

In [62]:
print(f'There are {skipped_lv2_groups.shape[0]} groups in the skipped sentences, '
     f'and {success_lv2_groups.shape[0]} groups in the success sentences. '
     f'{in_common.shape[0]} of these groups are in common between the two categories.')

There are 25 groups in the skipped sentences, and 24 groups in the success sentences. 0 of these groups are in common between the two categories.


NOTE: The overlap comparison is wrong, looking manually there's at least one category where there's the same label in both dataframes. Need to come back to refine this code to figure out how to get overlaps.