# Sentence categorization analysis
In order to improve the heuristic algorithm used to identify candidate relation phrases to embed in `abstract.py`, we need to get a sense of where this algorithm fails. In order to do this, we've added a function that identifies what constituency parse components are at each level of the tree for a given sentence, and kept track of both successes and failures. This way, we can determine how to improve the algorithm based on sentence structure.

In [1]:
import json
import pandas as pd
import numpy as np
from collections import defaultdict

## Reading in data

In [2]:
with open('../data/distant_sup_output/all_06Feb2023_skipped_sentence_cats.json') as f:
    skipped = json.load(f)
with open('../data/distant_sup_output/all_06Feb2023_success_sentence_cats.json') as f:
    success = json.load(f)

## Preliminary exploration

### Basic numbers
First, let's take a look at how many from each doc were skipped or successful.

In [3]:
skipped_num = {k: len(v) for k, v in skipped.items()}
success_num = {k: len(v) for k, v in success.items()}

nums_df = pd.DataFrame({'skipped':pd.Series(skipped_num),'success':pd.Series(success_num)})
nums_df.head()

Unnamed: 0,skipped,success
PMID30076223_abstract,3,7
PMID33863060_abstract,7,10
PMID7639774_abstract,2,2
PMID16663587_abstract,2,1
PMID31140930_abstract,6,9


Looking at the entire dataframe, it actually looks like we're doing an okay job, with more than half the sentences being categorized in most docs:

In [4]:
f'We are able to get a candidate relation phrase for more than half of the sentences in {nums_df[nums_df["success"] > nums_df["skipped"]].shape[0]} of {nums_df.shape[0]} docs'

'We are able to get a candidate relation phrase for more than half of the sentences in 43 of 56 docs'

### Level exploration
Now, let's do a little exploring of the constituency parse levels. First, let's just look at how many levels each sentence has, and if that's different for successes vs skips.

In [5]:
skipped_tree_depths = []
for doc_key, sents in skipped.items():
    for sent_idx, sent_struct in sents.items():
        num_levels = len(sent_struct.keys())
        skipped_tree_depths.append(num_levels)
        
success_tree_depths = []
for doc_key, sents in success.items():
    for sent_idx, sent_struct in sents.items():
        num_levels = len(sent_struct.keys())
        success_tree_depths.append(num_levels)

In [6]:
for group, lens in {'skipped': skipped_tree_depths, 'success': success_tree_depths}.items():
    print(f'Summary of level depths for {group} sentences:')
    print('------------------------------------------------')
    print(f'Max depth: {max(lens)}')
    print(f'Min depth: {min(lens)}')
    print(f'Mean depth: {np.mean(lens)}')
    print()

Summary of level depths for skipped sentences:
------------------------------------------------
Max depth: 22
Min depth: 5
Mean depth: 10.58445945945946

Summary of level depths for success sentences:
------------------------------------------------
Max depth: 22
Min depth: 5
Mean depth: 10.678048780487805



First of all, 22 levels... dang! However, it looks like the difference between succeess and failure isn't a function of the depth of the tree -- it must have to do with some other facet of the tree. What about how many children are on each level?

In [7]:
skipped_child_nums = defaultdict(list)
for doc_key, doc in skipped.items():
    for sent_idx, sent in doc.items():
        for level, components in sent.items():
            skipped_child_nums[level].append(len(components))

success_child_nums = defaultdict(list)
for doc_key, doc in success.items():
    for sent_idx, sent in doc.items():
        for level, components in sent.items():
            success_child_nums[level].append(len(components))

In [8]:
# Collect summary stats into a dataframe
skipped_child_means = {k: np.mean(v) for k, v in skipped_child_nums.items()}
success_child_means = {k: np.mean(v) for k, v in success_child_nums.items()}
skipped_child_max = {k: max(v) for k, v in skipped_child_nums.items()}
success_child_max = {k: max(v) for k, v in success_child_nums.items()}
skipped_child_min = {k: min(v) for k, v in skipped_child_nums.items()}
success_child_min = {k: min(v) for k, v in success_child_nums.items()}

child_stats_df = pd.DataFrame({'skipped_mean':pd.Series(skipped_child_means),
                               'success_mean':pd.Series(success_child_means),
                               'skipped_max':pd.Series(skipped_child_max),
                               'success_max':pd.Series(success_child_max),
                               'skipped_min':pd.Series(skipped_child_min),
                               'success_min':pd.Series(success_child_min)})
child_stats_df.head()

Unnamed: 0,skipped_mean,success_mean,skipped_max,success_max,skipped_min,success_min
0,1.0,1.0,1,1,1,1
1,3.844595,3.634146,7,7,2,2
2,5.47973,5.27561,13,13,3,3
3,6.793919,5.94878,18,18,1,1
4,6.179054,5.72439,20,18,1,1


The summary statistics here don't seem that different either; therefore, it must be the identity of the constituency components that differentiate the two groups.

## Detailed analysis
Now, we need to do the trickier work of figuring out how to meaningfully examine the identities at each level for skipped vs success. We'll start at the topmost levels and work our way down, as the first few levels tend to be relatively sensical, and may help us differentiate the failure modes from success modes.

### Convert json to df
To more easily manipulate the data, let's make a dataframe where the columns are level numbers, the indices are `doc_key_sent_idx`, and the values in the cells are lists of the labels at that level in that sentence of that doc. It would make sense to use a multiindex with the doc key as the outer level, but I think it's easiest to do it this way since I really only want to look at all the sentences, just maintaining the doc keys in case I want them later.

In [9]:
# Flatten jsons, then make df
skipped_flat = {}
for doc_key, doc in skipped.items():
    for sent_idx, sent in doc.items():
        ident = f'{doc_key}_{sent_idx}'
        vals = []
        for i in range(22):
            try:
                vals.append(tuple(sent[f'{i}']))
            except:
                vals.append(tuple([]))
        skipped_flat[ident] = vals     
skipped_df = pd.DataFrame.from_dict(skipped_flat, columns=[f'{i}' for i in range(22)], orient='index')

success_flat = {}
for doc_key, doc in success.items():
    for sent_idx, sent in doc.items():
        ident = f'{doc_key}_{sent_idx}'
        vals = []
        for i in range(22):
            try:
                vals.append(tuple(sent[f'{i}']))
            except:
                vals.append(tuple([]))
        success_flat[ident] = vals
success_df = pd.DataFrame.from_dict(success_flat, columns=[f'{i}' for i in range(22)], orient='index')

### Level 0

In [10]:
success_df['0'].unique()

array([('S',), ('SINV',)], dtype=object)

In [11]:
skipped_df['0'].unique()

array([('S',), ('NP',), ('SINV',)], dtype=object)

This is surprising, because I expected there only to be `'S'` in the first level. Interestingly, sentences whose trees begin with `'NP'` are skipped -- but is this just one sentence or is this a pattern?

In [12]:
success_df['0'].value_counts()

(S,)       409
(SINV,)      1
Name: 0, dtype: int64

In [13]:
skipped_df['0'].value_counts() 

(S,)       283
(NP,)       12
(SINV,)      1
Name: 0, dtype: int64

It looks like the `'NP'` top level failure is a trend -- let's bookmark that as thet first fix.

### Level 1
What about the next level?

In [14]:
succ_types_1 = success_df['1'].value_counts()

In [15]:
skip_types_1 = skipped_df['1'].value_counts()

First, let's see how many overlap between the two categories. It's likely that even if they overlap, the parts of the tree below this level are different between the two groups to explain why some were skipped and some weren't -- we'll want to do a more comprehensive analysis of this later on.

In [29]:
print(f'There are {len(succ_types_1)} and {len(skip_types_1)} patterns in '
     'success and skipped, respectively.\n\n')
succ_diff = set(succ_types_1.index).difference(set(skip_types_1.index))
skip_diff = set(skip_types_1.index).difference(set(succ_types_1.index))
print(f'There are {len(skip_diff)} unique patterns in the skipped sentences, which are:')
print('--------------------------------------------------------------------')
print(skip_diff)
print(f'\n\nThere are {len(succ_diff)} unique patterns in the success sentences, which are:')
print('--------------------------------------------------------------------')
print(succ_diff)

There are 24 and 29 patterns in success and skipped, respectively.


There are 12 unique patterns in the skipped sentences, which are:
--------------------------------------------------------------------
{('NP', ':', 'S', ',', 'CC', 'S', '.'), ('S', ',', 'RB', 'S', '.'), ('S', ',', 'CC', 'S', '.'), ('PP', ',', 'NP', ':', 'S', '.'), ('NP', ':', 'S', '.'), ('NP', 'PP', '.'), ('S', 'CC', 'S', '.'), ('PP', ',', 'S', ',', 'CC', 'S', '.'), ('S', 'CC', ',', 'S', '.'), ('SBAR', ',', 'SBAR', ',', 'CC', 'SBAR', '.'), ('ADVP', ',', 'S', ',', 'CC', 'S', '.'), ('S', ':', 'S', '.')}


There are 7 unique patterns in the success sentences, which are:
--------------------------------------------------------------------
{(':', 'S', ',', 'NP', 'VP', '.'), ('ADVP', 'NP', 'VP', '.'), ('ADVP', ',', 'ADJP', ',', 'NP', 'VP', '.'), ('SBAR', ',', 'SBAR', 'VP', '.'), ('ADVP', ',', 'NP', 'ADVP', 'VP', '.'), ('ADVP', ',', 'SBAR', 'VP', '.'), ('-RRB-', 'ADVP', ',', 'NP', 'VP', '.')}


Even on just the second level, this is starting to get visually/manually overwhelming; we need a better way to assess the differences between success and failure modes.