# Sentence categorization analysis
In order to improve the heuristic algorithm used to identify candidate relation phrases to embed in `abstract.py`, we need to get a sense of where this algorithm fails. There are some sentence structures that result in the candidate phrase being made up of non-continuous spans from the sentence, which means that the phrase cannot be directly matched back to the tokenization used for embedding. This is solveable by allowing disjoint spans when indexing into the sentence, but is indicative of a deeper problem -- for the sentences that I've observed behaving this way, there is a bunch of text included in the candidate phrase that isn't relevant to the relation, so I don't want to just program a workaround. Instead, I've had the code output the parse trees of the sentences where it fails vs succeeds. This way, we can determine how to improve the algorithm based on sentence structure.

In [1]:
import json
import pandas as pd
import numpy as np
from collections import defaultdict
import jsonlines
import sys
sys.path.append('../distant_supervision_re/')
from abstract import Abstract

  from .autonotebook import tqdm as notebook_tqdm


## Reading in data

In [2]:
with open('../data/distant_sup_output/TRAIN_only_PP_23Feb2023_skipped_sentences.json') as f:
    skipped = json.load(f)
with open('../data/distant_sup_output/TRAIN_only_PP_23Feb2023_success_sentences.json') as f:
    success = json.load(f)

In [3]:
# Convert to multiindex dataframes
skipped_tups = {(i,j): skipped[i][j] 
       for i in skipped.keys() 
       for j in skipped[i].keys()}

skipped_mux = pd.MultiIndex.from_tuples(skipped_tups.keys())
skipped_df = pd.DataFrame(list(skipped_tups.values()), index=skipped_mux).T

success_tups = {(i,j): success[i][j] 
       for i in success.keys() 
       for j in success[i].keys()}

success_mux = pd.MultiIndex.from_tuples(success_tups.keys())
success_df = pd.DataFrame(list(success_tups.values()), index=success_mux).T


In [4]:
skipped_df.head()

Unnamed: 0_level_0,PMID33863060_abstract,PMID33863060_abstract,PMID7639774_abstract,PMID7639774_abstract,PMID16663587_abstract,PMID16663587_abstract,PMID31140930_abstract,PMID31140930_abstract,PMID25914698_abstract,PMID25914698_abstract,...,PMID16169960_abstract,PMID16169960_abstract,PMID25596183_abstract,PMID25596183_abstract,PMID22214939_abstract,PMID22214939_abstract,PMID33232332_abstract,PMID33232332_abstract,PMID18992204_abstract,PMID18992204_abstract
Unnamed: 0_level_1,parse,phrase,parse,phrase,parse,phrase,parse,phrase,parse,phrase,...,parse,phrase,parse,phrase,parse,phrase,parse,phrase,parse,phrase
0,(S (NP (NP (DT These) (JJ same) (NNS elements)...,NO PHRASE: Multiple levels with kids,(S (S (NP (NP (DT The) (NML (NN signal) (NN tr...,NO PHRASE: Unknown cause AttributeError,(S (S (NP (NP (JJ Two-week-old) (JJ dwarf) (NN...,NO PHRASE: Unknown cause AttributeError,(S (S (VP (VBG Aiming) (S (VP (TO to) (ADVP (R...,NO PHRASE: Multiple levels with kids,"(S (PP (IN In) (NP (DT this) (NN study))) (, ,...",found that were induced after RKN infection,...,"(S (NP (NP (NNP S)) (, ,) (NP (NN S'-1,3-pheny...","inhibit , while JA biosynthesis inhibitors do ...",(S (NP (NP (NN Pathogen) (NN reproduction)) (P...,NO PHRASE: Multiple levels with kids,(S (NP (NP (VBN Herbivore-induced) (NN plant) ...,NO PHRASE: Multiple levels with kids,(S (S (VP (VBG Using) (NP (DT the) (NML (NN to...,NO PHRASE: Multiple levels with kids,(S (NP (NP (NP (NN Hemoglobin)) (-LRB- -LRB-) ...,decreased while was raised in all treated groups
1,(S (S (NP (NP (JJ Few) (NNS receptors)) (PP (I...,NO PHRASE: Unknown cause AttributeError,"(S (PP (IN In) (NP (DT this) (NN chapter))) (,...",NO PHRASE: Unknown cause AttributeError,(S (NP (CD GA-LRB-3) (-RRB- -RRB-)) (VP (VP (V...,NO PHRASE: Multiple levels with kids,"(S (NP (DT A) (JJ total) (CD 2,698) (NNS genes...",NO PHRASE: Multiple levels with kids,(S (NP (NP (NN Application)) (PP (IN of) (NP (...,NO PHRASE: Multiple levels with kids,...,(S (NP (VBN Sodium-nitroprusside-induced) (NN ...,NO PHRASE: Multiple levels with kids,(S (NP (PRP We)) (VP (VBD used) (NP (JJ deep) ...,used to investigate the physiology of plant an...,(S (NP (DT These) (NNS HIPVs)) (VP (VBP are) (...,NO PHRASE: Multiple levels with kids,"(S (ADVP (RB Meanwhile)) (, ,) (NP (NP (CD fiv...",NO PHRASE: Multiple levels with kids,,
2,(S (S (NP (NP (NN Induction)) (PP (IN of) (NP ...,NO PHRASE: Unknown cause AttributeError,,,,,(S (NP (NP (JJ Simultaneous) (NN impairment)) ...,NO PHRASE: Multiple levels with kids,(S (NP (NP (DT The) (JJ pharmacological) (NN i...,NO PHRASE: Multiple levels with kids,...,(S (NP (DT The) (NNS results)) (VP (VP (VBP de...,NO PHRASE: Multiple levels with kids,"(S (NP (NP (QP (IN Over) (CD 3,000)) (NN patho...",NO PHRASE: Multiple levels with kids,(S (PP (IN Over) (NP (DT the) (JJ past) (CD 3)...,NO PHRASE: Multiple levels with kids,(S (NP (DT The) (NNS results)) (VP (VBD indica...,indicated that alleviates the asymmetrical JA ...,,
3,"(S (NP (NP (DT The) (ADJP (JJ separate) (, ,) ...",means that must be relayed to the genes by mea...,,,,,(S (NP (DT This)) (VP (VBD indicated) (SBAR (I...,indicated that positively regulates root defen...,(S (NP (NP (NN Silencing)) (PP (IN of) (NP (NN...,"compromised , suggesting that the PI2 gene med...",...,,,"(S (ADVP (RB Intriguingly)) (, ,) (NP (JJ indi...",NO PHRASE: Multiple levels with kids,(S (PP (IN In) (NP (DT a) (JJ recent) (NN pape...,reported that induce the emissions of volatile...,"(S (ADVP (RB Moreover)) (, ,) (NP (PRP we)) (V...","demonstrate that , may be involved in a molecu...",,
4,(S (NP (NP (NN Analysis)) (PP (IN of) (NP (DT ...,NO PHRASE: Multiple levels with kids,,,,,(S (S (VP (VBN Taken) (ADVP (RB together)))) (...,suggest that is a highly programmed process in...,,,...,,,(S (NP (NP (JJ Fungal) (NNS genes)) (VP (VBG e...,NO PHRASE: Multiple levels with kids,"(S (PP (IN In) (NP (NN addition))) (, ,) (NP (...",NO PHRASE: Multiple levels with kids,,,,


## Preliminary exploration of parse trees

I'm just going to print out the parse trees and the phrases that were pulled from them side by side, in part to do a common-sense check on the contents of the dataframe (making sure that parse trees and phrases are correctly aligned), but also to get an idea of the lay of the land in terms of what's going on.

In [5]:
# Make cells wider so thte parse trees render better
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [6]:
doc_ids = skipped_df.columns.get_level_values(0).unique()
for doc_key in doc_ids:
    for i in skipped_df.index:
        parse = skipped_df.loc[i, (doc_key, 'parse')]
        phrase = skipped_df.loc[i, (doc_key, 'phrase')]
        if parse is not None:
            print('--------------------------------------------------------------------------')
            print('Parse tree:')
            Abstract.visualize_parse(parse)
            print(f'\nIdentified phrase: {phrase}')

--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                                                                              
                                                                                                   |                                                                                                           
                                                                                                   S                                                                                                          
                                                                   ________________________________|________________________________________________________________________________________________________   
                                                                  |                

--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                               |                                                                                                                                                                                                                   
                                                                                                                                          

                                                                                                                                                                                                                                                                                 
                                                                                                                                                   |                                                                                                                              
                                                                                                                                                   S                                                                                                                             
                         __________________________________________________________________________________________________________________________|_____________________________

                                                                                                                                                                                                                                                                                    
                                                                                                                                                    |                                                                                                                                
                                                                                                                                                    S                                                                                                                               
      ______________________________________________________________________________________________________________________________________________|___________________

                                                                                                                                  
                                                                   |                                                               
                                                                   S                                                              
  _________________________________________________________________|____________________________________________________________   
 |    |   |               VP                                                                                                    | 
 |    |   |        _______|_____                                                                                                |  
 |    |   |       |            SBAR                                                                                             | 
 |    |   |       |        _____|____________                                   

                                                                                                                                                               
                                                                            |                                                                                   
                                                                            S                                                                                  
        ____________________________________________________________________|________________________________________________________________________________   
       |                   VP                                                                                                                                | 
       |              _____|________                                                                                                                         |  
       |             |             SB

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                  |                                                                                                                                       

                                                                                                                          
                                                      |                                                                    
                                                      S                                                                   
    __________________________________________________|_________________________________________________________________   
   |                                                  VP                                                                | 
   |         _________________________________________|_______________________________________________                  |  
   |        |              |                                     PP                                   |                 | 
   |        |              |                    _________________|________                            |                 |  
   |        

--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                                                                
                                                                                           |                                                                                                     
                                                                                           S                                                                                                    
                ___________________________________________________________________________|__________________________________________________________________________________________________   
               |                                             VP                                                                            

                                                                                                                                                                                               
                                                                                                  |                                                                                             
                                                                                                  S                                                                                            
                              ____________________________________________________________________|__________________________________________________________________________________________   
                             |                                                                                       VP                                                                      | 
                             |        

--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                                 
                                                                                 |                                                                                
                                                                                 S                                                                               
   ______________________________________________________________________________|_____________________________________________________________________________   
  |                                                         VP                                                                                                 | 
  |                        _________________________________|____________________                    

                                                                                                                                               
                                                                         |                                                                      
                                                                         S                                                                     
      ___________________________________________________________________|___________________________________________________________________   
     |                   VP                                                                                                                  | 
     |              _____|_________                                                                                                          |  
     |             |              SBAR                                                                                               

                                                                                                                                   
                                                             |                                                                      
                                                             S                                                                     
         ____________________________________________________|___________________________________________________________________   
        |                               |   |                VP                                                                  | 
        |                               |   |        ________|_____                                                              |  
        |                               |   |       |             SBAR                                                           | 
        |                               |   |       |         _____|_____

                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                     |                                                                                                                                                                                 
                                                                                                                                                                                                     S                                                   

                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                 |                                                                                                                                                                                                    
                                                                                                                                                                                                 S                         

--------------------------------------------------------------------------
Parse tree:
                                                                                                                        
                                                             |                                                           
                                                             S                                                          
       ______________________________________________________|________________________________________________________   
      |                                   VP                                                                          | 
      |               ____________________|__________________                                                         |  
      |              |          |                            PP                                                       | 
      |              |          |                    ________|_

                                                                                                                                                                                                         
                                                                                                    |                                                                                                     
                                                                                                    S                                                                                                    
                                                                     _______________________________|__________________________________________________________________________________________________   
                                                                    NP                                                                                       |                                


Identified phrase: NO PHRASE: Multiple levels with kids
--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                          
                                                            |                                                                                              
                                                            S                                                                                             
    ________________________________________________________|___________________________________________________________________________________________   
   |     |                  |                                                        VP                                                                 | 
   |     |                  |             _____________________________________

                                                                                                                                                                                                                                                                                         
                                                                                                                                  |                                                                                                                                                       
                                                                                                                                  S                                                                                                                                                      
                     _____________________________________________________________________________________________________________|______________________

                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                             |                                                                                                                                                                                            
                                                                                                                                                                             S                                                                                                     

In [21]:
unknown = 0
mult_sent = 0
mult_kids = 0
disjoint = 0
sbar = 0
df_empties = 0
nones = 0
sbar_parse_strs = []
for doc_key in doc_ids:
    for i in skipped_df.index:
        phrase = skipped_df.loc[i, (doc_key, 'phrase')]
        parse = skipped_df.loc[i, (doc_key, 'parse')]
        if phrase == 'NO PHRASE: Multiple levels with kids':
            mult_kids += 1
        elif phrase == 'NO PHRASE: Unknown cause AttributeError':
            if parse.count('S') > 2:
                mult_sent += 1 
            unknown += 1
        elif phrase is not None:
            if 'SBAR' in parse:
                sbar += 1
                sbar_parse_strs.append(parse)
            disjoint += 1
        elif phrase is None and parse is None:
            df_empties += 1
        elif phrase is None and parse is not None:
            nones += 1
print(f'Unknown: {unknown}, Multiple nodes with kids: {mult_kids}, disjoint phrases: {disjoint}')
print(f'Common sense check to make sure the only Nones come from ragged df rows: df empties: {df_empties}, Nones: {nones}')
print(f'SBAR is present in the parses of {sbar} of the {disjoint} total disjoint phrases')
print(f'{mult_sent} of {unknown} total sentence with UNK failures have multiple S annotations')

Unknown: 43, Multiple nodes with kids: 105, disjoint phrases: 53
Common sense check to make sure the only Nones come from ragged df rows: df empties: 195, Nones: 0
SBAR is present in the parses of 43 of the 53 total disjoint phrases
41 of 43 total sentence with UNK failures have multiple S annotations


### Summary of findings
* Sentences that fail due to the `Unknown` error are mainly sentences that contain `S` annotations as children of the 0-level `S`
    * Because they have no `VP` on the first level
    * The only exceptions are things that are not complete sentences, which don't have `VP`'s at all; but there are only two instances of this in our dataset, and we don't care about dropping them
    * Clear solution is to treat each sub-`S` as its own sentence; the only open loop there is figuring out how that integrates with keeping track of where entities are in relation to phrases (pending brainstorm)
* Most sentences (43 of 53) that have disjoint phrases identified have an annotation of `SBAR` (clause introduced by subordinating conjunction)
    * In most cases, the `S`-labeled part of the sentence that falls under the `SBAR` label is what contains the biological relationships that we're interested in
* One scenario that fails with a `Multiple levels with kids` error is where two `VP`s are connected by a coordinating conjunction (`CC`)
    * The problem with just treating each `VP` separately is that they tend to share a subject
    * However, in the examples I've looked at, there aren't biological relationships to identify within the sentence because the subject is too general
    * I do *not* expect that to be a generalizable observation (`n=2` currently), but I think it might be worth just dropping sentences that follow this pattern to see how it affects performance
* Another `Multiple levels with kids` scenario is when a sentence is introduced with a `PP`
    * Easily solveable by ignoring `PP`'s at the first level, I think that is a safe thing to do

## After implementing SBAR solution
Implementing this increased the number of sentences that got dropped -- why??

In [23]:
with open('../data/distant_sup_output/TRAIN_only_SBAR_solution_23Feb2023_skipped_sentences.json') as f:
    skipped = json.load(f)
with open('../data/distant_sup_output/TRAIN_only_SBAR_solution_23Feb2023_success_sentences.json') as f:
    success = json.load(f)

In [24]:
# Convert to multiindex dataframes
skipped_tups = {(i,j): skipped[i][j] 
       for i in skipped.keys() 
       for j in skipped[i].keys()}

skipped_mux = pd.MultiIndex.from_tuples(skipped_tups.keys())
skipped_df = pd.DataFrame(list(skipped_tups.values()), index=skipped_mux).T

success_tups = {(i,j): success[i][j] 
       for i in success.keys() 
       for j in success[i].keys()}

success_mux = pd.MultiIndex.from_tuples(success_tups.keys())
success_df = pd.DataFrame(list(success_tups.values()), index=success_mux).T


In [26]:
incomplete = 0
mult_sent = 0
mult_kids = 0
disjoint = 0
sbar = 0
df_empties = 0
nones = 0
sbar_parse_strs = []
for doc_key in doc_ids:
    for i in skipped_df.index:
        phrase = skipped_df.loc[i, (doc_key, 'phrase')]
        parse = skipped_df.loc[i, (doc_key, 'parse')]
        if phrase == 'NO PHRASE: Multiple levels with kids':
            mult_kids += 1
        elif phrase == 'NO PHRASE: Multiple nested sentence annotations':
            mult_sent += 1
        elif phrase == 'NO PHRASE: Incomplete sentence':
            incomplete += 1
        elif phrase is not None:
            if 'SBAR' in parse:
                sbar += 1
                sbar_parse_strs.append(parse)
            disjoint += 1
        elif phrase is None and parse is None:
            df_empties += 1
        elif phrase is None and parse is not None:
            nones += 1           
            
print(f'Nested S: {mult_sent}, Incomplete sentence: {incomplete}, Multiple nodes with kids: {mult_kids}, Disjoint phrases: {disjoint}')
print(f'Common sense check to make sure the only Nones come from ragged df rows: df empties: {df_empties}, Nones: {nones}')
print(f'SBAR is present in the parses of {sbar} of the {disjoint} total disjoint phrases')

Nested S: 45, Incomplete sentence: 2, Multiple nodes with kids: 111, Disjoint phrases: 66
Common sense check to make sure the only Nones come from ragged df rows: df empties: 172, Nones: 0
SBAR is present in the parses of 56 of the 66 total disjoint phrases


Previously there were only 53 disjoint phrases present in the skipped sentences, so we now have more of them. We also have more scenarios where there are multiple nodes with kids, and we also have four more scenarios of sentences with nested S labels. What on earth did I do??

In [27]:
doc_ids = skipped_df.columns.get_level_values(0).unique()
for doc_key in doc_ids:
    for i in skipped_df.index:
        parse = skipped_df.loc[i, (doc_key, 'parse')]
        phrase = skipped_df.loc[i, (doc_key, 'phrase')]
        if parse is not None:
            print('--------------------------------------------------------------------------')
            print('Parse tree:')
            Abstract.visualize_parse(parse)
            print(f'\nIdentified phrase: {phrase}')

--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                                                                                                                    
                                                                                                                              |                                                                                                                      
                                                                                                                              S                                                                                                                     
              ________________________________________________________________________________________________________________|__________________________________________________

                                                                                                                                                                                                                                                                                                   
                                                                                                                                      |                                                                                                                                                             
                                                                                                                                      S                                                                                                                                                            
      _____________________________________________________________________________________________________________________

                                                                                                                                                                                                                                                                                 
                                                                                                                                                   |                                                                                                                              
                                                                                                                                                   S                                                                                                                             
                         __________________________________________________________________________________________________________________________|_____________________________

                                                                                                                                                                                                                                                                          
                                                                                                                                 |                                                                                                                                         
                                                                                                                                 S                                                                                                                                        
                           ______________________________________________________________________________________________________|____________________________________________________________________

                                                                                                                                                                                                                                                                               
                                                                                                                                 |                                                                                                                                              
                                                                                                                                 S                                                                                                                                             
           ______________________________________________________________________________________________________________________|_____________________________________________________

                                                                                                                                                               
                                                                            |                                                                                   
                                                                            S                                                                                  
        ____________________________________________________________________|________________________________________________________________________________   
       |                   VP                                                                                                                                | 
       |              _____|________                                                                                                                         |  
       |             |             SB

                                                                                                                                                                                                           
                                                                                             |                                                                                                              
                                                                                             S                                                                                                             
      _______________________________________________________________________________________|___________________________________________________________________________________________________________   
     |                                                         VP                                                                                                                     

                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                       |                                                                                                                                                                            
                                                                                                                                                       S                                                                                                                                                                           
                           

                                                                                                                                                                                                                                                       
                                                                                                                 |                                                                                                                                      
                                                                                                                 S                                                                                                                                     
                                     ____________________________________________________________________________|___________________________________________________________________________________________________________________________________   
      

                                                                                                                                                                    
                                                                                   |                                                                                 
                                                                                   S                                                                                
    _______________________________________________________________________________|______________________________________________________________________________   
   |     |        |                  VP                                                                                                                           | 
   |     |        |              ____|______________                                                                                                              |  
   |   

                                                                                                                                                                                     
                                                                                                               |                                                                      
                                                                                                               S                                                                     
                      _________________________________________________________________________________________|___________________________________________________________________   
                     |                                                                                                           VP                                                | 
                     |                                                        __________

                                                                                                                                                                                                                                                                                                              
                                                                                                                                           |                                                                                                                                                                   
                                                                                                                                           S                                                                                                                                                                  
            ______________________________________________________________________________

                                                                                                                                                                                                                                                                                                    
                                                                                                                                                               |                                                                                                                                     
                                                                                                                                                               S                                                                                                                                    
                            ____________________________________________________________________________________________

                                                                                                                                                                                         
                                                                                        |                                                                                                 
                                                                                        S                                                                                                
       _________________________________________________________________________________|______________________________________________________________________________________________   
      |                     VP                                                                                                                                                         | 
      |              _______|______                                 

                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                     |                                                                                                                                                                                 
                                                                                                                                                                                                     S                                                   

                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                     |                                                                                                                                                                            
                                                                                                                                                                                                     S                                                             

                                                                                                                                                     
                                                                           |                                                                          
                                                                           S                                                                         
                     ______________________________________________________|_______________________________________________________________________   
                    |                                                                           VP                                                 | 
                    |                             ______________________________________________|___________________                               |  
                    |                            |                    |          |              |

                                                                                                                                                                                           
                                                                                       |                                                                                                    
                                                                                       S                                                                                                   
          _____________________________________________________________________________|_________________________________________________________________________________________________   
         PP                                      |                                                                            VP                                                         | 
  _______|________________                       |     ___

                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                     |                                                                                                                                                                              
                                                                                                                                                                                     S                                                                                                         

                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                  |                                                                                                                                                                  
                                                                                                                                                                  S                                                                                                                                                                 
                        



Identified phrase: NO PHRASE: Multiple levels with kids
--------------------------------------------------------------------------
Parse tree:
                                                                                                                                                                                                                                                                    
                                                                                                                             |                                                                                                                                       
                                                                                                                             S                                                                                                                                      
       ________________________________________________________________

                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                             |                                                                                                                                                                                            
                                                                                                                                                                             S                                                                                                     