# Subsentence splitting

Compound sentences are quite common in written speech, yet , whilst consisting of several pretty complete parts, not all of those parts are equally informative.

Moreover, the text may contain a lot of such looooong sentences, thus:
+ we won't shorten it much if we just choose some of them;
+ a lot of useful information might be lost without any chance to inject it into result.

In our algorithm we are facing the following subtasks:
1. Check if the sentence is compound and of which type
2. If the sentence is compound, find clauses 

***For similar task of clause splitting there is a library, which, unfortunately, does not work with spaCy 3.0 as well:***  
https://awesomeopensource.com/project/mmxgn/spacy-clausie

***The following code is heavily copypasted from the book [Python Natural Language Processing Cookbook by Zhenya Antić]( https://subscription.packtpub.com/book/data/9781838987312/2/ch02lvl1sec13/splitting-sentences-into-clauses)***

In [97]:
from IPython.display import HTML, display
display(HTML())
text = "<tr><th>[text]</th><th>[idx]</th><th>[POS]</th><th>[DEP]</th><th>[ANCESTORS]</th><th>[CHILDREN]</th></tr>"
for token in sentences[0]:
    ancestors = [t.text for t in token.ancestors]
    children = [t.text for t in token.children]
    text += "<tr><td>" + "</td><td>".join([
        token.text, 
        str(token.i), 
        token.pos_, 
        token.dep_, 
        str(ancestors), 
        str(children)
    ]) + "</td></tr>"
    
style = '<style> td, th { text-align: left !important; }</style>'
display(HTML(style + '<table style="font-size: 14px;">' + text + "</table>"))
# display(HTML(text))

<IPython.core.display.HTML object>

[text],[idx],[POS],[DEP],[ANCESTORS],[CHILDREN]
The,0,DET,det,"['things', 'are', 'move']",[]
things,1,NOUN,nsubj,"['are', 'move']",['The']
are,2,AUX,ccomp,['move'],"['things', 'complicated']"
quite,3,ADV,advmod,"['complicated', 'are', 'move']",[]
complicated,4,ADJ,acomp,"['are', 'move']",['quite']
",",5,PUNCT,punct,['move'],[]
we,6,PRON,nsubj,['move'],[]
shall,7,AUX,aux,['move'],[]
move,8,VERB,ROOT,[],"['are', ',', 'we', 'shall', 'on', '.']"
on,9,ADP,prt,['move'],[]


In [108]:
spacy.explain("ccomp")

'clausal complement'

In [13]:
the_sentence = sentences[7] # this sentence seem to have just right level of complexity

In [98]:
the_sentence= sentences[0]

In [99]:
displacy.render(the_sentence, style='dep')

In [100]:
# We will use the following function to find the root token of the sentence, 
# which is usually the main verb. 
# In instances where there is a dependent clause, 
# it is the verb of the independent clause:

def find_root_of_sentence(doc):
    root_token = None
    for token in doc:
        if (token.dep_ == "ROOT"):
            root_token = token
    return root_token

# We will now find the root token of the sentence:
root_token = find_root_of_sentence(the_sentence)
print(root_token)

move


In [102]:
# We can now use the following function to find the other verbs in the sentence:

def find_other_verbs(doc, root_token):
    other_verbs = []
    for token in doc:
        ancestors = list(token.ancestors)
        #
        if (token.pos_ == "VERB") and (len(ancestors) == 1) and (ancestors[0] == root_token):
            other_verbs.append(token)
        # 
        else if (token.pos_ == "AUX"):
            other_verbs.append(token)
        #
    return other_verbs

# Use the preceding function to find the remaining verbs in the sentence:
other_verbs = find_other_verbs(the_sentence, root_token)
print(other_verbs)

[]


In [103]:
# We will use the following function to find the token spans for each verb:

def get_clause_token_span_for_verb(verb, the_sentence, all_verbs):
    first_token_index = len(the_sentence)
    last_token_index = 0
    this_verb_children = list(verb.children)
    for child in this_verb_children:
        if (child not in all_verbs):
            if (child.i < first_token_index):
                first_token_index = child.i
            if (child.i > last_token_index):
                last_token_index = child.i
    return(first_token_index, last_token_index)

In [104]:
# We will put together all the verbs in one array and process each using the preceding function. This will return a tuple of start and end indices for each verb's clause:

token_spans = []   
all_verbs = [root_token] + other_verbs
for other_verb in all_verbs:
    (first_token_index, last_token_index) = \
     get_clause_token_span_for_verb(other_verb, 
                                    the_sentence, all_verbs)
    token_spans.append((first_token_index, 
                        last_token_index))

In [105]:
all_verbs

[move]

In [106]:
token_spans

[(2, 10)]

In [107]:
# Using the start and end indices, we can now put together token spans for each clause. 
# We sort the sentence_clauses list at the end so that the clauses are 
# in the order they appear in the sentence:

sentence_clauses = []
for token_span in token_spans:
    start = token_span[0]
    end = token_span[1]
    if (start < end):
        clause = the_sentence[start:end]
        sentence_clauses.append(clause)
sentence_clauses = sorted(sentence_clauses, key=lambda tup: tup[0])

# Now, we can print the final result of the processing for our initial sentence

clauses_text = [clause.text for clause in sentence_clauses]
print(clauses_text)

# The result is as follows:

['are quite complicated, we shall move on']


In [93]:
sentence_clauses

[, ]