### Sentence Segmentation:

Now let's go ahead and see how we can divide sentences into segments. This really helps when working with NLP algorithms in order to find out the contextual meanings of a particular sentence.

Let's jump into its implementation..

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm') # loading vocabulary for English language!

In [3]:
doc = nlp(u"This is the first sentence. This is another sentence. This is the last sentence")

In [4]:
for sent in doc.sents:
    print(sent) # prints every single sentence inside our doc object

This is the first sentence.
This is another sentence.
This is the last sentence


In [5]:
# keep in mind the doc.sents is a generator, so onece you try to index it, it's gonna through back error!

# doc.sents[0]
# it generates the sentences, instead of holding them all into the memory!

In [6]:
# If you want to grab the index out of this doc.sents generator you can create an list off of that and that do indexing..

list(doc.sents)[0]

# there you got, here's our first sentence at index '0'

This is the first sentence.

In [7]:
# lets consider some other examples!

doc = nlp(u'"Management is doing the right things; leadership is doing the right things." -Peter Ducker')

# notice we have the ';' and ',' in the sentence. Let's see how Spacy tries to segment theses sentences!

In [8]:
# first let's check out the text of the doc object!

doc.text

'"Management is doing the right things; leadership is doing the right things." -Peter Ducker'

In [9]:
for sent in doc.sents:
    print(sent)
    print('\n')
    
# it has broken down these both quotes into separate sentences

"Management is doing the right things; leadership is doing the right things."


-Peter Ducker




### Adding Segmentation Rules:

In [10]:
# Let's add a NEW RULE to the spacy pipeline

# ADD A SEGMENTATION RULE
# since we know that every token inside a doc, has it's own index associated. which can be found out by token.i attribute off of the token
def test_index(doc):
    for token in doc:
        print(token.i)

In [11]:
test_index(doc)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17


In [22]:
# now let's segment the sentences on ';'
def set_custom_boundries(doc):
    for token in doc.sent[:-1]: # doc.sent[:-1] in order to avoid the index out of range error, we neglect the last word
        if token.text == ';':
            doc[token.i + 1].is_sent_start = True # the next index word after ';' would be considered as start of the next sentence
            
    return doc

In [23]:
# nlp.add_pipe(set_custom_boundries, before = 'parser') # adding our newly created function to nlp pipeline

# nlp.pipe_names # making sure our newly built function is added!

In [None]:
doc4 = nlp(u'"Management is doing the right things; leadership is doing the right things." -Peter Ducker')

for sent in doc4.sents:
    print(sent) # Now it separates sentences on ';'

### Changing Segmentation Rules:

In [27]:
nlp = spacy.load('en_core_web_sm')

In [31]:
mystring = u"This is a sentence. This is another sentence .\n\nThis is a \nthird sentence."

In [32]:
print(mystring)

This is a sentence. This is another sentence .

This is a 
third sentence.


In [36]:
# let's first check the default behaviour

doc = nlp(mystring)

for sentence in doc.sents:
    print(sentence)
    
# this separates the sentence on periods and white spaces as well apart from \n
# what if we just want to segment a sentence on \n

This is a sentence.
This is another sentence .


This is a 
third sentence.


In [37]:
# so let's change default spacy segmentation rules
from spacy.pipeline import SentenceSegmenter

In [44]:
def split_newlines(doc): # function to split sentence on a new lines
    start = 0
    seen_newline = False
    
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        
        elif word.text.startswith('\n'):
            seen_newline = True
            
    yield doc[start:]

In [39]:
sbd = SentenceSegmenter(nlp.vocab, strategy = split_newlines)

In [40]:
nlp.add_pipe(sbd)

In [41]:
doc = nlp(mystring)

In [43]:
for sentence in doc.sents:
    print(sentence)
    
# Now it'll separate each sentence on a new line

This is a sentence. This is another sentence .


This is a 

third sentence.
