# Sentence Segmentation or Boundary Detection

#### Deciding where Sentences begin and end
- If its a period it ends a sentence
- If the preceding token is in the hand compiled list of abbreviations, then it dosent end a sentence
- If the next token is capitalised the it ends a sentence

- Default = Uses the dependency parser
- Custom rule based or manual
-- You set boundaries before parsing the doc

In [59]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [60]:
doc_covid = nlp(open('covid19.txt').read())
doc_covid

Through the International Food Safety Authorities Network (INFOSAN),
national food safety authorities are seeking more information on the
potential for persistence of SARS-CoV-2, which causes COVID-19, on foods
traded internationally as well as the potential role of food in the transmission
of the virus. Currently, there are investigations conducted to evaluate the
viability and survival time of SARS-CoV-2. As a general rule, the consumption
of raw or undercooked animal products should be avoided. Raw meat, raw
milk or raw animal organs should be handled with care to avoid crosscontamination with uncooked foods.

### Finding Sentence Boundary

In [61]:
#Custom Manual Function

def sent_boundary(docx):
    for token in docx[:-1]:
        if token.text == '---':
            docx[token.i+1].is_sent_start = True
    return docx

In [62]:
# Adding the rule before Parsing

nlp.add_pipe(sent_boundary, before = 'parser')

In [63]:
mytext = u'''This is my first sentence---the last comment was so cool--- What if --- this is the last sentence'''

In [64]:
mysentence = nlp(mytext)

In [65]:
for sentence in mysentence.sents:
    print(sentence)


This is my first sentence---
the last comment was so cool--- What if ---
this is the last sentence


### Custom Rule Based

In [131]:
from spacy.lang.en import English
from spacy.pipeline import SentenceSegmenter

In [136]:
def split_on_newline(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline and not word.is_space:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text == '\n':
            seen_newline = True
    if start < len(doc):
        yield doc[start:len(doc)]
        


In [137]:
nlp3 = English()   # language with no model
sbd = SentenceSegmenter(nlp3.vocab, strategy= split_on_newline)
nlp3.add_pipe(sbd)

In [138]:
doc = nlp3(open('covid19.txt').read())
doc

Through the International Food Safety Authorities Network (INFOSAN),
national food safety authorities are seeking more information on the
potential for persistence of SARS-CoV-2, which causes COVID-19, on foods
traded internationally as well as the potential role of food in the transmission
of the virus. Currently, there are investigations conducted to evaluate the
viability and survival time of SARS-CoV-2. As a general rule, the consumption
of raw or undercooked animal products should be avoided. Raw meat, raw
milk or raw animal organs should be handled with care to avoid crosscontamination with uncooked foods.

In [139]:
for i, token in enumerate(doc.sents):
    print(i, token)
    

0 Through the International Food Safety Authorities Network (INFOSAN),

1 national food safety authorities are seeking more information on the

2 potential for persistence of SARS-CoV-2, which causes COVID-19, on foods

3 traded internationally as well as the potential role of food in the transmission

4 of the virus. Currently, there are investigations conducted to evaluate the

5 viability and survival time of SARS-CoV-2. As a general rule, the consumption

6 of raw or undercooked animal products should be avoided. Raw meat, raw

7 milk or raw animal organs should be handled with care to avoid crosscontamination with uncooked foods.


### Removing Pipe

In [None]:
nlp3.remove_pipe('sbd')