<a href="https://colab.research.google.com/github/quocthang0507/VietnameseNaturalLanguageProcessing/blob/main/Spacy_Sentence_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
!pip install spacy==3.0

Collecting spacy==3.0
  Downloading spacy-3.0.0-cp37-cp37m-manylinux2014_x86_64.whl (12.7 MB)
[K     |████████████████████████████████| 12.7 MB 218 kB/s 
Collecting srsly<3.0.0,>=2.4.0
  Downloading srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456 kB)
[K     |████████████████████████████████| 456 kB 48.9 MB/s 
Collecting pathy
  Downloading pathy-0.6.0-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.3 MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.1
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting spacy-legacy<3.1.0,>=3.0.0
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting pydantic<1.8.0,>=1.7.1
  Downloading pydantic-1.7.4-cp37-cp37m-manylinux2014_x86_64.whl (9.1 MB)
[K     |████████████████████████████████| 9.1 MB 14.4 MB/s 
[?25hCollecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting thinc<8.1.0,>=8.0.0
  Downloading thinc-8.0.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux20

In [1]:
import spacy

In [2]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_lg

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 72 kB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-md==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.0.0/en_core_web_md-3.0.0-py3-none-any.whl (47.1 MB)
[K     |████████████████████████████████| 47.1 MB 3.7 kB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.0.0
[38;5;2m✔ Download and installation successful[

# Default: Using the dependency parse

Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually the most accurate approach, but it requires a **trained pipeline** that provides accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box with spaCy’s provided trained pipelines. For social media or conversational text that doesn’t follow the same rules, your application may benefit from a custom trained or rule-based component.

In [3]:
nlp_default = spacy.load('en_core_web_sm')
doc = nlp_default('This is a sentence. This is another sentence.')
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.


# Statistical sentence segmenter

The `SentenceRecognizer` is a simple statistical component that only provides sentence boundaries. Along with being faster and smaller than the parser, its primary advantage is that it’s easier to train because it only requires annotated sentence boundaries rather than full dependency parses. spaCy’s [trained pipelines](https://spacy.io/models) include both a parser and a trained sentence segmenter, which is [disabled](https://spacy.io/usage/processing-pipelines#disabling) by default. If you only need sentence boundaries and no parser, you can use the `exclude` or `disable` argument on `spacy.load` to load the pipeline without the parser and then enable the sentence recognizer explicitly with `nlp.enable_pipe`.

In [4]:
nlp_stat = spacy.load('en_core_web_sm', exclude=['parser'])
nlp_stat.enable_pipe("senter")
doc = nlp_stat('This is a sentence. This is another sentence.')
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.


# Rule-based pipeline component

The `Sentencizer` component is a [pipeline component](https://spacy.io/usage/processing-pipelines) that splits sentences on punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only need sentence boundaries without dependency parses.

In [5]:
from spacy.lang.en import English

nlp_rule = English()
nlp_rule.add_pipe('sentencizer')
doc = nlp_rule('This is a sentence. This is another sentence.')
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.


# Custom rule-based strategy

If you want to implement your own strategy that differs from the default rule-based approach of splitting on sentences, you can also create a [custom pipeline component](https://spacy.io/usage/processing-pipelines#custom-components) that takes a `Doc` object and sets the `Token.is_sent_start` attribute on each individual token. If set to `False`, the token is explicitly marked as not the start of a sentence. If set to `None` (default), it’s treated as a missing value and can still be overwritten by the parser.

> **Important note**
>
> To prevent inconsistent state, you can only set boundaries **before** a document is parsed (and `doc.has_annotation("DEP")` is `False`). To ensure that your component is added in the right place, you can set `before='parser'` or `first=True` when adding it to the pipeline using `nlp.add_pipe`.

Here’s an example of a component that implements a pre-processing rule for splitting on `"..."` tokens. The component is added before the parser, which is then used to further segment the text. That’s possible, because `is_sent_start` is only set to `True` for some of the tokens – all others still specify `None` for unset sentence boundaries. This approach can be useful if you want to implement **additional** rules specific to your data, while still being able to take advantage of dependency-based sentence segmentation.

In [7]:
from spacy.language import Language

text = 'This is a sentence...hello...and another sentence.'

nlp_custom = spacy.load('en_core_web_sm')
doc = nlp_custom(text)
print('Before:', [sent.text for sent in doc.sents])

@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i + 1].is_sent_start = True
    return doc

nlp_custom.add_pipe('set_custom_boundaries', before='parser')
doc = nlp_custom(text)
print('After:', [sent.text for sent in doc.sents])

Before: ['This is a sentence...hello...and another sentence.']
After: ['This is a sentence...', 'hello...', 'and another sentence.']
