![Sentence Tokenization](http://www.digitalmeetsculture.net/wp-content/uploads/2015/04/article.jpg)

Source: http://www.digitalmeetsculture.net/article/article-about-preforma-published-in-archival-science/

# Sentence Tokenization

In previous article, word tokenization is introduced. What if we want to tokenize sentence? In general, we can easily split sentence by some punctuation such ., ? and !. However, there are lots of exception if we splitting article by those punctuation only.
In this article, you will go through why we need to use sentence tokenization and how can we use it.

# Why?
According to researchers, about 86% of article include the importance sentence in first one or two sentences. Believe that it is one of the reason why textsum model use first 2 sentences for training
When I am in school, teacher teaches how we should write an article. The importance sentence will be placed in the first sentence most of the time. It may exists in last sentence sometimes.

# How?
So how can we tokenize sentence? You can use the following simple python script to do that or using library such as nltk and spacy

In [28]:
# Capture from https://en.wikipedia.org/wiki/Lexical_analysis

article = 'In computer science, lexical analysis, lexing or tokenization is the process of \
converting a sequence of characters (such as in a computer program or web page) into a \
sequence of tokens (strings with an assigned and thus identified meaning). A program that \
performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner \
is also a term for the first stage of a lexer. A lexer is generally combined with a parser, \
which together analyze the syntax of programming languages, web pages, and so forth.'

article2 = 'ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456'

article3 = 'It is a great moment from 10 a.m. to 1 p.m. every weekend.'

### Self build

In [29]:
import re

for doc in [article, article2, article3]:
    print('Original Article: %s' % (doc))
    print()

    sentences = re.split('(\.|!|\?)', doc)
    
    for i, s in enumerate(sentences):
        print('-->Sentence %d: %s' % (i, s))

Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

-->Sentence 0: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning)
-->Sentence 1: .
-->Sentence 2:  A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer
-->Sentence 3: .


You can see that, "a.m." should treat as a "word". Of course, we can enhance the above regular expression to do it. But I will go for library rather than build the wheel again

### spaCy

In [30]:
import spacy
print('spaCy Version: %s' % spacy.__version__)

spaCy Version: 2.3.5


In [31]:
spacy_nlp = spacy.load('en_core_web_sm')

In [32]:
for article in [article, article2, article3]:
    print('Original Article: %s' % (article))
    print()
    doc = spacy_nlp(article)
    for i, token in enumerate(doc.sents):
        print('-->Sentence %d: %s' % (i, token.text))

Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

-->Sentence 0: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).
-->Sentence 1: A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer.
-->Sentence 2: A lexer is general

Can see that spacy handled "a.m." somehow.

### NLTK

In [33]:
import nltk
from nltk.tokenize import sent_tokenize
print('NTLK Version: %s' % nltk.__version__)

NTLK Version: 3.4.4


In [34]:
# nltk.download('punkt')

In [35]:
for article in [article, article2, article3]:
    print('Original Article: %s' % (article))
    print()

    doc = sent_tokenize(article)
    for i, token in enumerate(doc):
        print('-->Sentence %d: %s' % (i, token))

Original Article: It is a great moment from 10 a.m. to 1 p.m. every weekend.

-->Sentence 0: It is a great moment from 10 a.m. to 1 p.m. every weekend.
Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

-->Sentence 0: ConcateStringAnd123 ConcateSepcialCharacter_!
-->Sentence 1: @# !
-->Sentence 2: @#$%^&*()_+ 0123456
Original Article: It is a great moment from 10 a.m. to 1 p.m. every weekend.

-->Sentence 0: It is a great moment from 10 a.m. to 1 p.m. every weekend.


# Conclusion
So far both NLTK and spacy provides similar behavior so it depends on which library do you use in performing other preprocessing. 
Recently, I works on text mining related project which is classifying news category. Of course, I can build a ML model to classify it but I go for a simple approach. Only focus on the first sentence for every news and performing simple key word searching to build a baseline model. The result is not bad but it is a very quick way to deliver an initial version.