##### Understanding the text and sentence based tokenization in NLP using an example of text summarization. Tokenization is the process of breaking the text into simple readable (or digestable) chunks such that it is easier for computer to have better insight of text later on in NLP tasks. The chunks can be character, word, or sentence. 

In [27]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [28]:
text ="There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of the code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Its measures cosine of the angle between vectors. The angle will be 0 if sentences are similar."

In [29]:
stop_words = set(stopwords.words("english"))  # view the predefined stop word in the mentioned language
words = word_tokenize(text)  # Tokenize based on text
freq_table = dict() 

In [30]:
for word in words: # Loop over all the words in text
    word = word.lower()
    if word in stop_words:    # Don't include the stop_words in the dictionary
        continue
    if word in freq_table:     # Record frequency of each word
        freq_table[word] +=1
    else:
        freq_table[word] = 1

In [31]:
sentences = sent_tokenize(text)   # Tokenize text sentence wise
sentence_value= dict()

In [32]:
print(sentences)

['There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them.', 'Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning.', 'One benefit of this will be, you don’t need to train and build a model prior start using it for your project.', 'It’s good to understand Cosine similarity to make the best use of the code you are going to see.', 'Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.', 'Its measures cosine of the angle between vectors.', 'The angle will be 0 if sentences are similar.']


In [33]:
for sentence in sentences:      # Set score to each sentence based on the inclusion of each word in text
    for word, freq in freq_table.items():
        if word in sentence.lower():
            if sentence in sentence_value:
                sentence_value[sentence] += freq
            else:
                sentence_value[sentence] = freq

In [34]:
print(sentence_value)

{'There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them.': 32, 'Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning.': 20, 'One benefit of this will be, you don’t need to train and build a model prior start using it for your project.': 22, 'It’s good to understand Cosine similarity to make the best use of the code you are going to see.': 26, 'Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.': 29, 'Its measures cosine of the angle between vectors.': 19, 'The angle will be 0 if sentences are similar.': 14}


In [35]:
sum_values = 0  # Get the average score for the sentences
for sentence in sentence_value:
    sum_values += sentence_value[sentence]
average = int(sum_values/len(sentence_value))

In [39]:
average

23

In [40]:
summary = ''   # Write summary of the sentence
for sentence in sentences:
    if (sentence in sentence_value) and (sentence_value[sentence] > (1.2 * average)):
        summary += " " + sentence
print(summary)

 There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
