# Main Process

- Text Cleaning
- Sentence Tokenization
- Word Tokenization
- Word-frequency table
- Summarization

## Text

In [21]:
text = """
This article is about natural language processing done by computers. For the natural language processing done by the human brain, see Language processing in the brain. Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
A major drawback of statistical methods is that they require elaborate feature engineering. Since 2015,[19] the field has thus largely abandoned statistical methods and shifted to neural networks for machine learning. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT). Latest works tend to use non-technical structure of a given task to build proper neural network.[20]
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[6]
"""

## Install Important Modules

In [2]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.1.4-cp38-cp38-win_amd64.whl (12.0 MB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.6-cp38-cp38-win_amd64.whl (21 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp38-cp38-win_amd64.whl (452 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp38-cp38-win_amd64.whl (36 kB)
Collecting wasabi<1.1.0,>=0.8.1
  Using cached wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.5-cp38-cp38-win_amd64.whl (6.6 MB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
Collecting requests<3.0.0,>=2.13.0
  Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
Collecting tqdm<5.0.0,>=4.38.0
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=

In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


## Importing Modules

In [26]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [5]:
list(STOP_WORDS)

['beside',
 'empty',
 'about',
 'must',
 'further',
 'is',
 'from',
 'thence',
 'both',
 'last',
 'least',
 'across',
 'will',
 'serious',
 'and',
 'would',
 'us',
 'no',
 'were',
 'we',
 'fifty',
 'that',
 'less',
 'keep',
 'same',
 'such',
 'as',
 'thru',
 'else',
 'everything',
 'up',
 'full',
 'our',
 'could',
 'anyway',
 'an',
 'once',
 'cannot',
 'fifteen',
 'five',
 'should',
 'top',
 'whereupon',
 'beforehand',
 'someone',
 'whereas',
 'themselves',
 'may',
 'noone',
 'regarding',
 'after',
 '‘ve',
 'two',
 'each',
 'several',
 'therefore',
 'whereby',
 'sometime',
 'twenty',
 'what',
 'only',
 'among',
 'other',
 'under',
 'of',
 'again',
 'together',
 'between',
 'none',
 'most',
 'anywhere',
 'it',
 'sixty',
 'still',
 'but',
 '‘m',
 'done',
 'do',
 'there',
 'much',
 'by',
 'just',
 'nine',
 'not',
 '’s',
 'everyone',
 'have',
 'meanwhile',
 'on',
 'beyond',
 'take',
 'third',
 'see',
 'during',
 "'s",
 'throughout',
 'indeed',
 'why',
 'almost',
 'before',
 'all',
 'really

In [6]:
stopwords = list(STOP_WORDS)

In [7]:
nlp = spacy.load('en_core_web_sm')

In [22]:
doc = nlp(text)

In [23]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'This', 'article', 'is', 'about', 'natural', 'language', 'processing', 'done', 'by', 'computers', '.', 'For', 'the', 'natural', 'language', 'processing', 'done', 'by', 'the', 'human', 'brain', ',', 'see', 'Language', 'processing', 'in', 'the', 'brain', '.', 'Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '"', 'understanding', '"', 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights',

In [27]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [28]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [29]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [30]:
print(word_frequencies)

{'article': 1, 'natural': 6, 'language': 12, 'processing': 8, 'computers': 3, 'human': 2, 'brain': 2, 'Language': 1, 'Natural': 1, 'NLP': 2, 'subfield': 1, 'linguistics': 3, 'computer': 2, 'science': 1, 'artificial': 1, 'intelligence': 1, 'concerned': 1, 'interactions': 1, 'particular': 1, 'program': 1, 'process': 1, 'analyze': 1, 'large': 1, 'amounts': 1, 'data': 1, 'goal': 1, 'capable': 1, 'understanding': 1, 'contents': 1, 'documents': 3, 'including': 1, 'contextual': 1, 'nuances': 1, 'technology': 1, 'accurately': 1, 'extract': 1, 'information': 1, 'insights': 1, 'contained': 1, 'categorize': 1, 'organize': 1, 'major': 1, 'drawback': 1, 'statistical': 4, 'methods': 2, 'require': 1, 'elaborate': 1, 'feature': 1, 'engineering': 1, '2015,[19': 1, 'field': 1, 'largely': 1, 'abandoned': 1, 'shifted': 1, 'neural': 4, 'networks': 1, 'machine': 6, 'learning': 5, 'Popular': 1, 'techniques': 1, 'include': 1, 'use': 2, 'word': 2, 'embeddings': 1, 'capture': 1, 'semantic': 1, 'properties': 1, 

In [31]:
max_frequency = max(word_frequencies.values())

In [32]:
max_frequency

12

In [33]:
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word] / max_frequency

In [34]:
print(word_frequencies)

{'article': 0.08333333333333333, 'natural': 0.5, 'language': 1.0, 'processing': 0.6666666666666666, 'computers': 0.25, 'human': 0.16666666666666666, 'brain': 0.16666666666666666, 'Language': 0.08333333333333333, 'Natural': 0.08333333333333333, 'NLP': 0.16666666666666666, 'subfield': 0.08333333333333333, 'linguistics': 0.25, 'computer': 0.16666666666666666, 'science': 0.08333333333333333, 'artificial': 0.08333333333333333, 'intelligence': 0.08333333333333333, 'concerned': 0.08333333333333333, 'interactions': 0.08333333333333333, 'particular': 0.08333333333333333, 'program': 0.08333333333333333, 'process': 0.08333333333333333, 'analyze': 0.08333333333333333, 'large': 0.08333333333333333, 'amounts': 0.08333333333333333, 'data': 0.08333333333333333, 'goal': 0.08333333333333333, 'capable': 0.08333333333333333, 'understanding': 0.08333333333333333, 'contents': 0.08333333333333333, 'documents': 0.25, 'including': 0.08333333333333333, 'contextual': 0.08333333333333333, 'nuances': 0.08333333333

In [35]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[
, This article is about natural language processing done by computers., For the natural language processing done by the human brain, see Language processing in the brain., Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data., The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them., The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves., 
, A major drawback of statistical methods is that they require elaborate feature engineering., Since 2015,[19] the field has thus largely abandoned statistical methods and shifted to neural networks for machine learning., Popular techniques incl

In [36]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [37]:
sentence_scores

{This article is about natural language processing done by computers.: 2.5,
 For the natural language processing done by the human brain, see Language processing in the brain.: 4.333333333333333,
 Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.: 6.833333333333331,
 The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.: 2.0,
 The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.: 1.1666666666666667,
 A major drawback of statistical methods is that they require elaborate feature engineering.: 1.0,
 Since 2015,[19] the field has thus largely abandoned statistical methods and 

In [38]:
from heapq import nlargest

In [39]:
select_length = int(len(sentence_tokens) * 0.3)
select_length

4

In [40]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [41]:
summary

[Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.,
 For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).,
 Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.,
 In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural langua

In [43]:
final_summary = [word.text for word in summary]

In [44]:
final_summary

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.',
 'For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).',
 'Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.',
 'In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural

In [45]:
summary = ' '.join(final_summary)

In [46]:
summary

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT). Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language pro

In [47]:
print(summary)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT). Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language proc

In [48]:
print(text)


This article is about natural language processing done by computers. For the natural language processing done by the human brain, see Language processing in the brain. Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
A major drawback of statistical methods is that they require elaborate feature engineering. Since 2015,[19] the field has thus largely abandoned statistical methods and shifted to neural networks for machine learning. Popular techniques include the use o

In [49]:
print(len(text))
print(len(summary))

2544
1007


In this notebooks I success summarize an article from [Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing). Actually I follow video from KGP Talkie title `NLP Tutorial 12 - Text Summarization using NLP`