How to do text summarization?
1. Data Cleaning
2. Word Tokenization
3. Word Frequency Table
4. Sentence Tokenization
5. Summarization

In [None]:
text = """
Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. 'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crowding on planes lead to more serious issues than fighting for space in the overhead lockers, crashing elbows and seat back kicking? Tests conducted by the FAA use planes with a 31 inch pitch, a standard which on some airlines has decreased . Many economy seats on United Airlines have 30 inches of room, while some airlines offer as little as 28 inches . Cynthia Corbertt, a human factors researcher with the Federal Aviation Administration, that it conducts tests on how quickly passengers can leave a plane. But these tests are conducted using planes with 31 inches between each row of seats, a standard which on some airlines has decreased, reported the Detroit News. The distance between two seats from one point on a seat to the same point on the seat behind it is known as the pitch. While most airlines stick to a pitch of 31 inches or above, some fall below this. While United Airlines has 30 inches of space, Gulf Air economy seats have between 29 and 32 inches, Air Asia offers 29 inches and Spirit Airlines offers just 28 inches. British Airways has a seat pitch of 31 inches, while easyJet has 29 inches, Thomson's short haul seat pitch is 28 inches, and Virgin Atlantic's is 30-31.
"""

## Starting with spacy lib

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

2023-08-29 04:52:29.111555: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


##Importing important library

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

##1. Data Cleaning

In [None]:
stopwords = list(STOP_WORDS)
stopwords

['every',
 'many',
 'whereafter',
 'hundred',
 'and',
 'beforehand',
 'everywhere',
 'part',
 'fifteen',
 'very',
 'not',
 'down',
 'if',
 'thus',
 'used',
 'neither',
 'thru',
 'indeed',
 'whereupon',
 'her',
 'move',
 'for',
 "'ve",
 'regarding',
 'your',
 'themselves',
 'noone',
 'often',
 'whether',
 'twelve',
 'because',
 'we',
 'himself',
 'their',
 'cannot',
 '’m',
 'three',
 'against',
 'made',
 'across',
 'as',
 'bottom',
 'he',
 'than',
 'third',
 'whatever',
 'other',
 'how',
 'had',
 'below',
 'was',
 'or',
 'beyond',
 'although',
 'quite',
 'whither',
 'after',
 'whereby',
 'nevertheless',
 'due',
 'using',
 'without',
 'several',
 'hereafter',
 'just',
 'yours',
 'also',
 'meanwhile',
 'his',
 'nowhere',
 'from',
 'a',
 'hereupon',
 'what',
 'more',
 'beside',
 'where',
 'whenever',
 'herself',
 '‘ve',
 'top',
 'besides',
 'while',
 'forty',
 "'s",
 'may',
 '’s',
 'somewhere',
 'of',
 'sometimes',
 '‘ll',
 'yet',
 'its',
 'therefore',
 '’ll',
 'whence',
 'around',
 '’ve',

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
doc = nlp(text)

##2. Word Tokenization

In [None]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'Ever', 'noticed', 'how', 'plane', 'seats', 'appear', 'to', 'be', 'getting', 'smaller', 'and', 'smaller', '?', 'With', 'increasing', 'numbers', 'of', 'people', 'taking', 'to', 'the', 'skies', ',', 'some', 'experts', 'are', 'questioning', 'if', 'having', 'such', 'packed', 'out', 'planes', 'is', 'putting', 'passengers', 'at', 'risk', '.', 'They', 'say', 'that', 'the', 'shrinking', 'space', 'on', 'aeroplanes', 'is', 'not', 'only', 'uncomfortable', '-', 'it', "'s", 'putting', 'our', 'health', 'and', 'safety', 'in', 'danger', '.', 'More', 'than', 'squabbling', 'over', 'the', 'arm', 'rest', ',', 'shrinking', 'space', 'on', 'planes', 'putting', 'our', 'health', 'and', 'safety', 'in', 'danger', '?', 'This', 'week', ',', 'a', 'U.S', 'consumer', 'advisory', 'group', 'set', 'up', 'by', 'the', 'Department', 'of', 'Transportation', 'said', 'at', 'a', 'public', 'hearing', 'that', 'while', 'the', 'government', 'is', 'happy', 'to', 'set', 'standards', 'for', 'animals', 'flying', 'on', 'planes',

In [None]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

##3. Word Frequency Table

In [None]:
word_frequencies = {}
for word in doc:
  if word.text.lower() not in stopwords:
    if word.text.lower() not in punctuation:
      if word.text not in word_frequencies:
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1

In [None]:
print(word_frequencies)

{'noticed': 1, 'plane': 2, 'seats': 5, 'appear': 1, 'getting': 1, 'smaller': 2, 'increasing': 1, 'numbers': 1, 'people': 1, 'taking': 1, 'skies': 1, 'experts': 1, 'questioning': 1, 'having': 1, 'packed': 1, 'planes': 6, 'putting': 3, 'passengers': 3, 'risk': 1, 'shrinking': 2, 'space': 6, 'aeroplanes': 1, 'uncomfortable': 1, 'health': 2, 'safety': 2, 'danger': 2, 'squabbling': 1, 'arm': 1, 'rest': 1, 'week': 1, 'U.S': 1, 'consumer': 2, 'advisory': 1, 'group': 1, 'set': 2, 'Department': 1, 'Transportation': 1, 'said': 2, 'public': 1, 'hearing': 1, 'government': 1, 'happy': 1, 'standards': 1, 'animals': 2, 'flying': 1, 'stipulate': 1, 'minimum': 1, 'humans': 2, 'world': 1, 'rights': 1, 'food': 1, 'Charlie': 1, 'Leocha': 1, 'representative': 1, 'committee': 1, '\xa0': 1, 'time': 1, 'DOT': 1, 'FAA': 2, 'stand': 1, 'humane': 1, 'treatment': 1, 'crowding': 1, 'lead': 1, 'issues': 1, 'fighting': 1, 'overhead': 1, 'lockers': 1, 'crashing': 1, 'elbows': 1, 'seat': 5, 'kicking': 1, 'Tests': 1, '

In [None]:
max_frequency = max(word_frequencies.values())

In [None]:
max_frequency

11

##Frequency normalization
-- dividing every value with the max_frequency i.e 11

In [None]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_frequency

In [None]:
word_frequencies

{'noticed': 0.09090909090909091,
 'plane': 0.18181818181818182,
 'seats': 0.45454545454545453,
 'appear': 0.09090909090909091,
 'getting': 0.09090909090909091,
 'smaller': 0.18181818181818182,
 'increasing': 0.09090909090909091,
 'numbers': 0.09090909090909091,
 'people': 0.09090909090909091,
 'taking': 0.09090909090909091,
 'skies': 0.09090909090909091,
 'experts': 0.09090909090909091,
 'questioning': 0.09090909090909091,
 'having': 0.09090909090909091,
 'packed': 0.09090909090909091,
 'planes': 0.5454545454545454,
 'putting': 0.2727272727272727,
 'passengers': 0.2727272727272727,
 'risk': 0.09090909090909091,
 'shrinking': 0.18181818181818182,
 'space': 0.5454545454545454,
 'aeroplanes': 0.09090909090909091,
 'uncomfortable': 0.09090909090909091,
 'health': 0.18181818181818182,
 'safety': 0.18181818181818182,
 'danger': 0.18181818181818182,
 'squabbling': 0.09090909090909091,
 'arm': 0.09090909090909091,
 'rest': 0.09090909090909091,
 'week': 0.09090909090909091,
 'U.S': 0.0909090909

##4. Sentence Tokenization
o/p --> list of sentences

In [None]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[
Ever noticed how plane seats appear to be getting smaller and smaller?, With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk., They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger., More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger?, This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans., 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. , 'It is time that the DOT and FAA take a stand for humane treatment of passengers.', But could crowding on planes lead to more serious issues than fighting 

In [None]:
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else:
        sentence_scores[sent] += word_frequencies[word.text.lower()]

In [None]:
sentence_scores

{
 Ever noticed how plane seats appear to be getting smaller and smaller?: 1.272727272727273,
 With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk.: 2.0,
 They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger.: 1.7272727272727275,
 More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger?: 2.3636363636363633,
 This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans.: 3.181818181818181,
 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. : 1.8181818181818181,
 'It is time that the DOT and FAA take a stand

##5. Summarization


In [None]:
from heapq import nlargest

In [None]:
select_length = int(len(sentence_tokens)*0.3)
select_length

4

In [None]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [None]:
summary

"While United Airlines has 30 inches of space, Gulf Air economy seats have between 29 and 32 inches, Air Asia offers 29 inches and Spirit Airlines offers just 28 inches.British Airways has a seat pitch of 31 inches, while easyJet has 29 inches, Thomson's short haul seat pitch is 28 inches, and Virgin Atlantic's is 30-31.\nMany economy seats on United Airlines have 30 inches of room, while some airlines offer as little as 28 inches .But these tests are conducted using planes with 31 inches between each row of seats, a standard which on some airlines has decreased, reported the Detroit News."

In [None]:
summary

[While United Airlines has 30 inches of space, Gulf Air economy seats have between 29 and 32 inches, Air Asia offers 29 inches and Spirit Airlines offers just 28 inches.,
 British Airways has a seat pitch of 31 inches, while easyJet has 29 inches, Thomson's short haul seat pitch is 28 inches, and Virgin Atlantic's is 30-31.,
 Many economy seats on United Airlines have 30 inches of room, while some airlines offer as little as 28 inches .,
 But these tests are conducted using planes with 31 inches between each row of seats, a standard which on some airlines has decreased, reported the Detroit News.]

In [None]:
final_summary = [word.text for word in summary]
summary = ''.join(final_summary)

In [None]:
len(text)

2128

In [None]:
len(summary)

594