# Text Summarization using different methods

https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

Application :-

https://womencourage.acm.org/archive/2014/5_TechTalk_AutomaticTextSummarization.pdf

1) News Summarrization
2) NewsLetter from whole story.
3) Google search on basis on entity. Like shows summary in link
4) KT Internal document.
5) Financial research
6) Short description from complaint and request.
7) Legal contract analysis
8) Social media marketing
9) Email overload
10) Science and R&D
11) Help desk and customer support
12) 

#### Gensim Method 

gensim is a very handy python library for performing NLP tasks. The text summarization process using gensim library is based on TextRank Algorithm

TextRank is an extractive summarization technique. It is based on the concept that words which occur more frequently are significant. Hence , the sentences containing highly frequent words are important .

Based on this , the algorithm assigns scores to each sentence in the text . The top-ranked sentences make it to the summary.

In [None]:
# Importing package and summarizer
import gensim
from gensim.summarization import summarize

In [None]:
original_text = 'Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time.'

# Passing the text corpus to summarizer 
short_summary = summarize(original_text)
print(short_summary, "\n")

# Summarization by ratio
summary_by_ratio=summarize(original_text,ratio=0.1)
print(summary_by_ratio, "\n")

# Summarization by word count
summary_by_word_count=summarize(original_text,word_count=30)
print(summary_by_word_count)

# Summarization when both ratio & word count is given
summary=summarize(original_text, ratio=0.1, word_count=30) # in this case word count will be prefered and ratio will be ignored.
print(summary)

They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body.
It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
One who like junk food develop more risk to put on extra 

#### Text Summarization with Sumy

A sentence which is similar to many other sentences of the text has a high probability of being important. The approach of LexRank is that a particular sentence is recommended by other similar sentences and hence is ranked higher.


In [None]:
# Installing and Importing sumy
# !pip install sumy
import sumy

In [None]:
# sumy.summarizers

import nltk; nltk.download('punkt')

# Importing the parser and tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

# Import the LexRank summarizer
from sumy.summarizers.lex_rank import LexRankSummarizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# Initializing the parser
my_parser = PlaintextParser.from_string(original_text,Tokenizer('english'))

# Creating a summary of 3 sentences.
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser.document,sentences_count=3)

# Printing the summary
for sentence in lexrank_summary:
  print(sentence)

It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
It is more of fats and cholesterol which will have a harmful impact on your health.
Children will find one way or the other to have it.


#### LSA (Latent semantic analysis)

Latent Semantic Analysis is a unsupervised learning algorithm that can be used for extractive text summarization.

It extracts semantically significant sentences by applying singular value decomposition(SVD) to the matrix of term-document frequency. To learn more about this algorithm, check out here

In [None]:
# Import the summarizer
from sumy.summarizers.lsa import LsaSummarizer

# creating the summarizer
lsa_summarizer=LsaSummarizer()
lsa_summary= lsa_summarizer(my_parser.document,3)

# Printing the summary
for sentence in lsa_summary:
    print(sentence)

Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children.
To make things worse, junk food also clogs your arteries and increases the risk of a heart attack.
Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now.


#### Luhn

Luhn Summarization algorithm’s approach is based on TF-IDF (Term Frequency-Inverse Document Frequency). It is useful when very low frequent words as well as highly frequent words(stopwords) are both not significant.

Based on this, sentence scoring is carried out and the high ranking sentences make it to the summary.

In [None]:
# Import the summarizer
from sumy.summarizers.luhn import LuhnSummarizer

#  Creating the summarizer
luhn_summarizer=LuhnSummarizer()
luhn_summary=luhn_summarizer(my_parser.document,sentences_count=3)

# Printing the summary
for sentence in luhn_summary:
  print(sentence)

They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.


KL-Sum algorithm

It selects sentences based on similarity of word distribution as the original text. It aims to lower the KL-divergence criteria (learn more). It uses greedy optimization approach and keeps adding sentences till the KL-divergence decreases.

In [None]:
from sumy.summarizers.kl import KLSummarizer

# Instantiating the  KLSummarizer
kl_summarizer=KLSummarizer()
kl_summary=kl_summarizer(my_parser.document,sentences_count=3)

# Printing the summary
for sentence in kl_summary:
    print(sentence)

It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
Junk food is the easiest way to gain unhealthy weight.


#### What is Abstractive Text Summarization?

Abstractive summarization is the new state of art method, which generates new sentences that could best represent the whole text. This is better than extractive methods where sentences are just selected from original text for the summary.

How to easily implement abstractive summarization?

A simple and effective way is through the Huggingface’s transformers library.

#### Summarization with T5 Transformers

T5 is an encoder-decoder model. It converts all language problems into a text-to-text format.

First, you need to import the tokenizer and corresponding model through below command.

It is preferred to use T5ForConditionalGeneration model when the input and output are both sequences.

In [None]:
# !pip install transformers
# !pip install SentencePiece

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 7.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 34.6MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 53.5MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=76cd6

In [None]:
# !pip install transformers
# !pip install SentencePiece

# Importing requirements
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

# Instantiating the model and tokenizer 
my_model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
# print(tokenizer)
# Concatenating the word "summarize:" to raw text
text = "summarize:" + original_text

# encoding the input text
input_ids=tokenizer.encode(text, return_tensors='pt', max_length=512, truncation=True)

# The syntax will be: transformers.PreTrainedModel.generate (input_ids=None, max_length=None, min_length=None, num_beams=None)

# Generating summary ids
summary_ids = my_model.generate(input_ids)

# Decoding the tensor and printing the summary.
t5_summary = tokenizer.decode(summary_ids[0])
print(t5_summary)

<pad> junk food is the source of constipation and other diseases. it is rich in saturated


#### Summarization with BART Transformers
transformers library of HuggingFace supports summarization with BART models.

Import the model and tokenizer. For problems where there is need to generate sequences , it is preferred to use BartForConditionalGeneration model.

In [None]:
test_text = "Indian cricket players T Natarajan and Shardul Thakur recently took delivery of their respective new Mahindra Thar SUVs. The off-roader was gifted to them and four other crickets by Anand Mahindra, Chairman - Mahindra & Mahindra, as a goodwill gesture for their incredible performance in the India-Australia test tour earlier this year. The test series was India's first win in Australia since 1988 with the young Indian cricket team taking a 2-1 series victory. Apart from Natarajan and Thakur, the other players that will receive the Thar SUV include Mohammad Siraj, Washington Sundar, Shubman Gill and Navdeep Saini."

In [None]:
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

# Loading the model and tokenizer for bart-large-cnn

tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Encoding the inputs and passing them to model.generate()
inputs = tokenizer.batch_encode_plus([test_text],return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

# Decoding and printing the summary
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summary)

Indian cricket players T Natarajan and Shardul Thakur recently took delivery of their respective new Mahindra Thar SUVs. The off-roader was gifted to them and four other crickets as a goodwill gesture. The test series was India's first win in Australia since 1988 with the young Indian cricket team taking a 2-1 series victory.


#### Summarization with GPT-2 Transformers
GPT-2 transformer is another major player in text summarization, introduced by OpenAI. Thanks to transformers, the process followed is same just like with BART Transformers.

First, you have to import the tokenizer and model. Make sure that you import a LM Head type model, as it is necessary to generate sequences. Next, load the pretrained gpt-2 model and tokenizer .

In [None]:
# Importing model and tokenizer
from transformers import GPT2Tokenizer,GPT2LMHeadModel

# Instantiating the model and tokenizer with gpt-2
tokenizer=GPT2Tokenizer.from_pretrained('gpt2')
model=GPT2LMHeadModel.from_pretrained('gpt2')

# Encoding text to get input ids & pass them to model.generate()
inputs=tokenizer.batch_encode_plus([test_text],return_tensors='pt', max_length=100, truncation=True)
summary_ids=model.generate(inputs['input_ids'],early_stopping=True)

# Decoding and printing summary

GPT_summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
print(GPT_summary)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 100, but ``max_length`` is set to 20.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


Indian cricket players T Natarajan and Shardul Thakur recently took delivery of their respective new Mahindra Thar SUVs. The off-roader was gifted to them and four other crickets by Anand Mahindra, Chairman - Mahindra & Mahindra, as a goodwill gesture for their incredible performance in the India-Australia test tour earlier this year. The test series was India's first win in Australia since 1988 with the young Indian cricket team taking a 2


# IN this case BERT performs well for Cricket text from news.

# Most transformers are unfortunately completely constrained, which is the case for BERT (512 tokens max).

# If you want to use transformers without being limited to a sequence length, you should take a look at Transformer-XL or XLNet.

#### Summarization with XLM Transformers
Another transformer type that could be used for summarization are XLM Transformers.

You can import the XLMWithLMHeadModel as it supports generation of sequences.You can load the pretrained xlm-mlm-en-2048 model and tokenizer with weights using from_pretrained() method.

The nexts steps are same as the last three cases. The encoded input text is passed to generate() function with returns id sequence for the summary. You can decode and print the summary.

In [None]:
# Importing model and tokenizer
from transformers import XLMWithLMHeadModel, XLMTokenizer

# Instantiating the model and tokenizer 
tokenizer=XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
model=XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')

# Encoding text to get input ids & pass them to model.generate()
inputs=tokenizer.batch_encode_plus([original_text],return_tensors='pt',max_length=100, truncation=True)
summary_ids=model.generate(inputs['input_ids'],early_stopping=True)

# Decode and print the summary
XLM_summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
print(XLM_summary)

Some weights of XLMWithLMHeadModel were not initialized from the model checkpoint at xlm-mlm-en-2048 and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Input length of input_ids is 100, but ``max_length`` is set to 20.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


junk foods taste good that's why it is mostly liked by everyone of any age group especially kids and school going children. they generally ask for the junk food daily because they have been trend so by their parents from the childhood. they never have been discussed by their parents about the harmful effects of junk foods over health. according to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. they are generally fried food found in the market


# You can notice that the XLM_summary isn’t very good. It is because , even though it supports summaization , the model was not finetuned for this task.

# Till Now BERT performs best for summary generation.