# Abstractive text summarization

# Install packages

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Import data
We going to used this data for all text summarization models

In [None]:
# text to summarize
original_text = '''
This past week, two of Thailand's largest cities, Bangkok and Chiang Mai,
earned the ignominious privilege of being among the 10 cities of the world
with the worst air quality during that period. The Ministry of Public
Health has blamed air pollution for causing 200,000 hospital admissions
in the past week alone.
Air pollution is one of Thailand's largest killers, more than obesity,
smoking, and even Covid-19. It accounted for over 50,000 premature
deaths in 2021, reducing average life expectancy by two years. Further,
there is widespread public concern that air pollution will reduce one of the
country's main sources of income, tourism, in places like Chiang Mai.
However, despite these grim statistics, this year's air pollution menace and
the government's ham-fisted response seems like deja vu. During the first
few months of every year, the level of air pollution spikes to hazardous
levels and smog covers the skies.
Every year the government responds by proclaiming a ban on forest fires
(but inadequately enforcing this ban), asking people to wear masks and
stay indoors, spraying water (which doesn't do much), unhelpfully and
incorrectly blaming smallholder farmers, and expressing grave concern
about the problem.
But every year the government fails to address the underlying drivers of
the problem. Prime Minister Prayut Chan-o-cha has halted three draft
laws related to air pollution.
There was hope the recently-elected Bangkok governor Chadchart
Sittipunt would do more to address air pollution at least within Bangkok's
jurisdiction. Yet his actions so far have been limited.
The government's actions, however, are not surprising if you look at all of
the major parties' platforms for the upcoming election. None has
prioritised air pollution, called for wide-ranging reforms, or made air
pollution a major part of their campaign.
The Thailand Development Research Institute says that of their 87 major
policy promises, only three are environmentally-related. While more data
on the sources of pollution would be helpful, it is clear that what the
current government has been doing since 2014 (not to mention the actions
of previous governments) has not worked.
We know there are three major sources of air pollution in the country:
transport, industry, and agriculture, and that pollution is worse in winter
months when there is an upsurge in agricultural burning and a
temperature inversion resulting in less wind and rain to disperse
pollutants.
How much any of these three sources contributes to the total amount
varies from month to month and by location. For example, transport and
industry emissions comprise a much larger share of total emissions in
Bangkok than in Chiang Mai, where the vast majority (up to 90%) of
emissions come from agriculture.
Countries who have been able to reduce air pollution show us that while
this is a wickedly difficult problem to solve, it is not impossible and there
are policy solutions out there which could reduce pollutant levels and
improve health nationwide. So, let's look at each of three sources.
1. Transport: In Bangkok, vehicular emissions are high due to the presence
of many older, high-polluting vehicles, together with a drastic increase in
the number of cars in recent years. To reverse these trends, the
government could initiate something like the USA's "cash for clunkers"
programme, which provides incentives for citizens to replace older, more
polluting cars with newer, cleaner, and more fuel-efficient ones.
A number of cities not only have designated bus lanes but also switched
their fleets to new vehicles powered by electricity or natural gas. Both
policies could incentivise the public to use public buses more. Finally,
cities like Singapore and London were able to significantly reduce their air
pollution and traffic congestion by introducing congestion pricing
schemes.
2. Industry: Thailand has no emissions inventory database to record
industrial emissions, despite having around 140,000 polluting factories. A
head of a local NGO told me: "Since there are no emissions inventories
from factories, we're working blind." Further, in 2019, the National
Legislative Assembly revised the Factory Act 1992 so that only industrial
companies with more than 50 employees and machinery exceeding 50
horsepower are subject to monitoring for waste discharge and anti-
pollution measures, including air pollution.
Additionally, the authority to fine major polluters rests with the
Department of Industrial Works (DIW) under the Ministry of Industry but
this creates a conflict of interest since DIW's mandate is to expand
industrial growth without any curbs. Thailand needs a law that requires
polluting factories to disclose their emissions, such as United States'
Toxics Release Inventory and the European Pollutant Release and
Transfer Register. This new law would make factory permits for operation
dependent upon lowering their emissions.
3. Agriculture: While most hotspots of biomass burning that cause
pollution inside Thailand are in fact outside its boundaries, a large
percentage still occurs within Thailand, particularly stemming from maize,
sugarcane, and rice harvesting. Thai agribusinesses have a high degree of
culpability for burning in neighbouring countries, such as Laos and
Myanmar, due to their investments and introduction of contract farming
schemes there.
However, no information has been released on which companies are
responsible for the burning and no government has ever held these
agribusinesses accountable or penalised them for the burning. A good
example Thailand could follow is Singapore's 2014 Transboundary Haze
Pollution Act that targets the business sector by imposing fines on
companies with operations in neighbouring countries found to contribute
to haze pollution within Singapore's borders.
Moreover, the government could insist upon stringent product standards,
such as no burnt sugarcane, and could help farmers by subsiding the
purchase of harvesting machines and introducing other cleaner
production methods.
Overall, while legislation can never be a single silver bullet solution, no
country that has achieved cleaner air quality has done so without first
having sensible air pollution policies in place. For example, the USA, UK,
and Singapore have all passed Clean Air Acts.
The citizen-driven proposed "Thai Clean Air Act" Act provides the tools to
address the underlying causes that have so far impeded the resolution of
this public health crisis.
The bill adopts a rights-based approach that establishes the public's right
to clean air and by doing so concurrently creates an obligation of the state
to protect this right.
Finally, it includes economic incentives to push current major polluters to
reduce their emissions.
We hope that all parties will show that they truly care about the health and
lives of the people and will seek to adopt these policies and enact the
citizen-led Thai Clean Air Act.
Other countries have successfully improved their air quality, so why not
Thailand too?
Danny Marks is an Assistant Professor of Environmental Politics and
Policy at Dublin City University. Weenarin Lulitanonda is a co-founder of
the Thailand Clean Air Network.
'''

#Summarization with T5
T5 is an encoder-decoder model. It converts all language problems into a text-to-text format.

In [None]:
# Importing requirements
  # T5ForConditionalGeneration for both input and output are sequences.
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

## import model

In [None]:
# Instantiating the model and tokenizer 
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Next is the most important step which you should not forget. You have to add the string ” summarize: ” at the beginning of your raw text . T5 transformers performs different tasks by prepending the particular prefix to the input text.

In [None]:
# Concatenating the word "summarize:" to raw text
text = "summarize:" + original_text
text

'summarize:\nThis past week, two of Thailand\'s largest cities, Bangkok and Chiang Mai,\nearned the ignominious privilege of being among the 10 cities of the world\nwith the worst air quality during that period. The Ministry of Public\nHealth has blamed air pollution for causing 200,000 hospital admissions\nin the past week alone.\nAir pollution is one of Thailand\'s largest killers, more than obesity,\nsmoking, and even Covid-19. It accounted for over 50,000 premature\ndeaths in 2021, reducing average life expectancy by two years. Further,\nthere is widespread public concern that air pollution will reduce one of the\ncountry\'s main sources of income, tourism, in places like Chiang Mai.\nHowever, despite these grim statistics, this year\'s air pollution menace and\nthe government\'s ham-fisted response seems like deja vu. During the first\nfew months of every year, the level of air pollution spikes to hazardous\nlevels and smog covers the skies.\nEvery year the government responds by 

## data preparation

If you recall , T5 is a encoder-decoder mode and hence the input sequence should be in the form of a sequence of ids, or input-ids.

convert the input text into input-ids by encode() method

In [None]:
# encoding the input text
input_ids = t5_tokenizer.encode(text, return_tensors='pt', max_length=512)
input_ids.shape

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


torch.Size([1, 512])

pass input_ids to the function generate(), which will return a sequence of ids corresponding to the summary.

## generate summary

In [None]:
# Generating summary ids
summary_ids = t5_model.generate(input_ids, max_length=100, min_length=30)
print(summary_ids.shape)
print(summary_ids)

torch.Size([1, 59])
tensor([[    0,   799, 10441,    19,    80,    13, 10508,    31,     7,  2015,
         14804,     7,     6,    72,   145, 18719,     6, 10257,     6,    11,
           237,   638,  6961,  4481,     3,     5,     3,  3565, 20425,  7475,
             6,    48,   215,    31,     7,   799, 10441, 24034,  1330,   114,
          6009,  9056,     3,     5,   334,   215,     8,   789, 13288,    12,
          1115,     8,  3863,    13,     8,   682,     3,     5,     1]])


You can see that model has returned a tensor with sequence of ids. Now, use the decode() function to generate the summary text from these ids.

In [None]:
# Decoding the tensor and printing the summary.
t5_summary = t5_tokenizer.decode(summary_ids[0])
t5_summary

"<pad> air pollution is one of Thailand's largest killers, more than obesity, smoking, and even Covid-19. despite grim statistics, this year's air pollution menace seems like deja vu. every year the government fails to address the drivers of the problem.</s>"

# Summarization with BART
BART is one of variant of BERT, aimed for the generation problem.

In [None]:
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

## import model

Import the model and tokenizer. For problems where there is need to generate sequences , it is preferred to use BartForConditionalGeneration model.

” bart-large-cnn” is a pretrained model, fine tuned especially for summarization task.

In [None]:
# Loading the model and tokenizer for bart-large-cnn
bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

## data prepration

pass the input text in the form of a sequence of ids.

For this, use the batch_encode_plus() function with the tokenizer. This function returns a dictionary containing the encoded sequence or sequence pair, etc.

Set the max_length parameter in batch_encode_plus().

In [None]:
# Encoding the inputs
input_ids = bart_tokenizer.batch_encode_plus([original_text], 
                                              max_length = 512,
                                              return_tensors='pt')
input_ids['input_ids'].shape

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


torch.Size([1, 512])

## generate summary

In [None]:
# Encoding the inputs and passing them to model.generate()
summary_ids = bart_model.generate(input_ids['input_ids'], 
                                  early_stopping=True, 
                                  max_length=100, 
                                  min_length=30)
print(summary_ids.shape)
print(summary_ids)

torch.Size([1, 46])
tensor([[    2,     0, 17906,  6631,    16,    65,     9,  6547,    18,  1154,
         20480,     6,    55,    87, 14057,     6,  7893,     6,     8,   190,
         19150,   808,    12,  1646,     4,  1489,   692,  2869,   857,  1182,
          8710,    12,   139,    12,  7794,    34, 12856,   130,  2479,  2074,
          1330,     7,   935,  6631,     4,     2]])


model.generate() has returned a sequence of ids corresponding to the summary of original text. You can convert the sequence of ids to text through decode() method.

In [None]:
# Decoding and printing the summary
bart_summary = bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
bart_summary

"Air pollution is one of Thailand's largest killers, more than obesity, smoking, and even Covid-19. Prime Minister Prayut Chan-o-cha has halted three draft laws related to air pollution."

# Summarization with GPT-2

GPT-2 transformer is another major player in text summarization, introduced by OpenAI.
<br>
We can only use GPT-2, currently there is no GPT-3 released

First, you have to import the tokenizer and model. Make sure that you import a LM Head type model, as it is necessary to generate sequences. Next, load the pretrained gpt-2 model and tokenize

In [None]:
# Importing model and tokenizer
from transformers import GPT2Tokenizer, GPT2LMHeadModel

## import model

In [None]:
# Instantiating the model and tokenizer with gpt-2
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt_model = GPT2LMHeadModel.from_pretrained('gpt2')

## data prepration
basically we use library from huggingface, it is easy as you could see, so we do almost the same thing from previous models

In [None]:
# Encoding text to get input ids & pass them to model.generate()
input_ids = gpt_tokenizer.batch_encode_plus([original_text],return_tensors='pt',max_length=512)
input_ids['input_ids'].shape

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


torch.Size([1, 512])

## generate summary

the summary_ids contains the sequence of ids corresponding to the text summary . You can decode it and print the summary

In [None]:
#pass them to model.generate()
summary_ids = gpt_model.generate(input_ids['input_ids'], 
                                 max_length=100,
                                 early_stopping=True)
print(summary_ids.shape)
print(summary_ids)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 512, but `max_length` is set to 100. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.


torch.Size([1, 513])
tensor([[  198,  1212,  1613,  1285,    11,   734,   286, 16952,   338,  4387,
          4736,    11, 35007,   290,   609, 15483, 36709,    11,   198, 39123,
           262,  3627,  6351,   699, 11941,   286,   852,  1871,   262,   838,
          4736,   286,   262,   995,   198,  4480,   262,  5290,  1633,  3081,
          1141,   326,  2278,    13,   383,  9475,   286,  5094,   198, 18081,
           468, 13772,  1633, 12231,   329,  6666,   939,    11,   830,  4436,
         25349,   198,   259,   262,  1613,  1285,  3436,    13,   198, 16170,
         12231,   318,   530,   286, 16952,   338,  4387, 25542,    11,   517,
           621, 13825,    11,   198, 48783,    11,   290,   772, 39751,   312,
            12,  1129,    13,   632, 17830,   329,   625,  2026,    11,   830,
         19905,   198, 22595,    82,   287, 33448,    11,  8868,  2811,  1204,
         29098,   416,   734,   812,    13,  7735,    11,   198,  8117,   318,
         10095,  1171,  2328,  

In [None]:
# Decoding and printing summary
gpt_summary = gpt_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(gpt_summary)


This past week, two of Thailand's largest cities, Bangkok and Chiang Mai,
earned the ignominious privilege of being among the 10 cities of the world
with the worst air quality during that period. The Ministry of Public
Health has blamed air pollution for causing 200,000 hospital admissions
in the past week alone.
Air pollution is one of Thailand's largest killers, more than obesity,
smoking, and even Covid-19. It accounted for over 50,000 premature
deaths in 2021, reducing average life expectancy by two years. Further,
there is widespread public concern that air pollution will reduce one of the
country's main sources of income, tourism, in places like Chiang Mai.
However, despite these grim statistics, this year's air pollution menace and
the government's ham-fisted response seems like deja vu. During the first
few months of every year, the level of air pollution spikes to hazardous
levels and smog covers the skies.
Every year the government responds by proclaiming a ban on forest fir