<a href="https://colab.research.google.com/github/qmeng222/transformers-for-NLP/blob/main/Pipeline_Text_Summarization_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
# `wget` command is a utility for downloading files from the internet.
# `-nc` is an option for the `wget` command. -nc stands for "no-clobber." It instructs wget not to overwrite files that already exist in the current directory. If the file you're trying to download already exists, the -nc flag prevents it from being downloaded again.
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [4]:
# `!` operator tells the notebook cell that this line is not a Python code, its a command line script
!pip install transformers # install the Hugging Face Transformers library



In [5]:
from transformers import pipeline # import `pipeline` function from the HF Transformers library to perform NLP tasks using pre-trained models

import pandas as pd # for data manipulation and analysis (in DataFrames)
import numpy as np # for numerical computations in Python
import textwrap # format strings for display

In [24]:
# read data from a Comma Separated Values file & store it in a pandas DataFrame object:
df = pd.read_csv('bbc_text_cls.csv')
print(type(df))
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
(2225, 2)


(2225, 2)

In [7]:
# retrieve the first 5 rows of the Pandas DataFrame object:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [37]:
doc = df[df.labels == 'business']['text'].sample(random_state=42)
# .sample(random_state=42): randomly select one of those rows (the random_state=42 parameter ensures that the random selection is reproducible, as using the same random state will yield the same result)

In [26]:
print(type(doc))
print(doc.shape)

<class 'pandas.core.series.Series'>
(1,)


In [38]:
doc

480    Christmas sales worst since 1981\n\nUK retail ...
Name: text, dtype: object

In [13]:
# helper function to format the text:
def wrap(x):
  return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)
  # break text into lines fitting a maximum line width
  # replace_whitespace=True: if a line ends with whitespace or tabs, replace them with the equivalent number of non-breaking spaces
  # fix_sentence_endings=True: ensure that sentence-ending punctuation (, . ? !) is followed by two spaces

In [39]:
# print the news content:
# select the first (and possibly the only) element in the `doc` Series
print(wrap(doc.iloc[0]))

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [29]:
# create a summarization pipeline object:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [31]:
# pass the main content portion without any header (to avoid cheating) to the summarizer:
summarizer(doc.iloc[0].split("\n", 1)[1])
# .split("\n", 1): only 1 split should be performed, and split the text at the first newline encountered

[{'summary_text': ' Retail sales dropped by 1% on the month in December, after a 0.6% rise in November . Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth . The last time retailers endured a tougher Christmas was 23 years ago, when sales plunged 1.7% .'}]

In [32]:
def print_summary(doc):
  result = summarizer(doc.iloc[0].split("\n", 1)[1]) # a list of dictionary👆
  print(wrap(result[0]['summary_text']))

In [33]:
print_summary(doc)

 Retail sales dropped by 1% on the month in December, after a 0.6%
rise in November . Clothing retailers and non-specialist stores were
the worst hit with only internet retailers showing any significant
growth . The last time retailers endured a tougher Christmas was 23
years ago, when sales plunged 1.7% .


# Try again with another random seed:

In [34]:
doc = df[df.labels == 'entertainment']['text'].sample(random_state=123)
print(wrap(doc.iloc[0]))

Goodrem wins top female MTV prize

Pop singer Delta Goodrem has
scooped one of the top individual prizes at the first Australian MTV
Music Awards.

The 21-year-old singer won the award for best female
artist, with Australian Idol runner-up Shannon Noll taking the title
of best male at the ceremony.  Goodrem, known in both Britain and
Australia for her role as Nina Tucker in TV soap Neighbours, also
performed a duet with boyfriend Brian McFadden.  Other winners
included Green Day, voted best group, and the Black Eyed Peas.
Goodrem, Green Day and the Black Eyed Peas took home two awards each.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.  The Black Eyed Peas won awards for best R 'n' B
video and sexiest video, both for Hey Mama.  Local singer and
songwriter Missy Higgins took the title of breakthrough artist of the
year, with Australian Idol winner Guy Sebastian taking the honours f

In [35]:
print_summary(doc)

 The 21-year-old singer won the award for best female artist .
Australian Idol runner-up Shannon Noll took the title of best male at
the ceremony . Other winners included Green Day, the Black Eyed Peas,
Missy Higgins and Green Day . The VH1 First Music Award went to Cher
honouring her achievements within the music industry .
