The task is to explore the summarization pipeline in Transformers.

In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline

In [2]:
# download BBC text classification dataset
# original dataset on Kaggle: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification)
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [3]:
# save the dataset in Pandas dataframe
df = pd.read_csv('bbc_text_cls.csv')

In [4]:
# check the dataset
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [5]:
# check the labels
df.labels.unique()

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [6]:
# get a random business text
business_doc = df[df.labels == 'business']['text'].sample()

In [7]:
print(business_doc.iloc[0])

Japan bank shares up on link talk

Shares of Sumitomo Mitsui Financial (SMFG), and Daiwa Securities jumped amid speculation that two of Japan's biggest financial companies will merge.

Financial newspaper Nihon Keizai Shimbun claimed that the firms will join up next year and already have held discussions with Japanese regulators. The firms denied that they are about to link up, but said they are examining ways of working more closely together. SMFG shares climbed by 2.7% to 717,000, and Daiwa added 5.3% to 740 yen.

Combining SMFG, Japan's third-biggest lender, and Daiwa, the country's second-largest brokerage firm, would create a company with assets of more than $1,000bn (£537bn). SMFG President Yoshifumi Nishikawa said that the companies needed to bolster their businesses. "Both companies need to strengthen retail and other operations," he said, adding that "it's an issue we have in common". Daiwa said that "although it is true that the two groups have been engaging in various discus

In [8]:
# load the pipeline
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [9]:
# create a function to get a text summary
def summarize(doc):
    result = summarizer(doc.iloc[0].split('\n', 1)[1])
    return result[0]['summary_text']

In [10]:
print(summarize(business_doc))

 Sumitomo Mitsui Financial (SMFG), and Daiwa Securities (Daiwa Securities) deny that they are about to link up . SMFG shares climbed by 2.7% to 717,000, while Daiwa added 5.3% . Combining SMFG would create a company with assets of more than $1,000bn (£537bn)


In [11]:
# print out random texts of each topic and their summaries
entertainment_doc = df[df.labels == 'entertainment']['text'].sample()
print('Original text:', entertainment_doc.iloc[0], '\n', sep='\n')
print('Summary:', summarize(entertainment_doc), sep='\n')

Original text:
Shark Tale DVD is US best-seller

Oscar-nominated animation Shark Tale has raked in $80m (£42.4m) in the first week of its US DVD release becoming the year's best-selling home video so far.

The tally for its DVD and video sales soared past the film's opening week US box office takings of $56m (£29.7m). Shark Tale is now the sixth-highest earning DVD for first week sales. The all-time first-week record is held by 1995's Lion King followed by Shrek 2, Finding Nemo, the original Shrek, and Monster's Inc.

Shark Tale, whose voice cast includes Will Smith, Robert De Niro, Renee Zellweger and Martin Scorsese, sold more than 6 million DVDs and videos across the United States and Canada. It becomes the highest first-week earner for February, outshining My Big Fat Greek Wedding which sold four million units in 2003. Films which are expected to earn strong home video returns are usually timed for release in the busiest retail season which falls before Christmas. The best-selling 

In [12]:
politics_doc = df[df.labels == 'politics']['text'].sample()
print('Original text:', politics_doc.iloc[0], '\n', sep='\n')
print('Summary:', summarize(politics_doc), sep='\n')

Original text:
Cabinet anger at Brown cash raid

Ministers are unhappy about plans to use Whitehall cash to keep council tax bills down, local government minister Nick Raynsford has acknowledged.

Gordon Brown reallocated £512m from central to local government budgets in his pre-Budget report on Thursday. Mr Raynsford said he had held some "pretty frank discussions" with fellow ministers over the plans. But he said local governments had to deliver good services without big council tax rises.

The central government cash is part of a £1bn package to help local authorities in England keep next year's council tax rises below 5%, in what is likely to be a general election year.

Mr Raynsford said nearly all central government departments had an interest in well run local authorities. And he confirmed rows over the issue with ministerial colleagues. "Obviously we had some pretty frank discussions about this," he told BBC Radio 4's The World at One. But he said there was a recognition that "

In [13]:
sport_doc = df[df.labels == 'sport']['text'].sample()
print('Original text:', sport_doc.iloc[0], '\n', sep='\n')
print('Summary:', summarize(sport_doc), sep='\n')

Original text:
Bosvelt optimistic over new deal

Manchester City's Paul Bosvelt will find out "within a month" whether he is to be offered a new one-year deal.

The 34-year-old Dutch midfielder is out of contract in the summer and, although his age may count against him, he feels he can play on for another season. "I told the club I would like to stay for one more year. They promised me an answer within the next month so I am waiting to see," he said. "The main concern is my age but I think I have proved I am fit enough. Bosvelt joined City from Feyenoord in 2003 and at first he struggled to adapt to life in England. But his professionalism and dedication impressed manager Kevin Keegan. "He realised the pace of the game was faster than anything he was used to but he drove himself back into the team. He is an unsung hero," said Keegan.


Summary:
 Manchester City midfielder Paul Bosvelt is out of contract in the summer . The 34-year-old Dutch midfielder feels he can play on for another 

In [14]:
tech_doc = df[df.labels == 'tech']['text'].sample()
print('Original text:', tech_doc.iloc[0], '\n', sep='\n')
print('Summary:', summarize(tech_doc), sep='\n')

Original text:
EA to take on film and TV giants

Video game giant Electronic Arts (EA) says it wants to become the biggest entertainment firm in the world.

The US firm says it wants to compete with companies such as Disney and will only achieve this by making games appeal to mainstream audiences. EA publishes blockbuster titles such as Fifa and John Madden, as well as video game versions of movies such as Harry Potter and the James Bond films. Its revenues were $3bn (£1.65bn) in 2004, which EA hoped to double by 2009. EA is the biggest games publisher in the world and in 2004 had 27 titles which sold in excess of one million copies each. Nine of the 20 biggest-selling games in the UK last year were published by EA.

Gerhard Florin, EA's managing director for European publishing, said: "Doubling our industry in five years is not rocket science." He said it would take many years before EA could challenge Disney - which in 2004 reported revenues of $30bn (£16bn) - but it remained a goal 