The task to use TF-IDF vectorizer for text summarization.

In [1]:
import numpy as np
import pandas as pd
import textwrap
import nltk
from nltk.corpus import stopwords
from nltk import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/valentine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/valentine/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# download BBC text classification dataset
# original dataset on Kaggle: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification)
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [4]:
# save the dataset in Pandas dataframe
df = pd.read_csv('bbc_text_cls.csv')

In [5]:
# use TF-IDF vectorizer with stopwords
featurizer = TfidfVectorizer(stop_words=stopwords.words('english'), norm='l1')

In [6]:
# create a summarization function
def get_sentence_score(tfidf_row):
    # return the average of the non-zero values of the tf-idf vector representation of a sentence
    x = tfidf_row[tfidf_row != 0]
    return x.mean()

def summarize(text, show_scores=False, include_title=False):
    #extract title and sentences
    title = text.iloc[0].split('\n', 1)[0] if include_title else None
    sents = sent_tokenize(text.iloc[0].split("\n", 1)[1])
    # perform TF-IDF
    X = featurizer.fit_transform(sents)
    # get sentence scores
    scores = np.zeros(len(sents))
    for i in range(len(sents)):
        score = get_sentence_score(X[i,:])
        scores[i] = score
    # sort the scores    
    sort_idx = np.argsort(-scores)
    # create a summary
    res = ''
    for i in sort_idx[:5]:
        if show_scores:
            res = res + f'(score: {scores[i]}) {sents[i]} '
        else:
            res = res + sents[i] + ' '
    return title + "\n" + res if title else res 

In [17]:
# example of a random business text summary, printing out with scores
doc = df[df['labels'] == 'business']['text'].sample()

In [18]:
res = summarize(doc, show_scores=True, include_title=True)

In [19]:
print(res)

Cairn shares up on new oil find
(score: 0.14285714285714288) Cairn's shares closed up 64 pence, or 6%, at 1130p on Thursday. (score: 0.1111111111111111) Chief executive Bill Gammell added: "The more we progress in Rajasthan the better we feel about it." (score: 0.09999999999999999) Cairn made the discovery after having been granted an extension to their drilling licence in January by Indian authorities. (score: 0.0909090909090909) Cairn said drilling to the north-west of its development site in Rajasthan had produced "very strong results". (score: 0.09090909090909088) 
Shares in Cairn Energy have jumped 6% after the firm said an Indian oilfield was larger than previously thought. 


In [31]:
# make summaries of a random text from each class
for topic in df['labels'].unique():
    print(f'Topic: {topic}')
    doc = df[df['labels'] == topic]['text'].sample()
    print(summarize(doc, include_title=True),'\n')

Topic: business
UK Coal plunges into deeper loss
"We have a long journey ahead to fix these issues. The company said these actions should "significantly uplift earnings". In early trade on Thursday, its shares were down 10% at 119 pence. It expected 2005 to be a "transitional year" and to return to profitability in 2006. UK Coal said it was making "significant progress" in shaking up the business.  

Topic: entertainment
Band Aid retains number one spot
Opera band Il Divo have moved up one place with their eponymous album to number three. Maroon 5's album Songs About Jane has moved up to number seven despite being released 47 weeks ago. U2's How to Dismantle a Bomb remains at number one for a third week in a row, followed by Williams' Greatest Hits. The only other new entry in the top 10 came from Robbie Williams track Misunderstood, a new track written for his Greatest Hits album. And the Abba Gold greatest hits album has crept back into the top 40 more than nine years after it was fi