In [3]:
import pandas as pd
import numpy as np
import nltk
from nltk import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
data = pd.read_csv('bbc_text_cls.csv')
data.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [5]:
data['text'][0]

'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sig

In [6]:
sentences = sent_tokenize(data['text'][0])

In [8]:
sentences[:3]

['Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.',
 'The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.',
 'TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.']

In [9]:
tfidf_vectorizer = TfidfVectorizer()
vectorized = tfidf_vectorizer.fit_transform(sentences)

**Syntax: of TF-IDF**

sklearn.feature_extraction.text.TfidfVectorizer(input)

**Parameters:**

- **input:** It refers to parameter document passed, it can be a filename, file or content itself.

**Attributes:**

- **vocabulary_:** It returns a dictionary of terms as keys and values as feature indices.
- **idf_:** It returns the inverse document frequency vector of the document passed as a parameter.

**Returns:**

- **fit_transform():** It returns an array of terms along with tf-idf values.
- **get_feature_names():** It returns a list of feature names.

In [11]:
#  Displaying idf values of the words present in the corpus
print("idf values")
for k, v in zip(tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_):
    print(k, ':', v)

idf values
000 : 3.3513752571634776
09bn : 3.3513752571634776
10 : 3.3513752571634776
11 : 3.3513752571634776
13bn : 3.3513752571634776
1bn : 3.3513752571634776
2000 : 3.3513752571634776
2003 : 2.9459101490553135
2005 : 3.3513752571634776
27 : 2.9459101490553135
284m : 3.3513752571634776
300m : 3.3513752571634776
36bn : 3.3513752571634776
42 : 3.3513752571634776
464 : 3.3513752571634776
500m : 3.3513752571634776
600m : 3.3513752571634776
639m : 3.3513752571634776
76 : 3.3513752571634776
9bn : 3.3513752571634776
accounts : 2.9459101490553135
ad : 3.3513752571634776
adjust : 3.3513752571634776
advert : 3.3513752571634776
advertising : 2.9459101490553135
alexander : 3.3513752571634776
all : 3.3513752571634776
already : 3.3513752571634776
also : 2.9459101490553135
amount : 3.3513752571634776
an : 3.3513752571634776
analysts : 3.3513752571634776
and : 1.965080896043587
aol : 1.965080896043587
around : 3.3513752571634776
as : 2.6582280766035327
aside : 3.3513752571634776
at : 2.6582280766035

In [12]:
# Displaying tf-idf values along with indexing

# Getting indexes:
print("Word Indexes: ")
print(tfidf_vectorizer.vocabulary_)

# display tf-idf values
print("\ntf-idf value:")
print(vectorized)

# In matrix form
print("\ntf-idf values in matrix form")
print(vectorized.toarray())

Word Indexes: 
{'ad': 21, 'sales': 183, 'boost': 45, 'time': 205, 'warner': 217, 'profit': 161, 'quarterly': 167, 'profits': 162, 'at': 37, 'us': 214, 'media': 131, 'giant': 99, 'timewarner': 206, 'jumped': 122, '76': 18, 'to': 207, '13bn': 4, '600m': 16, 'for': 90, 'the': 202, 'three': 204, 'months': 134, 'december': 66, 'from': 95, '639m': 17, 'year': 227, 'earlier': 69, 'firm': 86, 'which': 222, 'is': 118, 'now': 137, 'one': 146, 'of': 139, 'biggest': 43, 'investors': 117, 'in': 111, 'google': 100, 'benefited': 40, 'high': 107, 'speed': 195, 'internet': 115, 'connections': 62, 'and': 32, 'higher': 108, 'advert': 23, 'said': 181, 'fourth': 92, 'quarter': 166, 'rose': 180, '11': 3, '1bn': 5, '10': 2, '9bn': 19, 'its': 121, 'were': 220, 'buoyed': 50, 'by': 53, 'off': 140, 'gains': 97, 'offset': 144, 'dip': 67, 'bros': 49, 'less': 124, 'users': 215, 'aol': 33, 'on': 145, 'friday': 94, 'that': 201, 'it': 119, 'owns': 152, 'search': 185, 'engine': 72, 'but': 52, 'own': 151, 'business': 51

In [13]:
print("number of sentences in article 1: ", len(sentences))

number of sentences in article 1:  20


In [16]:
print("Matrix length of of vectorized: ", vectorized.shape)

Matrix length of of vectorized:  (20, 228)


In [17]:
# Changing sparse matrix to array
tf_idf_matrix = vectorized.toarray()
tf_idf_matrix

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.16088028],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.34868323, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.21646566,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.23188568, 0.        ,
        0.        ]])

In [24]:
# Calculating scores for each sentences from the dense matrix
scores = []
for row in tf_idf_matrix:
    counter = 0
    non_zero = []
    for value in row:
        if value != 0:
            counter = counter +1
            non_zero.append(value)
    score = sum(non_zero)/counter
    scores.append(score)

print(scores)

[0.18903909497280832, 0.21260496978095947, 0.28286938539435885, 0.2188689647214197, 0.2721674144043621, 0.3113061162677085, 0.24650825771974072, 0.2318189200420009, 0.1889910099524704, 0.20905475661019424, 0.2987698867839692, 0.1686288846707722, 0.21230071776873882, 0.19162878226502164, 0.23057458478025905, 0.22440271391448605, 0.23065442914295697, 0.21808890743570952, 0.18816328005751867, 0.23244764664552092]


In [25]:
len(scores)

20

In [57]:
# Selecting senteces with score>threshold as the summary of the text
avg = sum(scores)/len(scores)
threshold = avg*1.04
print("Summary of Article 1:")
for i in range(len(scores)):
    if scores[i]>threshold:
        print("\n", sentences[i])

Summary of Article 1:

 TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.

 Time Warner said on Friday that it now owns 8% of search-engine Google.

 But its own internet business, AOL, had has mixed fortunes.

 It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.

 Time Warner's fourth quarter profits were slightly better than analysts' expectations.


In [85]:
def summarize(text):
    sentences = sent_tokenize(text)
    tfidf_vectorizer = TfidfVectorizer()
    vectorized = tfidf_vectorizer.fit_transform(sentences)
    # Changing sparse matrix to array
    tf_idf_matrix = vectorized.toarray()
    # Calculating scores for each sentences from the dense matrix
    scores = []
    for row in tf_idf_matrix:
        counter = 0
        non_zero = []
        for value in row:
            if value != 0:
                counter = counter +1
                non_zero.append(value)
        score = sum(non_zero)/counter
        scores.append(score)
        # Selecting senteces with score>threshold as the summary of the text
    avg = sum(scores)/len(scores)
    threshold = avg*1.08
    summaries = []
    for i in range(len(scores)):
        if scores[i]>threshold:
            # print("\n", sentences[i])
            summaries.append(sentences[i])
        
    return summaries

         

In [86]:
print("Summary of Article 1\n")
summarized = summarize(data['text'][0])
for sentence in summarized:
    print(sentence, end="\n")

Summary of Article 1

TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.
Time Warner said on Friday that it now owns 8% of search-engine Google.
But its own internet business, AOL, had has mixed fortunes.
It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.
Time Warner's fourth quarter profits were slightly better than analysts' expectations.


In [87]:
print("Summary of Article 2\n")
summarized = summarize(data['text'][1])
for sentence in summarized:
    print(sentence, end="\n")

Summary of Article 2

In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday.
Market concerns about the deficit has hit the greenback in recent months.
Worries about the deficit concerns about China do, however, remain.
The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy.


In [88]:
# Afaan Oromo article
fp = open('sample_article1.txt', 'r', encoding='utf-8')
article = fp.read()
print("Summary of Sample article\n")
summarized = summarize(article)
for sentence in summarized:
    print(sentence, end="\n")

Summary of Sample article

Raacheel Ruutoo dubartii calliftuufi gad of deebistudha.
Tarkaanfii ishee mara keessatti jaalala Waaqayyoof qabdu dursuun beekamti.
Akkasumas, wangeelaa bakka buʼuudhaan namoota hedduuf fakkeenya taateetti.
Akkamiin Wiiliyaam Ruutoo waliin walbaran?
Barnootashee sadarkaa lammaffaa mana barumsaa dubartootaa Buteeretti xumurte.
Achiin booda Yunivarsiiti Keeniyaatatti Sirna Barnootaan (Education) digirii argatte.
Yeroo dheeraaf garuu carraa barsiistuu ta’uu hin arganne.
Kun ammoo yeroo lamaan isaanii yunivarsitiitti barumsa turanidha.
Bu’uurri jireenya isaanii wal qorachuu achitti eegalanidha.
Raacheel yeroo sanatti digirii jalqabaashee Sirna Barnootaan barachaa turte.
Amma ijjoollee ja’a – durba sadii fi dhiira sadii qabu.
Kanaaf, Aadde Raacheel amma amaatii lammii Naayijeeriyaa taateetti.
Sochiinshee dubaroota hedduu deegaraa jiraachuu himeera.
Sagantaa bishaan qulqulluu dubartootaaf dhiyeessuu irrattis hojjeteetti Aadde Raacheel.
Kana hordofe, mootummaan wali