In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import re
import numpy as np
import string
import warnings
# hide warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('../data/bbc_text_cls.csv')

In [3]:
df.text.apply(lambda x:len(x)).sort_values()

1292      502
1561      720
1503      725
1515      728
1568      741
        ...  
1188    13828
2224    16159
1275    18388
762     19137
1185    25484
Name: text, Length: 2225, dtype: int64

In [4]:
article = df.loc[1185, 'text']

In [5]:
def custom_tokenizer(doc):
    # tokenize using nltk
    doc = nltk.word_tokenize(doc) # string got tokenized - like split()
    doc = [i for i in doc if i not in string.punctuation]
    return doc

tfidf = TfidfVectorizer(stop_words='english',analyzer='word',lowercase=True,tokenizer=custom_tokenizer)

In [6]:
article = re.sub(r'\n+','. ',article)

In [7]:
sents = nltk.sent_tokenize(article)

In [8]:
print(sents)

["Terror powers expose 'tyranny'.", "The Lord Chancellor has defended government plans to introduce control orders to keep foreign and British terrorist suspects under house arrest, where there isn't enough evidence to put them on trial.. Lord Falconer insists that the proposals do not equate to a police state and strike a balance between protecting the public against the threat of terrorism and upholding civil liberties.", 'But thriller writer Frederick Forsyth tells BBC News of his personal response to the move..', 'There is a mortal danger aimed at the heart of Britain.', 'Or so says Home Secretary Charles Clarke.', 'My reaction?', 'So what?', 'It is not that I am cynical or just do not care.', 'I care about this country very much..', 'But in the 66 years that I have been alive, there has not been one hour, of one day, of one month, of one year, when there has not been a threat aimed at us.', 'My point is, the British have always coped without becoming a dictatorship.', 'We have cop

In [9]:
sent_df = pd.DataFrame({'Sentences':sents})

In [10]:
sent_df.head()

Unnamed: 0,Sentences
0,Terror powers expose 'tyranny'.
1,The Lord Chancellor has defended government pl...
2,But thriller writer Frederick Forsyth tells BB...
3,There is a mortal danger aimed at the heart of...
4,Or so says Home Secretary Charles Clarke.


In [11]:
feature = tfidf.fit_transform(sent_df.Sentences)
feature

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2071 stored elements and shape (233, 1102)>

In [12]:
feature_matrix = feature.todense()

In [13]:
feature_matrix

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.32322119, 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]], shape=(233, 1102))

In [478]:
# def avg_calculator(x):
#     x = x.tolist()[0]
#     x = [j for j in x if j>0]
#     x = sum(x)/len(x)
#     return x
# [avg_calculator(i) for  i in feature_matrix]


In [14]:
sent_df['score'] = np.true_divide(feature_matrix.sum(1),(feature_matrix!=0).sum(1)).tolist()

In [15]:
sent_df.score = sent_df.score.apply(lambda x:x[0])

In [16]:
# sent_df[sent_df==1] are the cases which are very small sentences , like 1 or 2 words and aren't important without previous or next sentences
# so removing them
sent_df[sent_df!=1].sort_values('score',ascending=False)

Unnamed: 0,Sentences,score
32,That was why the tyrants lost.,0.707107
64,The threat now is new.,0.706659
190,Who is to decide whom is a suspect?,0.705663
195,Who will support their families?,0.705663
196,Will their children still go to school?,0.704484
...,...,...
98,We are sleep-walking into this.,
99,Wake up!.,
115,Where will this end.,
124,It is to you and I.,


In [17]:
sent_max = sent_df[sent_df!=1].sort_values('score',ascending=False).head(int(0.3*len(sent_df)))

In [18]:
sent_max.sort_index()

Unnamed: 0,Sentences,score
0,Terror powers expose 'tyranny'.,0.497685
3,There is a mortal danger aimed at the heart of...,0.445245
4,Or so says Home Secretary Charles Clarke.,0.446220
7,It is not that I am cynical or just do not care.,0.573720
8,I care about this country very much..,0.556704
...,...,...
211,I agree with Mr Forsyth.,0.574676
213,The facts turned out to be very different.,0.576345
215,We become animals too..,0.662352
223,They have arrived at their own interpretations...,0.447031


In [19]:
text_summarized = '\n'.join(sent_max.sort_index().Sentences.to_list())

In [20]:
text_summarized

'Terror powers expose \'tyranny\'.\nThere is a mortal danger aimed at the heart of Britain.\nOr so says Home Secretary Charles Clarke.\nIt is not that I am cynical or just do not care.\nI care about this country very much..\nMy point is, the British have always coped without becoming a dictatorship.\nI was born on 25 August, 1938.\nA week after my first birthday, the threat had become reality.\nMy father wore a uniform for five years.\nAfter 1945 we yearned for peace at last.\nBehind the Iron Curtain, another genocidal psychopath, another threat.\nWe built shelters that would have sheltered nothing.\nWe spent our treasure on weapons instead of hospitals.\nWe took silly precautions.\nSome fought it; some marched futilely against it.\nBy the early seventies it was terrorism as well.\nThat was why the tyrants lost.\nCivil rights were infringed as little as humanly possible.\nNow the threat is Islamic fundamentalism.\nIt is based and funded abroad; so was the IRA.\nIt is extremely hard to 

In [21]:
# creating a article summarizer function

def summarize_aricle(text):
    text = re.sub(r'\n+','. ',text)
    sents = nltk.sent_tokenize(text)
    sent_df = pd.DataFrame({'Sentences':sents})
    feature = tfidf.fit_transform(sent_df.Sentences)
    feature_matrix = feature.todense()
    sent_df['score'] = np.true_divide(feature_matrix.sum(1),(feature_matrix!=0).sum(1)).tolist()
    sent_df.score = sent_df.score.apply(lambda x:x[0])
    sent_df[sent_df!=1].sort_values('score',ascending=False)
    sent_max = sent_df[sent_df!=1].sort_values('score',ascending=False).head(int(0.3*len(sent_df)))
    sent_max.sort_index()
    text_summarized = '\n'.join(sent_max.sort_index().Sentences.to_list())
    return text_summarized

### `Using the function`

In [22]:
print(df.loc[0, 'text'])

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL

In [23]:
summarized = summarize_aricle(df.loc[0, 'text'])
print(summarized)

Ad sales boost Time Warner profit.
But its own internet business, AOL, had has mixed fortunes.
It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.
It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC.
It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.


In [24]:
article="""
Satya Narayana Nadella (Telugu: నాదెళ్ల సత్యనారాయణ, /nəˈdɛlə/; born 19 August 1967) is an Indian-American business executive.He is the executive chairman and CEO of Microsoft, succeeding Steve Ballmer in 2014 as CEO and John W. Thompson in 2021 as chairman. Before becoming CEO, he was the executive vice president of Microsoft's cloud and enterprise group, responsible for building and running the company's computing platforms.
Nadella was born in Hyderabad in Andhra Pradesh state, India into a Telugu-speaking Hindu family. His mother Prabhavati was a Sanskrit lecturer and his father, Bukkapuram Nadella Yugandhar, was an Indian Administrative Service officer of the 1962 batch. Yugandhar hailed from Bukkapuram in Anantapur district of Andhra Pradesh. Yugandhar's father migrated from Nadella village in Guntur district of Andhra Pradesh to Bukkapuram.
Satya Nadella attended the Hyderabad Public School, Begumpet before receiving a bachelor's in electrical engineering from the Manipal Institute of Technology in Karnataka in 1988. Nadella then traveled to the U.S. to study for an MS in computer science at the University of Wisconsin–Milwaukee, receiving his degree in 1990. Later, he received an MBA from the University of Chicago Booth School of Business in 1997.
Nadella worked at Sun Microsystems as a member of its technology staff before joining Microsoft in 1992.
At Microsoft, Nadella has led major projects that included the company's move to cloud computing and the development of one of the largest cloud infrastructures in the world.
Nadella worked as the senior vice-president of research and development (R&D) for the Online Services Division and vice-president of the Microsoft Business Division. Later, he was made the president of Microsoft's $19 billion Server and Tools Business and led a transformation of the company's business and technology culture from client services to cloud infrastructure and services. He has been credited for helping bring Microsoft's database, Windows Server and developer tools to its Azure cloud. The revenue from Cloud Services grew to $20.3 billion in June 2013 from $16.6 billion when he took over in 2011. He received $84.5 million in 2016 pay.

In 2013, Nadella's base salary was reportedly $669,167. Including stock bonuses, the total compensation stood at around $7.6 million.

Previous positions held by Nadella include:President of the Server & Tools Division (9 February 2011 – February 2014),Senior Vice-president of Research and Development for the Online Services Division (March 2007 – February 2011),Vice-president of the Business Division,Corporate Vice-president of Business Solutions and Search & Advertising Platform Group,Executive Vice-president of Cloud and Enterprise group


On 4 February 2014, Nadella was announced as the new CEO of Microsoft, the third CEO in the company's history, following Bill Gates and Steve Ballmer.

In October 2014, Nadella attended an event on Women in Computing and courted controversy after he made a statement that women should not ask for a raise and should trust the system. Nadella was criticised for the statement and he later apologized on Twitter. He then sent an email to Microsoft employees admitting he was "Completely wrong."
Nadella leads a live discussion on Microsoft's cloud strategy in 2014 in San Francisco.

Nadella's tenure at Microsoft has emphasized working with companies and technologies with which Microsoft also competes, including Apple Inc., Salesforce, IBM, and Dropbox. In contrast to previous Microsoft campaigns against the Linux operating system, Nadella proclaimed that "Microsoft ❤️ Linux", and Microsoft joined the Linux Foundation as a Platinum member in 2016.

Under Nadella, Microsoft revised its mission statement to "empower every person and every organization on the planet to achieve more". He orchestrated a cultural shift at Microsoft by emphasizing empathy, collaboration, and 'growth mindset'. He has transformed Microsoft's corporate culture into one that emphasizes continual learning and growth.

In 2014, Nadella's first acquisition with Microsoft was of Mojang, a Swedish game company best known for the computer game Minecraft, for $2.5 billion. He followed that by purchasing Xamarin for an undisclosed amount. He oversaw the purchase of professional network LinkedIn in 2016 for $26.2 billion. On October 26, 2018, Microsoft acquired GitHub for US$7.5 billion.

Since Nadella became CEO, Microsoft stock had tripled by September 2018, with a 27% annual growth rate.

In 2018, he was a Time 100 honoree.

In 2019, Nadella was named Financial Times Person of the Year and Fortune magazine Businessperson of the Year.

In 2020, Nadella was recognized as Global Indian Business Icon at CNBC-TV18's India Business Leader Awards in Mumbai.

In 2022, Nadella was awarded Padma Bhushan, the third highest civilian award in India by the Government of India.

In 1992, Nadella married Anupama, the daughter of his father's IAS batchmate. She was his junior at Manipal pursuing a B.Arch in the Faculty of Architecture. The couple had three children, a son and two daughters, and live in Clyde Hill and Bellevue, Washington. His son Zain was a legally blind quadriplegic with cerebral palsy. Zain died in February 2022, at the age of 26.

Nadella is an avid reader of American and Indian poetry. He also nurses a passion for cricket, having played on his school team. Nadella and his wife Anupama are part of the ownership group of Seattle Sounders FC, a Major League Soccer club.

Nadella has authored a book titled Hit Refresh that explores his life, his career in Microsoft and how he believes technology will shape the future. He announced that the profits from the book would go to Microsoft Philanthropies and through that to nonprofit organizations. """

In [25]:
summarized = summarize_aricle(article)
print(summarized)

Yugandhar hailed from Bukkapuram in Anantapur district of Andhra Pradesh.
He received $84.5 million in 2016 pay..
In 2013, Nadella's base salary was reportedly $669,167.
Nadella was criticised for the statement and he later apologized on Twitter.
He then sent an email to Microsoft employees admitting he was "Completely wrong.".
He followed that by purchasing Xamarin for an undisclosed amount.
He oversaw the purchase of professional network LinkedIn in 2016 for $26.2 billion.
In 2018, he was a Time 100 honoree..
She was his junior at Manipal pursuing a B.Arch in the Faculty of Architecture.
His son Zain was a legally blind quadriplegic with cerebral palsy.
He also nurses a passion for cricket, having played on his school team.
He announced that the profits from the book would go to Microsoft Philanthropies and through that to nonprofit organizations.
