In [106]:
# It’s an innovative news app that converts news articles into a 60-word summary
#  It is a process of generating a concise and meaningful summary of text from multiple text resources such as books, news articles, blog posts, research papers, emails, and tweets
## Automatic Text Summarization used features such as word frequency and phrase frequency to extract important sentences from the text for summarization purposes.

In [107]:
#  presence of cue words, words used in the title appearing in the text, and the location of sentences, to extract significant
# sentences for text summarization

In [108]:
#  TextRank algorithm, there’s another algorithm which we should become familiar with – the PageRank algorithm.

In [109]:
## PageRank is used for ranking web pages in online search results...

In [110]:
## PageRank score. This score is the probability of a user visiting that page.

In [111]:
#Each element of this matrix denotes the probability of a user transitioning from one web page to another. For example, the highlighted cell below contains the probability of transition from w1 to w2.

In [112]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sbha69\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [113]:
## Read the dataset...
TextArticle = pd.read_csv("D:\\DataScience\\TextSummerizationUsingSentenceEmbedding\\TextRank-TextSummerization\\tennis_articles_v4.csv")

In [114]:
## Inspect the data..
TextArticle.head()

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [115]:
TextArticle['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same 

In [116]:
TextArticle['article_text'][1]

"BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had his first match-point chance to break serve again at 5-1. He then dropped his serve to love, and let another match point slip in Medvedev's next service game by netting a backhand. He clinched on his fourth chance when Medvedev netted from the baseline. Copil upset expectations of a Federer final against Alexander Zverev in a 6-3, 6-7 (6), 6-4 win over the fifth-ranked German in the earlier semifinal. The Romanian aims for a first title after arriving at Basel without a career win over a top-10 opponent. Copil has two after also beating No. 6 Marin Cilic in the second round. Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with

In [117]:
TextArticle['article_text'][2]

'Roger Federer has revealed that organisers of the re-launched and condensed Davis Cup gave him three days to decide if he would commit to the controversial competition. Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment. "They only left me three days to decide", Federer said. "I didn\'t to have time to consult with all the people I had to consult. "I could not make a decision in that time, so I told them to do what they wanted." The 20-time Grand Slam champion has voiced doubts about the wisdom of the one-week format to be introduced by organisers Kosmos, who have promised the International Tennis Federation up to $3 billion in prize money over the next quarter-century. The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will replace the classic home-

In [118]:
## Split into text individual sentences...
from nltk.tokenize import sent_tokenize
sentences = []
for s in TextArticle['article_text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

In [119]:
sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl."]

In [120]:
## Download GloVe Word Embeddings
#GloVe word embeddings are vector representation of words. These word embeddings will be used to create vectors for our 
#sentences. We could have also used the Bag-of-Words or TF-IDF approaches to create features for our sentences, but these
#methods ignore the order of the words (and the number of features is usually pretty large).

In [121]:
# D:\DataScience\TextSummerizationUsingSentenceEmbedding\TextRank-TextSummerization\Globe6B

In [122]:
# Extract word vectors
word_embeddings = {}
f = open('D:\DataScience\TextSummerizationUsingSentenceEmbedding\TextRank-TextSummerization\Globe6B\glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [123]:
len(word_embeddings)
# We now have word vectors for 400,000 different terms stored in the dictionary – ‘word_embeddings’.

400000

In [124]:
# Text Preprocessing
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [125]:
# Get rid of the stopwords (commonly used words of a language – is, am, the, of, in, etc.) present in the sentences. If you have not downloaded nltk-stopwords, then execute the following line of code:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sbha69\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [126]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [127]:
# Let’s define a function to remove these stopwords from our dataset.
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [128]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# We will use clean_sentences to create vectors for sentences in our data with the help of the GloVe word vectors.

In [129]:
clean_sentences

['maria sharapova basically friends tennis players wta tour',
 'russian player problems openly speaking recent interview said really hide feelings much',
 'think everyone knows job',
 'courts court playing competitor want beat every single person whether locker room across net one strike conversation weather know next minutes go try win tennis match',
 'pretty competitive girl',
 'say hellos sending players flowers well',
 'uhm really friendly close many players',
 'lot friends away courts',
 'said really close lot players something strategic',
 'different men tour women tour',
 '',
 'think sport mean friends everyone categorized tennis player going get along tennis players',
 'think every person different interests',
 'friends completely different jobs interests met different parts life',
 'think everyone thinks tennis players greatest friends',
 'ultimately tennis small part',
 'many things interested',
 'basel switzerland ap roger federer advanced th swiss indoors final career beati

In [130]:
# Vector Representation of Sentences
# Extract word vectors
word_embeddings = {}
f = open('D:\DataScience\TextSummerizationUsingSentenceEmbedding\TextRank-TextSummerization\Globe6B\glove.6B.100d.txt', encoding='utf-8')
for line in f:
   # print(line)
    values = line.split()
    word = values[0]
    #print(word)
    coefs = np.asarray(values[1:], dtype='float32')
    #print(coefs)
    word_embeddings[word] = coefs
f.close()

In [131]:
#word_embeddings

In [136]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [137]:
sentence_vectors[0]

array([ 5.14825583e-02,  1.10544682e-01,  6.94999397e-01,  1.89168096e-01,
       -9.58077684e-02,  3.20288986e-01,  2.70662010e-01,  5.42440832e-01,
       -3.05938005e-01, -1.56364068e-01,  3.70127618e-01,  8.09492469e-02,
        8.41393881e-03,  2.47571543e-01, -3.69342804e-01, -7.61044994e-02,
        8.08582604e-02,  2.30643645e-01, -2.70402402e-01,  5.13828397e-01,
       -6.12548441e-02,  3.87900352e-01,  1.03121363e-01,  7.72494674e-01,
        2.59960234e-01, -7.96069205e-02,  1.42143592e-01, -9.62644577e-01,
        7.54904330e-01,  6.03260659e-02, -4.58570123e-01,  2.36780301e-01,
        2.29152635e-01, -1.56453326e-01,  3.97632688e-01, -2.32720934e-02,
       -5.05520999e-01,  4.13252831e-01, -2.85759270e-01, -1.35231465e-01,
       -1.37098104e-01, -1.48972601e-01,  3.37537557e-01, -3.49540442e-01,
        1.53484434e-01, -2.33341649e-01, -1.98460802e-01, -1.27821520e-01,
        5.08063912e-01, -3.68636876e-01, -2.28472307e-01, -3.15306723e-01,
        1.36149466e-01,  

In [138]:
### Similarity matrix prep...
#The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let’s create an empty similarity matrix for this task and populate it with cosine similarities of the sentences.
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [139]:
# We will use Cosine Similarity to compute the similarity between a pair of sentences.
from sklearn.metrics.pairwise import cosine_similarity

In [140]:
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [141]:
sim_mat[0][0]

0.0

In [142]:
## Applying page rank algo....
#Before proceeding further, let’s convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [143]:
## summary extraction...
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [144]:
# Extract top 10 sentences as the summary
for i in range(10):
    print(ranked_sentences[i][1])

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London 