## Text Summarization - (tf-idf)

#### STEP 1 : Data cleaning 
#### STEP 2 : Score of sentences (tf-idf)
#### STEP 3 : Summary Generation

## Initial Phase
### Importing Libraries and Reading Data 

In [1]:
### importing the necessary libraries

from nltk.corpus import stopwords
import numpy as np
import pandas
import nltk
import re
# from __future__ import division

In [2]:
df = pandas.read_csv('Data/tennis_articles_v4.csv')

In [3]:
print(df['article_text'])

for a in df['article_text']:
    print(len(a))

0    Maria Sharapova has basically no friends as te...
1    BASEL, Switzerland (AP), Roger Federer advance...
2    Roger Federer has revealed that organisers of ...
3    Kei Nishikori will try to end his long losing ...
4    Federer, 37, first broke through on tour over ...
5    Nadal has not played tennis since he was force...
6    Tennis giveth, and tennis taketh away. The end...
7    Federer won the Swiss Indoors last week by bea...
Name: article_text, dtype: object
1561
1331
2063
1341
2076
1545
1079
1833


### Tokenizing sentences into words which would be used for calculating tf-idf scores

In [4]:
### tokenized the sentences from the different news articles

from nltk.tokenize import sent_tokenize
s = ""
for a in df['article_text']:
      s += a
sentences = sent_tokenize(s)
# sentences

## STEP 1 : Data Cleaning
### Cleaning sentences, by removing Non Alphabet Characters and converting to Lower Case Letters

In [5]:
### pre processes the sentences by removing non alphabet characters and converting them to lower case letters 
### and stored in variable text

dict = {}
text=""
for a in sentences:
    temp = re.sub("[^a-zA-Z]"," ",a)
    temp = temp.lower()
    dict[temp] = a
    text+=temp
text

'maria sharapova has basically no friends as tennis players on the wta tour the russian player has no problems in openly speaking about it and in a recent interview she said   i don t really hide any feelings too much i think everyone knows this is my job here when i m on the courts or when i m on the court playing  i m a competitor and i want to beat every single person whether they re in the locker room or across the net so i m not the one to strike up a conversation about the weather and know that in the next few minutes i have to go and try to win a tennis match i m a pretty competitive girl i say my hellos  but i m not sending any players flowers as well uhm  i m not really friendly or close to many players i have not a lot of friends away from the courts  when she said she is not really close to a lot of players  is that something strategic that she is doing is it different on the men s tour than the women s tour  no  not at all i think just because you re in the same sport doesn

## STEP 2 : Getting tf-idf score of sentences
### Finding term frequency ( tf ) of words found in text

In [6]:
### calculated the frequency of the words found in text

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
print (word_frequencies)

{'maria': 1, 'sharapova': 1, 'basically': 1, 'friends': 6, 'tennis': 12, 'players': 16, 'wta': 2, 'tour': 5, 'russian': 1, 'player': 2, 'problems': 1, 'openly': 1, 'speaking': 2, 'recent': 1, 'interview': 1, 'said': 11, 'really': 6, 'hide': 1, 'feelings': 1, 'much': 3, 'think': 8, 'everyone': 3, 'knows': 1, 'job': 1, 'courts': 2, 'court': 6, 'playing': 3, 'competitor': 1, 'want': 1, 'beat': 1, 'every': 3, 'single': 1, 'person': 2, 'whether': 2, 'locker': 2, 'room': 1, 'across': 2, 'net': 1, 'one': 4, 'strike': 1, 'conversation': 1, 'weather': 1, 'know': 3, 'next': 8, 'minutes': 1, 'go': 2, 'try': 3, 'win': 9, 'match': 7, 'pretty': 2, 'competitive': 1, 'girl': 1, 'say': 2, 'hellos': 1, 'sending': 1, 'flowers': 1, 'well': 2, 'uhm': 1, 'friendly': 3, 'close': 3, 'many': 3, 'lot': 5, 'away': 3, 'something': 1, 'strategic': 1, 'different': 4, 'men': 1, 'women': 1, 'sport': 1, 'mean': 2, 'categorized': 1, 'going': 2, 'get': 4, 'along': 2, 'interests': 2, 'completely': 1, 'jobs': 1, 'met': 1,

### Finding weighted frequency of the words

In [7]:
### finding weighted frequency of the words

max_freq = max(word_frequencies.values())

for w in word_frequencies :
      word_frequencies[w]/=max_freq
# print word_frequencies

### Calculating sentence scores from the word frequncies

In [8]:
### calculating sentence scores from the word frequncies

sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

## STEP 3 : Summary Generation
### Outputting the top 17 sentences as the summary

In [9]:
### getting the summary by taking top score sentences

import heapq
summary_sentences = heapq.nlargest(17, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)

In [10]:
summary_sentences

['Federer has been handed a difficult draw where could could come across Kevin Anderson, Novak Djokovic and Rafael Nadal in the latter rounds.',
 'He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the win on his first match point.',
 "Federer's projected route to the Paris final could also lead to matches against Kevin Anderson and Novak Djokovic.",
 'Two players, Stefanos Tsitsipas and Kyle Edmund, won their first career ATP titles last week (13:26).',
 "'BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday.",
 "Nadal's appearance in Paris is a big boost to the tournament organisers who could see Roger Federer withdraw.",
 'Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.',
 'Meanwhile, Federer is hoping he can improve hi

In [11]:
summary

'Federer has been handed a difficult draw where could could come across Kevin Anderson, Novak Djokovic and Rafael Nadal in the latter rounds. He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the win on his first match point. Federer\'s projected route to the Paris final could also lead to matches against Kevin Anderson and Novak Djokovic. Two players, Stefanos Tsitsipas and Kyle Edmund, won their first career ATP titles last week (13:26). \'BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Nadal\'s appearance in Paris is a big boost to the tournament organisers who could see Roger Federer withdraw. Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest. Meanwhile, Federer is hoping he can improve his service game as he hunts

In [12]:
len(summary)

2183