### Project: Summarizing a text using NLP

In [1]:
text = """Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here.

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. I think every person has different interests.

I have friends that have completely different jobs and interests, and I've met them in very different parts of my life. I think everyone just thinks because we're tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.

There are so many other things that we're interested in, that we do."""

In [2]:
len(text)

1565

In [3]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
doc = nlp(text)

In [6]:
tokens = [token.text for token in doc]

In [7]:
print(tokens)

['Maria', 'Sharapova', 'has', 'basically', 'no', 'friends', 'as', 'tennis', 'players', 'on', 'the', 'WTA', 'Tour', '.', 'The', 'Russian', 'player', 'has', 'no', 'problems', 'in', 'openly', 'speaking', 'about', 'it', 'and', 'in', 'a', 'recent', 'interview', 'she', 'said', ':', "'", 'I', 'do', "n't", 'really', 'hide', 'any', 'feelings', 'too', 'much', '.', 'I', 'think', 'everyone', 'knows', 'this', 'is', 'my', 'job', 'here', '.', '\n\n', 'When', 'I', "'m", 'on', 'the', 'courts', 'or', 'when', 'I', "'m", 'on', 'the', 'court', 'playing', ',', 'I', "'m", 'a', 'competitor', 'and', 'I', 'want', 'to', 'beat', 'every', 'single', 'person', 'whether', 'they', "'re", 'in', 'the', 'locker', 'room', 'or', 'across', 'the', 'net', '.', 'So', 'I', "'m", 'not', 'the', 'one', 'to', 'strike', 'up', 'a', 'conversation', 'about', 'the', 'weather', 'and', 'know', 'that', 'in', 'the', 'next', 'few', 'minutes', 'I', 'have', 'to', 'go', 'and', 'try', 'to', 'win', 'a', 'tennis', 'match', '.', '\n\n', 'I', "'m", 

In [8]:
punctuation = punctuation + '\n'

In [9]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

### Text Cleaning

In [12]:
# creating a dictionary for how many times a word appears in the text

word_freq = {}
stop_words = list(STOP_WORDS)
for word in doc:
    if word.text.lower() not in stop_words:
        if word.text.lower() not in punctuation:
            if word.text not in word_freq.keys():
                word_freq[word.text] = 1
            else:
                word_freq[word.text] += 1
                
print(word_freq)

{'Maria': 1, 'Sharapova': 1, 'basically': 1, 'friends': 5, 'tennis': 6, 'players': 6, 'WTA': 1, 'Tour': 1, 'Russian': 1, 'player': 2, 'problems': 1, 'openly': 1, 'speaking': 1, 'recent': 1, 'interview': 1, 'said': 2, 'hide': 1, 'feelings': 1, 'think': 4, 'knows': 1, 'job': 1, '\n\n': 4, 'courts': 2, 'court': 1, 'playing': 1, 'competitor': 1, 'want': 1, 'beat': 1, 'single': 1, 'person': 2, 'locker': 1, 'room': 1, 'net': 1, 'strike': 1, 'conversation': 1, 'weather': 1, 'know': 1, 'minutes': 1, 'try': 1, 'win': 1, 'match': 1, 'pretty': 1, 'competitive': 1, 'girl': 1, 'hellos': 1, 'sending': 1, 'flowers': 1, 'Uhm': 1, 'friendly': 1, 'close': 2, 'lot': 2, 'away': 1, 'strategic': 1, 'different': 4, 'men': 1, 'tour': 2, 'women': 1, 'sport': 1, 'mean': 1, 'categorized': 1, 'going': 1, 'interests': 2, 'completely': 1, 'jobs': 1, 'met': 1, 'parts': 1, 'life': 1, 'thinks': 1, 'greatest': 1, 'ultimately': 1, 'small': 1, 'things': 1, 'interested': 1}


In [14]:
max_freq = max(word_freq.values())

In [15]:
# normalizing

for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq

In [16]:
print(word_freq)

{'Maria': 0.16666666666666666, 'Sharapova': 0.16666666666666666, 'basically': 0.16666666666666666, 'friends': 0.8333333333333334, 'tennis': 1.0, 'players': 1.0, 'WTA': 0.16666666666666666, 'Tour': 0.16666666666666666, 'Russian': 0.16666666666666666, 'player': 0.3333333333333333, 'problems': 0.16666666666666666, 'openly': 0.16666666666666666, 'speaking': 0.16666666666666666, 'recent': 0.16666666666666666, 'interview': 0.16666666666666666, 'said': 0.3333333333333333, 'hide': 0.16666666666666666, 'feelings': 0.16666666666666666, 'think': 0.6666666666666666, 'knows': 0.16666666666666666, 'job': 0.16666666666666666, '\n\n': 0.6666666666666666, 'courts': 0.3333333333333333, 'court': 0.16666666666666666, 'playing': 0.16666666666666666, 'competitor': 0.16666666666666666, 'want': 0.16666666666666666, 'beat': 0.16666666666666666, 'single': 0.16666666666666666, 'person': 0.3333333333333333, 'locker': 0.16666666666666666, 'room': 0.16666666666666666, 'net': 0.16666666666666666, 'strike': 0.1666666

### Sentence Tokenization

In [19]:
sentence_tokens = [sentence for sentence in doc.sents]
print(sentence_tokens)

[Maria Sharapova has basically no friends as tennis players on the WTA Tour., The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much., I think everyone knows this is my job here.

, When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net., So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

, I'm a pretty competitive girl., I say my hellos, but I'm not sending any players flowers as well., Uhm, I'm not really friendly or close to many players., I have not a lot of friends away from the courts.', When she said she is not really close to a lot of players, is that something strategic that she is doing?, Is it different on the men's tour than the women's tour?, 'No, not at all., I think just because yo

In [22]:
# iterate over sentences individually and providing a score for each sentence
# sentence as key and value is the score

sent_score = {}
for sentence in sentence_tokens:
    for word in sentence:
        if word.text.lower() in word_freq.keys():
            if sentence not in sent_score.keys():
                sent_score[sentence] = word_freq[word.text.lower()]
            else:
                sent_score[sentence] += word_freq[word.text.lower()]        

In [23]:
print(sent_score)

{Maria Sharapova has basically no friends as tennis players on the WTA Tour.: 3.3333333333333335, The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.: 1.8333333333333333, I think everyone knows this is my job here.

: 1.6666666666666665, When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.: 2.1666666666666665, So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

: 2.9999999999999996, I'm a pretty competitive girl.: 0.5, I say my hellos, but I'm not sending any players flowers as well.: 1.5, Uhm, I'm not really friendly or close to many players.: 1.5, I have not a lot of friends away from the courts.': 1.6666666666666667, When she said she is not really close to a lot of players, is that some

### Selecting 30% sentences with max score

In [25]:
from heapq import nlargest

In [27]:
len(sent_score)*0.3

5.1

Max 5 sentences we can select

In [29]:
summary = nlargest(8, iterable = sent_score, key = sent_score.get)
# get function will return the score of each sentence from sent_score
# to help identify the top 8 highest-scoring sentences.

In [30]:
print(summary)

[I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players., I think everyone just thinks because we're tennis players we should be the greatest of friends., Maria Sharapova has basically no friends as tennis players on the WTA Tour., I have friends that have completely different jobs and interests, and I've met them in very different parts of my life., So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

, I think every person has different interests.

, When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net., When she said she is not really close to a lot of players, is that something strategic that she is doing?]


In [31]:
final_summary = [word.text for word in summary]

In [32]:
print(final_summary)

["I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players.", "I think everyone just thinks because we're tennis players we should be the greatest of friends.", 'Maria Sharapova has basically no friends as tennis players on the WTA Tour.', "I have friends that have completely different jobs and interests, and I've met them in very different parts of my life.", "So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.\n\n", 'I think every person has different interests.\n\n', "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.", 'When she said she is not really close to a lot of players, is that something strategic that she is doing?']


In [34]:
summary = " ".join(final_summary)

In [35]:
print(summary)

I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. I think everyone just thinks because we're tennis players we should be the greatest of friends. Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I've met them in very different parts of my life. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

 I think every person has different interests.

 When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net. When she said she is not really close to a lot of players, is that something strategic that she is doing?


In [36]:
len(summary)

969

In [37]:
len(summary)/len(text)

0.6191693290734824

60% of the data is used in the summary