# Extractive Summarization - Word /Rankings

**Extractive summarization** is a text summarization technique that involves selecting and assembling sentences or phrases directly from the original source text to create a concise summary. Instead of generating new sentences, it extracts the most relevant and important content from the source while preserving the original wording. We shall explore scoing algorithms to do extractive summarization.

Scoring algorithms in text summarization assign scores to sentences based on criteria like word frequency, position, and keywords. These scores quantify sentence importance, aiding in content selection for summaries. Centrality-based methods, like PageRank, model sentence relationships in a graph. Statistical methods, like TF-IDF, weigh words' significance. Supervised learning utilizes features such as sentence length and context to predict importance. Hybrid approaches combine multiple indicators for robust scoring. Effective scoring algorithms enhance automated summarization by identifying pivotal content and shaping concise, coherent summaries.

## Objectives:
To explore two ways to do extractive summarization on the same test:
     
     1. Sentence Scoring Method
     2. TF-IDF Approach
     3. Get the rouge score

## Sentence Scoring Method

Sentence scores in summarization quantify a sentence's relevance and importance. Calculated using factors like word frequency, position, and semantic meaning, scores help rank sentences for inclusion in summaries. Higher scores indicate greater significance, aiding in content selection and producing concise, informative summaries.

**Steps to take**

    1. Preparing the data
    2. Processing the data
    3. Tokenizing the article into sentences
    4. Finding the weighted frequencies of the sentences
    5. Calculating the threshold of the sentences
    6. Getting the summary
    7. Evaluation

### Preparing the data

In [1]:
sample_text = """
 Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. 
 I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.
 So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. 
 I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players.
 I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all.
 I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. 
 I think every person has different interests. I have friends that have completely different jobs and interests, and I've met them in very different parts of my life.
 I think everyone just thinks because we're tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do. 
 There are so many other things that we're interested in, that we do.'
 """

### Importing Libraries Required

In [2]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
# sent_tokenize is used for splitting text into sentences
# word_tokenize is used for splitting sentences into individual words

from collections import Counter # to maintain the counter for each sentence

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tanmaysharma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Processing and tokenizing the data

In [3]:
# tokenization
sentences = sent_tokenize(sample_text)

# removing stop words
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# tokenizing each word from sample text and removing stop words and lowering down the case
words = [word.lower() for word in word_tokenize(sample_text) if word.lower() not in stop_words and word.isalnum()]

# frequency for each individual word
word_freq = Counter(words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tanmaysharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
max_frequency = Counter.most_common(word_freq)[0][1]
max_frequency

6

### Calculating Weighted Frequency

In [5]:
for freq in word_freq:
    word_freq[freq] /= max_frequency

In [6]:
word_freq

Counter({'maria': 0.16666666666666666,
         'sharapova': 0.16666666666666666,
         'basically': 0.16666666666666666,
         'friends': 0.8333333333333334,
         'tennis': 1.0,
         'players': 1.0,
         'wta': 0.16666666666666666,
         'tour': 0.5,
         'russian': 0.16666666666666666,
         'player': 0.3333333333333333,
         'problems': 0.16666666666666666,
         'openly': 0.16666666666666666,
         'speaking': 0.16666666666666666,
         'recent': 0.16666666666666666,
         'interview': 0.16666666666666666,
         'said': 0.3333333333333333,
         'really': 0.5,
         'hide': 0.16666666666666666,
         'feelings': 0.16666666666666666,
         'much': 0.16666666666666666,
         'think': 0.6666666666666666,
         'everyone': 0.5,
         'knows': 0.16666666666666666,
         'job': 0.16666666666666666,
         'courts': 0.3333333333333333,
         'court': 0.16666666666666666,
         'playing': 0.16666666666666666,
  

### Sentence Scoring or Sentence threshold calculation

In [7]:
# dictionary to hold sentence scores
sentence_scores = {}

# scoring each sentence
for sentence in sentences:

    # taking each individual word from each sentence and ranking them
    word_sentence = [word.lower() for word in word_tokenize(sentence) if word.lower() not in stop_words and word.isalnum()]
    # summing individual sentence score
    sentence_score = sum([word_freq[word] for word in word_sentence])
    # making sure sentences more than 25 words are not in summary
    sentence_scores[sentence] = sentence_score

### Getting the Summary

In [8]:
# selecting the top 5 sentences 
n = 5
summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n]
summary = ' '.join(summary_sentences)

In [9]:
summary

"I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. I think everyone just thinks because we're tennis players we should be the greatest of friends. \n Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I've met them in very different parts of my life. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match."

In [10]:
sample_text

"\n Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. \n I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.\n So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. \n I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players.\n I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all.\n I think just because you

### Rouge Score

In [11]:
!pip install rouge



In [12]:
from rouge import Rouge

rouge = Rouge()
scores = rouge.get_scores(sample_text, summary)
scores[0]['rouge-1']['f']

0.626086952220794

In [13]:
scores

[{'rouge-1': {'r': 1.0, 'p': 0.45569620253164556, 'f': 0.626086952220794},
  'rouge-2': {'r': 0.9719626168224299,
   'p': 0.3939393939393939,
   'f': 0.5606468961649509},
  'rouge-l': {'r': 1.0, 'p': 0.45569620253164556, 'f': 0.626086952220794}}]

## TF-IDF Approach:

TF-IDF (Term Frequency-Inverse Document Frequency) is a text analysis technique. It evaluates word importance within a document relative to its occurrence in a corpus. It measures how often a word appears in a document (TF) but is offset by its rarity across the corpus (IDF). High TF-IDF scores identify words specific to a document, often conveying its main theme.

**Steps to follow**

    1. Tokenize all the sentences
    2. Create a TF-IDF matrix
    3. Use cosine similarity to get the relation between document and individual sentences.
    4. Rank the sentences.
    5. Get the summary.
    6. Evaluation.

### Importing Dependencies

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer # converts a collection of raw documents to a matrix of TF-IDF features
from sklearn.metrics.pairwise import cosine_similarity # cosine similarity to get similarity between two sentences
from heapq import nlargest # to get the top n sentences

### Sentence tokenization

In [15]:
sentences = sent_tokenize(sample_text)

### Create a TF-IDF matrix

In [16]:
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

### Calculating Cosine Similarity

In [17]:
sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]

### Scoring Sentences

In [18]:
summary_sentences = nlargest(n, range(len(sentence_scores)), key=sentence_scores.__getitem__)

### Summary

In [19]:
summary_tfidf = ' '.join([sentences[i] for i in sorted(summary_sentences)])
summary_tfidf

"\n Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match."

### Rouge Score

In [20]:
scores = rouge.get_scores(sample_text, summary_tfidf)
scores[0]['rouge-1']['f']

0.6833333288347222

A little better rouge score than sentence scoring, with a differnece of 0.06. But, we just cannot rank sentences inside a paragraph best summarize it as it will always leave some data. To tackle this problem, we will use abstractive text summarization in the next jupyter notebook.