# https://tinyurl.com/ANLPColab3Part1
Go to "File" -> "Save a Copy in Drive..."
This lets you create your own copy of the notebook in your Google drive, and any changes you make doesn't impact the shared notebook

## Extractive Summarization

Extractive summarization involves selecting important sentences, phrases, or paragraphs directly from the source text and combining them to create a summary rather than generate a new text. The key idea is to identify and extract the most significant portions of the text.


## Method 1: Custom defining *TextRank*

### Let's run it step-by-step by defining the TextRank function

In [1]:
#Install required libraries
!pip install networkx nltk

# Import libraries and download NLTK data
import networkx as nx
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np

nltk.download('punkt')
nltk.download('stopwords')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# Define TextRank functions

#Function to calculate the similarity between two sentences - uses cosine similarity metric
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []

    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]

    #create a set of all unique words from both sentences
    all_words = list(set(sent1 + sent2))

    #create vector representations for each sentence based on word frequencies
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)

    for w in sent1:
        if w not in stopwords:
            vector1[all_words.index(w)] += 1

    for w in sent2:
        if w not in stopwords:
            vector2[all_words.index(w)] += 1

    #calculate the cosine similarity between these vectors
    return 1 - cosine_distance(vector1, vector2)

#Function to create the similarity matrix for all sentences
def build_similarity_matrix(sentences, stop_words):

    #initialize a zero matrix of size(number of sentences) x (number of sentences)
    similarity_matrix = np.zeros((len(sentences), len(sentences)))

    #fill this matrix with similarity scores between each pair of sentences
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2:
                continue
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

#function that implements the TextRank algorithm
def generate_summary(text, top_n=5):
    stop_words = set(stopwords.words('english'))
    summarize_text = []

    sentences = sent_tokenize(text)
    sentence_words = [word_tokenize(sent.lower()) for sent in sentences]

    #create the graph and calculate similarities
    similarity_matrix = build_similarity_matrix(sentence_words, stop_words)

    sentence_similarity_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(sentence_similarity_graph) #calculate the TextRank score

    #sort the sentences based on their scores and select the top N sentences
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)

    #Reorder selected sentences and generate summary
    for i in range(top_n):
        summarize_text.append(ranked_sentences[i][1])

    return " ".join(summarize_text)

In [3]:
# Experiment with a text
text = """
Deep learning (also known as deep structured learning) is part of a
broader family of machine learning methods based on artificial neural networks with
representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to
and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes
in biological systems. ANNs have various differences from biological brains. Specifically,
neural networks tend to be static and symbolic, while the biological brain of most living organisms
is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple
layers in the network. Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can.
Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size,
which permits practical application and optimized implementation, while retaining theoretical universality
under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely
from biologically informed connectionist models, for the sake of efficiency, trainability and understandability,
whence the structured part.

"""
print("Original text:")
print(text)
print("Original text length:")
print(len(text))


print("\nGenerating summary...")
print("\n")

new_summary = generate_summary(text, top_n=1)
print("Summary of provided text:")
print(new_summary)
print("Summary text length:")
print(len(new_summary))


Original text:

Deep learning (also known as deep structured learning) is part of a
broader family of machine learning methods based on artificial neural networks with
representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to
and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes
in biological systems. ANNs have various differences from biological brains. Specifically,
neural networks tend to be static and symbolic, whi

In [4]:
#Trying a different top-n
print("Original text:")
print(text)
print("Original text length:")
print(len(text))


print("\nGenerating summary...")
print("\n")

new_summary = generate_summary(text, top_n=2)
print("Summary of provided text:")
print(new_summary)
print("Summary text length:")
print(len(new_summary))

Original text:

Deep learning (also known as deep structured learning) is part of a
broader family of machine learning methods based on artificial neural networks with
representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to
and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes
in biological systems. ANNs have various differences from biological brains. Specifically,
neural networks tend to be static and symbolic, whi

In [5]:
new_summary = generate_summary(text, top_n=1)
print("Summary of provided text:")
print(new_summary)
print("Summary text length:")
print(len(new_summary))

Summary of provided text:
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to
and in some cases surpassing human expert performance.
Summary text length:
493


## Method 2: Using Python packages

### Now, let's run it using pre-defined functions in the Spacy and PyTextRank python libraries

In [6]:
# Load required libraries
!pip install pytextrank #a spaCy extension that effectively implements the TextRank algorithm

import spacy
import pytextrank
#load spacy language model
nlp = spacy.load("en_core_web_sm")

Collecting pytextrank
  Downloading pytextrank-3.3.0-py3-none-any.whl.metadata (12 kB)
Collecting icecream>=2.1 (from pytextrank)
  Downloading icecream-2.1.3-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting colorama>=0.3.9 (from icecream>=2.1->pytextrank)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting executing>=0.3.1 (from icecream>=2.1->pytextrank)
  Downloading executing-2.1.0-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting asttokens>=2.0.1 (from icecream>=2.1->pytextrank)
  Downloading asttokens-2.4.1-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading pytextrank-3.3.0-py3-none-any.whl (26 kB)
Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Downloading asttokens-2.4.1-py2.py3-none-any.whl (27 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading executing-2.1.0-py2.py3-none-any.whl (25 kB)
Installing collected packages: executing, colorama, asttokens, icecream, pytextrank
Successfully installed asttokens-2.4.1 colorama-0.

In [7]:
#Load Text

text = '''Deep learning (also known as deep structured learning) is part of a
broader family of machine learning methods based on artificial neural networks with
representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to
and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes
in biological systems. ANNs have various differences from biological brains. Specifically,
neural networks tend to be static and symbolic, while the biological brain of most living organisms
is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple
layers in the network. Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can.
Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size,
which permits practical application and optimized implementation, while retaining theoretical universality
under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely
from biologically informed connectionist models, for the sake of efficiency, trainability and understandability,
whence the structured part.'''

text = """

"""



In [8]:
#loads the model with the TextRank summarization pipeline
nlp.add_pipe("textrank")

<pytextrank.base.BaseTextRankFactory at 0x7d4371531ed0>

In [9]:
#Text summarization
print("Original Text:")
print(text)
print('Original Document Size:',len(text)) #number of characters
print('\n')


print("\nGenerating summary...")
print("\n")

doc = nlp(text)
summary = ''
summarySize = 0

#Limit summary to 2 phrases and 2 sentences
for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
    summary = summary + " " +str(sent)
    summarySize += len(sent) # counts characters in the selected sentences

print("Summary :", summary)
print("\n")
print("Summary Size :", summarySize)

print("Summary length:")
print(len(summary)) #might differ from summary size due to additional spaces

Original Text:



Original Document Size: 2



Generating summary...


Summary :  




Summary Size : 1
Summary length:
3
