<a href="https://colab.research.google.com/github/sakshiigupta/nlp-projects/blob/main/extractive_summarizer_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

It is important to understand that we have used textrank as an approach to rank the sentences. TextRank does not rely on any previous training data and can work with any arbitrary piece of text. TextRank is a general purpose graph-based ranking algorithm for NLP.


[article followed](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)


In [1]:
import nltk
nltk.download("stopwords")
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
file_name = "/content/drive/MyDrive/NLP/summary.txt"

In [4]:
file = open(file_name, "r")
filedata = file.readlines()
filedata

['In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry

In [5]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
summarize_text = []



In [None]:
# Step 1 - Read text and tokenize

article = filedata[0].split(". ")
print(article, "\n")
sentences = []

for sentence in article:
  print(sentence)
  sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
#sentences.pop() 

sentences  #is a 2d array where each element is an array of words. each element is a sentence in word array form

In [11]:
def sentence_similarity(sent1, sent2, stopwords=None):
  #stopwords is an array of words
    if stopwords is None:
        stopwords = []
 
    #converting to lowercase and creating an array of words
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    #set of all words in both sentences
    all_words = list(set(sent1 + sent2))
 
    #creating 2 vectors to store word frequency based on index of word in the allwords set
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
    #cosine dist =0 means similar, higher the value lesser the similarity
    #as per the value returned, higher the value, higher the similarity.

In [7]:
# Create an empty similarity matrix
similarity_matrix = np.zeros((len(sentences), len(sentences)))
similarity_matrix 

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [12]:
for idx1 in range(len(sentences)):
  for idx2 in range(len(sentences)):
    if idx1 == idx2: #ignore if both are same sentences
      continue 
    similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)


In [13]:
similarity_matrix

array([[0.        , 0.21516574, 0.10050378, 0.06600984, 0.        ,
        0.07106691, 0.25400025, 0.13074409, 0.09622504, 0.06085806],
       [0.21516574, 0.        , 0.19462474, 0.255655  , 0.16329932,
        0.27524094, 0.09837388, 0.10127394, 0.0745356 , 0.18856181],
       [0.10050378, 0.19462474, 0.        , 0.08956222, 0.09534626,
        0.06428243, 0.15316792, 0.11826248, 0.08703883, 0.22019275],
       [0.06600984, 0.255655  , 0.08956222, 0.        , 0.25048972,
        0.1266601 , 0.05029955, 0.23302069, 0.11433239, 0.21693046],
       [0.        , 0.16329932, 0.09534626, 0.25048972, 0.        ,
        0.13483997, 0.        , 0.12403473, 0.09128709, 0.23094011],
       [0.07106691, 0.27524094, 0.06428243, 0.1266601 , 0.13483997,
        0.        , 0.05415304, 0.0836242 , 0.06154575, 0.15569979],
       [0.25400025, 0.09837388, 0.15316792, 0.05029955, 0.        ,
        0.05415304, 0.        , 0.0996271 , 0.14664712, 0.18549556],
       [0.13074409, 0.10127394, 0.1182624

In [15]:
# Step 3 - Rank sentences in similarity martix
similarity_graph = nx.from_numpy_array(similarity_matrix)
similarity_graph
#each node in this graph is a sentence

<networkx.classes.graph.Graph at 0x7fd62172c250>

In [16]:
#ranking of each node/sentence
scores = nx.pagerank(similarity_graph)
scores

{0: 0.08543021908426283,
 1: 0.1262117608296295,
 2: 0.09409890128802637,
 3: 0.11386767381829474,
 4: 0.0911373445844822,
 5: 0.08689211093261075,
 6: 0.08899375984519095,
 7: 0.10100416115804903,
 8: 0.08153023345816275,
 9: 0.13083383500129075}

In [17]:
# Step 4 - Sort the rank and pick top sentences
'''
ranked sentence is an array, each element is a dictionary - score value and the sentence array
i is the index of the sentence and s is the sentence array
'''
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
print("Indexes of top ranked_sentence order are ", ranked_sentence)    

Indexes of top ranked_sentence order are  [(0.13083383500129075, ['This', 'program', 'also', 'included', 'developer-focused', 'AI', 'school', 'that', 'provided', 'a', 'bunch', 'of', 'assets', 'to', 'help', 'build', 'AI', 'skills.']), (0.1262117608296295, ['Envisioned', 'as', 'a', 'three-year', 'collaborative', 'program,', 'Intelligent', 'Cloud', 'Hub', 'will', 'support', 'around', '100', 'institutions', 'with', 'AI', 'infrastructure,', 'course', 'content', 'and', 'curriculum,', 'developer', 'support,', 'development', 'tools', 'and', 'give', 'students', 'access', 'to', 'cloud', 'and', 'AI', 'services']), (0.11386767381829474, ['The', 'company', 'will', 'provide', 'AI', 'development', 'tools', 'and', 'Azure', 'AI', 'services', 'such', 'as', 'Microsoft', 'Cognitive', 'Services,', 'Bot', 'Services', 'and', 'Azure', 'Machine', 'Learning.According', 'to', 'Manish', 'Prakash,', 'Country', 'General', 'Manager-PS,', 'Health', 'and', 'Education,', 'Microsoft', 'India,', 'said,', '"With', 'AI', '

In [18]:
for i in range(3):
  summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize text
print("Summarize Text: \n", ". ".join(summarize_text))

Summarize Text: 
 This program also included developer-focused AI school that provided a bunch of assets to help build AI skills.. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset


In [19]:
sentences

[['In',
  'an',
  'attempt',
  'to',
  'build',
  'an',
  'AI-ready',
  'workforce,',
  'Microsoft',
  'announced',
  'Intelligent',
  'Cloud',
  'Hub',
  'which',
  'has',
  'been',
  'launched',
  'to',
  'empower',
  'the',
  'next',
  'generation',
  'of',
  'students',
  'with',
  'AI-ready',
  'skills'],
 ['Envisioned',
  'as',
  'a',
  'three-year',
  'collaborative',
  'program,',
  'Intelligent',
  'Cloud',
  'Hub',
  'will',
  'support',
  'around',
  '100',
  'institutions',
  'with',
  'AI',
  'infrastructure,',
  'course',
  'content',
  'and',
  'curriculum,',
  'developer',
  'support,',
  'development',
  'tools',
  'and',
  'give',
  'students',
  'access',
  'to',
  'cloud',
  'and',
  'AI',
  'services'],
 ['As',
  'part',
  'of',
  'the',
  'program,',
  'the',
  'Redmond',
  'giant',
  'which',
  'wants',
  'to',
  'expand',
  'its',
  'reach',
  'and',
  'is',
  'planning',
  'to',
  'build',
  'a',
  'strong',
  'developer',
  'ecosystem',
  'in',
  'India',
  'w