### **Installation, initialization and dependencies**

In [1]:
# change to local directory
import os
os.chdir("/content/drive/My Drive/Colab Notebooks/Demo capstone/Summarization")

In [2]:
# install packages (in addition to Colab pre-installed packages like torch)
!pip install aylien-apiclient
!pip install bert-extractive-summarizer
!pip install spacy==2.3.0
!pip install transformers
!pip install neuralcoref

Collecting aylien-apiclient
  Downloading https://files.pythonhosted.org/packages/f8/2e/f901d788b248afb61e5f4f5a1439df0a858aaeaea30bab425cd1e1dc685b/aylien-apiclient-0.7.0.tar.gz
Building wheels for collected packages: aylien-apiclient
  Building wheel for aylien-apiclient (setup.py) ... [?25l[?25hdone
  Created wheel for aylien-apiclient: filename=aylien_apiclient-0.7.0-cp37-none-any.whl size=9099 sha256=6721821dd870d11d28da31b3d0dc2f3570140d7519b088c9ee66ffc65d0c8f6d
  Stored in directory: /root/.cache/pip/wheels/93/72/c6/b11d6bca9428ae3841ea52cc60924e8d916c00c2a10fc6fbea
Successfully built aylien-apiclient
Installing collected packages: aylien-apiclient
Successfully installed aylien-apiclient-0.7.0
Collecting bert-extractive-summarizer
  Downloading https://files.pythonhosted.org/packages/1a/07/fdb05f9e18b6f641499ef56737126fbd2fafe1cdc1a04ba069d5aa205901/bert_extractive_summarizer-0.7.1-py3-none-any.whl
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/pac

In [3]:
# importing packages

import numpy as np
import pandas as pd
import math
import re

import nltk
nltk.download('punkt')
nltk.download('stopwords')

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

import transformers
from transformers import AutoModelWithLMHead, AutoTokenizer, AutoModelForSeq2SeqLM

from aylienapiclient import textapi

from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from summarizer import Summarizer as summarizer_bert

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
txt = '''
U.S. Secretary of State Antony Blinken (2nd R), joined by national security advisor Jake Sullivan (R), speaks while facing Yang Jiechi (2nd L), director of the Central Foreign Affairs Commission Office, and Wang Yi (L), China's foreign minister at the opening session of U.S.-China talks at the Captain Cook Hotel in Anchorage, Alaska on March 18, 2021. Frederic J. Brown | AFP | Getty Images

BEIJING — The first high-level gathering of U.S. and Chinese officials under President Joe Biden kicked off with an exchange of insults at a pre-meeting press event in Alaska on Thursday. A planned four-minute photo session for the officials to address reporters ended up lasting one hour and 15 minutes due to a frothy exchange, according to NBC News. Both the Chinese and U.S. side kept calling the reporters back into the room so they could add remarks. Expectations were already low for the meeting in Anchorage, Alaska, with U.S. Secretary of State Antony Blinken, National Security Advisor Jake Sullivan, Chinese Foreign Minister Wang Yi and Yang Jiechi, director of the Central Foreign Affairs Commission of the Chinese Communist Party. In his opening remarks, Blinken said the U.S. would discuss its "deep concerns with actions by China, including in Xinjiang, Hong Kong, Taiwan, cyber attacks on the United States, economic coercion toward our allies." "Each of these actions threaten the rules-based order that maintains global stability. That's why they're not merely internal matters, and why we feel an obligation to raise these issues here today," Blinken said. "I said that the United States relationship with China will be competitive where it should be, collaborative where it can be, adversarial where it must be."

The United States does not have the qualification to say that it wants to speak to China from a position of strength. Yang Jiechi director of the Central Foreign Affairs Commission

Beijing considers issues in Xinjiang, Hong Kong and Taiwan as part of its domestic affairs, and the officials reiterated at the meeting that China is firmly opposed to foreign interference. Yang said the U.S. side "carefully orchestrated" the dialogue, according to an official translation reported by NBC. "I think we thought too well of the United States, we thought that the U.S. side will follow the necessary diplomatic protocols," Yang said, adding that "the United States does not have the qualification to say that it wants to speak to China from a position of strength." Yang said the U.S. must deal with the Chinese side in "the right way" and reiterated Beijing's call for cooperation.

I'm hearing deep satisfaction that the United States is back, that we're reengaged with our allies and partners. I'm also hearing deep concern about some of the actions your government is taking. Antony Blinken U.S. Secretary of State

Under Chinese President Xi Jinping, the Chinese government has been consolidating its power at home and abroad. In the last year, Beijing has pushed ahead with major trade deals with Asia-Pacific neighbors and the European Union. Chinese authorities have also emphasized their success in quickly controlling the coronavirus pandemic domestically, and their claim of lifting all 1.4 billion people in the country out of poverty — both of which Yang pointed to in his meeting with U.S. officials. "We believe that it is important for the United States to change its own image, and to stop advancing its own democracy in the rest of the world," Yang said. China's Ministry of Foreign Affairs did not immediately have a comment. State-run broadcaster CCTV said the U.S. went "seriously overtime" in its opening remarks and "provoked disputes," according to a CNBC translation of the Mandarin-language report.
'''

### **1. Using textrank model**

In [5]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

# Extract word vectors - get Glove embeddings
word_embeddings = {}
print("start loading vectorization")

# if pythonanywhere
# f = open('/home/viethoangtranduong/xlite_capstone/summary/vectorization/glove.6B.50d.txt', encoding='utf-8')

# if not pythonanywhere
f = open('./summary/vectorization/glove.6B.50d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

def summarizer_textrank_get(text, percent_sentences = 50):
    """summarizing using textrank method
    compute the pageranke levels after embeeded each sentence into a vector 
    with weight = its cosine similarity

    Args:
        text (str): the text to summarize
        percent_sentences (int, optional): percent to retain. Defaults to 50.

    Returns:
        dictionary with the method [method], summary [sentences], number of sentences [summary_num_sentences]
    """    
    num_sentences = int(percent_sentences / 100 * len(sent_tokenize(text)))
    sentences = []
    for s in [text]:
      sentences.append(sent_tokenize(s))

    sentences = [y for x in sentences for y in x] # flatten list

    # remove punctuations, numbers and special characters
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

    # make alphabets lowercase
    clean_sentences = [s.lower() for s in clean_sentences]

    # remove stopwords from the sentences
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

    sentence_vectors = []
    for i in clean_sentences:
      if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((50,))) for w in i.split()])/(len(i.split())+0.001)
      else:
        v = np.zeros((50,))
      sentence_vectors.append(v)


    # similarity matrix
    sim_mat = np.zeros([len(sentences), len(sentences)])

    for i in range(len(sentences)):
      for j in range(len(sentences)):
        if i != j:
          sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,50), sentence_vectors[j].reshape(1,50))[0,0]

    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)

    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

    summary = ""

    # Extract top 10 sentences as the summary
    for i in range(num_sentences):
      summary += ranked_sentences[i][1] + " "
    sentences = sent_tokenize(summary)

    # output
    output = {'sentences': summary[:-1], 'summary_num_sentences': len(sentences), "method": "TextRank"}

    return output

start loading vectorization


In [6]:
output = summarizer_textrank_get(txt, 30)

print("Number of sentences:", output['summary_num_sentences'])
print("Summary:", output['sentences'])

Number of sentences: 6
Summary: "I think we thought too well of the United States, we thought that the U.S. side will follow the necessary diplomatic protocols," Yang said, adding that "the United States does not have the qualification to say that it wants to speak to China from a position of strength." Yang said the U.S. must deal with the Chinese side in "the right way" and reiterated Beijing's call for cooperation. In his opening remarks, Blinken said the U.S. would discuss its "deep concerns with actions by China, including in Xinjiang, Hong Kong, Taiwan, cyber attacks on the United States, economic coercion toward our allies." "We believe that it is important for the United States to change its own image, and to stop advancing its own democracy in the rest of the world," Yang said. Chinese authorities have also emphasized their success in quickly controlling the coronavirus pandemic domestically, and their claim of lifting all 1.4 billion people in the country out of poverty — bot

### **2. Using huggingface (transformers) model**

In [7]:
tokenizer_hgf = AutoTokenizer.from_pretrained('./summary/models/hgf_model')
model_hgf = AutoModelForSeq2SeqLM.from_pretrained('./summary/models/hgf_model')

def split_group(sentences, cap = 600):
  """split the long documents to fit the HuggingFace tokens limit

  Args:
      sentences (str): text
      cap (int, optional): the limit length for each subset. Defaults to 600.

  Returns:
      (list) the lists of splitted sentences to fit into HGF
  """  
  if type(sentences) == str:
    sentences = sent_tokenize(sentences)
    
  split = [[]]
  cur_val = 0

  # check if the tokens exceed the limit, then pass it to the next batch
  for sen in sentences:
    val = len(sen.split(" "))
    if cur_val + val <= cap:
      split[-1].append(sen)
      cur_val += val
    else:
      cur_val = 0
      split.append([])
      split[-1].append(sen)
      cur_val += val
  return split

def summarize_hgf(txt):
  """summarize using HGF model

  Args:
      txt (str): text 

  Returns:
      (str) summary
  """  
  inputs = tokenizer_hgf.encode(txt, return_tensors="pt")
  outputs = model_hgf.generate(inputs)
  result = tokenizer_hgf.decode(outputs[0])
  return result[7:-4]

def summarizer_hgf_get(text, percent_sentences = 50):
  """get the hgf summarization after splitting the sentences and fit into the transformer model

  Args:
      text (str): the text to summarize
      percent_sentences (int, optional): percent to retain. Defaults to 50.

  Returns:
      dictionary with the method [method], summary [sentences], number of sentences [summary_num_sentences]
  """    
  
  num_sentences = int(percent_sentences / 100 * len(sent_tokenize(text)))
  
  # if fail, use aylien
  try:

    # split sentences to fit the token limit
    summary = ""
    split_texts = split_group(text)
    for item in split_texts:
      current_text = "".join(item)
      current_summary = summarize_hgf(current_text)
      summary += current_summary + " "
    
    sentences = sent_tokenize(summary)
    
    # summarize each batch and append the results
    # output
    output = {'sentences': summary, 'summary_num_sentences': len(sentences), "method": "hgf"}
    # print("HGF works!")
    return output

  # use aylien if failed
  except:
    output = {}

    # send requests to aylien
    client = textapi.Client("79e389d3", "1bc2400da0cb4745c30fb68b67e5e5cf")

    out = client.Summarize({'sentences_number': num_sentences,
                            'text': text,
                            'title': "Class reading"})

    output['summary_num_sentences'] = num_sentences
    output['sentences'] = "".join([" " + val for val in out['sentences']])[1:]
    output['method'] = "aylien"
    # print("Bert transformer failed! Use aylien")
    return output

In [8]:
output = summarizer_hgf_get(txt, 30)

print("Number of sentences:", output['summary_num_sentences'])
print("Summary:", output['sentences'])

Number of sentences: 5
Summary: First high-level gathering of U.S. and Chinese officials under President Joe Biden kicked off with an exchange of insults at a pre-meeting press event in Alaska on Thursday. A planned four-minute photo session for the officials to address reporters ended up lasting one hour and 15 minutes due to a frothy exchange. State-run broadcaster CCTV said the U.S. went "seriously overtime" in its opening remarks. CCTV said there were "provoked disputes," according to a CNBC translation of the Mandarin-language report. The United States and China have been locked in a bitter trade dispute for years. 


### **3. Using bert-extractive model**

In [9]:
# Load model, model config and tokenizer via Transformers
custom_config = transformers.AutoConfig.from_pretrained('bert-base-cased')
custom_config.output_hidden_states = True
custom_tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-cased')
custom_model = transformers.AutoModel.from_pretrained('bert-base-cased', config=custom_config)
summarizer_bert_model = summarizer_bert(custom_model=custom_model, custom_tokenizer=custom_tokenizer, )

def summarizer_bertext_get(text, percent_sentences):
  """summarizing using bert extractive method with few heuristics

  Args:
      text (str): the text to summarize
      percent_sentences (int, optional): percent to retain. Defaults to 50.

  Returns:
      dictionary with the method [method], summary [sentences], number of sentences [summary_num_sentences]
  """    
  num_sentences = int(percent_sentences / 100 * len(sent_tokenize(text)))
  
  # handle failing cases
  try:

    # Filter unneccessary information
    citations = {"References:", "Reference:", "Citations:", "Citation:", "Resources:", "Bibliography:" "Resource:", "Author Bio:", "Author bio:", "author bio:"}

    # remove citations
    current_l = len(text)
    for word in citations:
      try:
        idx = text.index(word)
        text = text[:idx]
      except:
        pass

      # check if accidentally cut all the text
      if len(text) < current_l/3:
        raise ValueError

    # 2nd round of filter
    sentences = sent_tokenize(text)
    filtered = set()
    for i, sentence in enumerate(sentences):
      if sentence[-1] in {"?", ":"} or sentence[-5:-1] in {'here', 'poor'} or sentence[:2] in {"By"} or len(sentence) < 10:
        filtered.add(sentence)
    
    # get the new text after filtering
    new_sentences = [" " + sentence for sentence in sentences if sentence not in filtered]
    new_text = "".join(new_sentences)
    total_sentences = len(new_sentences)
    
    # print(num_sentences, total_sentences)
    summary = summarizer_bert_model(new_text, ratio = (num_sentences + 1) / max(total_sentences, 1))

    # split into sub-sentences
    sentences = sent_tokenize(summary)
    
    # output
    output = {'sentences': summary, 'summary_num_sentences': len(sentences), "method": "bert"}
    # print("Bert transformer works!")
  
  
  # if failed, use aylien
  except:
    
  
    output = {}

    # send request to aylien
    client = textapi.Client("79e389d3", "1bc2400da0cb4745c30fb68b67e5e5cf")

    out = client.Summarize({'sentences_number': num_sentences,
                            'text': text,
                            'title': "Class reading"})

    output['summary_num_sentences'] = num_sentences
    output['sentences'] = "".join([" " + val for val in out['sentences']])[1:]
    output['method'] = "aylien"
    # print("Bert transformer failed! Use aylien")
  return output

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




In [10]:
output = summarizer_bertext_get(txt, 30)

print("Number of sentences:", output['summary_num_sentences'])
print("Summary:", output['sentences'])

Number of sentences: 7
Summary: U.S. Secretary of State Antony Blinken (2nd R), joined by national security advisor Jake Sullivan (R), speaks while facing Yang Jiechi (2nd L), director of the Central Foreign Affairs Commission Office, and Wang Yi (L), China's foreign minister at the opening session of U.S.-China talks at the Captain Cook Hotel in Anchorage, Alaska on March 18, 2021. A planned four-minute photo session for the officials to address reporters ended up lasting one hour and 15 minutes due to a frothy exchange, according to NBC News. Expectations were already low for the meeting in Anchorage, Alaska, with U.S. Secretary of State Antony Blinken, National Security Advisor Jake Sullivan, Chinese Foreign Minister Wang Yi and Yang Jiechi, director of the Central Foreign Affairs Commission of the Chinese Communist Party. I think we thought too well of the United States, we thought that the U.S. side will follow the necessary diplomatic protocols," Yang said, adding that "the Unite

### **4. Using aylien model (external)**

In [11]:
def summarizer_aylien_get(text, percent_sentences = 50):
    """summarizing using aylien method

    Args:
        text (str): the text to summarize
        percent_sentences (int, optional): percent to retain. Defaults to 50.

    Returns:
        dictionary with the method [method], summary [sentences], number of sentences [summary_num_sentences]
    """    

    num_sentences = int(percent_sentences / 100 * len(sent_tokenize(text)))


    output = {}

    # send request to their service
    client = textapi.Client("79e389d3", "1bc2400da0cb4745c30fb68b67e5e5cf")

    out = client.Summarize({'sentences_number': num_sentences,
                            'text': text,
                            'title': None})

    output['summary_num_sentences'] = num_sentences
    output['sentences'] = "".join([" " + val for val in out['sentences']])[1:]
    output['method'] = "Aylien"
    return output

In [12]:
output = summarizer_aylien_get(txt, 30)

print("Number of sentences:", output['summary_num_sentences'])
print("Summary:", output['sentences'])

Number of sentences: 6
Summary: U.S. Secretary of State Antony Blinken (2nd R), joined by national security advisor Jake Sullivan (R), speaks while facing Yang Jiechi (2nd L), director of the Central Foreign Affairs Commission Office, and Wang Yi (L), China's foreign minister at the opening session of U.S.-China talks at the Captain Cook Hotel in Anchorage, Alaska on March 18, 2021. Expectations were already low for the meeting in Anchorage, Alaska, with U.S. Secretary of State Antony Blinken, National Security Advisor Jake Sullivan, Chinese Foreign Minister Wang Yi and Yang Jiechi, director of the Central Foreign Affairs Commission of the Chinese Communist Party. In his opening remarks, Blinken said the U.S. would discuss its "deep concerns with actions by China, including in Xinjiang, Hong Kong, Taiwan, cyber attacks on the United States, economic coercion toward our allies." Yang Jiechi director of the Central Foreign Affairs Commission

Beijing considers issues in Xinjiang, Hong Ko

### **5. Using tf-idf model**

In [13]:
# extracting from https://github.com/bhuwanesh340/Text_Summarization_NLP/blob/master/Text_Summary.py
    
'''
We already have a sentence tokenizer, so we just need 
to run the sent_tokenize() method to create the array of sentences.
'''

def _create_frequency_matrix(sentences):
    """get frequency matrix for words

    Args:
        sentences (str): text

    Returns:
        frequency matrix
    """    
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    # remove stopwords and build a frequency matrix
    # if word not there, add to dictionary
    # if it's there, then add +1 to that value
    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue
            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix


def _create_tf_matrix(freq_matrix):
    """get term frequency

    Args:
        frequency matrix (dict)

    Returns:
        term frequency matrix
    """      
    tf_matrix = {}

    # frequency divide by length of document
    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix


def _create_documents_per_words(freq_matrix):
    """create documents for each word

    Args:
        frequency matrix (dict)

    Returns:
        document per word
    """      
    word_per_doc_table = {}

    # start counting 
    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table


def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    """iv=nverse document frequency

    Args:
        freq_matrix (dict): frequency
        count_doc_per_words (dict): document frequency
        total_documents (int): number of documents

    Returns:
        inverse document frequency matrix
    """    
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        # follow the formula
        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix


def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    """tf-idf matrix from tf and idf matrices

    Args:
        tf_matrix (dict): term frequency
        idf_matrix (dict):document frequency

    Returns:
        dict: tf-idf matrix
    """    
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix


def _score_sentences(tf_idf_matrix) -> dict:
    """score a sentence by its word's TF: adding the TF frequency of 
    every non-stop word in a sentence divided by total no of words in a sentence.

    Args:
        tf_idf_matrix (dict): tf-idf values above

    Returns:
        dict: each sentence value
    """    

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score
        # formula
        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue


def _find_score(sentenceValue, percent_sentences) -> int:
    """Find the average score from the sentence value dictionary

    Args:
        sentenceValue (dict): sentences' values above 
        percent_sentences (int): percent of information to retain

    Returns:
        int: threshold to keep
    """   

    values = []
    for entry in sentenceValue:
        values.append(sentenceValue[entry])

    # Average value of a sentence from original summary_text
    return np.percentile(values, 100 - percent_sentences)


def _generate_summary(sentences, sentenceValue, threshold):
    """get the summary: if value above the threshold

    Args:
        sentences (list): all sentences
        sentenceValue (dict): the dict storing its value
        threshold (int): threshold to select sentences

    Returns:
        (str): summary
    """    
    sentence_count = 0
    summary = ''

    # check if qualify 
    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary


def summarizer_tfidf_get(text, percent_sentences = 50):
    """sumarize using tfidf

    Args:
        text (str): the text to summarize
        percent_sentences (int, optional): percent to retain. Defaults to 50.

    Returns:
        (str) summarized summary_text
    """ 

    sentences = sent_tokenize(text)

    # 1 Sentence Tokenize
    
    total_documents = len(sentences)
    #print(sentences)

    # 2 Create the Frequency matrix of the words in each sentence.
    freq_matrix = _create_frequency_matrix(sentences)
    #print(freq_matrix)

    '''
    Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
    '''
    # 3 Calculate TermFrequency and generate a matrix
    tf_matrix = _create_tf_matrix(freq_matrix)
    #print(tf_matrix)

    # 4 creating table for documents per words
    count_doc_per_words = _create_documents_per_words(freq_matrix)
    #print(count_doc_per_words)

    '''
    Inverse document frequency (IDF) is how unique or rare a word is.
    '''
    # 5 Calculate IDF and generate a matrix
    idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)
    #print(idf_matrix)

    # 6 Calculate TF-IDF and generate a matrix
    tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)
    #print(tf_idf_matrix)

    # 7 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(tf_idf_matrix)
    #print(sentence_scores)

    # 8 Find the threshold
    threshold = _find_score(sentence_scores, percent_sentences)
    #print(threshold)

    # 9 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, threshold)
    sentences = sent_tokenize(summary)

    # output
    output = {'sentences': summary, 'summary_num_sentences': len(sentences), "method": "tfidf"}

    return output

In [14]:
output = summarizer_tfidf_get(txt, 30)

print("Number of sentences:", output['summary_num_sentences'])
print("Summary:", output['sentences'])

Number of sentences: 7
Summary:  Frederic J. Both the Chinese and U.S. side kept calling the reporters back into the room so they could add remarks. "Each of these actions threaten the rules-based order that maintains global stability. The United States does not have the qualification to say that it wants to speak to China from a position of strength. I'm also hearing deep concern about some of the actions your government is taking. In the last year, Beijing has pushed ahead with major trade deals with Asia-Pacific neighbors and the European Union. China's Ministry of Foreign Affairs did not immediately have a comment.
