## Import Library

In [1]:
import pandas as pd
import numpy as np
import torch
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from google.colab import drive
drive.mount('/content/drive')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Mounted at /content/drive


## Data Processing

#### Paragraphs texts

In [2]:
paragraphs = pd.read_json("/content/drive/MyDrive/vip_project_spring_2023/data/textbook_df.json")

In [3]:
paragraphs

Unnamed: 0,Textbook_Data,Chapter.Section
0,CHAPTER -2 Complex Numbers,-2.00
1,The basic manipulations of complex numbers are...,-2.00
2,Simple algebraic rules: Operations on complex...,-2.00
3,Elimination of trigonometry: Eulers formula f...,-2.00
4,Representation by vectors: A vector drawn fro...,-2.00
...,...,...
999,We also continued to stress the important conc...,10.12
1000,Lab~#11 is devoted to IIR filters. This lab us...,10.12
1001,"Lab: #11 PeZ, the,, and domains The PeZ tool ...",10.12
1002,The CD-ROM also contains the following demonst...,10.12


In [4]:
paragraphs["Textbook_Data"][0]

'CHAPTER -2 Complex Numbers '

In [5]:
length = len(paragraphs)
length

1004

#### Sub_chapters in the textbook

In [6]:
sections = {i for i in paragraphs["Chapter.Section"]}
sections = [s for s in sections]
sections.sort()
print(sections)

[-2.7, -2.6, -2.5, -2.4, -2.3, -2.2, -2.1, -2.0, -1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1.0, 1.0, 1.1, 1.2, 1.3, 1.4, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.11, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 8.1, 8.11, 8.2, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9.0, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10.0, 10.1, 10.11, 10.12, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9]


#### Explore with Chapter 1 - Section 0

In [7]:
section_1_0_text = [paragraphs["Textbook_Data"][i] for i in range(length) if paragraphs["Chapter.Section"][i] == 1.0]
section_1_0_text

['CHAPTER 1 Introduction This is a book about signals and systems. In this age of multimedia computers, audio and video entertainment systems, and digital communication systems, it is almost certain that you, the reader of this text, have formed some impression of the meaning of the terms signal and system, and you probably use the terms often in daily conversation.',
 'It is likely that your usage and understanding of the terms are correct within some rather broad definitions. For example, you may think of a signal as “something” that carries information. Usually, that something is a pattern of variations of a physical quantity that can be manipulated, stored, or transmitted by physical processes. Examples include speech signals, audio signals, video or image signals, biomedical signals, radar signals, and seismic signals, to name just a few. An important point is that signals can take many equivalent forms or representations. For example, a speech signal is produced as an acoustic si

In [8]:
section_1_0_string = ''
for i in section_1_0_text:
  section_1_0_string += ' ' + str(i)
section_1_0_string

' CHAPTER 1 Introduction This is a book about signals and systems. In this age of multimedia computers, audio and video entertainment systems, and digital communication systems, it is almost certain that you, the reader of this text, have formed some impression of the meaning of the terms signal and system, and you probably use the terms often in daily conversation. It is likely that your usage and understanding of the terms are correct within some rather broad definitions. For example, you may think of a signal as “something” that carries information. Usually, that something is a pattern of variations of a physical quantity that can be manipulated, stored, or transmitted by physical processes. Examples include speech signals, audio signals, video or image signals, biomedical signals, radar signals, and seismic signals, to name just a few. An important point is that signals can take many equivalent forms or representations. For example, a speech signal is produced as an acoustic signal

#### Append paragraphs of each sub-chapters to the string of the whole sub_chapter

In [9]:
def to_string(section):
  section_text = [paragraphs["Textbook_Data"][i] for i in range(length) if paragraphs["Chapter.Section"][i] == section]
  section_string = ''
  for i in section_text:
    section_string += ' ' + str(i)
  return section_string

In [10]:
to_string(1.0)

' CHAPTER 1 Introduction This is a book about signals and systems. In this age of multimedia computers, audio and video entertainment systems, and digital communication systems, it is almost certain that you, the reader of this text, have formed some impression of the meaning of the terms signal and system, and you probably use the terms often in daily conversation. It is likely that your usage and understanding of the terms are correct within some rather broad definitions. For example, you may think of a signal as “something” that carries information. Usually, that something is a pattern of variations of a physical quantity that can be manipulated, stored, or transmitted by physical processes. Examples include speech signals, audio signals, video or image signals, biomedical signals, radar signals, and seismic signals, to name just a few. An important point is that signals can take many equivalent forms or representations. For example, a speech signal is produced as an acoustic signal

## Text Summarization algorithm

#### Import NLTK

In [11]:
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

In [12]:
from nltk.tokenize import word_tokenize, sent_tokenize
sentences = sent_tokenize(section_1_0_string)
sentences

[' CHAPTER 1 Introduction This is a book about signals and systems.',
 'In this age of multimedia computers, audio and video entertainment systems, and digital communication systems, it is almost certain that you, the reader of this text, have formed some impression of the meaning of the terms signal and system, and you probably use the terms often in daily conversation.',
 'It is likely that your usage and understanding of the terms are correct within some rather broad definitions.',
 'For example, you may think of a signal as “something” that carries information.',
 'Usually, that something is a pattern of variations of a physical quantity that can be manipulated, stored, or transmitted by physical processes.',
 'Examples include speech signals, audio signals, video or image signals, biomedical signals, radar signals, and seismic signals, to name just a few.',
 'An important point is that signals can take many equivalent forms or representations.',
 'For example, a speech signal is p

In [13]:
def create_dictionary_table(text_string) -> dict:
   
    # Removing stop words
    stop_words = set(stopwords.words("english"))
    stop_words.add(';')
    stop_words.add('“')
    stop_words.add('.')

    stop_words.add('(')
    stop_words.add(')')
    stop_words.add(',')
    stop_words.add('”')
    # stop_words.add('※')
    stop_words.add(':')
    stop_words.add('{')
    stop_words.add('}')
    
    words = word_tokenize(text_string.lower())
    
    # Reducing words to their root form
    # stemer = SnowballStemmer("english")
    
    # Creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        # wd = stemer.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table

In [14]:
frequency_table = create_dictionary_table(section_1_0_string)
frequency_table


{'chapter': 1,
 '1': 1,
 'introduction': 1,
 'book': 1,
 'signals': 16,
 'systems': 9,
 'age': 1,
 'multimedia': 1,
 'computers': 1,
 'audio': 4,
 'video': 2,
 'entertainment': 1,
 'digital': 2,
 'communication': 1,
 'almost': 1,
 'certain': 1,
 'reader': 1,
 'text': 2,
 'formed': 1,
 'impression': 1,
 'meaning': 1,
 'terms': 3,
 'signal': 9,
 'system': 6,
 'probably': 1,
 'use': 2,
 'often': 2,
 'daily': 1,
 'conversation': 1,
 'likely': 1,
 'usage': 1,
 'understanding': 2,
 'correct': 1,
 'within': 1,
 'rather': 1,
 'broad': 1,
 'definitions': 1,
 'example': 4,
 'may': 2,
 'think': 1,
 'something': 3,
 'carries': 1,
 'information': 1,
 'usually': 1,
 'pattern': 2,
 'variations': 1,
 'physical': 2,
 'quantity': 1,
 'manipulated': 1,
 'stored': 2,
 'transmitted': 1,
 'processes': 1,
 'examples': 1,
 'include': 1,
 'speech': 2,
 'image': 1,
 'biomedical': 1,
 'radar': 1,
 'seismic': 1,
 'name': 1,
 'important': 1,
 'point': 1,
 'take': 1,
 'many': 1,
 'equivalent': 1,
 'forms': 1,
 'rep

In [15]:
def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    # Algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words
      
    return sentence_weight

In [16]:
def _calculate_threshold(sentence_weight, length) -> int:
  
    weight = list(sentence_weight.values())
    if len(weight) <= length:
      #min
      th = np.min(weight)
    else:
      weight.sort(reverse=True)
      th = weight[length-1]

    return th

In [17]:
weights = _calculate_sentence_scores(sentences, frequency_table)
weights

{' CHAPTE': 5.5,
 'In this': 2.1923076923076925,
 'It is l': 1.2142857142857142,
 'For exa': 2.5251177394034534,
 'Usually': 1.3571428571428572,
 'Example': 3.066666666666667,
 'An impo': 3.5,
 'The ter': 1.75,
 'More sp': 3.8181818181818183,
 'A CD pl': 2.4285714285714284,
 'In gene': 4.9,
 'Our goa': 3.8461538461538463,
 'Specifi': 3.857142857142857,
 'We also': 3.0526315789473686}

In [18]:
threshold = _calculate_threshold(weights, 4)
threshold

3.8461538461538463

In [19]:
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

In [20]:
_get_article_summary(sentences, weights, threshold)

'  CHAPTER 1 Introduction This is a book about signals and systems. In general, systems operate on signals to produce new signals or new signal representations. Our goal in this text is to develop a framework wherein it is possible to make precise statements about both signals and systems. Specifically, we want to show that mathematics is an appropriate language for describing and understanding signals and systems.'

## **Apply** for all sub_chapters

In [21]:
summary = {}
length_of_summary = []
for s in sections:
  section_string = to_string(s)
  section_sentences = sent_tokenize(section_string)
  section_frequency_table = create_dictionary_table(section_string)
  secction_sentence_weights = _calculate_sentence_scores(section_sentences, section_frequency_table)
  section_threshold = _calculate_threshold(secction_sentence_weights, 4)
  summary[str(s)] = "Chapter "+ str(s) + ":"+_get_article_summary(section_sentences, secction_sentence_weights, section_threshold)
  sum_len = len(sent_tokenize(summary[str(s)]))
  length_of_summary.append(len(sent_tokenize(summary[str(s)])))
  # if sum_len > 7:
  #   section_threshold = _calculate_threshold(secction_sentence_weights, 1)
  #   summary[str(s)] = _get_article_summary(section_sentences, secction_sentence_weights, section_threshold)
  #   sum_len = len(sent_tokenize(summary[str(s)]))
  if len(sent_tokenize(summary[str(s)])) == 0:
    print("NULL")

summary

{'-2.7': 'Chapter -2.7:  Summary and Links This appendix has presented a brief review of complex numbers and their visualization as vectors in the two-dimensional complex plane. The labs in Chapter 2 deal with various aspects of complex numbers, and also introduce MATLAB. In Lab #2, we also have included a number of MATLAB functions for plotting vectors from complex numbers ( zvect, zcat) and for changing between Cartesian and polar forms ( zprint). Lab: #2, Adding Sinusoids via Complex Amplitudes The Complex Numbers via MATLAB demo is a quick reference to these routines.',
 '-2.6': 'Chapter -2.6: Graphical display of the th roots of unity ( ). These are the solutions of. Notice that there are only distinct roots. Graphical display of the th roots of unity ( ). These are the solutions of. Notice that there are only distinct roots. Plot all the solutions.',
 '-2.5': 'Chapter -2.5: Draw the vector from the origin to the head of. This vector is the sum. Complex multiplication becomes a ro

## Evaluate length of summary

In [22]:
print(length_of_summary)

[4, 7, 5, 4, 4, 5, 4, 5, 7, 4, 4, 4, 4, 5, 5, 6, 4, 4, 4, 4, 4, 4, 4, 5, 4, 7, 7, 4, 5, 4, 6, 4, 4, 6, 5, 8, 6, 4, 7, 5, 4, 4, 4, 8, 5, 7, 8, 4, 6, 5, 4, 6, 5, 4, 6, 5, 4, 4, 4, 4, 4, 6, 4, 4, 4, 4, 5, 6, 4, 4, 5, 5, 8, 9, 6, 4, 4, 4, 5, 6, 5, 9, 6, 8, 6, 5, 5, 4, 4, 5, 4, 7, 4, 4, 4, 5, 7, 4, 6, 4, 5, 6, 8, 4, 4, 5, 9]


#### Average length of summary

In [23]:
np.average(length_of_summary)

5.093457943925234

#### Export summary to csv file

In [24]:
from google.colab import files
l = [(k, v) for k, v in summary.items()]
df = pd.DataFrame(l, columns=["sub_chapter",'summarization'])
df.to_csv('summary.csv')
files.download('summary.csv')