<a href="https://colab.research.google.com/github/shubhamksingh1/Summarization/blob/main/Summary_Using_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####Term Frequency * Inverse Document Frequency 

TF-IDF algorithm is made of 2 algorithms multiplied together.

Term Frequency
Term frequency (TF) is how often a word appears in a document, divided by how many words there are.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

Inverse document frequency
Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is.

IDF(t) = log_e(Total number of documents / Number of documents with term t in it);

Example, 

Consider a document containing 100 words wherein the word appleappears 5times. The term frequency (i.e., TF) for apple is then (5 / 100) = 0.05.
Now, assume we have 10 million documents and the word apple appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
import math
import nltk
from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

###The 9 steps implementation

**Prerequisites** Python3, NLTK library of python



###1. Tokenize the sentences
We’ll tokenize the sentences here instead of words. And we’ll give weight to these sentences.

###2. Create the Frequency matrix of the words in each sentence.

We calculate the frequency of words in each sentence.
Here, each sentence is the key and the value is a dictionary of word frequency.

In [None]:
def _create_frequency_matrix(sentences) -> dict:
    """
    we create a dictionary for the word frequency table.
    For this, we should only use the words that are not part of the stopWords array.

    Removing stop words and making frequency table
    Stemmer - an algorithm to bring words to its root word.
    :rtype: dict
    """
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix


###3. Calculate TermFrequency and generate a matrix

We’ll find the TermFrequency for each word in a paragraph.

Now, remember the definition of TF,

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
Here, the document is a paragraph, the term is a word in a paragraph.

If we compare this table with the table we’ve generated in step 2, you will see the words having the same frequency are having the similar TF score.

In [None]:
def _create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix



###4. Creating a table for documents per words


This again a simple table which helps in calculating IDF matrix.
we calculate, “how many sentences contain a word”, Let’s call it Documents per words matrix.

In [None]:
def _create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table


###5. Calculate IDF and generate a matrix


We’ll find the IDF for each word in a paragraph.

Now, remember the definition of IDF,

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
Here, the document is a paragraph, the term is a word in a paragraph.

In [None]:

def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix

###6. Calculate TF-IDF and generate a matrix

Now we have both the matrix and the next step is very easy.
TF-IDF algorithm is made of 2 algorithms multiplied together.
In simple terms, we are multiplying the values from both the matrix and generating new matrix.

In [None]:

def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

###7. Score the sentences

Scoring a sentence is differs with different algorithms. Here, we are using Tf-IDF score of words in a sentence to give weight to the paragraph.
This gives the table of sentences and their respected score

In [None]:
def _score_sentences(tf_idf_matrix) -> dict:
    """
    score a sentence by its word's TF
    Basic algorithm: adding the TF frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue

###8. Find the threshold

Similar to any summarization algorithms, there can be different ways to calculate a threshold value. We’re calculating the average sentence score.

In [None]:
def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    average = (sumValues / len(sentenceValue))

    return average

###9. Generate the summary

Algorithm: Select a sentence for a summarization if the sentence score is more than the average score.

In [None]:
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            if summary == "":
                summary = sentence
            else:
                summary += " " + sentence
            sentence_count += 1

    return summary

In [None]:
def run_summarization(text):
    """
    :param text: Plain summary_text of long article
    :return: summarized summary_text
    """

    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''
    # 1 Sentence Tokenize
    sentences = sent_tokenize(text)
    total_documents = len(sentences)
    print(type(sentences))
    print("sentences:\n")
    print(sentences)
    print("\n\n")

    # 2 Create the Frequency matrix of the words in each sentence.
    freq_matrix = _create_frequency_matrix(sentences)
    print("freq_matrix:\n")
    print(freq_matrix)
    print("\n\n")

    '''
    Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
    '''
    # 3 Calculate TermFrequency and generate a matrix
    tf_matrix = _create_tf_matrix(freq_matrix)
    print("tf_matrix:\n")
    print(tf_matrix)
    print("\n\n")

    # 4 creating table for documents per words
    count_doc_per_words = _create_documents_per_words(freq_matrix)
    print("count_doc_per_words:\n")
    print(count_doc_per_words)
    print("\n\n")

    '''
    Inverse document frequency (IDF) is how unique or rare a word is.
    '''
    # 5 Calculate IDF and generate a matrix
    idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)
    print("idf_matrix:\n")
    print(idf_matrix)
    print("\n\n")

    # 6 Calculate TF-IDF and generate a matrix
    tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)
    print("tf_idf_matrix:\n")
    print(tf_idf_matrix)
    print("\n\n")

    # 7 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(tf_idf_matrix)
    print("sentence_scores:\n")
    print(sentence_scores)
    print("\n\n")

    # 8 Find the threshold
    threshold = _find_average_score(sentence_scores)
    print("threshold:\n")
    print(threshold)
    print("\n\n")

    # 9 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1 * threshold)
    return summary



In [None]:
with open(r'/content/drive/My Drive/Colab Notebooks/data.txt', 'r') as myfile:
    data = myfile.read()

In [None]:
print(data)

Apple Inc. is one of the world’s largest makers of PCs and peripheral and consumer products, such as the
iPod digital music player, the iPad tablet, the iPhone smartphone, and the ‘‘Apple Watch,’’ for sale
primarily to the business, creative, education, government, and consumer markets. It also sells operating
systems, utilities, languages, developer tools, and database software.
AAPL's iPhone directly accounted for 66% of FY 15 revenues, with over 230 million units sold. AAPL sold
over 169 million iPhones in FY 14, contributing 56% of revenues. This was AAPL's fastest-growing
segment over the past couple of years, and while we expect the rate of growth to slow as the business
becomes larger and more mature, we still see substantial opportunities related to international,
enterprise and education markets. We note the fall 2015 introductions of the next generation iPhone 6s
and the iPhone 6s Plus devices.
Released in April 2010, the iPad quickly became the best-selling tablet computer b

In [None]:
filesummary=run_summarization(data)
print(filesummary.replace("\n",""))

<class 'list'>
sentences:

['Apple Inc. is one of the world’s largest makers of PCs and peripheral and consumer products, such as the\niPod digital music player, the iPad tablet, the iPhone smartphone, and the ‘‘Apple Watch,’’ for sale\nprimarily to the business, creative, education, government, and consumer markets.', 'It also sells operating\nsystems, utilities, languages, developer tools, and database software.', "AAPL's iPhone directly accounted for 66% of FY 15 revenues, with over 230 million units sold.", 'AAPL sold\nover 169 million iPhones in FY 14, contributing 56% of revenues.', "This was AAPL's fastest-growing\nsegment over the past couple of years, and while we expect the rate of growth to slow as the business\nbecomes larger and more mature, we still see substantial opportunities related to international,\nenterprise and education markets.", 'We note the fall 2015 introductions of the next generation iPhone 6s\nand the iPhone 6s Plus devices.', 'Released in April 2010, the

In [None]:
import docx
from docx  import Document
# doc = Document("/content/Data.docx")

def readtxt(filename):
    doc = Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

textfromdoc = readtxt('/content/Data.docx')
print(textfromdoc)

Apple Inc. is one of the world’s largest makers of PCs and peripheral and consumer products, such as the
iPod digital music player, the iPad tablet, the iPhone smartphone, and the ‘‘Apple Watch,’’ for sale
primarily to the business, creative, education, government, and consumer markets. It also sells operating
systems, utilities, languages, developer tools, and database software.
AAPL's iPhone directly accounted for 66% of FY 15 revenues, with over 230 million units sold. AAPL sold
over 169 million iPhones in FY 14, contributing 56% of revenues. This was AAPL's fastest-growing
segment over the past couple of years, and while we expect the rate of growth to slow as the business
becomes larger and more mature, we still see substantial opportunities related to international,
enterprise and education markets. We note the fall 2015 introductions of the next generation iPhone 6s
and the iPhone 6s Plus devices.
Released in April 2010, the iPad quickly became the best-selling tablet computer b

In [None]:
filedocsummary=run_summarization(textfromdoc)
print(filedocsummary.replace("\n",""))

<class 'list'>
sentences:

['Apple Inc. is one of the world’s largest makers of PCs and peripheral and consumer products, such as the\niPod digital music player, the iPad tablet, the iPhone smartphone, and the ‘‘Apple Watch,’’ for sale\nprimarily to the business, creative, education, government, and consumer markets.', 'It also sells operating\nsystems, utilities, languages, developer tools, and database software.', "AAPL's iPhone directly accounted for 66% of FY 15 revenues, with over 230 million units sold.", 'AAPL sold\nover 169 million iPhones in FY 14, contributing 56% of revenues.', "This was AAPL's fastest-growing\nsegment over the past couple of years, and while we expect the rate of growth to slow as the business\nbecomes larger and more mature, we still see substantial opportunities related to international,\nenterprise and education markets.", 'We note the fall 2015 introductions of the next generation iPhone 6s\nand the iPhone 6s Plus devices.', 'Released in April 2010, the

In [None]:
result = run_summarization(text_str)
print(result)

<class 'list'>
sentences:

['\nThose Who Are Resilient Stay In The Game Longer\n“On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.”\u200a—\u200aFriedrich Nietzsche\nChallenges and setbacks are not meant to defeat you, but promote you.', 'However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments.Have you experienced this before?', 'To be honest, I don’t have the answers.', 'I can’t tell you what the right course of action is; only you will know.', 'However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people.', 'To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways 

In [None]:
sentence = sent_tokenize(text_str)
summary = sent_tokenize(result)
res = [sentence[x] for x in range(len(sentence)) if sentence[x] in summary]

In [None]:
res

['To be honest, I don’t have the answers.',
 'Same failure, yet different responses.',
 'Who is right and who is wrong?',
 'Neither.',
 'Each person has a different mindset that decides their outcome.',
 'It was at that point their biggest breakthrough came.',
 'Perhaps all those years of perseverance finally paid off.',
 'If you settle for less, you will receive less than you deserve and convince yourself you are justified to receive it.',
 'It must come from within you.',
 'Gnaw away at your problems until you solve them or find a solution.',
 'Where are you settling in your life right now?',
 'Could you be you playing for bigger stakes than you are?',
 'Vision + desire + dedication + patience + daily action leads to astonishing success.',
 'Success is a fickle and long game with highs and lows.',
 'So become intentional on what you want out of life.',
 'Commit to it.',
 'Nurture your dreams.',
 'Don’t leave your dreams to chance.']

In [None]:
sentence[7]

'Who is right and who is wrong?'

In [None]:
summary

['To be honest, I don’t have the answers.',
 'Same failure, yet different responses.',
 'Who is right and who is wrong?',
 'Neither.',
 'Each person has a different mindset that decides their outcome.',
 'It was at that point their biggest breakthrough came.',
 'Perhaps all those years of perseverance finally paid off.',
 'If you settle for less, you will receive less than you deserve and convince yourself you are justified to receive it.',
 'It must come from within you.',
 'Gnaw away at your problems until you solve them or find a solution.',
 'Where are you settling in your life right now?',
 'Could you be you playing for bigger stakes than you are?',
 'Vision + desire + dedication + patience + daily action leads to astonishing success.',
 'Success is a fickle and long game with highs and lows.',
 'So become intentional on what you want out of life.',
 'Commit to it.',
 'Nurture your dreams.',
 'Don’t leave your dreams to chance.']

In [None]:
final_summary = ""

In [None]:
for sentence in summary:
    final_summary += sentence

print(final_summary)

 To be honest, I don’t have the answers. Same failure, yet different responses. Who is right and who is wrong? Neither. Each person has a different mindset that decides their outcome. It was at that point their biggest breakthrough came. Perhaps all those years of perseverance finally paid off. If you settle for less, you will receive less than you deserve and convince yourself you are justified to receive it. It must come from within you. Gnaw away at your problems until you solve them or find a solution. Where are you settling in your life right now? Could you be you playing for bigger stakes than you are? Vision + desire + dedication + patience + daily action leads to astonishing success. Success is a fickle and long game with highs and lows. So become intentional on what you want out of life. Commit to it. Nurture your dreams. Don’t leave your dreams to chance.To be honest, I don’t have the answers.Same failure, yet different responses.Who is right and who is wrong?Neither.Each per

In [None]:
def generate_final_sorted_summary(summary):
    final_summary = ""
    for sentence in summary:
        final_summary+=sentence
    return final_summary

In [None]:
document1 = '''
We see more upside than downside risk to the upcoming iPhone product cycle and a building Services narrative. Even if
device revenue growth slows, Services and Wearables can pick
up the slack, delivering 7% revenue and 21% EPS growth
annually through CY20, supporting our SOTP-driven $232 PT.
Estimates move higher on iPhone ASPs, Services, Wearables.F3Q revenue (+17%
Y/Y) beat expectations largely on iPhone ASPs (+20%) and Services (+31%
reported,28% normalized) with wearables (+37% Y/Y) maintaining momentum
from previous quarters. Sept Q guidance also topped expectations and reflects a
similar mid teens revenue growth rate for the overall company despite the more
difficult compare from a year ago. The combination of a strong macro
environment and an increasingly engaged customer base led to double digit
growth in all regions on a sell-in basis during the June quarter. After flowing
through the stronger results - namely higher F4Q revenue, slightly offset by
higher OpEx, plus stronger LT Services and wearables growth - our FY18 EPS
increases to $11.75 (from $11.45) and FY19 EPS increases to $14.12(from $14.00).
Our $232 SOTP-driven price target is unchanged after raising our revenue
forecast modestly, shifting to CY19,and considering recent multiple compression
at Services peers (Facebook, Tencent, Alibaba).
What we learnedfrom Apple'sF3Q18 earnings:
(+) Services narrativegains momentum with references to strong pipeline of
new services. Services revenue grew 31% Y/Y,above consensus expectations of
26% Y/Y and more in-line with our 32% Y/Y forecast. However, backing out a
$236M one-time legal benefit implies 28% Y/Y normalized growth. In the June
quarter, paid subscriptions to Apple Services topped 300M users (+60% Y/Y)
and the App Store, AppleCare, Apple Music, iCloud and Apple Pay all setnew
June quarterly revenue records.For the App Store, results were even more
impressive when considering the Chinese government reportedly slowed the
process of new app approvals in the quarter (China is biggest App Store country
in the world),although approvals appear to be accelerating in recent weeks.
Importantly, management expressed enthusiasm for upcoming "new" Services
and, for the first time, provided some color on Apple's personnel and partnership
investments in original video content,noting they "aren't really ready to share the
details about it yet."
'''

In [None]:
result1 = run_summarization(document1)
print(result1)

<class 'list'>
sentences:

['\nWe see more upside than downside risk to the upcoming iPhone product cycle and a building Services narrative.', 'Even if\ndevice revenue growth slows, Services and Wearables can pick\nup the slack, delivering 7% revenue and 21% EPS growth\nannually through CY20, supporting our SOTP-driven $232 PT.', 'Estimates move higher on iPhone ASPs, Services, Wearables.F3Q revenue (+17%\nY/Y) beat expectations largely on iPhone ASPs (+20%) and Services (+31%\nreported,28% normalized) with wearables (+37% Y/Y) maintaining momentum\nfrom previous quarters.', 'Sept Q guidance also topped expectations and reflects a\nsimilar mid teens revenue growth rate for the overall company despite the more\ndifficult compare from a year ago.', 'The combination of a strong macro\nenvironment and an increasingly engaged customer base led to double digit\ngrowth in all regions on a sell-in basis during the June quarter.', 'After flowing\nthrough the stronger results - namely higher F4Q

In [None]:
sentence1 = sent_tokenize(document1)
summary1 = sent_tokenize(result1)
res1 = [sentence1[x] for x in range(len(sentence1)) if sentence1[x] in summary1]

In [None]:
res1               ##sorting is already being done in _generate_summary

['\nWe see more upside than downside risk to the upcoming iPhone product cycle and a building Services narrative.',
 'Sept Q guidance also topped expectations and reflects a\nsimilar mid teens revenue growth rate for the overall company despite the more\ndifficult compare from a year ago.',
 'The combination of a strong macro\nenvironment and an increasingly engaged customer base led to double digit\ngrowth in all regions on a sell-in basis during the June quarter.',
 "What we learnedfrom Apple'sF3Q18 earnings:\n(+) Services narrativegains momentum with references to strong pipeline of\nnew services.",
 'Services revenue grew 31% Y/Y,above consensus expectations of\n26% Y/Y and more in-line with our 32% Y/Y forecast.',
 'However, backing out a\n$236M one-time legal benefit implies 28% Y/Y normalized growth.']

In [None]:
final_summary1 = generate_final_sorted_summary(res1)

In [None]:
final_summary1

"\nWe see more upside than downside risk to the upcoming iPhone product cycle and a building Services narrative.Sept Q guidance also topped expectations and reflects a\nsimilar mid teens revenue growth rate for the overall company despite the more\ndifficult compare from a year ago.The combination of a strong macro\nenvironment and an increasingly engaged customer base led to double digit\ngrowth in all regions on a sell-in basis during the June quarter.What we learnedfrom Apple'sF3Q18 earnings:\n(+) Services narrativegains momentum with references to strong pipeline of\nnew services.Services revenue grew 31% Y/Y,above consensus expectations of\n26% Y/Y and more in-line with our 32% Y/Y forecast.However, backing out a\n$236M one-time legal benefit implies 28% Y/Y normalized growth."

In [None]:
sentence = ["abc","def","ghi","jkl","mno"]
summary = ["ghi","abc","mno"]

res = [sentence[x] for x in range(len(sentence)) if sentence[x] in summary]

In [None]:
res

['abc', 'ghi', 'mno']