<a href="https://colab.research.google.com/github/smaranjitghose/PyArticleSummary/blob/master/Article_Summarization_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
 import nltk
 nltk.download('stopwords')
 nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Importing libraries

In [0]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

## Collecting the data


In [52]:
#fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

#parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

#returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

#looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text

print(article_content)

The 20th (twentieth) century was a century that began on
January 1, 1901[1] and ended on December 31, 2000.[2] It was the tenth and final century of the 2nd millennium. Strictly speaking, it is distinct from the century known as the 1900s which began on January 1, 1900, and ended on December 31, 1999.
The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold War and post-Cold War conflicts; intergovernmental organizations and cultural homogenization through developments in emerging transportation and communications technology; poverty reduction and world population growth, awareness of environmental degradation, ecological extinction;[3][4] and the birth of the Digital Revolution, enabled by the wide adoption of MOS transistors and integrated circuits. It saw great advances in power generation, com

In [53]:
    file = open('article_content.txt','w') 
    file.write(article_content)

20742

## Processing the data


In [54]:
#removing stop words
stop_words = set(stopwords.words("english"))
    
words = word_tokenize(article_content)
    
#reducing words to their root form
stem = PorterStemmer()
    
#creating dictionary for the word frequency table
frequency_table = dict()
for wd in words:
    wd = stem.stem(wd)
    if wd in stop_words:
        continue
    if wd in frequency_table:
            frequency_table[wd] += 1
    else:
         frequency_table[wd] = 1

for items in frequency_table.items():
    print(items)

('20th', 20)
('(', 10)
('twentieth', 1)
(')', 10)
('centuri', 45)
('wa', 36)
('began', 5)
('januari', 2)
('1', 4)
(',', 294)
('1901', 2)
('[', 40)
(']', 40)
('end', 15)
('decemb', 2)
('31', 2)
('2000', 1)
('.', 124)
('2', 2)
('It', 6)
('tenth', 1)
('final', 2)
('2nd', 1)
('millennium', 1)
('strictli', 1)
('speak', 1)
('distinct', 1)
('known', 2)
('1900', 2)
('1999', 2)
('domin', 3)
('chain', 1)
('event', 1)
('herald', 2)
('signific', 4)
('chang', 7)
('world', 49)
('histori', 6)
('redefin', 1)
('era', 1)
(':', 4)
('flu', 1)
('pandem', 1)
('war', 39)
('I', 4)
('II', 8)
('nuclear', 10)
('power', 12)
('space', 4)
('explor', 3)
('nation', 18)
('decolon', 3)
('cold', 5)
('post-cold', 1)
('conflict', 8)
(';', 11)
('intergovernment', 1)
('organ', 3)
('cultur', 7)
('homogen', 2)
('develop', 17)
('emerg', 3)
('transport', 4)
('commun', 10)
('technolog', 20)
('poverti', 2)
('reduct', 1)
('popul', 11)
('growth', 2)
('awar', 2)
('environment', 5)
('degrad', 1)
('ecolog', 2)
('extinct', 2)
('3', 1)


## Tokenizing into sentences

In [55]:
from nltk.tokenize import word_tokenize, sent_tokenize

sentences = sent_tokenize(article_content)
for sentence in sentences:
    print(sentence)

The 20th (twentieth) century was a century that began on
January 1, 1901[1] and ended on December 31, 2000.
[2] It was the tenth and final century of the 2nd millennium.
Strictly speaking, it is distinct from the century known as the 1900s which began on January 1, 1900, and ended on December 31, 1999.
The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold War and post-Cold War conflicts; intergovernmental organizations and cultural homogenization through developments in emerging transportation and communications technology; poverty reduction and world population growth, awareness of environmental degradation, ecological extinction;[3][4] and the birth of the Digital Revolution, enabled by the wide adoption of MOS transistors and integrated circuits.
It saw great advances in power generation, co

##  Finding the weighted frequencies of the sentences

In [64]:
sentence_weight = dict()
for sentence in sentences:
    sentence_wordcount = (len(word_tokenize(sentence)))
    sentence_wordcount_without_stop_words = 0
    for word_weight in frequency_table:
        if word_weight in sentence.lower():
            sentence_wordcount_without_stop_words += 1
            if sentence[:7] in sentence_weight:
                sentence_weight[sentence[:7]] += frequency_table[word_weight]
            else:
                sentence_weight[sentence[:7]] = frequency_table[word_weight]

    sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] /        sentence_wordcount_without_stop_words
      
for weights in sentence_weight.items():
    print(weights)

('The 20t', 15.034693134219632)
('[2] It ', 27.444444444444443)
('Strictl', 28.625)
('It saw ', 19.033333333333335)
('The ave', 13.456521739130435)
('[5]\nThe', 24.7)
('The Mar', 10.297297297297296)
('Through', 18.875)
('The dis', 16.44736842105263)
('It took', 19.941176470588236)
('[8][9][', 18.23076923076923)
('Global ', 10.692307692307692)
('[11] Tr', 16.88235294117647)
('Up unti', 16.0)
('[12]\nTh', 21.88888888888889)
('Nationa', 17.757575757575758)
('The cen', 27.45)
('Terms l', 27.863636363636363)
('Scienti', 22.865384615384617)
('It was ', 21.5)
('Horses ', 19.72)
('These d', 18.25925925925926)
('Humans ', 38.583333333333336)
('Mass me', 19.29032258064516)
('Advance', 11.5)
('Rapid t', 29.36842105263158)
('World W', 27.304347826086957)
('However', 46.63636363636363)
('For the', 23.285714285714285)
('The las', 23.636363636363637)
('[13]\nTh', 19.975)
('Technol', 22.77777777777778)
('After m', 15.714285714285714)
('In addi', 17.814814814814813)
('The Aus', 23.5)
('The Rus', 15.4)


## Calculating the threshold of the sentences

In [66]:
# Calculating the average score for the sentences
sum_values = 0
for entry in sentence_weight:
    sum_values += sentence_weight[entry]
# Getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))
print(average_score)

22.745546270465134


## Obtain Summary


In [67]:
sentence_counter = 0
article_summary = ''

for sentence in sentences:
    if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (1.5*average_score):
        article_summary += " " + sentence
        sentence_counter += 1
print(article_summary)

 Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. At the beginning of the period, the British Empire was the world's most powerful nation,[15] having acted as the world's policeman for the past century. In total, World War II left some 60 million people dead. At the beginning of the century, strong discrimination based on race and sex was significant in general society. During the century, the social taboo of sexism fell. Communications and information technology, transportation technology, and medical advances had radically altered daily lives. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. Millions were infected with HIV, the virus which causes AIDS. 

## Saving it

In [68]:
file = open('article_summary.txt','w') 
file.write(article_summary)

1184

1184