## Extractive-based-Text-Summarization
一个简单的抽取式摘要，首先抓取维基百科网页内容，在进行统计分析，进而进行抽取式摘要

In [1]:
import bs4 as BeautifulSoup
import urllib.request

###  step 0 获取并处理网页内容，得到原始文本数据article_content
urllib用于抓取网页，BeautifulSoup可以将输入文本转换为Unicode字符，将输出文本转换为UTF-8字符。

In [2]:
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')
article_read = fetched_data.read()
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

In [47]:
# find_all（'p'）用于获取所有内容段落，属性text返回文本内容。
paragraphs = article_parsed.find_all('p')
article = ''
for p in paragraphs:  
    article += p.text

In [48]:
print(article)

The 20th (twentieth) century was a century that began on
January 1, 1901[1] and ended on December 31, 2000.[2] It was the tenth and final century of the 2nd millennium.
The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold War and post-Cold War conflicts; intergovernmental organizations and cultural homogenization through developments in emerging transportation and communications technology; poverty reduction and world population growth, awareness of environmental degradation, ecological extinction;[3][4] and the birth of the Digital Revolution, enabled by the wide adoption of MOS transistors and integrated circuits. It saw great advances in power generation, communication and medical technology that by the late 1980s allowed for near-instantaneous worldwide computer communication and genetic m

###  Step 1 计算词频表
这里用到的这个PorterStemmer工具实际上比较粗糙，例如was->wa，nltk中还有其他词还原的方法SnowballStemmer，LancasterStemmer等。

In [102]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize, sent_tokenize

In [103]:
def _create_dictionary_table(text_string) -> dict:
    # nltk停用词
    stop_words = set(stopwords.words("english"))    
    # nltk把单词还原成词根形式的方法
    stem = PorterStemmer()
    # tokenize
    words = word_tokenize(text_string)
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd not in stop_words:
            if wd in frequency_table:
                frequency_table[wd] += 1
            else:
                frequency_table[wd] = 1
    return frequency_table

###  Step 2 文章拆分为句子

In [104]:
def _sent_tokenize(article):
    sentences = sent_tokenize(article)
    return sentences

###  Step 3 计算句子加权频率

In [115]:
def _calculate_sentence_scores(sentences, frequency_table) -> dict:
    sentence_weight = dict()
    stem = PorterStemmer()
    for sentence in sentences:
        sentence_stem = ' '.join([stem.stem(wd) for wd in sentence.split()]) 
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence_stem.lower():
                sentence_wordcount_without_stop_words += 1
                # 前句子7个词作为dict句子索引
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]
            # 标准化句子得分
        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]]/sentence_wordcount_without_stop_words
    return sentence_weight
                    

### Step 4 计算句子平均得分
句子平均得分作为判断句子可否作为摘要的阈值

In [106]:
def _calculate_average_score(sentence_weight) -> int:
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]
    average_score = (sum_values / len(sentence_weight)) 
    return average_score

### Step 5 抽取摘要

In [107]:
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''
    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1
    return article_summary, sentence_counter

### main

In [117]:
# step 1-5
frequency_table = _create_dictionary_table(article)
sentences = _sent_tokenize(article)
sentence_scores = _calculate_sentence_scores(sentences, frequency_table)
threshold = _calculate_average_score(sentence_scores)
article_summary, sentence_counter = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)


In [118]:
print("原始句子数量" ,len(sentences))
print("摘要句子数量" ,(sentence_counter))

原始句子数量 124
摘要句子数量 12


In [119]:
print("摘要：\n",article_summary)

摘要：
  Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. At the beginning of the period, the British Empire was the world's most powerful nation,[15] having acted as the world's policeman for the past century. In total, World War II left some 60 million people dead. At the beginning of the century, strong discrimination based on race and sex was significant in general society. During the century, the social taboo of sexism fell. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. Millions were infected with HIV, the virus which causes AIDS. This includes deaths caused by wars, genocide, politicide and mass murders. Later in the 20th century, the development of