In [3]:
import nltk
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

In [4]:
num_sentences_in_summary=2
url="https://en.wikipedia.org/wiki/Automatic_summarization"
parser=HtmlParser.from_url(url, Tokenizer("english"))

In [5]:
summarizer_list=("TextRankSummarizer:", "LexRankSummarizer:", 
                 "LuhnSummarizer:", "LsaSummarizer")

summarizers=[TextRankSummarizer(), LexRankSummarizer(),
             LuhnSummarizer(), LsaSummarizer()]

In [6]:
for i, summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    
    print("-"*30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
------------------------------
LexRankSummarizer:
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training documen

In [1]:
from summa import summarizer
from summa import keywords

text=open("nlphistory.txt").read()

print("Summary:\n", summarizer.summarize(text, ratio=0.1))

Summary:
 However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others.


In [1]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [2]:
device="cpu"

model=T5ForConditionalGeneration.from_pretrained("t5-small").to(device)
tokenizer=T5Tokenizer.from_pretrained("t5-small")

In [3]:
text ="""
don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache). 

If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to translate a few Navajo sentences into English. At this point, these machine translated Navajo-English pairs can be added as additional training data to English-Navajo MT system. This results in a translation system with more examples to train on (even though some of these examples are synthetic). In general, though, if accuracy of translation is paramount, it would perhaps make sense to form a hybrid MT system which combines the neural models with rules and some form of post-processing, though. 

"""

In [4]:
preprocessed_text=text.strip().replace("\n", "")
t5_prepared_Text="summarize: "+preprocessed_text

tokenized_text=tokenizer.encode(t5_prepared_Text, 
                                return_tensors="pt", 
                                max_length=600, 
                                padding=True).to(device)

In [7]:
summary_ids=model.generate(tokenized_text,
                           num_beams=4,
                           no_repeat_ngram_size=2,
                           min_length=100,
                           max_length=200,
                           early_stopping=True)

In [8]:
output=tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summarized text: \n", output)

Summarized text: 
 it is more practical to make use of the translation APIs. if you’re working with a completely new language, it would make sense to store translations of frequently used text (called translation memory or translation cache) the MT system combines the neural models with rules and some form of post-processing, though, as well as the syntax of tumblr and Navajo. the system is based on the language used by the translators.
