<a href="https://colab.research.google.com/github/ufrpe-ensino/curso-mineracao-textos/blob/master/13_Sumarizacao.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sumarização automática

(adaptado de https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)

A sumarização pode ser definida como a *tarefa de produzir um resumo conciso e fluente, preservando as informações-chave e o significado geral.*

Neste demo, utilizaremos uma técnica conhecida como **TextRank**. O TextRank não depende de nenhum dado de treinamento anterior e pode funcionar com qualquer pedaço de texto arbitrário. Ele é um algoritmo de classificação baseado em **grafos** de propósito geral para NLP.

## Importando dependências

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.cluster.util import cosine_distance
from nltk.corpus import stopwords
import numpy as np
import networkx as nx

## Dados de teste

Inglês:

In [None]:
!wget -O msft.txt https://raw.githubusercontent.com/edubey/text-summarizer/master/msft.txt
!cat msft.txt

Português:

In [None]:

import nltk
nltk.download('machado')
nltk.download('punkt')

from nltk.corpus import machado
print(machado.readme()[:1000])

In [None]:
dom_casmurro = machado.raw('romance/marm08.txt')
dom_casmurro[:100]

## Pré processamento

In [None]:
msft_pp = open('msft.txt', "r").readlines()

## Similarity matrix
Cada sentença será representada como um vetor de BoW binário, e similaridade entre elas será dada pela distância de cosseno:

In [None]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
    
def build_similarity_matrix(sentences, stop_words):
  # Create an empty similarity matrix
  similarity_matrix = np.zeros((len(sentences), len(sentences)))

  for idx1 in range(len(sentences)):
    for idx2 in range(len(sentences)):
      if idx1 == idx2: #ignore if both are same sentences
        continue 
    similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
  return similarity_matrix
  

## Text Rank

In [None]:
def generate_summary(sentences, language='english', top_n=5):
  stop_words = stopwords.words(language)
  summarize_text = []

  # Step 2 - Generate Similary Martix across sentences
  sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

  # Step 3 - Rank sentences in similarity martix
  sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
  scores = nx.pagerank(sentence_similarity_graph)

  # Step 4 - Sort the rank and pick top sentences
  ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
  print("Indexes of top ranked_sentence order are ", ranked_sentence)    

  for i in range(top_n):
    summarize_text.append("".join(ranked_sentence[i][1]))

  return ". ".join(summarize_text)

## Testando!

In [None]:
language  = 'english'
text      = open('msft.txt', "r").read()
sentences = sent_tokenize(text)

print('TEXTO:')
display(text)


summary = generate_summary(sentences, language='english', top_n=3)
print('SUMARIO:')
display(summary)

In [None]:
language  = 'portuguese'
text      = dom_casmurro
sentences = sent_tokenize(text[50:5000])

print('TEXTO:')
display(text[50:5000])


summary = generate_summary(sentences, language='english', top_n=5)
print('SUMARIO:')
display(summary)

# Sumy

O [Sumy](https://pypi.org/project/sumy/) é uma biblioteca em python que implementa diversos métodos de sumarização extrativa, como por exemplo:

* Luhn - heurestic method
* Edmundson heurestic method with previous statistic research
* Latent Semantic Analysis, LSA 
* LexRank - Unsupervised approach inspired by algorithms PageRank and HITS,
* TextRank - Unsupervised approach, also using PageRank algorithm
* SumBasic - Method that is often used as a baseline in the literature
* KL-Sum - Method that greedily adds sentences to a summary so long as it 
* Reduction - Graph-based summarization

In [None]:
!pip install sumy

## Testando

In [None]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer 
from sumy.summarizers.lex_rank import LexRankSummarizer

from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

In [None]:

SENTENCES_COUNT = 3
language = 'english'
text      = open('msft.txt', "r").read()

parser = PlaintextParser.from_string(text, Tokenizer(language))
stemmer = Stemmer(language)

# summarizer = LsaSummarizer(stemmer)
summarizer = LexRankSummarizer(stemmer)

summarizer.stop_words = get_stop_words(language)

summary = []
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    summary.append(sentence)

''.join(str(summary))

In [None]:
SENTENCES_COUNT = 3
language = 'portuguese'

url = "https://globoesporte.globo.com/motor/formula-1/noticia/temporada-2020-pode-ser-a-mais-cara-da-historia-da-formula-1-preve-diretor-da-rbr.ghtml"
parser = HtmlParser.from_url(url, Tokenizer(language))

summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(language)

summary = []
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    summary.append(sentence)

''.join(str(summary))

In [None]:
text = dom_casmurro[50:5000]
SENTENCES_COUNT = 3
language = 'portuguese'

parser = PlaintextParser.from_string(text, Tokenizer(language))
stemmer = Stemmer(language)

# summarizer = LsaSummarizer(stemmer)
summarizer = LexRankSummarizer(stemmer)

summarizer.stop_words = get_stop_words(language)

summary = []
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    summary.append(sentence)

''.join(str(summary))

# Exercício

Solicite ao usuário que digite (ou cole) uma URL de uma notícia na web (por exemplo, do site g1.com.br). 

Utilize a propriedade `parser.document.words` do `sumy` para contar quantas palavras existem no documento original, e apresente o resultado final após o processo de sumarização.