## 어텐션을 이용한 텍스트 요약(Text Summarization with Attention mechanism)
- 텍스트 요약 : 상대적으로 큰 원문을 핵심 내용만 간추려서 상대적으로 작은 요약문으로 변환하는 것
- 어텐션 매커니즘(attention mechanism) 기반의 시퀀스-투-시퀀스(Sequences-to-Sequence, seq2seq) 모델을 활용한 텍스트 요약
- 텍스트 요약은 크게 **추출적 요약(extractive summarization)**과 **추상적 요약(abstractive summarization)**으로 구분 <br>
  #### 1) 추출적 요약(extractive summarization)
  - 원문에서 중요한 핵심 문장 또는 단어구를 몇 개 뽑아서 이들로 구성된 요약문을 만드는 방법
  - 결과로 나온 요약문의 문장이나 단어구들은 전부 원문에 있는 문장들
  - 단점 : 이미 존재하는 문장이나 단어구로만 구성하므로 모델의 언어 표현 능력이 제한됨
  
  #### 2) 추상적 요약(abstractive summarization)
  - 원문에 없던 문장이라도 핵심 문맥을 반영한 새로운 문장을 생성해서 원문을 요약하는 방법
  - 추출적 요약보다는 난이도가 높음
  - 단점 : 기본적으로 지도 학습 문제이기 때문에, 대량의 데이터가 필요함

In [1]:
# install required packages
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/thushv89/attention_keras/master/src/layers/attention.py", filename="attention.py")

# clear install verbose messages
from IPython.display import clear_output
clear_output()


# import packages
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
pd.options.display.max_colwidth = None
np.random.seed(seed=0)
nltk.download('stopwords')

import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from attention import AttentionLayer


def show_table(df, sample_num=2):
  print('>>> shape :', df.shape)
  print('>>> No of NA :', df.isna().sum().sum())
  if len(df) <= sample_num*2:
    display(df)
  else:
    display(df.head(sample_num))
    display(df.tail(sample_num))
  pass


# pd : 1.1.5  |  np : 1.19.5  |  nltk : 3.2.5  |  tensorflow : 2.4.1
print(f'>>> pd : {pd.__version__}  |  np : {np.__version__}  |  nltk : {nltk.__version__}  |  tensorflow : {tensorflow.__version__}') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
>>> pd : 1.1.5  |  np : 1.19.5  |  nltk : 3.2.5  |  tensorflow : 2.4.1


## 1. 데이터 준비 (Prepare data)
- 아마존 리뷰 데이터 사용 (https://www.kaggle.com/snap/amazon-fine-food-reviews)

In [2]:
# load and check data
data = pd.read_csv('drive/MyDrive/mount_data/Amazon_Fine_Food_Reviews.csv', nrows=100000, usecols=['Text', 'Summary'])
show_table(data)

>>> shape : (100000, 2)
>>> No of NA : 2


Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."


Unnamed: 0,Summary,Text
99998,Spicy!!,"I do like these noodles although, to say they are spicy is somewhat of an understatement. No one else in the family tolerates spicy very well so, seeing these, I was looking forward to an extra little something for the palate. I was not disappointed. To be completely honest, I usually drain most of the liquid as it is almost too much!"
99999,"This spicy noodle cures my cold, upset stomach, and headache every time!","I love this noodle and have it once or twice a week. The amazing thing is that when I don't feel well because of a cold, a hot bowl of this noodle will cure my upset stomach, headache, and running nose! This may not work for you, but you should definitely try it."


In [3]:
# remove Null or duplicated texts
print(data.nunique(), '\n')
data = data.dropna().drop_duplicates('Text', keep='last', ignore_index=True)
show_table(data)

Summary    72348
Text       88426
dtype: int64 

>>> shape : (88425, 2)
>>> No of NA : 0


Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."


Unnamed: 0,Summary,Text
88423,Spicy!!,"I do like these noodles although, to say they are spicy is somewhat of an understatement. No one else in the family tolerates spicy very well so, seeing these, I was looking forward to an extra little something for the palate. I was not disappointed. To be completely honest, I usually drain most of the liquid as it is almost too much!"
88424,"This spicy noodle cures my cold, upset stomach, and headache every time!","I love this noodle and have it once or twice a week. The amazing thing is that when I don't feel well because of a cold, a hot bowl of this noodle will cure my upset stomach, headache, and running nose! This may not work for you, but you should definitely try it."


In [4]:
# contraction dictionary
contractions = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have", "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}
print('>>> No of contractions :', len(contractions))

# stopwords
stop_words = set(stopwords.words('english'))
print('>>> No of stopwords :', len(stop_words), np.random.choice(list(stop_words), 10), '...')

>>> No of contractions : 120
>>> No of stopwords : 179 ['after' 'were' 'd' 'am' 'here' 'while' 'own' 'yourselves' 'yours' 'only'] ...
