# TextRank

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information
content and overall meaning.

1. Extractive Summarization
 - Identifying the important sentences or phrases from the original text and extract only those from the text.

2. Abstractive Summarization
 - Generating new sentences from the original text


3. TextRank: extractive & unsupervised text summarizatoin
 -  Concatenate text -> sentences -> sentence embeddings -> similarity matrix (between vectors) -> graph

### Connect to existence Github repo

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
%cd /content/drive/Shared drives/ZWTZWT
!git clone https://github.com/vantuan5644/LungCancerTreatment.git

/content/drive/Shared drives/ZWTZWT
Cloning into 'LungCancerTreatment'...
remote: Enumerating objects: 5928, done.[K
remote: Counting objects: 100% (5928/5928), done.[K
remote: Compressing objects: 100% (5278/5278), done.[K
remote: Total 5928 (delta 742), reused 5749 (delta 563), pack-reused 0[K
Receiving objects: 100% (5928/5928), 22.71 MiB | 8.42 MiB/s, done.
Resolving deltas: 100% (742/742), done.


In [5]:
%cd LungCancerTreatment/

/content/drive/Shared drives/ZWTZWT/LungCancerTreatment


## TextRank

In [6]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import re


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Splitting into sentences

In [8]:
data = pd.read_csv('ground_truths/ground_truth.csv')
data.head()
stage_level = data[['text', 'stage_level']].groupby('stage_level').agg({'text': lambda text: ' '.join(text),
                                                                        })
data = stage_level.reset_index(level=0)
data

Unnamed: 0,stage_level,text
0,0.0,Because stage 0 NSCLC is limited to the lining...
1,1.0,"If you have stage I NSCLC, surgery may be the ..."
2,2.0,People who have stage II NSCLC and are healthy...
3,3.0,Treatment for stage IIIA NSCLC may include som...
4,4.0,Stage IV NSCLC is widespread when it is diagno...


In [9]:
# Split text into sentences
from nltk. tokenize import sent_tokenize
sentences = []
for s in data['text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list
sentences[:5]

['Because stage 0 NSCLC is limited to the lining layer of airways and has not invaded deeper into the lung tissue or other areas, it is usually curable by surgery alone.',
 'No chemotherapy or radiation therapy is needed.',
 'If you are healthy enough for surgery, you can usually be treated by segmentectomy or wedge resection (removal of part of the lobe of the lung).',
 'Cancers in some locations (such as where the windpipe divides into the left and right main bronchi) may be treated with a sleeve resection, but in some cases they may be hard to remove completely without removing a lobe (lobectomy) or even an entire lung (pneumonectomy).',
 'For some stage 0 cancers, treatments such as photodynamic therapy (PDT), laser therapy, or brachytherapy (internal radiation) may be alternatives to surgery.']

### Make sentences embeddings from GloVe

In [10]:
# GloVe Embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2020-04-11 08:49:20--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-04-11 08:49:20--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-04-11 08:49:21--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [0]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()


#### Text Preprocessing

Remove new-line character

In [0]:
clean_sentences = [re.sub('\n+', ' ', sent) for sent in sentences]


Remove stopwords

In [17]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]


#### Make sentence vectors from word embeddings

In [0]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split()))
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)
  
assert len(sentences) == len(sentence_vectors)

In [24]:
sentence_vectors[0].shape

(100,)

### Similarity Matrix Preparation

In [0]:
# Similarity matrix is a zero matrix with dimension (n, n)
# We will initialize this matrix with cosine similarity of the sentences 
sim_mat = np.zeros([len(sentences), len(sentences)])


In [0]:
from sklearn.metrics.pairwise import cosine_similarity

for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]


### Applying PageRank algorithm

#### Convert into graph

We need to convert the similarity matrix **sim_mat** into a graph.

The nodes of this graph will represent the sentences and the edges will represent the similarity scores between sentences.

In [0]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)


#### Summary Extraction

Extracting the top N sentences based on their rankings for summary generation

In [0]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)


In [28]:
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])


NSCLC that has spread to only one other site Cancer that is limited in the lungs and has only spread to one other site (such as the brain) is not common, but it can sometimes be treated (and even potentially cured) with surgery and/or radiation therapy to treat the area of cancer spread, followed by treatment of the cancer in the lung.
Even if positive margins are not found, chemo is usually recommended after surgery to try to destroy any cancer cells that might have been left behind.
As with stage I cancers, newer lab tests now being studied may help doctors find out which patients need this adjuvant treatment and which are less likely to benefit from it.
If you are in otherwise good health, treatments such as surgery, chemotherapy (chemo), targeted therapy, immunotherapy, and radiation therapy may help you live longer and make you feel better by relieving symptoms, even though they aren’t likely to cure you.
For people with stage I NSCLC that has a higher risk of coming back (based o

# Sequence-to-Sequence Modeling

There are two major components of a Seq2Seq model:

- Encoder: An LSTM model reads the entire input sequence wherein, at each timestep, one word is fed into the encoder. It then processes the information at every timestep and captures the contextual information present in the input sequence.

![LSTM Encoder](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/05/61.jpg)


- Decoder: An LSTM network which reads the entire target sequence word-by-word and predicts the same sequence offset by one timestep. **The decoder is trained to predict the next word in the sequence given the previous word.**

![LSTM Decoder](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/05/71.jpg)


The encoder converts the entire input sequence into a fixed length vector and then the decoder predicts the output sequence. Hence it is difficult for the encoder to memorize long sequences into a fixed length vector. We can overcome this issue by using **attention mechanism**, that aims to predict a word by looking at a few specific parts of the sequence only, rather than the entire sequence.


### Train on AmazonFineFoodReview Dataset

This dataset consists of reviews of fine foods from Amazon. The data spans a period of more than 10 years, including all ~500,000 reviews up to October 2012. These reviews include product and user information, ratings, plain text review, and summary. It also includes reviews from all other Amazon categories.



In [72]:
import numpy as np  
import pandas as pd 
import re           
from bs4 import BeautifulSoup 
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords   
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping
import warnings
pd.set_option("display.max_colwidth", 200)
warnings.filterwarnings("ignore")


Using TensorFlow backend.


In [38]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"vantuan5644","key":"e42b59d4233ccff575f11002cc6e0c90"}'}

In [0]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [40]:
import os
os.getcwd()


'/content/drive/Shared drives/ZWTZWT/LungCancerTreatment/src/text_summarization'

In [43]:
!kaggle datasets list -s amazon


ref                                                  title                                               size  lastUpdated          downloadCount  
---------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  
snap/amazon-fine-food-reviews                        Amazon Fine Food Reviews                           242MB  2017-05-01 18:51:31          67526  
sid321axn/amazon-alexa-reviews                       Amazon Alexa Reviews                               164KB  2018-07-31 17:45:14           7724  
bittlingmayer/amazonreviews                          Amazon Reviews for Sentiment Analysis              493MB  2019-11-18 02:50:34          23275  
grikomsn/amazon-cell-phones-reviews                  Amazon Cell Phones Reviews                           9MB  2019-12-26 22:21:16           6143  
datafiniti/consumer-reviews-of-amazon-products       Consumer Reviews of Amazon Products                 16MB  2

In [44]:
!kaggle datasets download -d snap/amazon-fine-food-reviews

Downloading amazon-fine-food-reviews.zip to /content/drive/Shared drives/ZWTZWT/LungCancerTreatment/src/text_summarization
100% 241M/242M [00:03<00:00, 67.4MB/s]
100% 242M/242M [00:03<00:00, 71.9MB/s]


In [45]:
os.listdir(os.getcwd())

['__init__.py',
 'extractive_textrank.ipynb',
 'sentences_transformer.py',
 'attention.py',
 '.ipynb_checkpoints',
 '__pycache__',
 'kaggle.json',
 'amazon-fine-food-reviews.zip']

In [0]:
dataset_file = os.getcwd() + '/amazon-fine-food-reviews.zip'
import zipfile
zip_ref = zipfile.ZipFile(dataset_file, 'r') 
zip_ref.extractall()
zip_ref.close()


In [47]:
os.listdir(os.getcwd())

['__init__.py',
 'extractive_textrank.ipynb',
 'sentences_transformer.py',
 'attention.py',
 '.ipynb_checkpoints',
 '__pycache__',
 'kaggle.json',
 'amazon-fine-food-reviews.zip',
 'Reviews.csv',
 'database.sqlite',
 'hashes.txt']

### Load the dataset

In [0]:
data = pd.read_csv("Reviews.csv")


In [50]:
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


Drop duplicates and NA values

In [0]:
data.drop_duplicates(subset=['Text'], inplace=True)  #dropping duplicates
data.dropna(axis=0,inplace=True)   #dropping na

In [0]:
from contraction import contraction_mapping

#### Text cleaning
 - Convert to lowercase
 - Remove HTML tags
 - Contraction mapping
 - Remove special characters
 - Remove stopwords


In [0]:
from bs4 import BeautifulSoup


In [0]:
stop_words = set(stopwords.words('english')) 
def text_cleaner(text, remove_short_words=False, remove_xml_tag=False):
    newString = text.lower()
    if remove_xml_tag:
      newString = BeautifulSoup(newString, "lxml").text
    newString = re.sub(r'\([^)]*\)', '', newString)
    newString = re.sub('"','', newString)
    newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])    
    newString = re.sub(r"'s\b","",newString)
    newString = re.sub('\n+', ' ', newString)
    newString = re.sub("[^a-zA-Z0-9]", " ", newString) 
    tokens = [w for w in newString.split() if not w in stop_words]
    if remove_short_words:
      long_words=[]
      for i in tokens:
          if len(i) >= 3:                  # removing short word
              long_words.append(i)   
      return (" ".join(long_words)).strip()
    else:
      return ' '.join(tokens)


In [0]:
cleaned_text = []
for t in data['Text']:
    cleaned_text.append(text_cleaner(t, remove_xml_tag=True))


In [0]:
cleaned_summary = []
for t in data['Summary']:
    cleaned_summary.append(text_cleaner(t, remove_xml_tag=False))


In [0]:
data['cleaned_text'] = cleaned_text
data['cleaned_summary'] = cleaned_summary
data['cleaned_summary'].replace('', np.nan, inplace=True)
data.dropna(axis=0,inplace=True)


In [83]:
data[['cleaned_text', 'cleaned_summary']]

Unnamed: 0,cleaned_text,cleaned_summary
0,bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product better,good quality dog food
1,product arrived labeled jumbo salted peanuts peanuts actually small sized unsalted sure error vendor intended represent product jumbo,advertised
2,confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat famil...,delight says
3,looking secret ingredient robitussin believe found got addition root beer extract ordered made cherry soda flavor medicinal,cough medicine
4,great taffy great price wide assortment yummy taffy delivery quick taffy lover deal,great taffy
...,...,...
568449,great sesame chicken good better resturants eaten husband loved find recipes use,without
568450,disappointed flavor chocolate notes especially weak milk thickens flavor still disappoints worth try never buy use left gone time thanks small cans,disappointed
568451,stars small give 10 15 one training session tried train dog ceaser dog treats made puppy hyper compare ingredients know little stars basic food ingredients without preservatives food coloring swee...,perfect maltipoo
568452,best treats training rewarding dog good grooming lower calories loved doggies sweet potatoes seem favorite wet noses treat,favorite training reward treat


### Custom Attention Layer

In [0]:
!pwd

In [35]:
%cd src/text_summarization/
!wget https://raw.githubusercontent.com/thushv89/attention_keras/master/layers/attention.py

[Errno 2] No such file or directory: 'src/text_summarization/'
/content/drive/Shared drives/ZWTZWT/LungCancerTreatment/src/text_summarization
--2020-04-11 10:21:31--  https://raw.githubusercontent.com/thushv89/attention_keras/master/layers/attention.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5230 (5.1K) [text/plain]
Saving to: ‘attention.py’


2020-04-11 10:21:31 (8.33 MB/s) - ‘attention.py’ saved [5230/5230]



In [0]:
from attention import AttentionLayer