# TextRank

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information
content and overall meaning.

1. Extractive Summarization
 - Identifying the important sentences or phrases from the original text and extract only those from the text.

2. Abstractive Summarization
 - Generating new sentences from the original text


3. TextRank: extractive & unsupervised text summarizatoin
 -  Concatenate text -> sentences -> sentence embeddings -> similarity matrix (between vectors) -> graph

### Connect to existence Github repo

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
%cd /content/drive/Shared drives/ZWTZWT
!git clone https://github.com/vantuan5644/LungCancerTreatment.git

/content/drive/Shared drives/ZWTZWT
Cloning into 'LungCancerTreatment'...
remote: Enumerating objects: 5928, done.[K
remote: Counting objects: 100% (5928/5928), done.[K
remote: Compressing objects: 100% (5278/5278), done.[K
remote: Total 5928 (delta 742), reused 5749 (delta 563), pack-reused 0[K
Receiving objects: 100% (5928/5928), 22.71 MiB | 8.42 MiB/s, done.
Resolving deltas: 100% (742/742), done.


In [5]:
%cd LungCancerTreatment/

/content/drive/Shared drives/ZWTZWT/LungCancerTreatment


## TextRank

In [6]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import re


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Splitting into sentences

In [8]:
data = pd.read_csv('ground_truths/ground_truth.csv')
data.head()
stage_level = data[['text', 'stage_level']].groupby('stage_level').agg({'text': lambda text: ' '.join(text),
                                                                        })
data = stage_level.reset_index(level=0)
data

Unnamed: 0,stage_level,text
0,0.0,Because stage 0 NSCLC is limited to the lining...
1,1.0,"If you have stage I NSCLC, surgery may be the ..."
2,2.0,People who have stage II NSCLC and are healthy...
3,3.0,Treatment for stage IIIA NSCLC may include som...
4,4.0,Stage IV NSCLC is widespread when it is diagno...


In [9]:
# Split text into sentences
from nltk. tokenize import sent_tokenize
sentences = []
for s in data['text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list
sentences[:5]

['Because stage 0 NSCLC is limited to the lining layer of airways and has not invaded deeper into the lung tissue or other areas, it is usually curable by surgery alone.',
 'No chemotherapy or radiation therapy is needed.',
 'If you are healthy enough for surgery, you can usually be treated by segmentectomy or wedge resection (removal of part of the lobe of the lung).',
 'Cancers in some locations (such as where the windpipe divides into the left and right main bronchi) may be treated with a sleeve resection, but in some cases they may be hard to remove completely without removing a lobe (lobectomy) or even an entire lung (pneumonectomy).',
 'For some stage 0 cancers, treatments such as photodynamic therapy (PDT), laser therapy, or brachytherapy (internal radiation) may be alternatives to surgery.']

### Make sentences embeddings from GloVe

In [10]:
# GloVe Embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2020-04-11 08:49:20--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-04-11 08:49:20--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-04-11 08:49:21--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [0]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()


#### Text Preprocessing

Remove new-line character

In [0]:
clean_sentences = [re.sub('\n+', ' ', sent) for sent in sentences]


Remove stopwords

In [17]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]


#### Make sentence vectors from word embeddings

In [0]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split()))
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)
  
assert len(sentences) == len(sentence_vectors)

In [24]:
sentence_vectors[0].shape

(100,)

### Similarity Matrix Preparation

In [0]:
# Similarity matrix is a zero matrix with dimension (n, n)
# We will initialize this matrix with cosine similarity of the sentences 
sim_mat = np.zeros([len(sentences), len(sentences)])


In [0]:
from sklearn.metrics.pairwise import cosine_similarity

for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]


### Applying PageRank algorithm

#### Convert into graph

We need to convert the similarity matrix **sim_mat** into a graph.

The nodes of this graph will represent the sentences and the edges will represent the similarity scores between sentences.

In [0]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)


#### Summary Extraction

Extracting the top N sentences based on their rankings for summary generation

In [0]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)


In [28]:
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])


NSCLC that has spread to only one other site Cancer that is limited in the lungs and has only spread to one other site (such as the brain) is not common, but it can sometimes be treated (and even potentially cured) with surgery and/or radiation therapy to treat the area of cancer spread, followed by treatment of the cancer in the lung.
Even if positive margins are not found, chemo is usually recommended after surgery to try to destroy any cancer cells that might have been left behind.
As with stage I cancers, newer lab tests now being studied may help doctors find out which patients need this adjuvant treatment and which are less likely to benefit from it.
If you are in otherwise good health, treatments such as surgery, chemotherapy (chemo), targeted therapy, immunotherapy, and radiation therapy may help you live longer and make you feel better by relieving symptoms, even though they aren’t likely to cure you.
For people with stage I NSCLC that has a higher risk of coming back (based o