This part of code is to extract the most significant sentences and noun chunks in each Abstract. TextRank has been applied for extractive summarization, it determines the importance score of each sentence based on its similarity score with other sentences. Drawing inspiration from the PageRank algorithm, TextRank creates a word graph and progressively calculates the ranking score of each sentence in an iterative manner. 

In [1]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.cluster.util import cosine_distance
import numpy as np
import spacy
from gensim.models import Word2Vec

  from .autonotebook import tqdm as notebook_tqdm


The similarity matrix can be calculated using two methods, the first method is calculated according to the number of words co-occured in two sequences, the second method to construct similarity matrix base on the embedding similarity between two sequences. 

The formula of the TextRank is $S(V_{i}) = (1-d) + d * \sum(j\epsilon linked(V_{i})) [ (W_{ji} / \sum(k\epsilon linked(V_{j})) W_{jk} ) * S(V_{j}) ]$
where $S (V_{i})$ is the TextRank score for sequence $V_{i}$.
d is the impedance coefficient, usually set to 0.85
linked(V_{j}) refers to the sequences except for the current one,
$W_{ji}$ is the similarity score between sequence i and j

In [2]:
def ranking(x_clean, noun_or_sentence = 0, number=5, similarity_method = 0):
    '''
    params:
        x_clean: clean the passage, 
        number: extract the top 5 noun chunks/sentences
        noun_or_sentence: if noun_or_sentence equal to 1, rank noun chunks; if noun_or_sentence is equal to 0, rank sentences
        similarity_method: similarity_matrix = 0, similarity analysis using word co-occurance pattern, similarity_matrix = 1, similarity analysis using word2vec similarity
    results:
        a dictionary containing the sentence along with its importance
    '''
    # clean the noun_chunks
    if noun_or_sentence == 1:
        x_clean = noun_chunks_split
        x_clean = [re.sub(r'[^a-zA-Z0-9-\s]', u'',sentence.replace('\n', '').replace('-',' '), flags=re.UNICODE) for sentence in x_clean]
        x_clean = [list(filter(lambda w: w not in stop_words and len(w)>2, map(lemmatizer.lemmatize, sentence.lower().split()))) for sentence in x_clean]
    similarity_matrix = np.zeros((len(x_clean),len(x_clean)))
    # similarity analysis using word co-occurance pattern
    if similarity_method == 0:
        for idx1 in range(0,len(x_clean)):
            for idx2 in range(0,len(x_clean)):
                if idx1 == idx2:
                    continue
                elif len(x_clean[idx1])==0 or len(x_clean[idx2])==0:
                    similarity_matrix[idx1][idx2] = 0
                else:
                    # construct a vocabulary containing all the words
                    vocabulary = list(set(x_clean[idx1] + x_clean[idx2]))
                    vector_sentence1 = [0]*len(vocabulary)
                    vector_sentence2 = [0]*len(vocabulary)
                    # count the word in each vocabulary
                    for word in x_clean[idx1]:
                        vector_sentence1[vocabulary.index(word)] += 1
                    for word in x_clean[idx2]:
                        vector_sentence2[vocabulary.index(word)] += 1
                    #calculate similarity matrix using cosine similarity
                    similarity_matrix[idx1][idx2] = 1 - cosine_distance(vector_sentence1, vector_sentence2)
    else:
        # Set the size of each word vector to be 100
        w2v = Word2Vec(x_clean, vector_size=100, min_count=1)

        # Get the embedding for each sentence by averaging the embeddings of the words in the sentence
        sentence_embeddings = [[w2v.wv[word] for word in words] for words in x_clean]
        sentence_embeddings = [np.mean(embedding, axis=0) for embedding in sentence_embeddings]

        for i, row_embedding in enumerate(sentence_embeddings):
            for j, column_embedding in enumerate(sentence_embeddings):
                if np.isnan(row_embedding).any() or np.isnan(column_embedding).any():
                    similarity_matrix[i][j] = 0
                else:
                    similarity_matrix[i][j] = 1 - cosine_distance(row_embedding, column_embedding)
    # complete similarity matrix
    similarity_matrix = similarity_matrix + similarity_matrix.T - np.diag(similarity_matrix.diagonal())
    similarity_matrix = np.divide(similarity_matrix,np.sum(similarity_matrix,axis = 0),where=np.sum(similarity_matrix,axis = 0) != 0)
    # iteratively update the textRank score
    vector = np.array([1] * len(similarity_matrix))
    previous_sum_vector = 0
    for epoch in range(100):
        vector = (1 - 0.85) + 0.85 * np.matmul(similarity_matrix, vector)
        if abs(previous_sum_vector - sum(vector)) >= 1e-5:
            previous_sum_vector = sum(vector)
        else:
            break
    vector_sort = list(np.argsort(vector))
    vector_sort.reverse()
    # extract the top n textrank sentences
    i = 0
    top_sentences = {}
    for i in range(number):
        print (str(vector_sort[i]) + " : " + str(vector[vector_sort[i]]))
        if noun_or_sentence == 0:
            sentence = sentences[vector_sort[i]]
        else:
            sentence = noun_chunks_split[vector_sort[i]]
        top_sentences[sentence] = vector[vector_sort[i]]
        print(sentence)
    return top_sentences

clean the sentences and extract noun chunks

In [3]:
# test a passage, clean the passage
nlp = spacy.load('en_core_web_sm')
text ="Observed scaling relations in galaxies between baryons and dark matter global properties are key to shed light on the process of galaxy formation and on the nature of dark matter. Here, we study the scaling relation between the neutral hydrogen (HI) and dark matter mass in isolated rotationally-supported disk galaxies at low redshift. We first show that state-of-the-art galaxy formation simulations predict that the HI-to-dark halo mass ratio decreases with stellar mass for the most massive disk galaxies. We then infer dark matter halo masses from high-quality rotation curve data for isolated disk galaxies in the local Universe, and report on the actual universality of the HI-to-dark halo mass ratio for these observed galaxies. This scaling relation holds for disks spanning a range of 4 orders of magnitude in stellar mass and 3 orders of magnitude in surface brightness. Accounting for the diversity of rotation curve shapes in our observational fits decreases the scatter of the HI-to-dark halo mass ratio while keeping it constant. This finding extends the previously reported discrepancy for the stellar-to-halo mass relation of massive disk galaxies within galaxy formation simulations to the realm of neutral atomic gas. Our result reveals that isolated galaxies with regularly rotating extended HI disks are surprisingly self-similar up to high masses, which hints at mass-independent self-regulation mechanisms that have yet to be fully understood."
nltk.download('punkt')
nltk.download('stopwords')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
sentences = nltk.sent_tokenize(text)
x_clean = [re.sub(r'[^a-zA-Z0-9-\s]', u'',sentence.replace('\n', '').replace('-',' '), flags=re.UNICODE) for sentence in sentences]
x_clean = [list(filter(lambda w: w not in stop_words and len(w)>2, map(lemmatizer.lemmatize, sentence.lower().split()))) for sentence in x_clean]
# extract the noun chunks
doc = nlp(text)
noun_chunks_split = [chunk.text for chunk in doc.noun_chunks]

[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


noun chunks similarity ranking using co-occurance matrix

In [4]:
top_sentences = ranking(x_clean, noun_or_sentence = 1, number = 30, similarity_method = 0)

23 : 1.4021092890451134
dark matter halo masses
19 : 1.3208139883700234
the HI-to-dark halo mass ratio
43 : 1.3208139883700234
the HI-to-dark halo mass ratio
28 : 1.3208139883700234
the HI-to-dark halo mass ratio
25 : 1.304102326929677
isolated disk galaxies
14 : 1.2985871523037398
dark matter mass
21 : 1.2437635249712296
the most massive disk galaxies
48 : 1.2437635249712296
massive disk galaxies
1 : 1.1391368593863604
galaxies
55 : 1.1368473063201823
high masses
15 : 1.08123495043033
isolated rotationally-supported disk galaxies
29 : 1.048644747227133
these observed galaxies
53 : 1.0422127806183559
isolated galaxies
37 : 1.0
magnitude
51 : 1.0
neutral atomic gas
12 : 1.0
the neutral hydrogen
33 : 1.0
4 orders
34 : 1.0
magnitude
36 : 1.0
3 orders
0 : 0.9675349485284513
Observed scaling relations
7 : 0.9656717650439303
galaxy formation
11 : 0.9380818803990555
the scaling relation
30 : 0.9380818803990555
This scaling relation
35 : 0.9239248730366187
stellar mass
20 : 0.9239248730366187


noun chunks similarity ranking using word embedding similarity

In [5]:
ranking(x_clean, noun_or_sentence = 1, number = 30, similarity_method = 1)

42 : 10459.01311373482
the scatter
24 : 3574.118291895084
high-quality rotation curve data
40 : 3284.1414626658334
rotation curve shapes
15 : 3087.8408540534215
isolated rotationally-supported disk galaxies
31 : 2874.8996401229197
disks
49 : 2506.3845655004106
galaxy formation simulations
54 : 2379.082466527472
regularly rotating extended HI disks
29 : 2277.3866503191457
these observed galaxies
48 : 2028.8996218860914
massive disk galaxies
21 : 2028.8996218860914
the most massive disk galaxies
55 : 2013.5238200551723
high masses
7 : 1957.0152209154837
galaxy formation
8 : 1873.0308153518727
the nature
51 : 1371.0511289210433
neutral atomic gas
20 : 1342.466402828283
stellar mass
35 : 1342.4664028282828
stellar mass
0 : 493.07162321111446
Observed scaling relations
25 : 392.60162028606555
isolated disk galaxies
37 : 381.91049721769724
magnitude
34 : 381.91049721769724
magnitude
5 : 278.5665286230951
light
39 : 74.09439557370426
the diversity
18 : 60.34346830567828
the-art
22 : 0.1500000

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


sentence ranking using co-occurance similarity matrix

In [6]:
ranking(x_clean, noun_or_sentence = 0, number = 5, similarity_method = 0)

2 : 1.327454290436561
We first show that state-of-the-art galaxy formation simulations predict that the HI-to-dark halo mass ratio decreases with stellar mass for the most massive disk galaxies.
3 : 1.319636058124042
We then infer dark matter halo masses from high-quality rotation curve data for isolated disk galaxies in the local Universe, and report on the actual universality of the HI-to-dark halo mass ratio for these observed galaxies.
1 : 1.1370994929413936
Here, we study the scaling relation between the neutral hydrogen (HI) and dark matter mass in isolated rotationally-supported disk galaxies at low redshift.
6 : 1.0890211350683119
This finding extends the previously reported discrepancy for the stellar-to-halo mass relation of massive disk galaxies within galaxy formation simulations to the realm of neutral atomic gas.
0 : 0.9364789389550399
Observed scaling relations in galaxies between baryons and dark matter global properties are key to shed light on the process of galaxy fo

sentence ranking using word2vec similarity matrix

In [7]:
ranking(x_clean, noun_or_sentence = 0, number = 5, similarity_method = 1)

3 : 1.2363638153252827
We then infer dark matter halo masses from high-quality rotation curve data for isolated disk galaxies in the local Universe, and report on the actual universality of the HI-to-dark halo mass ratio for these observed galaxies.
2 : 1.1379744915731522
We first show that state-of-the-art galaxy formation simulations predict that the HI-to-dark halo mass ratio decreases with stellar mass for the most massive disk galaxies.
6 : 1.0504689067331228
This finding extends the previously reported discrepancy for the stellar-to-halo mass relation of massive disk galaxies within galaxy formation simulations to the realm of neutral atomic gas.
0 : 1.0348830550365087
Observed scaling relations in galaxies between baryons and dark matter global properties are key to shed light on the process of galaxy formation and on the nature of dark matter.
1 : 1.0135659525968403
Here, we study the scaling relation between the neutral hydrogen (HI) and dark matter mass in isolated rotational