In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import networkx as nx
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

# Summarizing Text

Let's try out extractive summarization using the first four paragraphs of [The Great Gatsby](http://gutenberg.net.au/ebooks02/0200041h.html).

First, we'll try to extract the most representative sentence.  Then, we'll extract keywords.

## Sentence extraction

The steps of our sentence extraction process:

1. Parse and tokenize the text using spaCy, and divide into sentences.
2. Calculate the tf-idf matrix.
3. Calculate similarity scores.
4. Calculate TextRank: We're going to use the ´networkx´ package to run the TextRank algorithm.

Let's get started!


In [2]:
# Importing the text the lazy way.
gatsby="In my younger and more vulnerable years my father gave me some advice that Ive been turning over in my mind ever since. \"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world havent had the advantages that youve had.\" He didn't say any more but weve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence Im inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon--for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I dont care what its founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction--Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No--Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men."

# We want to use the standard english-language parser.
# DWS NOTE:  WARNING
# Note: I had to run "spacy download en_core_web_sm"
# then we load 'en_core_web_sm' instead of simply 'en'
parser = spacy.load('en_core_web_sm')
#parser = spacy.load('en')

# Parsing Gatsby.
gatsby = parser(gatsby)

# Dividing the text into sentences and storing them as a list of strings.
sentences=[]
for span in gatsby.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(gatsby[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)

# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(1, 1), 
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

In [3]:
print(sentences)

['In my younger and more vulnerable years my father gave me some advice that Ive been turning over in my mind ever since.', '"Whenever you feel like criticizing any one," he told me, "just remember that all the people in this world havent had the advantages that youve had."', "He didn't say any more but weve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that.", 'In consequence Im inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores.', 'The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college', 'I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men.', 'Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sig

In [4]:
# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(1, 1), 
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

In [5]:

features = counter.get_feature_names()


In [6]:
print(features)

['And', 'Conduct', 'East', 'Gatsby', 'He', 'If', 'Im', 'In', 'Ive', 'Most', 'No', 'Only', 'Reserving', 'The', 'This', 'When', 'Whenever', 'abnormal', 'abortive', 'about', 'accused', 'admission', 'advantages', 'advice', 'afraid', 'after', 'again', 'all', 'also', 'always', 'am', 'an', 'any', 'appears', 'are', 'as', 'at', 'attach', 'attention', 'autumn', 'away', 'back', 'be', 'because', 'been', 'being', 'birth', 'boasting', 'book', 'bores', 'but', 'by', 'came', 'care', 'certain', 'closed', 'college', 'come', 'communicative', 'confidences', 'consequence', 'creative', 'criticizing', 'curious', 'deal', 'decencies', 'detect', 'didn', 'dignified', 'do', 'dont', 'dreams', 'dust', 'earthquakes', 'elations', 'end', 'ever', 'everything', 'excursions', 'exempt', 'express', 'extraordinary', 'father', 'feel', 'feigned', 'felt', 'few', 'find', 'flabby', 'floated', 'for', 'forever', 'forget', 'foul', 'found', 'founded', 'frequently', 'from', 'fundamental', 'gave', 'gestures', 'gift', 'gives', 'glimpses

In [7]:
data_counts.shape

(16, 284)

# Similarity

So far, this is all (hopefully) familiar: We've done text parsing and the tf-idf calculation before.  We should now have sentences represented as vectors, with each word having a score based on how often it occurs in the sentence divided by how often it occurs in the whole text.

Now let's calculate the similarity scores for the sentences and apply the TextRank algorithm.  Because TextRank is based on Google's PageRank algorithm, the function is called 'pagerank'.  The hyperparameters are the damping parameter ´alpha´ and the convergence parameter ´tol´.

In [8]:
# Calculating similarity
similarity = data_counts * data_counts.T

# Identifying the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks=nx.pagerank(nx_graph, alpha=.85, tol=.00000001)

ranked = sorted(((ranks[i],s) for i,s in enumerate(sentences)),
                reverse=True)
print(ranked[0])


(0.074946631856127058, 'This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the "creative temperament"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again.')


Since a lot of Gatsby is about the narrator acting as the observer of other peoples' sordid secrets, this seems pretty good.  Now, let's extract some keywords.

# Keyword summarization

1) Parse and tokenize text (already done).  
2) Filter out stopwords, choose only nouns and adjectives.  
3) Calculate the neighbors of words (we'll use a window of 4).  
4) Run TextRank on the neighbor matrix.  


In [9]:
# Removing stop words and punctuation, then getting a list of all unique words in the text
gatsby_filt = [word for word in gatsby if word.is_stop==False and (word.pos_=='NOUN' or word.pos_=='ADJ')]
words=set(gatsby_filt)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency=pd.DataFrame(columns=words,index=words,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([word == item for item in gatsby_filt]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+5)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in gatsby_filt for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1

print('done!')
        



done!


In [10]:

# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])


[(0.013370948308795436, hope), (0.012223431176324349, promises), (0.012223431176324349, exempt), (0.01214206885054891, glimpses), (0.011895137937387881, intimate)]


These results are less impressive.  'Hope', 'promises', and 'glimpses' certainly fit the elegiac, on-the-outside-looking-in tone of the book, but 'exempt' and 'world' are pretty generic.  TextRank may perform better on a larger text sample.

# Drill

It is also possible that keyword phrases will work better.  Modfiy the keyword extraction code to extract two-word phrases (digrams) rather than single words.  Then try it with trigrams.  You will probably want to broaden the window that defines 'neighbors.'  Try a few different modifications, and write up your observations in your notebook.  Discuss with your mentor.

## BIGRAMS ##

In [51]:
# Creating the tf-idf matrix.
counter_bigram = TfidfVectorizer(lowercase=False, 
                                  stop_words=None,
                                  ngram_range=(2, 2), 
                                  analyzer=u'word', 
                                  max_df=.5, 
                                  min_df=1,
                                  max_features=None, 
                                  vocabulary=None, 
                                  binary=False)

#Applying the vectorizer
data_counts_bigram = counter_bigram.fit_transform(sentences)

In [52]:
print(counter_bigram.get_feature_names())

['And after', 'Conduct may', 'East last', 'Gatsby the', 'Gatsby turned', 'Gatsby what', 'Gatsby who', 'He didn', 'If personality', 'Im inclined', 'In consequence', 'In my', 'Ive been', 'Most of', 'No Gatsby', 'Only Gatsby', 'Reserving judgments', 'The abnormal', 'This responsiveness', 'When came', 'Whenever you', 'abnormal mind', 'abortive sorrows', 'about him', 'about that', 'accused of', 'admission that', 'advantages that', 'advice that', 'afraid of', 'after boasting', 'after certain', 'all judgments', 'all right', 'all the', 'also made', 'always been', 'am still', 'an extraordinary', 'an intimate', 'an unaffected', 'an unbroken', 'and also', 'and at', 'and attach', 'and marred', 'and more', 'and short', 'and snobbishly', 'and so', 'and understood', 'and which', 'any more', 'any one', 'any other', 'appears in', 'are usually', 'as have', 'as if', 'as my', 'at birth', 'at least', 'at sort', 'at the', 'attach itself', 'attention forever', 'autumn felt', 'back from', 'be founded', 'be in

In [83]:
gatsby_filt = [word.string.strip() for word in gatsby if word.is_stop==False and (word.pos_=='NOUN' or word.pos_=='ADJ')]

In [84]:
gatsby_filt

['younger',
 'vulnerable',
 'years',
 'father',
 'advice',
 'mind',
 'people',
 'world',
 'advantages',
 'communicative',
 'reserved',
 'way',
 'great',
 'deal',
 'consequence',
 'inclined',
 'judgments',
 'habit',
 'curious',
 'natures',
 'victim',
 'veteran',
 'bores',
 'abnormal',
 'mind',
 'quick',
 'quality',
 'normal',
 'person',
 'college',
 'politician',
 'privy',
 'secret',
 'griefs',
 'wild',
 'unknown',
 'men',
 'Most',
 'confidences',
 'sleep',
 'preoccupation',
 'hostile',
 'levity',
 'unmistakable',
 'sign',
 'intimate',
 'revelation',
 'horizon',
 'intimate',
 'revelations',
 'young',
 'men',
 'terms',
 'plagiaristic',
 'obvious',
 'suppressions',
 'Reserving',
 'judgments',
 'matter',
 'infinite',
 'hope',
 'little',
 'afraid',
 'father',
 'sense',
 'fundamental',
 'decencies',
 'birth',
 'way',
 'tolerance',
 'admission',
 'limit',
 'Conduct',
 'hard',
 'rock',
 'wet',
 'marshes',
 'certain',
 'point',
 'autumn',
 'world',
 'uniform',
 'sort',
 'moral',
 'attention',
 

In [85]:
bigrams = list(zip(gatsby_filt, gatsby_filt[1:]))
trigrams = list(zip(gatsby_filt, gatsby_filt[1:], gatsby_filt[2:]))


In [86]:
print(trigrams)

[('younger', 'vulnerable', 'years'), ('vulnerable', 'years', 'father'), ('years', 'father', 'advice'), ('father', 'advice', 'mind'), ('advice', 'mind', 'people'), ('mind', 'people', 'world'), ('people', 'world', 'advantages'), ('world', 'advantages', 'communicative'), ('advantages', 'communicative', 'reserved'), ('communicative', 'reserved', 'way'), ('reserved', 'way', 'great'), ('way', 'great', 'deal'), ('great', 'deal', 'consequence'), ('deal', 'consequence', 'inclined'), ('consequence', 'inclined', 'judgments'), ('inclined', 'judgments', 'habit'), ('judgments', 'habit', 'curious'), ('habit', 'curious', 'natures'), ('curious', 'natures', 'victim'), ('natures', 'victim', 'veteran'), ('victim', 'veteran', 'bores'), ('veteran', 'bores', 'abnormal'), ('bores', 'abnormal', 'mind'), ('abnormal', 'mind', 'quick'), ('mind', 'quick', 'quality'), ('quick', 'quality', 'normal'), ('quality', 'normal', 'person'), ('normal', 'person', 'college'), ('person', 'college', 'politician'), ('college', 'p

In [87]:
bigrams = [' '.join(tup) for tup  in bigrams]
trigrams = [' '.join(tup) for tup in trigrams]

In [88]:
print(trigrams)

['younger vulnerable years', 'vulnerable years father', 'years father advice', 'father advice mind', 'advice mind people', 'mind people world', 'people world advantages', 'world advantages communicative', 'advantages communicative reserved', 'communicative reserved way', 'reserved way great', 'way great deal', 'great deal consequence', 'deal consequence inclined', 'consequence inclined judgments', 'inclined judgments habit', 'judgments habit curious', 'habit curious natures', 'curious natures victim', 'natures victim veteran', 'victim veteran bores', 'veteran bores abnormal', 'bores abnormal mind', 'abnormal mind quick', 'mind quick quality', 'quick quality normal', 'quality normal person', 'normal person college', 'person college politician', 'college politician privy', 'politician privy secret', 'privy secret griefs', 'secret griefs wild', 'griefs wild unknown', 'wild unknown men', 'unknown men Most', 'men Most confidences', 'Most confidences sleep', 'confidences sleep preoccupation'

In [54]:


#Creating a grid indicating whether words are within 4 places of the target word
adjacency_bigram=pd.DataFrame(columns=bigrams,index=bigrams,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(bigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    # Making sure to stop at the end of the string, even if there are less than four words left after the target.
    end=max(0,len(bigrams)-(len(bigrams)-(i+5)))
    # The potential neighbors.
    nextwords=bigrams[i+1:end]
    # Filtering the neighbors to select only those in the word list
    #inset=[x in gatsby_filt for x in nextwords]
    neighbors=[nextwords[i] for i in range(len(nextwords))] # if inset[i]] 
    # Adding 1 to the adjacency matrix for neighbors of the target word
    if neighbors:
        adjacency_bigram.loc[word,neighbors]=adjacency_bigram.loc[word,neighbors]+1

print('done!')
        


done!


In [61]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency_bigram.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(bigrams)),
                reverse=True)
print(ranked[:5])

[(0.008478571504987353, 'r e s e r v e d   w a y'), (0.008478571504987353, 'p e r s o n   l i k e l y'), (0.008396088482342145, 'w a y   g r e a t'), (0.008396088482342145, 'r e a d i n e s s   p e r s o n'), (0.008322012406773167, 'g r e a t   d e a l')]


In [62]:
# Increasing Neighborhood size to 7

#Creating a grid indicating whether words are within 4 places of the target word
adjacency_bigram=pd.DataFrame(columns=bigrams,index=bigrams,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(bigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    # Making sure to stop at the end of the string, even if there are less than four words left after the target.
    end=max(0,len(bigrams)-(len(bigrams)-(i+8)))
    # The potential neighbors.
    nextwords=bigrams[i+1:end]
    # Filtering the neighbors to select only those in the word list
    #inset=[x in gatsby_filt for x in nextwords]
    neighbors=[nextwords[i] for i in range(len(nextwords))] # if inset[i]] 
    # Adding 1 to the adjacency matrix for neighbors of the target word
    if neighbors:
        adjacency_bigram.loc[word,neighbors]=adjacency_bigram.loc[word,neighbors]+1

print('done!')
        

done!


In [63]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency_bigram.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(bigrams)),
                reverse=True)
print(ranked[:5])

[(0.00849776192675846, 'f o u l   d u s t'), (0.008497761926758458, 'w o r l d   a d v a n t a g e s'), (0.00838308473138196, 'e n d   f o u l'), (0.00838308473138196, 'a d v a n t a g e s   c o m m u n i c a t i v e'), (0.008284113942061676, 'l i k e l y   e n d')]


In [66]:
# Increasing Neighborhood size to 10

#Creating a grid indicating whether words are within 4 places of the target word
adjacency_bigram=pd.DataFrame(columns=bigrams,index=bigrams,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(bigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    # Making sure to stop at the end of the string, even if there are less than four words left after the target.
    end=max(0,len(bigrams)-(len(bigrams)-(i+11)))
    # The potential neighbors.
    nextwords=bigrams[i+1:end]
    # Filtering the neighbors to select only those in the word list
    #inset=[x in gatsby_filt for x in nextwords]
    neighbors=[nextwords[i] for i in range(len(nextwords))] # if inset[i]] 
    # Adding 1 to the adjacency matrix for neighbors of the target word
    if neighbors:
        adjacency_bigram.loc[word,neighbors]=adjacency_bigram.loc[word,neighbors]+1

print('done!')
        

done!


In [67]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency_bigram.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(bigrams)),
                reverse=True)
print(ranked[:5])

[(0.008478571504987353, 'r e s e r v e d   w a y'), (0.008478571504987353, 'p e r s o n   l i k e l y'), (0.008396088482342145, 'w a y   g r e a t'), (0.008396088482342145, 'r e a d i n e s s   p e r s o n'), (0.008322012406773167, 'g r e a t   d e a l')]


###  Results for Bigrams ###

* We extract "reserved way", "person likely", "way great", "readiness person" and "great deal" for a neighborhood size of 4.

* We extract "foul dust", "world advantages", "end foul", "advantages communicative" and "likely end" for a size of 7.

* For a neighborhood size of 10, we extract the bigrams "reserved way", "person likely", "way great", "readiness person" and "great deal"

* The bigrams for sizes of 4 and 10 repeat.  Many of them seem to have some relationship with a book like Gatsby.

* Best examples are "way great" and "great deal".

## TRIGRAMS ##

In [89]:
#Creating a grid indicating whether words are within 4 places of the target word
adjacency_trigram=pd.DataFrame(columns=trigrams,index=trigrams,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(trigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    # Making sure to stop at the end of the string, even if there are less than four words left after the target.
    end=max(0,len(trigrams)-(len(trigrams)-(i+5)))
    # The potential neighbors.
    nextwords=trigrams[i+1:end]
    # Filtering the neighbors to select only those in the word list
    #inset=[x in gatsby_filt for x in nextwords]
    neighbors=[nextwords[i] for i in range(len(nextwords))] # if inset[i]] 
    # Adding 1 to the adjacency matrix for neighbors of the target word
    if neighbors:
        adjacency_trigram.loc[word,neighbors]=adjacency_trigram.loc[word,neighbors]+1

print('done!')
        


done!


In [90]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency_trigram.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(trigrams)),
                reverse=True)
print(ranked[:5])

[(0.008608224436651105, 'wake dreams interest'), (0.008608224436651105, 'advice mind people'), (0.008418552148932803, 'mind people world'), (0.008418552148932803, 'dust wake dreams'), (0.008268498103417074, 'foul dust wake')]


In [91]:
#Creating a grid indicating whether words are within 7 places of the target word
adjacency_trigram=pd.DataFrame(columns=trigrams,index=trigrams,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(trigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    # Making sure to stop at the end of the string, even if there are less than four words left after the target.
    end=max(0,len(trigrams)-(len(trigrams)-(i+8)))
    # The potential neighbors.
    nextwords=trigrams[i+1:end]
    # Filtering the neighbors to select only those in the word list
    #inset=[x in gatsby_filt for x in nextwords]
    neighbors=[nextwords[i] for i in range(len(nextwords))] # if inset[i]] 
    # Adding 1 to the adjacency matrix for neighbors of the target word
    if neighbors:
        adjacency_trigram.loc[word,neighbors]=adjacency_trigram.loc[word,neighbors]+1

print('done!')
        


done!


In [92]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency_trigram.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(trigrams)),
                reverse=True)
print(ranked[:5])

[(0.008562630346707433, 'world advantages communicative'), (0.008562630346707432, 'end foul dust'), (0.008447077754373141, 'likely end foul'), (0.008447077754373141, 'advantages communicative reserved'), (0.008347351464644144, 'person likely end')]


In [93]:
#Creating a grid indicating whether words are within 10 places of the target word
adjacency_trigram=pd.DataFrame(columns=trigrams,index=trigrams,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(trigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    # Making sure to stop at the end of the string, even if there are less than four words left after the target.
    end=max(0,len(trigrams)-(len(trigrams)-(i+11)))
    # The potential neighbors.
    nextwords=trigrams[i+1:end]
    # Filtering the neighbors to select only those in the word list
    #inset=[x in gatsby_filt for x in nextwords]
    neighbors=[nextwords[i] for i in range(len(nextwords))] # if inset[i]] 
    # Adding 1 to the adjacency matrix for neighbors of the target word
    if neighbors:
        adjacency_trigram.loc[word,neighbors]=adjacency_trigram.loc[word,neighbors]+1

print('done!')
        

done!


In [94]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
nx_words = nx.from_numpy_matrix(adjacency_trigram.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(trigrams)),
                reverse=True)
print(ranked[:5])

[(0.008543296283550108, 'reserved way great'), (0.008543296283550108, 'readiness person likely'), (0.008460183839056497, 'way great deal'), (0.008460183839056497, 'romantic readiness person'), (0.008385542544465694, 'hope romantic readiness')]


### Trigram Results ###

* We finally see some truly interesting phrases

* The best appear to be "reserved way great", "hope romantic readiness", and "way great deal"