This notebook reads in a cleaned corpus text file and returns the most frequent words and the most important sentences, representing a summary of the corpus.  The code is modeled after the code from the following paper by R. Mihalcea and P. Tarau:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

Additional reference for implementing the paper's code:
https://github.com/davidadamojr/TextRank/blob/master/textrank/__init__.py


In [1]:
#import packages

import nltk, string
import numpy as np
import pandas as pd
import itertools
import textrank
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
#read in file

with open('exercise_benefits_corpus_cleaned.txt', 'r') as f:
    corpus = f.read()

corpus = corpus.decode('utf-8').strip()

### Preprocessing:
1. remove extra spaces, tabs, and returns
2. stemming, lemmatisation, POS tagging

In [15]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string, re

def processed(text):
    text = ' '.join(corpus.strip().split('\n')).lower()
    lemmatiser = WordNetLemmatizer()
    lem_text = lemmatiser.lemmatize(text)
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    lem_text = regex.sub('', lem_text)
    return lem_text
    
clean = processed(corpus)

I compared stemming to lemmatizing and found that for this particular corpus, the two methods yielded very similar results.  I decided to go with lemmatizing since it avoids making up words, as stemming sometimes does.

In [23]:
#tag text with POS (part of speech) & tokenize

from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize

def tag_tokenize(processed_text):
    tokens = word_tokenize(processed_text) # Generate list of tokens
    tagged = pos_tag(tokens)
    sentences = sent_tokenize(corpus)  #will use later in TextRank
    return tagged

tagged = tag_tokenize(clean)

In [24]:
#remove stopwords & filter for tags:
from nltk.corpus import stopwords

def filter_for_tags(tagged, tags=['NN', 'JJ', 'NNP']):
    #filter based on POS tags
    tagged = [item for item in tagged if item[1] in tags]
    return tagged

def filter_nostp(tagged_text):
    filtered = filter_for_tags(tagged_text)
    #filtered = re.sub(u"\u2019", "", filtered)
    stp = stopwords.words("english")
    add1 = ["thats", "says", "theres", "its", "whats", "wheres", "even", "also", "may", "might", "think", "believe", "study", "dr", "university"]
    add = [unicode(i, "utf-8") for i in add1]
    stop = stp + add
    no_stp = [w[0] for w in filtered if w[0] not in stop]
    return no_stp

#will call the filter_nostp() function later in word frequency count

tagged2 = filter_for_tags(tagged)

### Get most common words in corpus:

In [25]:
from collections import Counter

count = Counter(filter_nostp(tagged2))
top30 = count.most_common(30)
for i in top30:
    print i[0]

exercise
’
new
health
training
many
“
”
time
fat
brain
group
activity
week
physical
interval
immune
sedentary
workout
weight
research
blood
mice
fitness
age
memory
heart
highintensity
science
intense


### TextRank Algorithm:

In [27]:
#tokenize sentences
sentences = sent_tokenize(corpus)

def normalize(tagged):
    """Return a list of tuples with the first item's periods removed."""
    tagged = [(item[0].replace('.', ''), item[1]) for item in tagged]

def unique_everseen(iterable, key=None):
    #List unique elements in order of appearance.
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in [x for x in iterable if x not in seen]:
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

In [28]:
#calculate cosine distance:
import re, math
WORD = re.compile(r'\w+')
import itertools
import networkx as nx

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator

def text_to_vector(text):
     words = WORD.findall(text)
     return Counter(words)    


def build_graph(nodes):
    """Return a networkx graph instance.
    :param nodes: List of hashables that represent the nodes of a graph.
    """
    gr = nx.Graph()  # initialize an undirected graph
    gr.add_nodes_from(nodes)
    nodePairs = list(itertools.combinations(nodes, 2))

    # add edges to the graph (weighted by cosine distance)
    for pair in nodePairs:
        firstString = pair[0]
        vector1 = text_to_vector(firstString)
        secondString = pair[1]
        vector2 = text_to_vector(secondString)
        pairwise_sim = get_cosine(vector1, vector2)
        gr.add_edge(firstString, secondString, weight=pairwise_sim)

    return gr

build_graph(sentences)

<networkx.classes.graph.Graph at 0x1a23c39550>

### Return key phrases from corpus:

In [32]:
unique_word_set = unique_everseen([x[0] for x in tagged2])
word_set_list = list(unique_word_set)

In [33]:
#calculate score of each sentence

calculated_page_rank = nx.pagerank(build_graph(sentences), weight='weight')

In [43]:
# most important words in ascending order of importance
keyphrases = sorted(calculated_page_rank, key=calculated_page_rank.get,
                        reverse=True)

# the number of keyphrases returned will be relative to the size of the
# text (a third of the number of vertices)
one_third = len(word_set_list) // 3
keyphrases = keyphrases[0:one_third + 1]

In [44]:
res = []
for i in range(0,19):
    keyphrases[i] = keyphrases[i].encode('utf-8')
    res.append(keyphrases[i])


In [45]:
#display neater output:

for i in res:
    print i

She and her colleagues did find increases in the activity of certain genes and the levels of some proteins in the brains of the runners that could have contributed to the changes in their synapses, she says.
These volunteers, who had been sedentary and overweight, were told they would be taking part in an exercise program to get them ready to complete a 5K race, and that the study would examine some of the effects of the training, including psychological impacts.
When scientists in Sweden scanned the spines of mice before and after they ran for several weeks on treadmills, the researchers noticed significant increases in the size of their spinal discs, indicating that those structures had been responding and adapting to the demands of running.
For their inaugural study of the riders , which was published in 2014, the scientists measured a broad range of the cyclists’ physical and cognitive abilities and compared them to those of sedentary older people and much younger men and women.
In

In [37]:
textfile = open('exercise_benefits_summary.txt', 'w')

for item in res:
  textfile.write("%s\n" % item)
textfile.close()

#note: in the text file, all punctuation is back to normal (no weird symbols) as it is typically in UTF-8.