### Introduction 
This is a text summary implementation in python. 
I am writing a function text_summary to obtain summary of an article.

I am following the video given by Matt Gallivan.
Sources: https://www.youtube.com/watch?v=wM2RBih6Cm0

The three text files used are here
1. https://s3-us-west-1.amazonaws.com/lehmanlife/apple-cook-letter-fbi.txt
2. https://s3-us-west-1.amazonaws.com/lehmanlife/public_int_test.txt
3. https://s3-us-west-1.amazonaws.com/lehmanlife/nyt_rich_donor.txt


### What does it do?
This script tries to obtain summary of an article. Here's an example of top 5 important sentences when applied to Tim Cook's open letter to fight against FBI. 

1. A Dangerous Precedent Rather than asking for legislative action through Congress, the FBI is proposing an unprecedented use of the All Writs Act of 1789 to justify an expansion of its authority.
2. All that information needs to be protected from hackers and criminals who want to access it, steal it, and use it without our knowledge or permission.
3. And ultimately, we fear that this demand would undermine the very freedoms and liberty our government is meant to protect.
4. And while the government may argue that its use would be limited to this case, there is no way to guarantee such control.
5. Apple complies with valid subpoenas and search warrants, as we have in the San Bernardino case.



### Setup 

In [90]:
%%sh
pip install nltk



In [91]:
# nltk.download('punkt')

from nltk.tokenize import sent_tokenize
import nltk
import re
from math import log


### Read text file
Good review on how to read txt file. 

In [92]:
text = open('public_int_test.txt', 'rw')

print text


<open file 'public_int_test.txt', mode 'rw' at 0x7fa920847780>


In [104]:
text_string = text.read()
print type(text_string)

<type 'str'>



Looks like I was able to read in the text file fine.


## Main Part
There are three component to get text summary in python

### 1. Break Text into pieces by each sentence 

In [94]:
def sentences(text):
    return sent_tokenize(text)
sent = sentences(text_string.decode('utf-8'))

#### Encoding matters
I had trouble when running sent_tokenize, because of decoding issue. I had to specify utf-8.

The sent object is a list of sentences. I think a sentence is defined when there's a period, question mark or exclamation mark.

In [95]:
print type(sent)
print sent[0]
print "-"*10
print sent[1]

<type 'list'>
Democrats seize super PAC crown
Liberal groups outraise conservative counterparts 2-to-1 during first half of 2013

Correction, Sept. 6, 2013, 4:28 p.m.: The historical comparison of receipts for conservative and liberal super PACs has been amended since this story first published after an error was discovered in the data provided to the Center by the Federal Election Commission.
----------
Democrats have become the kings of super PACs.


Some problem with sent_tokenize is that it does not recognize headline and subheadline as distinct sentences. This should be a concern later on. 


### 2. Connect these pieces in a graph

We will use the idea of network. Our goal is to construct a directed graph. Each sentence will be a node. The intensity of links is determined by the similarity of two sentences. The similarity measurement is obtained by counting the number of common words, then normalized by length of the sentences. 

In [96]:
def common_words(s1, s2):
    '''return common words between two sentences'''
    '''reference: http://stackoverflow.com/questions/16351744/finding-the-common-words-between-two-text-corpus-in-nltk'''
    words1 = s1.lower().split()
    words2 = s2.lower().split()
    
    intersection = set(words1) & set(words2)
    return intersection

def similarity(s1, s2):
    '''return the amount of similarity between two sentences/pieces'''
    words1 = s1.lower().split()
    words2 = s2.lower().split()
 
    return( len(common_words(s1, s2)) / 
          (log(len(words1) + log(len(words2) ))))

def link(nodes):
    '''Return a list of edges connecting the nodes, where the edges are given a 
    weight based on their similarity'''
    return [(start, end, similarity(start, end))
                         for start in nodes
                         for end in nodes
                         if start is not end]

edge_link = link(sent)
edge_link[0]

(u'Democrats seize super PAC crown\nLiberal groups outraise conservative counterparts 2-to-1 during first half of 2013\n\nCorrection, Sept. 6, 2013, 4:28 p.m.: The historical comparison of receipts for conservative and liberal super PACs has been amended since this story first published after an error was discovered in the data provided to the Center by the Federal Election Commission.',
 u'Democrats have become the kings of super PACs.',
 0.9766417300203694)

Looks good!

### 3. Rank importance for these links
We will use page rank method to find the most important node. We will use *networkx* library to implement this. 

In [97]:
%%sh
pip install networkx



In [98]:
import networkx as nx

def rank(nodes, edges):
    '''return a dictornary containing the scores for each node'''
    graph = nx.DiGraph()
    graph.add_nodes_from(nodes)
    graph.add_weighted_edges_from(edges)
    return nx.pagerank(graph)

def summarize(text, num_summaries = 3):
    '''Create small summaries of a larger text.'''
    nodes = sentences(text)
    edges = link(nodes)
    scores = rank(nodes, edges)
    return sorted(scores)[:num_summaries]


In [99]:
print summarize(text_string.decode('utf-8'), 5)

[u'(It spent $1.4 million on advertisements opposing Republican Gabriel Gomez who ultimately lost to Democrat Ed Markey by 10 percentage points.)', u'1 donor, giving $350,000 during the first half of the year.', u'Additionally, New York City Mayor Michael Bloomberg, an independent, and the Texas-based law firm of Democratic mega-donors Steve and Amber Mostyn, each donated $250,000 to Americans for Responsible Solutions.', u'Ahead of the 2012 election, Benioff bundled more than $500,000 for Obama\u2019s re-election efforts, and this spring, he visited the White House three times, records show.', u'All the while, the late GOP mega-donor Bob Perry \u2014 the Houston homebuilder who died in April \u2014 continued to fuel conservative super PACs at the dawn of 2013.']


### Results:
The summaries don't seem very informative to understanding the article. For news paper articles, focus on headline and subheadline might work better than this method. 

### Wrap in a function
Input: a txt file
Output a list of summaries

In [100]:
def text_summary(txt_file, encode = 'utf-8', number = 3):
    text = open(txt_file, 'rw')
    text_string = text.read().decode(encode)
    
    return summarize(text_string, number)
    

In [101]:
text_summary('public_int_test.txt')

[u'(It spent $1.4 million on advertisements opposing Republican Gabriel Gomez who ultimately lost to Democrat Ed Markey by 10 percentage points.)',
 u'1 donor, giving $350,000 during the first half of the year.',
 u'Additionally, New York City Mayor Michael Bloomberg, an independent, and the Texas-based law firm of Democratic mega-donors Steve and Amber Mostyn, each donated $250,000 to Americans for Responsible Solutions.']

### Try out different articles

In [102]:
text_summary('nyt_rich_donor.txt', number = 5)

[u'A number of the families are tied to networks of ideological donors who, on the left and the right alike, have sought to fundamentally reshape their own political parties.',
 u'According to the Pew Research Center, nearly seven in 10 favor preserving Social Security and Medicare benefits as they are.',
 u'Across a sprawling country, they reside in an archipelago of wealth, exclusive neighborhoods dotting a handful of cities and towns.',
 u'And in an economy that has minted billionaires in a dizzying array of industries, most made their fortunes in just two: finance and energy.',
 u'And while the shale boom has generated new fortunes, it has also produced a glut of oil that is now driving down prices.']

This one is not too bad. Better than the previous one. 

In [119]:
apple = text_summary('apple-cook-letter-fbi.txt', number = 5)


In [120]:
for sen in apple: 
    print sen
    print "-"*5


A Dangerous Precedent
Rather than asking for legislative action through Congress, the FBI is proposing an unprecedented use of the All Writs Act of 1789 to justify an expansion of its authority.
-----
All that information needs to be protected from hackers and criminals who want to access it, steal it, and use it without our knowledge or permission.
-----
And ultimately, we fear that this demand would undermine the very freedoms and liberty our government is meant to protect.
-----
And while the government may argue that its use would be limited to this case, there is no way to guarantee such control.
-----
Apple complies with valid subpoenas and search warrants, as we have in the San Bernardino case.
-----


[u'A Dangerous Precedent\nRather than asking for legislative action through Congress, the FBI is proposing an unprecedented use of the All Writs Act of 1789 to justify an expansion of its authority.',
 u'All that information needs to be protected from hackers and criminals who want to access it, steal it, and use it without our knowledge or permission.',
 u'And ultimately, we fear that this demand would undermine the very freedoms and liberty our government is meant to protect.',
 u'And while the government may argue that its use would be limited to this case, there is no way to guarantee such control.',
 u'Answers to your questions about privacy and security\n\nThe Need for Encryption\nSmartphones, led by iPhone, have become an essential part of our lives.']