## Text summarization (Senetnce Ranking)

#### STEP 1 : Data cleaning ( removing non letter characters, turning to lower case letters )
#### STEP 2 : Building Sentence Similarity Matrix
#### STEP 3 : Sentence Ranking
#### STEP 4 : Summary Generation

## Initial Phase
### Importing Libraries and Reading Data 

In [1]:
### importing the necessary libraries

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
import networkx as nx
from nltk.tokenize import  sent_tokenize

In [2]:
### Reading the data file

df = pd.read_csv('Data/tennis_articles_v4.csv')
df['article_text']

0    Maria Sharapova has basically no friends as te...
1    BASEL, Switzerland (AP), Roger Federer advance...
2    Roger Federer has revealed that organisers of ...
3    Kei Nishikori will try to end his long losing ...
4    Federer, 37, first broke through on tour over ...
5    Nadal has not played tennis since he was force...
6    Tennis giveth, and tennis taketh away. The end...
7    Federer won the Swiss Indoors last week by bea...
Name: article_text, dtype: object

In [4]:
## working with re ( regular expression in python)

import re
s = 'he&&&s'
s = re.sub("[^a-zA-Z]"," ",s)

## STEP 1 : Data Cleaning
### Cleaning sentences, by removing Non Alphabet Characters and converting to Lower Case Letters

In [5]:
### cleaning sentences, by removing non alphabet characters and converting to lower case letters

dict = {}
s = ""
for a in df['article_text']:
      s += a
# print s
s = s.lower()
# print s

sentences = sent_tokenize(s)
# print sentences

final = []

for s in sentences:
      temp = re.sub("[^a-zA-Z]"," ",s)
      temp = temp.lower()
      final.append(temp)
      dict[temp] = s
# printfinal 

## STEP 2 : Building Senetnce Similarity Matrix
### Similarity is found using Cosine Similarity between vector representation of sentences

In [6]:
### define method for calculating similarity

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)




def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    return similarity_matrix

## STEP 3 : Sentence Ranking
### Sentences are ranked using PageRank Algorithm on the Graph generated from the Sentence Similarity Matrix

In [7]:
### generating the final summary : graph is generated using networkx library and cosine similarity matrix 
# containg adjacency list; after that sentences are scored using pagerank and sorted and stored in ranked_sentences

    # Step 2 - Generate Similary Martix across sentences
sentence_similarity_martix = build_similarity_matrix(final, '')

    # Step 3 - Rank sentences in similarity martix
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(final)), reverse=True)    
# print type(ranked_sentence)
# print("Indexes of top ranked_sentence order are ", ranked_sentence)    



# Step 5 - Offcourse, output the summarize texr
# print('Summarize Text: \n', ". ".join(summarize_text))

## STEP 4 : Summary Generation
### Summary is outputted as the top 10 ranked sentences

In [11]:
for i in range(5):
     print(dict[ranked_sentence[i][1]])
        
# for i in range(10):
#       summarize_text.append(" ".join(ranked_sentence[i][1]))

argentina and britain received wild cards to the new-look event, and will compete along with the four 2018 semi-finalists and the 12 teams who win qualifying rounds next february.
the competition is set to feature 18 countries in the november 18-24 finals in madrid next year, and will replace the classic home-and-away ties played four times per year for decades.
"nadal has not played tennis since he was forced to retire from the us open semi-finals against juan martin del porto with a knee injury.
"not always, but i really feel like in the mid-2000 years there was a huge shift of the attitudes of the top players and being more friendly and being more giving, and a lot of that had to do with players like roger coming up.
but with the atp world tour finals due to begin next month, nadal is ready to prove his fitness before the season-ending event at the 02 arena.
