# Text Analytics Exploration

## Data

In [13]:
url = "http://hansardpublic.parliament.sa.gov.au/Pages/HansardResult.aspx#/docid/HANSARD-10-25756"
text = "When I tried to get a copy of the commissioner's report after being tabled, why was I basically told that there was a very limited— The PRESIDENT: This is a matter of personal explanation in a supplementary. Just please, the Hon. Mr Wortley, ask your supplementary. The Hon. R.P. WORTLEY: Why weren't all members of parliament given a copy of the royal commission's report? The Hon. D.W. Ridgway: But you told us before you never read reports. The Hon. R.I. LUCAS (Treasurer) (15:26): Mr President, I won't go down that particular path, as delicious as that interjection might have been in relation to the Hon. Mr Wortley saying he couldn't trust himself to read his own reports. I don't know why the Hon. Mr Wortley was unable to get a copy of the royal commission report. It was certainly publicly available. If it pleases the member, I will see whether there is not a spare copy somewhere. If we do find a spare copy and give it to him, I will be asking questions afterwards of the honourable member just to make sure he did read it. The Hon. D.W. Ridgway: Do you want it delivered to Scuzzi or something more convenient for you? The PRESIDENT: Are you finished, the Hon. Mr Ridgway? The Hon. R.P. WORTLEY: You just worry about our trade exports, mate, for the state. The PRESIDENT: The Hon. Mr Wortley, I am waiting patiently here to give you the call for your question. Have you finished your private conversation with the Hon. Mr Ridgway? Yes? The Hon. Mr Wortley."
title = "Murray-Darling Basin Royal Commission"

In [14]:
# Import data from spreadsheet
import pandas as pd
import numpy as np
import os

#data = pd.read_excel (os.getcwd() + "\\HANSARDfullDataset.xlsx", sheet_name="text")
data = pd.read_excel ("..\\data\\HANSARDfullDataset.xlsx", sheet_name="text")
df = pd.DataFrame(data, columns= ['hansardID','text'])
df = df.astype({"hansardID":'str', "text":'str'}) 
df.dtypes 

hansardID    object
text         object
dtype: object

In [15]:
len(df.index)

170275

In [17]:
df['text'].replace('NULL', np.nan, inplace=True)
df.dropna(subset=['text'], inplace=True)
len(df.index)

170275

In [49]:
# Group text into one document
grouped_text = df.groupby('hansardID')['text'].agg(lambda col: '. '.join(col))
grouped_text_df = pd.DataFrame(grouped_text, columns= ['text'])
print (grouped_text)

hansardID
HANSARD-10-21289.xml    Bills. Children and Young People (Safety) Bill...
HANSARD-10-21290.xml    Sentencing Bill. Assent. His Excellency the Go...
HANSARD-10-21291.xml    Statutes Amendment (Possession of Firearms and...
HANSARD-10-21292.xml    Public Interest Disclosure Bill. Conference. T...
HANSARD-10-21310.xml    Bills. Statutes Amendment (Heavy Vehicles Regi...
                                              ...                        
HANSARD-11-34941.xml    Child Protection Department. 930 Ms STINSON (B...
HANSARD-11-34942.xml    Financial Counselling Service. 940 Ms STINSON ...
HANSARD-11-34943.xml    Guardianship of the Chief Executive. 1008 Ms S...
HANSARD-11-34944.xml    Guardianship of the Chief Executive. 1009 Ms S...
HANSARD-11-34945.xml    Family Group Conferences. 1015 Ms STINSON (Bad...
Name: text, Length: 8833, dtype: object


In [50]:
grouped_text.iloc[4]

"Bills. Statutes Amendment (Heavy Vehicles Registration Fees) Bill. Second Reading. Adjourned debate on second reading.. (Continued from 1 June 2017.). The Hon. J.M.A. LENSINK (16:35):  I rise to indicate opposition support for this bill, which is part of the harmonisation of national laws which relate to heavy vehicles. In particular, this bill amends the Highways Act and the Motor Vehicles Act so that South Australia can meet its obligations under the Heavy Vehicle National Law (South Australia) Act 2013, which contains the national law as a schedule. This bill provides for the creation of a national heavy vehicle regulator. For the benefit of readers, heavy vehicles are defined as trucks with a gross vehicle mass of 4.5 tonnes or more.. The section relating to registration of the national law has not commenced yet, so heavy vehicle registrations remain under state legislation; however, those jurisdictions which are participants, which I understand includes all states and territories

## Document Summarisation

Articles and libraries to look into further: 
* https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
* https://stackabuse.com/text-summarization-with-nltk-in-python/
* https://github.com/alanbuxton/PyTeaserPython3
* https://github.com/abisee/pointer-generator
* https://github.com/DerwenAI/pytextrank
* https://github.com/tensorflow/models/tree/master/research/textsum
* https://radimrehurek.com/gensim/models/lsimodel.html
* https://towardsdatascience.com/text-summarization-in-python-76c0a41f0dc4 (additional links to articles at the end)

In [7]:
# The feature base model extracts the features of the sentence, then evaluate its importance
# Feature base text summarization by TextTeaser
#from pyteaser import SummarizeUrl
#url = 'http://www.huffingtonpost.com/2013/11/22/twitter-forward-secrecy_n_4326599.html'
#summaries = SummarizeUrl(url)
#print summaries

In [8]:
#  TextRank summarization
from gensim.summarization.summarizer import summarize
print(summarize(text))

The Hon. R.P. WORTLEY: Why weren't all members of parliament given a copy of the royal commission's report?
The Hon. D.W. Ridgway: But you told us before you never read reports.
I don't know why the Hon. Mr Wortley was unable to get a copy of the royal commission report.


In [9]:
# Topic Model summarisation
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
model = LsiModel(common_corpus, id2word=common_dictionary)
vectorized_corpus = model[common_corpus]
#print(vectorized_corpus)
#model.print_topics(1)

In [10]:
# TextTeasor - automatic summarization algorithm that combines the power of natural language processing and machine learning
#from textteaser import TextTeaser
#tt = TextTeaser()
#tt.summarize(title, text)

In [61]:
# https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70
# Approach uses TextRank algorithm
# TextRank does not rely on any previous training data and can work with any arbitrary piece of text. 
# TextRank is a general purpose graph-based ranking algorithm for NLP

# Import all necessary libraries
import os
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
import re 

# Generate clean sentences
def read_article(file_name, output=True):
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    return create_clean_sentences(article, output)

def read_text(text, output=True):
    article = text.split(". ")
    return create_clean_sentences(article, output)

def create_clean_sentences(article, output=True):
    sentences = []
    
    for sentence in article:
        processed = sentence.replace("[^a-zA-Z]", " ")
        word_count = len(re.findall(r'\w+', processed)) 
        if word_count > 2: # Include sentences with more than two words
            sentences.append(processed.split(" "))
        if output: 
            print(sentence, ": words = ", word_count)
    
    #sentences.pop() 
    
    return sentences   

# Similarity matrix
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

# Generate Summary Method
def generate_summary(file_name, text, top_n=5, output=True):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text and split it
    sentences = ""
    if file_name is not None:
        sentences = read_article(file_name, output)
    else:
        sentences = read_text(text, output)

    # Step 2 - Generate Similarly Matrix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity matrix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    try: 
        scores = nx.pagerank(sentence_similarity_graph, max_iter=200)
    except nx.NetworkXError:
        return ""
    except nx.PowerIterationFailedConvergence:
        return ""
    
    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    if output:
        print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    if len(ranked_sentence) < top_n:
        top_n = len(ranked_sentence)
        
    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - output the summary text
    if output: 
        print("Summary Text: \n", ". ".join(summarize_text))
        
    return '. '.join(summarize_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\noaka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [62]:
# let's begin
generate_summary(os.getcwd() + "\\test.txt", None, 2, False)

'This program also included developer-focused AI school that provided a bunch of assets to help build AI skills.. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services'

In [63]:
generate_summary(None, grouped_text.iloc[4], 2)

Bills : words =  1
Statutes Amendment (Heavy Vehicles Registration Fees) Bill : words =  7
Second Reading : words =  2
Adjourned debate on second reading. : words =  5
(Continued from 1 June 2017.) : words =  5
The Hon : words =  2
J.M.A : words =  3
LENSINK (16:35):  I rise to indicate opposition support for this bill, which is part of the harmonisation of national laws which relate to heavy vehicles : words =  26
In particular, this bill amends the Highways Act and the Motor Vehicles Act so that South Australia can meet its obligations under the Heavy Vehicle National Law (South Australia) Act 2013, which contains the national law as a schedule : words =  39
This bill provides for the creation of a national heavy vehicle regulator : words =  12
For the benefit of readers, heavy vehicles are defined as trucks with a gross vehicle mass of 4.5 tonnes or more. : words =  22
The section relating to registration of the national law has not commenced yet, so heavy vehicle registrations rema

'This bill provides for the creation of a national heavy vehicle regulator. In particular, this bill amends the Highways Act and the Motor Vehicles Act so that South Australia can meet its obligations under the Heavy Vehicle National Law (South Australia) Act 2013, which contains the national law as a schedule'

In [64]:
generate_summary(None, grouped_text.iloc[2], 2, True)

Statutes Amendment (Possession of Firearms and Prohibited Weapons) Bill : words =  9
Assent : words =  1
His Excellency the Governor assented to the bill. : words =  8
Indexes of top ranked_sentence order are  [(0.5, ['Statutes', 'Amendment', '(Possession', 'of', 'Firearms', 'and', 'Prohibited', 'Weapons)', 'Bill']), (0.5, ['His', 'Excellency', 'the', 'Governor', 'assented', 'to', 'the', 'bill.'])]
Summary Text: 
 Statutes Amendment (Possession of Firearms and Prohibited Weapons) Bill. His Excellency the Governor assented to the bill.


'Statutes Amendment (Possession of Firearms and Prohibited Weapons) Bill. His Excellency the Governor assented to the bill.'

In [None]:
# Iterate over all text in data frame and add summary
new_df = grouped_text_df
new_df['textrank'] = grouped_text_df.apply(lambda x: generate_summary(None, x.text, 2, False), axis=1)

In [None]:
new_df.dtypes

In [43]:
new_df.to_excel('.\\DocumentSummary.xlsx', sheet_name='TextRank', index=True)