###### CSCE 5222: Feature Engineering - Assignment 2 (ICE_2)
###### Student: Naga Sumanth Vankadari
###### Instructor:  Dr. Sayed Khushal Shah


## Code from Article: Text Summarization - https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70

##### Step 1: Importing Libraries

In [251]:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

##### Step 2: Generate clean sentences

In [252]:
def read_article(file_name):
    
    print_data = open(file_name, "r")
    print("Input Article: \n" + print_data.read())
    print_data.close()
    
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []

    for sentence in article:
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop() 
    
    return sentences

##### Step 3: Build Sentence Similarity

In [253]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

##### Step 4. Build Similarity matrix

In [254]:
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

##### Step 5: Generate Summary Method


In [255]:

def generate_summary(file_name, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []

    sentences =  read_article(file_name)

    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)      

    for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

#     print("Summarize Text: \n", ". ".join(summarize_text))
    
    return ". ".join(summarize_text)

##### Generate summary for sample provided in the article

In [256]:
print("Extracted Summary:\n" + generate_summary("msft.txt", 2))
print("------------------------")

Input Article: 
In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming live

##### Applying above summary extraction code to five articles from previous assignment (ICE_1)

In [257]:
import glob

output_1 = []
i = 1


for file_path in glob.iglob('/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/5_articles/*.txt'):
    print("File Number:" + str(i))
    extracted_summary = generate_summary(file_path, 2)
    print("Extracted Summary:\n" + extracted_summary)
    print("------------------------")
    output_1.append(extracted_summary)
    i = i + 1

File Number:1
Input Article: 
Collins banned in landmark case. Sprinter Michelle Collins has received an eight-year ban for doping offences after a hearing at the North American Court of Arbitration for Sport (CAS). America's former world indoor 200m champion is the first athlete to be suspended without a positive drugs test or an admission of drugs use. Collins' ban is a result of her connection to the federal inquiry into the Balco doping scandal. The 33-year-old was found guilty of using performance-enhancing drugs. The US Anti-Doping Agency (USADA) decided to press charges against Collins in the summer. The sprinter has consistently protested her innocence but the CAS has upheld USADA's findings. "The USADA has proved, beyond a reasonable doubt, that Collins took EPO, the testosterone/epitestosterone cream and THG," said a CAS statement. "Collins used these substances to enhance her performance and elude the drug testing that was available at the time." So far a total of 13 athlete

## Apply Text Cleaning 

#### Read the input files and print the content

In [258]:
import glob


for file_path in glob.iglob('/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/5_articles/*.txt'):
    f = open(file_path,'r')
    print(f.read())
    
    
# As seen below, articles containt punctuation marks such as: quotes, semicolon, ''s'. Remove extra spaces, convert to lower case.
# We will also apply lemmatization to use the origin word


Collins banned in landmark case. Sprinter Michelle Collins has received an eight-year ban for doping offences after a hearing at the North American Court of Arbitration for Sport (CAS). America's former world indoor 200m champion is the first athlete to be suspended without a positive drugs test or an admission of drugs use. Collins' ban is a result of her connection to the federal inquiry into the Balco doping scandal. The 33-year-old was found guilty of using performance-enhancing drugs. The US Anti-Doping Agency (USADA) decided to press charges against Collins in the summer. The sprinter has consistently protested her innocence but the CAS has upheld USADA's findings. "The USADA has proved, beyond a reasonable doubt, that Collins took EPO, the testosterone/epitestosterone cream and THG," said a CAS statement. "Collins used these substances to enhance her performance and elude the drug testing that was available at the time." So far a total of 13 athletes have been sanctioned for vio

### Text Cleaning - With Lemmatization

In [259]:
import glob
import os
import nltk
from nltk.stem import WordNetLemmatizer


# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
nltk.download('wordnet')

# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

punctuation_signs = list("?:,;")

cleaned_files_save_path = '/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/cleaned_output/'

for file_path in glob.iglob('/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/5_articles/*.txt'):
    f = open(file_path,'r')
    file_name = file_path.split('/')[-1]
    print(file_name)
    
    file_content = f.read()
    file_content = file_content.replace('"', '')
    file_content = file_content.lower()
    for punct_sign in punctuation_signs:
        file_content = file_content.replace(punct_sign, '')
    file_content = file_content.replace("'s", "")
    
    lemmatized_list = []
    text_words = file_content.split(" ")
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    lemmatized_text = " ".join(lemmatized_list)
    
    
    print(lemmatized_text)
    
    
    with open(cleaned_files_save_path + file_name, 'w') as f:
        f.write(lemmatized_text)
        f.close()

sport.txt
collins ban in landmark case. sprinter michelle collins have receive an eight-year ban for dope offences after a hear at the north american court of arbitration for sport (cas). america former world indoor 200m champion be the first athlete to be suspend without a positive drug test or an admission of drug use. collins' ban be a result of her connection to the federal inquiry into the balco dope scandal. the 33-year-old be find guilty of use performance-enhancing drugs. the us anti-doping agency (usada) decide to press charge against collins in the summer. the sprinter have consistently protest her innocence but the cas have uphold usada findings. the usada have prove beyond a reasonable doubt that collins take epo the testosterone/epitestosterone cream and thg say a cas statement. collins use these substances to enhance her performance and elude the drug test that be available at the time. so far a total of 13 athletes have be sanction for violations involve drug associate w

[nltk_data] Downloading package punkt to /Users/nvvankad/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nvvankad/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [260]:
import glob

output_2 = []
i = 1


for file_path in glob.iglob('/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/cleaned_output/*.txt'):
    print("File Number:" + str(i))
    extracted_summary = generate_summary(file_path, 2)
    print("Extracted Summary:\n" + extracted_summary)
    print("------------------------")
    output_2.append(extracted_summary)
    i = i + 1

File Number:1
Input Article: 
collins ban in landmark case. sprinter michelle collins have receive an eight-year ban for dope offences after a hear at the north american court of arbitration for sport (cas). america former world indoor 200m champion be the first athlete to be suspend without a positive drug test or an admission of drug use. collins' ban be a result of her connection to the federal inquiry into the balco dope scandal. the 33-year-old be find guilty of use performance-enhancing drugs. the us anti-doping agency (usada) decide to press charge against collins in the summer. the sprinter have consistently protest her innocence but the cas have uphold usada findings. the usada have prove beyond a reasonable doubt that collins take epo the testosterone/epitestosterone cream and thg say a cas statement. collins use these substances to enhance her performance and elude the drug test that be available at the time. so far a total of 13 athletes have be sanction for violations invo

### Comparing Extracted Summary - Before and After Text Cleaning (With Lemmatization)

In [261]:
for i in range(len(output_1)):
    print("Summary Before Text Cleaning: \n" + output_1[i])
    print("\n")
    print("Summary After Text Cleaning with Lemmatization: \n" + output_2[i])
    print("--------------")

Summary Before Text Cleaning: 
Sprinter Michelle Collins has received an eight-year ban for doping offences after a hearing at the North American Court of Arbitration for Sport (CAS). "The USADA has proved, beyond a reasonable doubt, that Collins took EPO, the testosterone/epitestosterone cream and THG," said a CAS statement


Summary After Text Cleaning with Lemmatization: 
collins use these substances to enhance her performance and elude the drug test that be available at the time. collins ban in landmark case
--------------
Summary Before Text Cleaning: 
Indonesia's government has confirmed it is considering raising fuel prices by as much as 30%. Since President Yudhoyono's government came to power in October, it has indicated its intention of raising domestic fuel prices by cutting subsidies


Summary After Text Cleaning with Lemmatization: 
indonesia government have confirm it be consider raise fuel price by as much as 30%. indonesia pay subsidies to importers in order to stabilis

### Text Cleaning - Without Lemmatization

In [262]:
import glob
import os



punctuation_signs = list("?:!,;")


cleaned_files_save_path = '/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/cleaned_output/'

for file_path in glob.iglob('/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/5_articles/*.txt'):
    f = open(file_path,'r')
    file_name = file_path.split('/')[-1]
    print(file_name)
    
    file_content = f.read()
    file_content = file_content.replace('"', '')
    file_content = file_content.lower()
    for punct_sign in punctuation_signs:
        file_content = file_content.replace(punct_sign, '')
    file_content = file_content.replace("'s", "")
    print(file_content)
    
    
    with open(cleaned_files_save_path + file_name, 'w') as f:
        f.write(file_content)
        f.close()

sport.txt
collins banned in landmark case. sprinter michelle collins has received an eight-year ban for doping offences after a hearing at the north american court of arbitration for sport (cas). america former world indoor 200m champion is the first athlete to be suspended without a positive drugs test or an admission of drugs use. collins' ban is a result of her connection to the federal inquiry into the balco doping scandal. the 33-year-old was found guilty of using performance-enhancing drugs. the us anti-doping agency (usada) decided to press charges against collins in the summer. the sprinter has consistently protested her innocence but the cas has upheld usada findings. the usada has proved beyond a reasonable doubt that collins took epo the testosterone/epitestosterone cream and thg said a cas statement. collins used these substances to enhance her performance and elude the drug testing that was available at the time. so far a total of 13 athletes have been sanctioned for viola

### Extracted Summary after text cleaning without Lemmatization

In [263]:
import glob

output_2 = []
i = 1


for file_path in glob.iglob('/Users/nvvankad/Documents/new_laptop/Personal/Masters/CSCE 5222 Feature Engineering/Assignment2/cleaned_output/*.txt'):
    print("File Number:" + str(i))
    extracted_summary = generate_summary(file_path, 2)
    print("Extracted Summary:\n" + extracted_summary)
    print("------------------------")
    output_2.append(extracted_summary)
    i = i + 1

File Number:1
Input Article: 
collins banned in landmark case. sprinter michelle collins has received an eight-year ban for doping offences after a hearing at the north american court of arbitration for sport (cas). america former world indoor 200m champion is the first athlete to be suspended without a positive drugs test or an admission of drugs use. collins' ban is a result of her connection to the federal inquiry into the balco doping scandal. the 33-year-old was found guilty of using performance-enhancing drugs. the us anti-doping agency (usada) decided to press charges against collins in the summer. the sprinter has consistently protested her innocence but the cas has upheld usada findings. the usada has proved beyond a reasonable doubt that collins took epo the testosterone/epitestosterone cream and thg said a cas statement. collins used these substances to enhance her performance and elude the drug testing that was available at the time. so far a total of 13 athletes have been 

Extracted Summary:
microsoft releases regular security updates to its software to protect pcs. microsoft says it is clamping down on people running pirated versions of its windows operating system by restricting their access to security features
------------------------
File Number:5
Input Article: 
bennett play takes theatre prizes. the history boys by alan bennett has been named best new play in the critics' circle theatre awards. set in a grammar school the play also earned a best actor prize for star richard griffiths as teacher hector. the producers was named best musical victoria hamilton was best actress for suddenly last summer and festen rufus norris was named best director. the history boys also won the best new comedy title at the theatregoers' choice awards. partly based upon alan bennett experience as a teacher the history boys has been at london national theatre since last may. the critics' circle named rebecca lenkiewicz its most promising playwright for the night season

### Comparing Extracted Summary - Before and After Text Cleaning (Without Lemmatization)

In [264]:
for i in range(len(output_1)):
    print("Summary Before Text Cleaning: \n" + output_1[i])
    print("\n")
    print("Summary After Text Cleaning without Lemmatization: \n" + output_2[i])
    print("--------------")

Summary Before Text Cleaning: 
Sprinter Michelle Collins has received an eight-year ban for doping offences after a hearing at the North American Court of Arbitration for Sport (CAS). "The USADA has proved, beyond a reasonable doubt, that Collins took EPO, the testosterone/epitestosterone cream and THG," said a CAS statement


Summary After Text Cleaning without Lemmatization: 
usada chief executive officer terry madden said the action taken against collins was further proof of that. the usada has proved beyond a reasonable doubt that collins took epo the testosterone/epitestosterone cream and thg said a cas statement
--------------
Summary Before Text Cleaning: 
Indonesia's government has confirmed it is considering raising fuel prices by as much as 30%. Since President Yudhoyono's government came to power in October, it has indicated its intention of raising domestic fuel prices by cutting subsidies


Summary After Text Cleaning without Lemmatization: 
indonesia government has confir