# Text Summarising of the articles

## Text-Summarization
**Automatic summarization** is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.

Automatic data summarization is part of the real machine learning and data mining. The main idea of summarization is to find a subset of data which contains the "information" of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.

There are two general approaches to automatic summarization: Extraction and Abstraction. 
1. *Extractive Summarization*: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.
2. *Abstractive Summarization*: These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text. Such a summary might include verbal innovations. 
Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization.

In this Jupyter notebook, TextRank algorithm for extractive text summarization is implemented using Google's PageRank search algorithm to generate corelations among sentences.

### Libraries Used
- [Numpy](http://www.numpy.org)
- [Pandas](https://pandas.pydata.org/)
- [Natural Language Toolkit](https://www.nltk.org/)

### Algorithms and Concepts
- TextRank
- [PageRank](https://en.wikipedia.org/wiki/PageRank)
- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

### How to run
- Install the required libraries using pip, virtual environment or conda.
- Run `jupyter notebook` in your terminal.


In [1]:
# Importing the required Libraries
import numpy as np
import pandas as pd
import nltk
# nltk.download('punkt') # one time execution
import re
#nltk.download('stopwords') # one time execution
import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

from nltk.corpus import stopwords

from sklearn.metrics.pairwise import cosine_similarity

import networkx as nx

In [2]:
# Extract word vectors
word_embeddings = {}
file = open('glove.6B.100d.txt', encoding='utf-8')
for line in file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
file.close()
len(word_embeddings)

400000

In [3]:
# reading the file
df = pd.read_excel('TASK.xlsx')

In [4]:
df

Unnamed: 0,TEST DATASET,Unnamed: 1
0,,Introduction
1,,Acnesol Gel is an antibiotic that fights bacte...
2,,Ambrodil Syrup is used for treating various re...
3,,Augmentin 625 Duo Tablet is a penicillin-type ...
4,,Azithral 500 Tablet is an antibiotic used to t...
...,...,...
996,,Azapure Tablet belongs to a group of medicines...
997,,Arimidex 1mg Tablet is used alone or with oth...
998,,Arpimune ME 100mg Capsule is used to prevent y...
999,,Amlodac CH Tablet is a combination medicine us...


In [5]:
df.columns

Index(['TEST DATASET', 'Unnamed: 1'], dtype='object')

In [6]:
df.rename(columns = {'Unnamed: 1' : 'Introduction' }, inplace=True)
# Deleting the first row
df.drop(0)

Unnamed: 0,TEST DATASET,Introduction
1,,Acnesol Gel is an antibiotic that fights bacte...
2,,Ambrodil Syrup is used for treating various re...
3,,Augmentin 625 Duo Tablet is a penicillin-type ...
4,,Azithral 500 Tablet is an antibiotic used to t...
5,,Alkasol Oral Solution is a medicine used in th...
...,...,...
996,,Azapure Tablet belongs to a group of medicines...
997,,Arimidex 1mg Tablet is used alone or with oth...
998,,Arpimune ME 100mg Capsule is used to prevent y...
999,,Amlodac CH Tablet is a combination medicine us...


In [15]:
# Converting the DataFrame into a dictionary
text_dictionary = {}
for i in range(1,len(df['TEST DATASET'])):
    text_dictionary[i] = df['Introduction'][i]
    
print(text_dictionary[1])

Acnesol Gel is an antibiotic that fights bacteria. It is used to treat acne, which appears as spots or pimples on your face, chest or back. This medicine works by attacking the bacteria that cause these pimples.Acnesol Gel is only meant for external use and should be used as advised by your doctor. You should normally wash and dry the affected area before applying a thin layer of the medicine. It should not be applied to broken or damaged skin. Avoid any contact with your eyes, nose, or mouth. Rinse it off with water if you accidentally get it in these areas. It may take several weeks for your symptoms to improve, but you should keep using this medicine regularly. Do not stop using it as soon as your acne starts to get better. Ask your doctor when you should stop treatment.Common side effects like minor itching, burning, or redness of the skin and oily skin may be seen in some people. These are usually temporary and resolve on their own. Consult your doctor if they bother you or do not

**There are 1000 such description of the different medicines. The task is to give summarised form of these description.**

In [8]:
# function to remove stopwords
def remove_stopwords(sen):
    stop_words = stopwords.words('english')
    
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [9]:
# function to make vectors out of the sentences
def sentence_vector_func (sentences_cleaned) : 
    sentence_vector = []
    for i in sentences_cleaned:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vector.append(v)
    
    return (sentence_vector)

In [10]:
# function to get the summary of the articles
# NOTE - Remove '#' infront of print statement for displaying the contents at different stages of the text summarisation process
def summary_text (test_text, n = 5):
    sentences = []
    
    # tokenising the text 
    sentences.append(sent_tokenize(test_text))
    # print(sentences)
    sentences = [y for x in sentences for y in x] # flatten list
    # print(sentences)
    
    # remove punctuations, numbers and special characters
    clean_sentences = pd.Series(sentences).str.replace("[^a-z A-Z 0-9]", " ")

    # make alphabets lowercase
    clean_sentences = [s.lower() for s in clean_sentences]
    #print(clean_sentences)

    
    # remove stopwords from the sentences
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    #print(clean_sentences)
    
    sentence_vectors = sentence_vector_func(clean_sentences)
    
    # similarity matrix
    sim_mat = np.zeros([len(sentences), len(sentences)])
    #print(sim_mat)
    
    # Finding the similarities between the sentences 
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    #print(scores)
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)))
    # Extract sentences as the summary
    summarised_string = ''
    for i in range(n):
        
        try:
            summarised_string = summarised_string + str(ranked_sentences[i][1])            
        except IndexError:
            print ("Summary Not Available")
    
    return (summarised_string)

In [21]:

print("Kindly let me know in how many sentences you want the summary - ")
x = int(input())

summary_dictionary = {}

for key in text_dictionary:
    
    para = text_dictionary[key]
    print("Summary of the article - ",key)
    summary = summary_text(para,x)
    summary_dictionary[key] = summary
    
    print(summary)
    print('='*120)    
    
print ("*"*40,"The process has been completed successfully","*"*40)

Kindly let me know in how many sentences you want the summary - 
3
Summary of the article -  1
Acnesol Gel is an antibiotic that fights bacteria.These are usually temporary and resolve on their own.Consult your doctor about using this medicine if you are pregnant or breastfeeding.
Summary of the article -  2
Ambrodil Syrup is used for treating various respiratory tract disorders associated with excessive mucus.It works by thinning and loosens mucus in the nose, windpipe and lungs and make it easier to cough out.Ambrodil Syrup should be taken with food.For better results, it is suggested to take it at the same time every day.
Summary of the article -  3
It is used to treat infections of the lungs (e.g., pneumonia), ear, nasal sinus, urinary tract, skin and soft tissue.You should take it regularly at evenly spaced intervals as per the schedule prescribed by your doctor.Augmentin 625 Duo Tablet is a penicillin-type of antibiotic that helps your body fight infections caused by bacteria.
Su

In [13]:
summary_table = pd.DataFrame(list(summary_dictionary.items()),columns = ['TEST DATASET','Summary'])

In [14]:
data_table = pd.DataFrame(list(text_dictionary.items()),columns = ['TEST DATASET','Introduction'])

In [15]:
# Combining the findings into the table
result  = pd.concat([data_table , summary_table['Summary']], axis = 1 , sort = False)
result

Unnamed: 0,TEST DATASET,Introduction,Summary
0,1,Acnesol Gel is an antibiotic that fights bacte...,Acnesol Gel is an antibiotic that fights bacte...
1,2,Ambrodil Syrup is used for treating various re...,Ambrodil Syrup is used for treating various re...
2,3,Augmentin 625 Duo Tablet is a penicillin-type ...,It is used to treat infections of the lungs (e...
3,4,Azithral 500 Tablet is an antibiotic used to t...,These are usually temporary and subside with t...
4,5,Alkasol Oral Solution is a medicine used in th...,This will prevent you from getting an upset st...
...,...,...,...
995,996,Azapure Tablet belongs to a group of medicines...,"Swallow it as a whole, do not crush, chew, br..."
996,997,Arimidex 1mg Tablet is used alone or with oth...,Swallow the tablets whole with a drink of wate...
997,998,Arpimune ME 100mg Capsule is used to prevent y...,"However, it is not recommended while breastfee..."
998,999,Amlodac CH Tablet is a combination medicine us...,Consult your doctor If any of these bother you...


In [16]:
# Saving it to a file (remove the '#' to save)
#result.to_csv("Summary_File.csv")