<a href="https://colab.research.google.com/github/vmavis/colab/blob/main/text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Importing Libraries**

All the necessary libraries and functions are installed and imported first. Further need of other libraries and functions may require us to import them seperately.

Numpy lets us work with arrays while pandas let us work with dataframes. NLTK lets us work with natural language processing, such as importing stopwords and computing cosine distance. Networkx lets us create, manipulate, and learn the structure, dynamics, and functions of complex networks.

In [None]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import networkx as nx

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **Creating User-Defined Functions**

A function to read the text file is defined below. It splits the text based on dots and removes non-alphabetic characters.

In [None]:
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []
    for sentence in article:
        print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop()

    return sentences

A function to compute the similarity between each sentence is defined below. A vector with size following the length of all words is created first. A vector is then created each for the first and second sentence.



In [None]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []

    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]

    all_words = list(set(sent1+sent2))

    vector1 = [0]*len(all_words)
    vector2 = [0]*len(all_words)

    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)]+=1

    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)]+=1

    return 1-cosine_distance(vector1, vector2)

A function to create the similarity matrix is defined below. An empty matrix is created first to store the similarity values. Same sentences are ignored here.

In [None]:
def build_similarity_matrix(sentences, stop_words):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))

    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2:
                continue
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

A function to summarize the text is defined below. All English stopwords are stored in a variable and an empty list to store the the summarized text is created. The text is read first by using the function defined previously. The sentences similarity matrix is then generated. The sentences in that matrix is then ranked descendingly based on their scores. Only the few top sentences are taken and used for the text summary.

In [None]:
def generate_summary(file_name, top_n):
    stop_words = stopwords.words('english')
    summarize_text = []

    sentences =  read_article(file_name)

    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    print("Indexes of the top ranked sentences order are ", ranked_sentence)

    for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    print("Text Summary: \n", ". ".join(summarize_text))


## **Data Preprocessing**

The text is read and stored in a variable for further computation using the function previously defined.

In [None]:
sentences = read_article('data_4D.txt')

﻿This paper focuses on an event for the main actors in e-tourism, called Tourism@
It is held on the French Riviera and is dedicated to new uses of Information and Communication Technology (ICT) in the tourism industry
It is a major international trade fair in Europe for innovative start up companies, high tech small and medium sized enterprises (SMEs), large multinationals and academics related to the tourism industry
Each edition of Tourism@ includes a competition aiming at awarding the best projects in terms of creativity and commitment to developing and implementing new technologies or new uses of ICT in the tourism industry
As far as Tourism@ has several unique characteristics in relation to innovation in tourism, it is an interesting case deserving an in-depth analysis
Innovation is playing an increasing role in services (Miles, 2001) and, unquestionably, is particularly important for the tourism industry (Hjalager, 2002)
Tourism has been one of the main drivers of Internet use in

The similarity matrix is created using the previously defined function. It is then used to build the similarity graph.

In [None]:
sentence_similarity_martix = build_similarity_matrix(sentences, stop_words=None)
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)

## **Text Summary**

The summary is printed below along with the original text and the indexes of the top ranked sentences order. The summary makes use of the top 3 sentences with the highest score.

In [None]:
generate_summary("data_4D.txt", 3)

﻿This paper focuses on an event for the main actors in e-tourism, called Tourism@
It is held on the French Riviera and is dedicated to new uses of Information and Communication Technology (ICT) in the tourism industry
It is a major international trade fair in Europe for innovative start up companies, high tech small and medium sized enterprises (SMEs), large multinationals and academics related to the tourism industry
Each edition of Tourism@ includes a competition aiming at awarding the best projects in terms of creativity and commitment to developing and implementing new technologies or new uses of ICT in the tourism industry
As far as Tourism@ has several unique characteristics in relation to innovation in tourism, it is an interesting case deserving an in-depth analysis
Innovation is playing an increasing role in services (Miles, 2001) and, unquestionably, is particularly important for the tourism industry (Hjalager, 2002)
Tourism has been one of the main drivers of Internet use in

## **Scores**

All sentences are sorted descendingly based on their scores provided in the similarity graph. They are then printed below.

In [None]:
scores = nx.pagerank(sentence_similarity_graph)
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [None]:
for i in ranked_sentence:
  print(i, '\n')

(0.04880194724399263, ['The', 'aim', 'is', 'to', 'capture', 'the', 'evolution', 'of', 'innovative', 'activity', 'in', 'the', 'tourism', 'industry', 'through', 'the', 'empirical', 'analysis', 'of', 'the', 'Tourism@', 'event', 'and', 'the', 'annual', 'Tourism@', 'Awards', 'for', 'best', 'projects']) 

(0.045147634384120505, ['The', 'identification', 'and', 'selection', 'of', 'new', 'technologies', 'are', '“tricky', 'and', 'costly', 'processes”', '(Maskell', 'et', 'al.,', '2005:2)', 'and', 'especially', 'when', 'related', 'to', 'such', 'a', 'complex', 'and', 'heterogeneous', 'activity', 'as', 'it', 'is', 'the', 'case', 'in', 'the', 'tourism', 'industry']) 

(0.044283320293213156, ['This', 'article', 'exploits', 'the', 'database', 'of', 'the', 'competing', 'projects', 'to', 'analyse', 'the', 'dynamics', 'of', 'innovation', 'in', 'tourism']) 

(0.044204337674503325, ['It', 'is', 'held', 'on', 'the', 'French', 'Riviera', 'and', 'is', 'dedicated', 'to', 'new', 'uses', 'of', 'Information', 'an