Skip to content

shirbhargav/Graph-based-Extractive-Text-Summarization-Sentence-Scoring-Scheme-for-Big-Data-Applications

Repository files navigation

Extractive Text Summarization

Text Summarization is an emerging field in Natural Language Process (NLP) domain. It is very full for summarizing full-length document automatically. We have proposed the statistical model for the text summarization process. Because of features like easy-to- use, less complex, low computation resources requirement. We have also used a graph-based model to establish relations between sentences, and mathematical operations to establish a relation between words and sentences. We have used documents from the “WiKiHow” dataset to test our model. We have also demonstrated the effect of the sentence clustering method by including the graph clustering stage in the post-processing phase. We have used the GraphFrame module from the Apache Spark environment as it has the ability to parallel the execution of graph-based operations. To evaluate our model we have used Recall-oriented Understudying Gisting Evaluation (ROUGE) parameters. Based on the obtained results, it is clear that graph clustering is dependent on the type of documents provided in the dataset.

Brief Architechture

App Screenshot

Dataset

For this project we have used 'WikiHow Dataset'. The dataset is introduced in https://arxiv.org/abs/1810.09305. Please refer to the paper for more information regarding the dataset and its properties.

Each article consists of multiple paragraphs and each paragraph starts with a sentence summarizing it. By merging the paragraphs to form the article and the paragraph outlines to form the summary, the resulting version of the dataset contains more than 200,000 long-sequence pairs.

Sample of first 5000 article and Sentence CSVs and Reference CSV can be found in Sample Dataset folder.

Full Dataset Link: Link

Repo Link: Link

Deployment

  • Extract the Sentance CSV and Reference Summary CSV using WikiHow Topic Extractor.ipynb
  • Import CSVs to Apache Spark
  • Use Spark based tables to generate summary from using Extractive Text Summarization.ipynb

Requirement:

Hardware:

  • Intel i5 and Higher
  • 4+ Cores @2.80 GHz
  • 8 GB Ram

Software

  • Language: Python v3.8.5+

  • Jupyter Notebook

  • Packages:

    • Pandas v1.1.3+
    • Numpy v1.19.2+
    • NLTK v3.5+
    • Rake NLTK v3.5+
    • ROUGE v1.0.4+
    • PySpark v 3.1.0+
    • Graphframe v0.8.1+

Tech Stack

Python, Apache Spark, GraphFrames

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages