Data was published at https://github.com/duyvuleo/VNTC

# Introduction

In this tutorial, we will implement some algorithms to apply in text summarization problem.

## What is Text Summarization?

Text summarization is the problem of creating a short, accurate, and fluent summary of a longer text document.

Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

## What will we do in this tutorial?

In this tutorial, we will solve Text Summarization for Vietnamese newspapers, using some algorithms belows:
1. Extractive Text Summarization
    1. Doc2Vec
    2. Latent Semantic Analysis (LSA)
    3. Text Rank
2. Abstractive Text Summarization
    1. Google textsum


We just implement "**Single document summarization**" problem in this tutorial, another problem called "**Multi-document summarization**" will be dicussed in another time.

# Extractive Text Summarization

## Doc2Vec

### Basic idea
The idea of using Doc2Vec algorithm for text summarization problem is described as follows:
1. In all documents, we will extract sentences separately.
2. Each sentence will be represented by a vector, via doc2vec model
3. Use KMean algorithm to find out most featured sentences.

In [1]:
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
from pyvi import ViTokenizer, ViPosTagger
from tqdm import tqdm
import numpy as np
import gensim
import numpy as np

In [3]:
import os 
dir_path = os.path.dirname(os.path.realpath(os.getcwd()))
dir_path = os.path.join(dir_path, 'Data')

sentences = []

In [4]:
import pickle

def get_data(folder):
    sentences = []
    for path in os.listdir(folder):
        file_path = os.path.join(folder, path)
        with open(file_path, 'r', encoding="utf-16") as f:

            lines = f.readlines()

            for line in lines:
                sens = line.split('.')
                for sen in sens:
                    if len(sen) > 10:
                        sen = gensim.utils.simple_preprocess(sen)
                        sen = ' '.join(sen)
                        sen = ViTokenizer.tokenize(sen)
                        sentences.append(sen)

    return sentences

You can use multiprocessing here, but we will not use it for easy in understanding code.

In [5]:
# from multiprocessing import Pool
# sentences = []
# train_paths = [os.path.join(dir_path, 'VNTC-master/Data/10Topics/Ver1.1/Train_Full'), 
#                os.path.join(dir_path, 'VNTC-master/Data/10Topics/Ver1.1/Test_Full'),
#                os.path.join(dir_path, 'VNTC-master/Data/27Topics/Ver1.1/new train'),
#                os.path.join(dir_path, 'VNTC-master/Data/27Topics/Ver1.1/new test')]

# dirs = []
# for path in train_paths:
#     for p in os.listdir(path):
#         dirs.append(os.path.join(path, p))

# for d in tqdm(dirs):
#     sens = get_data(d)
#     sentences = sentences + sens

# # with Pool(8) as pool:
# #     pool.map(get_data, tqdm(dirs))



In [6]:
# pickle.dump(sentences, open('./sentences.pkl', 'wb'))
sentences = pickle.load(open('./sentences.pkl', 'rb'))

In [7]:
def get_corpus(sentences):
    corpus = []
    
    for i in tqdm(range(len(sentences))):
        sen = sentences[i]
        
        words = sen.split(' ')
        tagged_document = gensim.models.doc2vec.TaggedDocument(words, [i])
        
        corpus.append(tagged_document)
        
    return corpus

In [8]:
train_corpus = get_corpus(sentences)

100%|██████████| 2385532/2385532 [00:24<00:00, 97517.13it/s] 


#### Build Doc2Vec model

In [9]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=300, min_count=2, epochs=40)
model.build_vocab(train_corpus)

In [10]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

KeyboardInterrupt: 

#### Test with new document