# Liputan6.id Dataset: Summarization

This project was carried out with @Roby Koeswojo, @Satriavi Dananjaya, and @Rijal Abdulhakim as part of Indonesia AI's Portfolio Project, with Text Summarization as the main topic.

The dataset used in this project was presented in *"Liputan6: A Large-scale Indonesian Dataset for Text Summarization"* paper by Koto, et al. (2020). Following the paper, it contains 193.883 `train`, 10.972 `dev`, and 10.972 `test` document, with 16.1% of novel unigram, 52.5% novel bigram, 71.8% novel trigram, and 82.4% novel quadgram, with 311K vocabulary in article and 100K vocabulary in summary.

It's worth to note that the writers provide two different datasets, namely Canonical and Extreme. The one I used here is the Canonical one, which is the heavier and "original" one with details explained above. I also need to add that the data were initially received with json format, which is versatile but memory-hungry, so I convert it to parquet in order to save time and computing resource.

In [6]:
def read_parquet():
    #Testing lazy import! Might be handy for OOP...
    import pandas as pd
    import polars as pl

    df = pl.read_parquet('train.parquet').to_pandas()
    return df

read_parquet()

Unnamed: 0,id,url,clean_article,clean_summary,extractive_summary
0,100000.0,https://www.liputan6.com/news/read/100000/yudh...,"[[Liputan6, ., com, ,, Jakarta, :, Presiden, S...","[[Menurut, Presiden, Susilo, Bambang, Yudhoyon...","[0, 1]"
1,100002.0,https://www.liputan6.com/news/read/100002/jepa...,"[[Liputan6, ., com, ,, Jakarta, :, Perdana, Me...","[[Pada, masa, silam, Jepang, terlalu, ambisius...","[2, 3]"
2,100003.0,https://www.liputan6.com/news/read/100003/pulu...,"[[Liputan6, ., com, ,, Kutai, :, Banjir, denga...","[[Puluhan, hektare, areal, persawahan, yang, s...","[1, 5]"
3,100004.0,https://www.liputan6.com/news/read/100004/pres...,"[[Liputan6, ., com, ,, Jakarta, :, Presiden, S...","[[Sekjen, PBB, Kofi, Annan, memuji, langkah, P...","[2, 5]"
4,100005.0,https://www.liputan6.com/news/read/100005/warg...,"[[Liputan6, ., com, ,, Solok, :, Warga, Kampun...","[[Untuk, mempercepat, pelaksanaan, belajar-men...","[0, 2]"
...,...,...,...,...,...
193878,99995.0,https://www.liputan6.com/news/read/99995/banji...,"[[Liputan6, ., com, ,, Kutai, :, Banjir, yang,...","[[Sebanyak, 25, kecamatan, di, Kutai, Barat, d...","[1, 4, 3]"
193879,99996.0,https://www.liputan6.com/news/read/99996/lima-...,"[[Liputan6, ., com, ,, Kabupaten, Gowa, :, Lim...","[[Ribuan, kubik, lumpur, dari, Gunung, Bawakar...","[3, 6]"
193880,99997.0,https://www.liputan6.com/news/read/99997/kawas...,"[[Liputan6, ., com, ,, Nias, :, Sejumlah, desa...","[[Kawasan, paling, utara, di, Pulau, Nias, ,, ...","[1, 2, 6, 3]"
193881,99998.0,https://www.liputan6.com/news/read/99998/kebak...,"[[Liputan6, ., com, ,, Bogor, :, Kebakaran, di...","[[Dari, bukti-bukti, di, lapangan, ,, kebakara...","[0, 3]"


Converting pandas to PyArrow with Lazy Import. Handy for saving memory!

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193883 entries, 0 to 193882
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  193883 non-null  float64
 1   url                 193883 non-null  object 
 2   clean_article       193883 non-null  object 
 3   clean_summary       193883 non-null  object 
 4   extractive_summary  193883 non-null  object 
dtypes: float64(1), object(4)
memory usage: 7.4+ MB


In [8]:
df['clean_headline'] = df['url'].apply(lambda x: x.split('/')[-1].replace('.html','').replace('-',' '))
df['id'] = df['url'].apply(lambda x: x.split('/')[-2])
df = df.drop(columns=['url'])
df = df[['id','clean_headline','clean_article','clean_summary','extractive_summary']]
df = df.set_index('id')

df.head(1)

Unnamed: 0_level_0,clean_headline,clean_article,clean_summary,extractive_summary
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100000,yudhoyono berharap masalah kemiskinan menjadi ...,"[[Liputan6, ., com, ,, Jakarta, :, Presiden, S...","[[Menurut, Presiden, Susilo, Bambang, Yudhoyon...","[0, 1]"


The `df.info` shown 0 null values in the dataframe, but `id` needlessly utilize `float64` data type. `url`, however, possess two dataparts which is usable in the analysis: First, the number part of the `url` can replace `id` as a string, and second, the part after it is actually the title of that article, which is usable as the article's headline. This creates a conditional probabilities as well: If a word is present in article, then it has a possibility to be included in the summary. Similarly, if that particular word is present in summary, it also has a probability to be included in the headline.

The next part is to clean stopwords using Louis Owen's Stoplist, with some additional words that I added. You can access this stoplist at https://github.com/louisowen6/NLP_bahasa_resources/blob/master/combined_stop_words.txt.

However, the sheer size of this database creates a memory problem, and I did try my very best with both Kaggle and Google Colab to circumvent this. First, I tried to save each part (headline, article, and summary) as Sklearn's sparse matrix using `CountVectorizer`, but even the "mere" headline data exceeds both Kaggle's RAM (30gb) and Colab's Disk (100gb) size limit. Second, I did try to use an Indonesian HuggingFace model to tokenize these lists for EDA, but the clean_article hits the RAM limit and resets the kernel.

The only way, then, is to use n-gram language model (which also means zero vectorization) with either lazy or parallel execution. Using `Pandas` single core and eager execution, combined with absolutely inadequate laptop specification, means hours and hours of compute time per cell (yes, I did try to create a vocabulary with all 3 lists using `stack().unique()`, `count()`, and `isin()` statement, and those statements finished in 298 minutes). I will need "something-something" optimize this task further, and my (current) solution is to use `joblib` to paralellize `pandas` apply statement.

In case you're wondering why I didn't use `pandarallel` library, I did try to use it but encountered system memory problem. It seems that `pandarallel` has issues in accessing specific memory in Windows, but that's way beyond my knowledge for now.

Let's start with removing stopwords.

In [None]:
def clean_text(df):
    from joblib import Parallel, delayed #Lazy import and paralellization

    with open ('combined_stop_words.txt', 'r') as f:
        stop_words = f.read().splitlines()
        stop_words = set(stop_words)
        stopwordslist = [i for i in stop_words]

    wordstoclean = ['liputan6', 'liputan', 'com', 'array', 'dtype', 'object', 'dtypeobject', 'sctv']

    stopwordslist.extend(wordstoclean)

    df = Parallel(n_jobs=-1)(delayed(lambda x: ' '.join([word for word in x.split() if word not in stopwordslist]))(text) for text in df)

    return df

#Astype(str) is necessary because the df is, somehow, a numpy.ndarray.
df['clean_article'] = clean_text(df['clean_article'].astype(str))
df['clean_summary'] = clean_text(df['clean_summary'].astype(str))
df['extractive_summary'] = clean_text(df['extractive_summary'].astype(str))

In [21]:
#No need to import pandas again after lazy importing!
clean_headline = pd.read_pickle('clean_headline.pkl').to_list()
clean_article = pd.read_pickle('clean_article.pkl').to_list()
clean_summary = pd.read_pickle('clean_summary.pkl').to_list()

In [26]:
def ngram_analysis(x):
    import nltk
    from nltk.util import ngrams
    from nltk.probability import FreqDist
    from nltk.tokenize import word_tokenize

    token = word_tokenize(' '.join(x))
    vocab = FreqDist(ngrams(token, 1))
    bigram = FreqDist(ngrams(token, 2))
    trigram = FreqDist(ngrams(token, 3))

    ngram = pl.DataFrame({'vocab':vocab, 'bigram':bigram, 'trigram':trigram}).to_pandas()

    return ngram

ngram_analysis(clean_headline)

Exception ignored in: 'zmq.backend.cython.message.Frame.__dealloc__'
Traceback (most recent call last):
  File "zmq\\backend\\cython\\checkrc.pxd", line 13, in zmq.backend.cython.checkrc._check_rc
KeyboardInterrupt: 


TypeError: 'tuple' object cannot be converted to 'PyString'