This notebook is demonstation for challenge "Vietnamese NLP Continual Learning". In the past, Underthesea has mainly focused on tuning model. With this project, we create a simple challenge for ourselves to build a continuous learning NLP system.

## The August 2021 Challenges Vietnamese NLP Dataset for Continual Learning

The August 2021 Challenges include [part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) and [dependency parsing](https://universaldependencies.org/).

### Create environment

In [None]:
%load_ext autoreload
%autoreload 2

# add project folder
import os
from os.path import dirname, join
PROJECT_FOLDER = dirname(dirname(os.getcwd()))
os.sys.path.append(PROJECT_FOLDER)

# add dependencies
from underthesea.utils.col_analyzer import UDAnalyzer, computeIDF
from underthesea.utils.col_script import UDDataset
from IPython.display import display, display_png
from wordcloud import WordCloud
from PIL import Image
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# init folder
DATASETS_FOLDER = join(PROJECT_FOLDER, "datasets")
COL_FOLDER = join(DATASETS_FOLDER, "UD_Vietnamese-COL")
raw_file = join(COL_FOLDER, "corpus", "raw", "202108.txt")

### Datasets

In [None]:
%%capture
raw_file = join(COL_FOLDER, "corpus", "raw", "202108.txt")
generated_dataset = UDDataset.load_from_raw_file(raw_file)

ud_file = join(COL_FOLDER, "corpus", "ud", "202108.txt")
ud_dataset = UDDataset.load(ud_file)

generated_dataset.merge(ud_dataset)
dataset = generated_dataset

dataset.write(ud_file)

In [None]:
analyzer = UDAnalyzer()

In [None]:
analyzer.analyze_dataset_len(dataset)
words_pos = analyzer.analyze_words_pos(dataset)
punctuations = set(words_pos[words_pos['pos'] == 'CH']['word'])

In [None]:
sent_ids = analyzer.analyze_sent_ids(dataset)

In [None]:
doc_sents = analyzer.analyze_doc_sent_freq(dataset)

In [None]:
x = [item[1] for item in doc_sents]
plt.hist(x, bins=40)
plt.xticks(np.arange(min(x), max(x)+1, 3))
plt.title("How many sentences were collected for each doc URL?")
plt.xlabel("Number of sentences")
plt.ylabel("Frequency")
plt.show()

#### Stopwords using IDF

In [None]:
doc_word_freqs = analyzer.get_doc_word_counters(dataset).values()
idfs = computeIDF(doc_word_freqs)
print("Words with lowest IDFs are candidates for Stopwords!")
stopwords_idf = {k: v for k, v in sorted(dict(idfs).items(), key=lambda x: x[1])[:40]}
stopwords_idf

#### Stopwords using Kullback-Leibler divergence

In [None]:
from underthesea.datasets import stopwords
",".join(sorted(stopwords.words))

### Actionable Insights

We want to explore:

* What is word frequencies?
* What is word frequencies today?
* How many words in this corpus?
* What are out of vocabulary words?

#### What are words

In [None]:
counter = analyzer.analyze_words(dataset)

#### Remove some (potential) stopwords to get clearer Wordcloud

In [None]:
wordlist = [word for word in counter]
for word in wordlist:
    if word in stopwords_idf or word in punctuations:
        del counter[word]

In [None]:
w1 = WordCloud().generate_from_frequencies(counter)
plt.figure(figsize=(16, 12), dpi=50)
plt.imshow(w1, interpolation="bilinear")
plt.axis("off")
plt.show()

Beautiful word cloud for most frequencies words in this corpus.

#### What are today words?

In [None]:
counter = analyzer.analyze_today_words(dataset)

#### Remove some (potential) stopwords to get clearer Wordcloud

In [None]:
wordlist = [word for word in counter]
for word in wordlist:
    if word in stopwords_idf or word in punctuations:
        del counter[word]

In [None]:
w1 = WordCloud().generate_from_frequencies(counter)
plt.figure(figsize=(16, 12), dpi=50)
plt.imshow(w1, interpolation="bilinear")
plt.axis("off")
plt.show()

#### Trending News

In [None]:
%%javascript
    var script = document.createElement('script');
    script.type = 'text/javascript';
    script.src = '//cdnjs.cloudflare.com/ajax/libs/d3/7.0.1/d3.min.js';
    document.head.appendChild(script);
    console.log(window.d3)
    
    var script = document.createElement('script');
    script.type = 'text/javascript';
    script.src = '//cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js';
    document.head.appendChild(script);
    console.log(window.$)

In [None]:
from IPython.display import Javascript
from ui import generate_svg_script
svg_script = generate_svg_script(dataset.get_by_sent_id("1142").get_ud_str())

Javascript(svg_script)

Trending News Today

## How to Contribute?

It's great that you find this project interesting ❤️. Even the smallest contribute is appreciated. Welcome to this exciting journey with us.

### You can contribute in so many ways!

* [Create more usefull open datasets](https://github.com/undertheseanlp/underthesea/tree/master/datasets/UD_Vietnamese-COL)
* [Create more actionable insights](https://github.com/undertheseanlp/underthesea/tree/master/datasets/UD_Vietnamese-COL)