Skip to content

An unsupervised model to clustering Thai news. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

Notifications You must be signed in to change notification settings

sorayutmild/Unsupervised-Thai-Document-Clustering-with-Sanook-news

Repository files navigation

Unsupervised-Thai-Document-Clustering-with-Sanook-news

TL;DR This work creates an unsupervised model to clustering Thai news into 10 categories. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

Problem statement

Create unsupervised model to clustering sanook news 10 categories.

Dataset

Method

1. Vector representation

1.1 Vector representation using Bag-of-Words (TF-IDF)

I create vector representation using Bag-of-Words (TF-IDF) and use it as a baseline.

Bag-of-Words

  • Text cleaning: remove link, symbols, numbers, special characters
  • Word tokenization: newmm (dictionary-based, Maximum Matching + Thai Character Cluster)
  • TF-IDF vectorization

1.2 Vector representation using Transformer model

Transformer model

  • Text cleaning: remove link, symbols, numbers, special characters
  • Sentence tokenization: CRF
  • Sentence embedding: The best model is WangchanBERTa with SimCSE.
  • Weighted with number of Named-Entities After, Sentences are embedded to vector by Transformer model. The embedded vectors are weighted by number of named entities of particular types in sentence. then make Document vector representation using these formulas.

$$ v_{d} = \frac {\sum_{s \in d}w_{s} \times v_{s}} {\sum w_{s}} $$

$$ w_{s} = n_{s} + 1$$

where ns denotes the number of named entities of particular types in sentence. This weighting scheme is adopted from https://ieeexplore.ieee.org/document/9085059

2. Clustering model

After, we get vector representation. we use the vector as a cluster features. I used simple k-mean clustering following the code below.

from sklearn.cluster import KMeans
k = 10
km = KMeans(n_clusters=k, max_iter=100, n_init=55,)

How to run code

  • For web scraping (you can skip this. we download it for you)

    • Install the library by running this command pip install -r requirements.txt
    • Download chromedriver.exe and put in directory.
    • then run this notebook sanook_web_scraping.ipynb with that environment.
  • Document clustering

    • Run this Document_clustering.ipynb notebook on Google Colab. it contains
      • Text preprocessing
      • Text representation
        • Bag-of-Words
        • Transformer Embedding
      • Clustering model
      • Evaluation
      • Error analysis

Results

Chosen the class of cluster by select the most frequency in each cluster.
compare the predictions with Labels by accuracy score as a evaluation metric.

Vector representation techniques Acc
TF-IDF 0.8216
SimCSE WangchanBERTa 0.8330
SimCSE WangchanBERTa Weighted with number of Named-Entities 0.8445
SimCSE WangchanBERTa finetuned Weighted with number of Named-Entities 0.7368

Discussion

  • I have tried a lot of Transformer model (BERT, RoBERTa, and WangchanBERTa) by adding pooling layer to get embedding vector shape (number_of_samples, 768). But they are not perform well in this task.
  • SimCSE improves the model's performance.
  • SimCSE model with Weighted with number of Named-Entities is the best in my experiments.

Future work

  • Try other Clustering models (e.g., Hierarchical clustering, DBSCAN)
  • Try Dimension reduction methods (e.g., PCA)
  • Try other weighted schemes
  • Try Vector representation with Doc2vec method
  • Try soft clustering (topic modeling) (e.g., LDA)

Acknowledgements

About

An unsupervised model to clustering Thai news. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published