Unsupervised-Thai-Document-Clustering-with-Sanook-news

TL;DR This work creates an unsupervised model to clustering Thai news into 10 categories. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

Problem statement

Create unsupervised model to clustering sanook news 10 categories.

Dataset

The data was scraped from sanook.com is ordered by most popular views for each category.
there are 10 categories: crime, politics, money, technology, sport, health, horoscope, car, game, entertain
I scraped using Selenium and BeautifulSoup.
he source code can be found in sanook_web_scraping.ipynb or you can download it from Google drive

Method

1. Vector representation

1.1 Vector representation using Bag-of-Words (TF-IDF)

I create vector representation using Bag-of-Words (TF-IDF) and use it as a baseline.

Text cleaning: remove link, symbols, numbers, special characters
Word tokenization: newmm (dictionary-based, Maximum Matching + Thai Character Cluster)
TF-IDF vectorization

1.2 Vector representation using Transformer model

Text cleaning: remove link, symbols, numbers, special characters
Sentence tokenization: CRF
Sentence embedding: The best model is WangchanBERTa with SimCSE.
Weighted with number of Named-Entities After, Sentences are embedded to vector by Transformer model. The embedded vectors are weighted by number of named entities of particular types in sentence. then make Document vector representation using these formulas.

$$ v_{d} = \frac {\sum_{s \in d}w_{s} \times v_{s}} {\sum w_{s}} $$

$$ w_{s} = n_{s} + 1$$

where n_s denotes the number of named entities of particular types in sentence. This weighting scheme is adopted from https://ieeexplore.ieee.org/document/9085059

2. Clustering model

After, we get vector representation. we use the vector as a cluster features. I used simple k-mean clustering following the code below.

from sklearn.cluster import KMeans
k = 10
km = KMeans(n_clusters=k, max_iter=100, n_init=55,)

How to run code

For web scraping (you can skip this. we download it for you)
- Install the library by running this command pip install -r requirements.txt
- Download chromedriver.exe and put in directory.
- then run this notebook sanook_web_scraping.ipynb with that environment.
Document clustering
- Run this Document_clustering.ipynb notebook on Google Colab. it contains
  - Text preprocessing
  - Text representation
    - Bag-of-Words
    - Transformer Embedding
  - Clustering model
  - Evaluation
  - Error analysis

Results

Chosen the class of cluster by select the most frequency in each cluster.
compare the predictions with Labels by accuracy score as a evaluation metric.

Vector representation techniques	Acc
TF-IDF	0.8216
SimCSE WangchanBERTa	0.8330
SimCSE WangchanBERTa Weighted with number of Named-Entities	0.8445
SimCSE WangchanBERTa finetuned Weighted with number of Named-Entities	0.7368

Discussion

I have tried a lot of Transformer model (BERT, RoBERTa, and WangchanBERTa) by adding pooling layer to get embedding vector shape (number_of_samples, 768). But they are not perform well in this task.
SimCSE improves the model's performance.
SimCSE model with Weighted with number of Named-Entities is the best in my experiments.

Future work

Try other Clustering models (e.g., Hierarchical clustering, DBSCAN)
Try Dimension reduction methods (e.g., PCA)
Try other weighted schemes
Try Vector representation with Doc2vec method
Try soft clustering (topic modeling) (e.g., LDA)

Acknowledgements

pre-trained model from huggingface
weighting scheme is adopted from https://ieeexplore.ieee.org/document/9085059
WangchanBERTa: Pretraining transformer-based Thai Language Models
SimCSE: Simple Contrastive Learning of Sentence Embeddings

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
img		img
.gitignore		.gitignore
Document_clustering.ipynb		Document_clustering.ipynb
README.md		README.md
requirements.txt		requirements.txt
sanook_news_all.csv		sanook_news_all.csv
sanook_scraping_function.py		sanook_scraping_function.py
sanook_web_scraping.ipynb		sanook_web_scraping.ipynb
simCSE_finetune.ipynb		simCSE_finetune.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

.gitignore

.gitignore

Document_clustering.ipynb

Document_clustering.ipynb

README.md

README.md

requirements.txt

requirements.txt

sanook_news_all.csv

sanook_news_all.csv

sanook_scraping_function.py

sanook_scraping_function.py

sanook_web_scraping.ipynb

sanook_web_scraping.ipynb

simCSE_finetune.ipynb

simCSE_finetune.ipynb

Repository files navigation

Unsupervised-Thai-Document-Clustering-with-Sanook-news

Problem statement

Dataset

Method

1. Vector representation

1.1 Vector representation using Bag-of-Words (TF-IDF)

1.2 Vector representation using Transformer model

2. Clustering model

How to run code

Results

Discussion

Future work

Acknowledgements

About

Releases

Packages

Languages

sorayutmild/Unsupervised-Thai-Document-Clustering-with-Sanook-news

Folders and files

Latest commit

History

Repository files navigation

Unsupervised-Thai-Document-Clustering-with-Sanook-news

Problem statement

Dataset

Method

1. Vector representation

1.1 Vector representation using Bag-of-Words (TF-IDF)

1.2 Vector representation using Transformer model

2. Clustering model

How to run code

Results

Discussion

Future work

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages