<a href="https://colab.research.google.com/github/tomonari-masada/course2024-nlp/blob/main/03_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# トピックモデル
* bag-of-wordsの範囲内でテキストデータの高度な分析を行う。
* 潜在的ディリクレ配分法(LDA; latent Dirichlet allocation)を使う。
* scikit-learnにある実装を使う。
  * gensimのLdaModelは非推奨。（理由は、passesのデフォルトの値が1だから。）

**以下に示すようなチューニングをしてはじめて、LDAがその能力を発揮してくれます。**

## データセット
* Hugging Faceにある`CShorten/ML-ArXiv-Papers`を使う。
  * https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers

In [None]:
from datasets import load_dataset

ds = load_dataset("CShorten/ML-ArXiv-Papers")
ds = ds["train"].train_test_split(test_size=0.1, seed=1234)

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
        num_rows: 105832
    })
    test: Dataset({
        features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
        num_rows: 11760
    })
})

In [None]:
ds["train"]["title"][:20]

['Wind ramp event prediction with parallelized Gradient Boosted Regression\n  Trees',
 'Common Phone: A Multilingual Dataset for Robust Acoustic Modelling',
 'Adiabatic Persistent Contrastive Divergence Learning',
 'Decentralized Local Stochastic Extra-Gradient for Variational\n  Inequalities',
 'Fuzzy Dynamical Genetic Programming in XCSF',
 'Probabilistic Neural Network Training for Semi-Supervised Classifiers',
 'The Traveling Observer Model: Multi-task Learning Through Spatial\n  Variable Embeddings',
 'Online Continual Learning with Natural Distribution Shifts: An Empirical\n  Study with Visual Data',
 'Inferring clonal evolution of tumors from single nucleotide somatic\n  mutations',
 'Using a Binary Classification Model to Predict the Likelihood of\n  Enrolment to the Undergraduate Program of a Philippine University',
 'Rapid Structural Pruning of Neural Networks with Set-based Task-Adaptive\n  Meta-Pruning',
 'AI-MIA: COVID-19 Detection & Severity Analysis through Medical Imagi

## 単語の出現回数を数える

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english", min_df=10)
X_train = vectorizer.fit_transform(ds["train"]["title"])
X_test = vectorizer.transform(ds["test"]["title"])

In [None]:
X_train.shape

(105832, 5308)

In [None]:
X_test.shape

(11760, 5308)

## LDA

* とりあえずLDAの変分推論を動かしてみる。

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(
  n_components=20,
  evaluate_every=1,
  verbose=1,
  random_state=123,
)
lda.fit(X_train)

iteration: 1 of max_iter: 10, perplexity: 2198.5085
iteration: 2 of max_iter: 10, perplexity: 1848.8412
iteration: 3 of max_iter: 10, perplexity: 1647.0992
iteration: 4 of max_iter: 10, perplexity: 1517.0774
iteration: 5 of max_iter: 10, perplexity: 1437.1730
iteration: 6 of max_iter: 10, perplexity: 1389.1073
iteration: 7 of max_iter: 10, perplexity: 1359.7666
iteration: 8 of max_iter: 10, perplexity: 1339.7824
iteration: 9 of max_iter: 10, perplexity: 1325.9726
iteration: 10 of max_iter: 10, perplexity: 1316.1240
2985.28


In [None]:
lda.perplexity(X_test)

1316.124040359834

## ハイパーパラメータのチューニング
* perplexityの値が最小になるようにチューニングする。
  * トピック数(`n_components`)は、自分の都合で決めても良いかも。
* トピック数に合わせて、`doc_topic_prior`と`topic_word_prior`の両方をチューニングする。
  * トピック数が変わると、最も良い`doc_topic_prior`と`topic_word_prior`の値も、変わる。

### 1

In [None]:
for n_components in [20, 30, 40, 50]:
  for doc_topic_prior in [0.2, 0.1, 0.05]:
    for topic_word_prior in [0.05, 0.02, 0.01]:
      lda = LatentDirichletAllocation(
        n_components=n_components,
        doc_topic_prior=doc_topic_prior,
        topic_word_prior=topic_word_prior,
        max_iter=20,
        evaluate_every=1,
        verbose=1,
        random_state=123,
      )
      lda.fit(X_train)
      print(f"-- test perplexity: {lda.perplexity(X_test):.2f}")
      print(f"---- {n_components} topics, alpha={doc_topic_prior:.4f}, eta={topic_word_prior:.4f}")


iteration: 1 of max_iter: 20, perplexity: 3015.0258
iteration: 2 of max_iter: 20, perplexity: 2365.6630
iteration: 3 of max_iter: 20, perplexity: 1963.0553
iteration: 4 of max_iter: 20, perplexity: 1710.8841
iteration: 5 of max_iter: 20, perplexity: 1554.1238
iteration: 6 of max_iter: 20, perplexity: 1453.7942
iteration: 7 of max_iter: 20, perplexity: 1386.5439
iteration: 8 of max_iter: 20, perplexity: 1340.1355
iteration: 9 of max_iter: 20, perplexity: 1307.6926
iteration: 10 of max_iter: 20, perplexity: 1284.4908
iteration: 11 of max_iter: 20, perplexity: 1267.2861
iteration: 12 of max_iter: 20, perplexity: 1254.3886
iteration: 13 of max_iter: 20, perplexity: 1244.6055
iteration: 14 of max_iter: 20, perplexity: 1237.3350
iteration: 15 of max_iter: 20, perplexity: 1231.7562
iteration: 16 of max_iter: 20, perplexity: 1227.3495
iteration: 17 of max_iter: 20, perplexity: 1223.8334
iteration: 18 of max_iter: 20, perplexity: 1220.9968
iteration: 19 of max_iter: 20, perplexity: 1218.7509
it

### 2

In [None]:
for n_components in [20, 30, 40]:
  for doc_topic_prior in [0.4, 0.3, 0.2]:
    for topic_word_prior in [0.01, 0.005, 0.002]:
      lda = LatentDirichletAllocation(
        n_components=n_components,
        doc_topic_prior=doc_topic_prior,
        topic_word_prior=topic_word_prior,
        max_iter=20,
        evaluate_every=1,
        verbose=1,
        random_state=123,
      )
      lda.fit(X_train)
      print(f"-- test perplexity: {lda.perplexity(X_test):.2f}")
      print(f"---- {n_components} topics, alpha={doc_topic_prior:.4f}, eta={topic_word_prior:.4f}")


iteration: 1 of max_iter: 20, perplexity: 4627.6715
iteration: 2 of max_iter: 20, perplexity: 3237.7093
iteration: 3 of max_iter: 20, perplexity: 2350.2718
iteration: 4 of max_iter: 20, perplexity: 1878.1427
iteration: 5 of max_iter: 20, perplexity: 1612.6079
iteration: 6 of max_iter: 20, perplexity: 1456.4463
iteration: 7 of max_iter: 20, perplexity: 1360.0351
iteration: 8 of max_iter: 20, perplexity: 1298.7921
iteration: 9 of max_iter: 20, perplexity: 1258.6076
iteration: 10 of max_iter: 20, perplexity: 1231.6621
iteration: 11 of max_iter: 20, perplexity: 1212.9790
iteration: 12 of max_iter: 20, perplexity: 1199.7908
iteration: 13 of max_iter: 20, perplexity: 1190.2813
iteration: 14 of max_iter: 20, perplexity: 1183.0117
iteration: 15 of max_iter: 20, perplexity: 1177.4912
iteration: 16 of max_iter: 20, perplexity: 1173.4027
iteration: 17 of max_iter: 20, perplexity: 1170.3885
iteration: 18 of max_iter: 20, perplexity: 1168.0815
iteration: 19 of max_iter: 20, perplexity: 1166.2770
it

KeyboardInterrupt: 

### 3

In [None]:
for n_components in [15, 20, 25]:
  for doc_topic_prior in [0.6, 0.5, 0.4]:
    for topic_word_prior in [0.03, 0.02, 0.01]:
      lda = LatentDirichletAllocation(
        n_components=n_components,
        doc_topic_prior=doc_topic_prior,
        topic_word_prior=topic_word_prior,
        max_iter=20,
        evaluate_every=1,
        verbose=1,
        random_state=123,
      )
      lda.fit(X_train)
      print(f"-- test perplexity: {lda.perplexity(X_test):.2f}")
      print(f"---- {n_components} topics, alpha={doc_topic_prior:.4f}, eta={topic_word_prior:.4f}")


iteration: 1 of max_iter: 20, perplexity: 3100.8906
iteration: 2 of max_iter: 20, perplexity: 3014.2460
iteration: 3 of max_iter: 20, perplexity: 2749.3072
iteration: 4 of max_iter: 20, perplexity: 2337.7289
iteration: 5 of max_iter: 20, perplexity: 1970.2551
iteration: 6 of max_iter: 20, perplexity: 1710.4167
iteration: 7 of max_iter: 20, perplexity: 1536.7093
iteration: 8 of max_iter: 20, perplexity: 1420.4502
iteration: 9 of max_iter: 20, perplexity: 1341.8394
iteration: 10 of max_iter: 20, perplexity: 1288.1176
iteration: 11 of max_iter: 20, perplexity: 1250.4483
iteration: 12 of max_iter: 20, perplexity: 1223.6283
iteration: 13 of max_iter: 20, perplexity: 1204.4556
iteration: 14 of max_iter: 20, perplexity: 1190.7195
iteration: 15 of max_iter: 20, perplexity: 1180.8077
iteration: 16 of max_iter: 20, perplexity: 1173.3452
iteration: 17 of max_iter: 20, perplexity: 1167.6131
iteration: 18 of max_iter: 20, perplexity: 1163.0610
iteration: 19 of max_iter: 20, perplexity: 1159.2964
it

KeyboardInterrupt: 

### 4

In [None]:
for n_components in [10, 15, 20]:
  for doc_topic_prior in [0.8, 0.7, 0.6]:
    for topic_word_prior in [0.03, 0.02, 0.01]:
      lda = LatentDirichletAllocation(
        n_components=n_components,
        doc_topic_prior=doc_topic_prior,
        topic_word_prior=topic_word_prior,
        max_iter=20,
        evaluate_every=1,
        verbose=1,
        random_state=123,
      )
      lda.fit(X_train)
      print(f"-- test perplexity: {lda.perplexity(X_test):.2f}")
      print(f"---- {n_components} topics, alpha={doc_topic_prior:.4f}, eta={topic_word_prior:.4f}")


iteration: 1 of max_iter: 20, perplexity: 2266.4164
iteration: 2 of max_iter: 20, perplexity: 2244.5659
iteration: 3 of max_iter: 20, perplexity: 2185.8426
iteration: 4 of max_iter: 20, perplexity: 2059.9715
iteration: 5 of max_iter: 20, perplexity: 1883.4189
iteration: 6 of max_iter: 20, perplexity: 1708.4065
iteration: 7 of max_iter: 20, perplexity: 1564.9912
iteration: 8 of max_iter: 20, perplexity: 1457.6728
iteration: 9 of max_iter: 20, perplexity: 1380.0627
iteration: 10 of max_iter: 20, perplexity: 1324.0296
iteration: 11 of max_iter: 20, perplexity: 1283.0956
iteration: 12 of max_iter: 20, perplexity: 1252.4563
iteration: 13 of max_iter: 20, perplexity: 1228.8534
iteration: 14 of max_iter: 20, perplexity: 1210.3411
iteration: 15 of max_iter: 20, perplexity: 1195.5801
iteration: 16 of max_iter: 20, perplexity: 1183.6725
iteration: 17 of max_iter: 20, perplexity: 1173.9410
iteration: 18 of max_iter: 20, perplexity: 1165.9833
iteration: 19 of max_iter: 20, perplexity: 1159.4936
it

## 最も良かった設定で改めて変分推論を実行

In [None]:
vectorizer = CountVectorizer(stop_words="english", min_df=10)
X = vectorizer.fit_transform(ds["train"]["title"] + ds["test"]["title"])

In [None]:
lda = LatentDirichletAllocation(
  n_components=15,
  doc_topic_prior=0.6,
  topic_word_prior=0.02,
  max_iter=50,
  evaluate_every=1,
  verbose=1,
  random_state=123,
)
lda.fit(X)

iteration: 1 of max_iter: 50, perplexity: 3212.0452
iteration: 2 of max_iter: 50, perplexity: 3116.1717
iteration: 3 of max_iter: 50, perplexity: 2825.9628
iteration: 4 of max_iter: 50, perplexity: 2385.6992
iteration: 5 of max_iter: 50, perplexity: 2001.4003
iteration: 6 of max_iter: 50, perplexity: 1732.1999
iteration: 7 of max_iter: 50, perplexity: 1553.8136
iteration: 8 of max_iter: 50, perplexity: 1436.3327
iteration: 9 of max_iter: 50, perplexity: 1357.6988
iteration: 10 of max_iter: 50, perplexity: 1303.5938
iteration: 11 of max_iter: 50, perplexity: 1265.2954
iteration: 12 of max_iter: 50, perplexity: 1237.3723
iteration: 13 of max_iter: 50, perplexity: 1216.6442
iteration: 14 of max_iter: 50, perplexity: 1201.1017
iteration: 15 of max_iter: 50, perplexity: 1189.4617
iteration: 16 of max_iter: 50, perplexity: 1180.6088
iteration: 17 of max_iter: 50, perplexity: 1173.7038
iteration: 18 of max_iter: 50, perplexity: 1168.3038
iteration: 19 of max_iter: 50, perplexity: 1163.9205
it

* モデルを保存

In [None]:
import pickle

outfile = "lda_model.pk"
with open(outfile, 'wb') as pickle_file:
  pickle.dump(lda, pickle_file)

## 可視化

In [None]:
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer

ds = load_dataset("CShorten/ML-ArXiv-Papers")
ds = ds["train"].train_test_split(test_size=0.1, seed=1234)

vectorizer = CountVectorizer(stop_words="english", min_df=10)
X = vectorizer.fit_transform(ds["train"]["title"] + ds["test"]["title"])

In [None]:
import pickle

outfile = "lda_model.pk"
with open(outfile, "rb") as pickle_file:
  lda_model = pickle.load(pickle_file)

* pyLDAvisはあらかじめインストールしておく。

In [None]:
import pyLDAvis
import pyLDAvis.lda_model

pyLDAvis.enable_notebook()
pyLDAvis.lda_model.prepare(lda_model, X, vectorizer, mds='mmds')