# 2. LDAによるトピックの計算

- このnotebookは、make_lda_model.pyによるLDAを用いたトピック量の計算を分解したものである。

## 2.1 LDAおさらい

- LDAは、Latent Dirichlet Allocation(潜在的ディリクレ配分法)の略。
- 各文章における各単語は、各文章のトピック量および各単語のトピック量に応じて確率的に出現すると仮定し、データからそのトピック量を逆算するという手法である。
- 例えば、犬についての文章と猫についての文章を集めてきたとする。犬についての文章では「犬」トピック量が多く、「猫」トピック量は少ないと推測できる。また、「わんわん」「ドッグラン」などの単語は犬トピック量が多く、「にゃーにゃー」「ひっかく」などの単語は猫トピック量が多いと推測できるが、それらトピック量を文章データと機械学習（ベイズ推定）を用いて推定しようというのがLDAである。
- トピック量等の各種パラメータはベイズ推定によって計算される。
- トピック数は分析者によって定性的な考察等を元に決められることもあれば、[コヒーレンス](https://radimrehurek.com/gensim/models/coherencemodel.html)やパープレキシティ等の指標を用いて定量的に決められることもある。

### 2.1.1 より詳しく

- LDAのモデルの詳細は以下のとおり。
![LDA_figure](figure/Smoothed_LDA.png)
    - $\alpha$をパラメータに持つディリクレ分布により、各文書における各単語のトピックの確率分布$\theta$ (ex. 犬0.7, 猫0.3)が決まる。
    - 他方、$\beta$をパラメータにもつディリクレ分布により、各トピックにおける各単語の確率分布$\varphi$ (ex. 犬トピック: わんわん0.6, ドッグラン0.3, にゃーにゃー0.05, ひっかく0.05)が決まる。
    - 各文章では、各単語ごとに、確率分布$\theta$によって、属するトピック$Z$ (ex. 犬)が決まり、決まったトピック$Z$とそのトピック内における単語の確率分布$\varphi$によって、最終的な単語$W$ (ex. わんわん)がきまる。
    - 各単語の発生確率を数式で表せば、次のようになる
$$P(W,Z,\theta,\varphi; \alpha,\beta) = \prod^K_{i=1}P(\varphi_i;\beta)\prod^M_{j=1}P(\theta_j;\alpha)\prod^N_{t=1}P(Z_{j, t}|\theta_j)P(W_{j, t}|\varphi_{Z_{j, t}})$$
ただし、$K$はトピック数、$M$は文章数、$N$は文章全体の単語の数である
- ディリクレ分布は多項分布(カテゴリの発生確率分布)の共益事前分布であり、ベイズ推定で多項分布を処理する上で計算がしやすくなる。
- パープレキシティは、LDAモデルの予測精度の逆数といえる。データを訓練データとテストデータに分割(※このとき、同じ文章内にはテストデータに入る単語もあれば、訓練データに入る単語もある)し、訓練データでモデルを作成した後、テストデータで、どの程度の精度で単語を正しく予測できたかを計算し、パープレキシティを算出する。パープレキシティの数式は以下のとおり。
$$perplexity(W^{test}|M) = \exp\left(-\frac{\sum^{D^{test}}_{d=1}\log\left(\prod^{N_d}_{n=1}\sum^K_{k=1}\theta_{dk}\phi_{kw_{dn}}\right)}{\sum^{D^{test}}_{d=1}N^{test}_d}\right)$$
ここで$D^{test}$はテスト用単語を含む文章、$N^{test}_d$は文章dの単語数、$w^{test}_d$は文章dにおけるテスト用の単語、$M$は訓練で得たモデル、$W^{test}$はテスト用単語全体のベクトルである
- コヒーレンス

### 参考文献

- LDAの数式を含む詳しい解説は岩田氏による[トピックモデル](https://www.amazon.co.jp/dp/4061529048/ref=cm_sw_em_r_mt_dp_U_Vf70CbB6RVDJ3)がわかりやすい。
- ベイズ推定のアルゴリズムはいくつかある。[オリジナルの論文](https://www.genetics.org/content/155/2/945)や、今回用いる[gensim](https://radimrehurek.com/gensim/models/ldamodel.html)というライブラリでは変分ベイズ法が使用されている。他、ギブスサンプリングや期待値伝播法を用いる方法もある。本プロジェクトで以前使用していた[plda](http://ai.deepq.com/plda/)というライブラリではギブスサンプリングを使用している。
- ベイズ推定についての数式を含む解説はやはりビショップ氏による[パターン認識と機械学習](https://www.amazon.co.jp/dp/4621061224/ref=cm_sw_em_r_mt_dp_U_fz60CbNVSCXAN)が詳しいが、入門書としては須山氏によって最近刊行された[ベイズ推論による機械学習入門](https://www.amazon.co.jp/dp/4061538322/ref=cm_sw_em_r_mt_dp_U_eA60Cb2BHZ1B0)がわかりやすい。
- コヒーレンスについては牧山氏の[こちらのスライド](https://www.slideshare.net/hoxo_m/coherence-57598192)がわかりやすい。
- gensimで実装されているコヒーレンスの論文は以下。
    - [c_v](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) Michael(2015) 精度は高いが、低速。
    - [c_uci](https://www.aclweb.org/anthology/N10-1012) Newman(2010)
    - [u_mass](https://mimno.infosci.cornell.edu/papers/mimno-semantic-emnlp.pdf) Mimno(2011) 高速。
    - [c_npmi](https://www.aclweb.org/anthology/W13-0102) Aletras(2013)

# 2.2 LDAによるトピック量推定

### 2.2.1 データロード

In [1]:
!ls -lh ../data/interim/

合計 2.0G
-rw-r--r-- 1 user01 user01  39M  5月  9 21:33 df.feather
-rw-r--r-- 1 user01 user01  36M  5月  9 21:36 df_filterd.feather
-rw-r--r-- 1 user01 user01  40M  5月  9 21:36 df_filterd_joined.feather
-rw-rw-r-- 1 user01 user01 949M  5月  9 21:38 df_parse_filterd.csv
-rw-rw-r-- 1 user01 user01 949M  5月  8 21:15 df_parse_filterd_old.csv
-rw-r--r-- 1 user01 user01  17M  5月  9 17:14 df_parts.feather


In [1]:
# わかち書きしたデータを読み込む
import pandas as pd

INTERIM_PATH = "../data/interim/"
df_parsed = pd.read_csv(f"{INTERIM_PATH}df_parse_filterd.csv")

In [2]:
df_parsed.shape

(2883934, 14)

In [3]:
df_parsed.head(3)

Unnamed: 0,id,word_position,original_text,surface_form,pronunciation,base_form,word_class,inflection_type,inflection_form,used,0,1,2,3
0,JP200410B50001,0,高速走行中(トンネルの中)、車両前部より白煙(あるいは水蒸気)シートを起こしたところ炎を確認...,高速,コウソク,高速,名詞-一般,,,True,名詞,一般,,
1,JP200410B50001,1,高速走行中(トンネルの中)、車両前部より白煙(あるいは水蒸気)シートを起こしたところ炎を確認...,走行中,ソウコウチュウ,走行中,名詞-固有名詞-一般,,,True,名詞,固有名詞,一般,
2,JP200410B50001,3,高速走行中(トンネルの中)、車両前部より白煙(あるいは水蒸気)シートを起こしたところ炎を確認...,トンネル,トンネル,トンネル,名詞-一般,,,True,名詞,一般,,


In [4]:
!ls ../data/external/

stop_words.csv	used_word_types.csv  形態素解析ツールの品詞体系.htm


In [5]:
# トピックモデルの対象外とする言葉を読み込む
stop_words_path = "../data/external/stop_words.csv"

df_stop_words = pd.read_csv(stop_words_path)
df_stop_words

Unnamed: 0,word
0,時
1,中
2,の
3,お客様
4,店
5,とき
6,事
7,為
8,こと
9,様


In [6]:
# src/utils/pandas.py より
# アンチジョイン用の自作関数。あると便利。
def anti_join(left, right, **kwargs):
    """Return rows in `left` which are not present in `right`"""
    kwargs['how'] = 'left'
    kwargs['indicator'] = True
    return (
        left
        .merge(right, **kwargs)
        .query('_merge == "left_only"')
        .drop(columns='_merge')
    )

In [7]:
df_parsed.shape

(2883934, 14)

In [8]:
# df_parsedからstop wordを除く
df_parsed = df_parsed.pipe(
    anti_join,
    df_stop_words,
    left_on='base_form',
    right_on='word',
)

In [9]:
df_parsed.shape

(2862849, 15)

約20000レコード(単語)が削除された

### 2.2.2 LDAの前準備

In [10]:
# 文章ごとに単語をまとめる
parsed_col = (
    # id と base_form　（基本形） だけ使う
    df_parsed[['id', 'base_form']]
    # NAを削除する(なんか混じるので)
    .dropna()
    # 文書のIDごとにまとめて base_form の list にする
    .groupby('id')
    .base_form
    .apply(list)
)

In [11]:
parsed_col.head()

id
JP200410B50001    [高速, 走行中, トンネル, 車両, 前部, 白, 煙, 水蒸気, シート, 起こす, 炎...
JP200410B50002                           [アイドリング, キンキン, 高位, する, 診断]
JP200410B50003    [普通に, 走行, する, 通勤, 使用, する, エンジン, フロント, カバ, オイル,...
JP200410B50004    [ディスクロータ, 錆びる, 走行, キーキー, 音, する, ディスクロータ, 錆びる, ...
JP200410B50005    [時々, オート, 動く, なる, オート, 機能, セット, する, 2, 3回, 動く,...
Name: base_form, dtype: object

In [12]:
from gensim import corpora

# 単語を数字に置き換えるための辞書を作成
dictionary = corpora.Dictionary(parsed_col)



In [13]:
for i in range(5): print(i, dictionary[i])

0 する
1 エンジン
2 エンジンオイル
3 シート
4 トンネル


In [14]:
len(dictionary)

43942

43942単語の辞書ができた

In [15]:
# 辞書を用いて、各文章を数字に変換
corpus = parsed_col.apply(dictionary.doc2bow)

In [16]:
parsed_col[1]

['アイドリング', 'キンキン', '高位', 'する', '診断']

上の文章は下のように変換された。

In [17]:
corpus[1]

[(0, 1), (24, 1), (25, 1), (26, 1), (27, 1)]

辞書を使えば、元に戻せる（順不同）

In [18]:
[dictionary[i] for i,_ in corpus[1]]    

['する', 'アイドリング', 'キンキン', '診断', '高位']

### 2.2.3 LDAモデリング

In [19]:
# LDAのハイパーパラメータ。今回はトピック数は60トピックでモデリングする。
hyper_params = {
    'alpha': 'auto',
    'eta': 'auto',
    'num_topics': 60,
    'random_state': 0,
    'passes': 100,
    'gamma_threshold': 1e-3,
    "chunksize": 2000,
    "update_every":1,
    "decay":0.5,
    "offset":1.0,
    "eval_every":10,
    "iterations":50,
    "distributed":False,
}

In [54]:
# LDAでトピック量を計算。かなり時間がかかる。
from gensim.models import LdaModel

lm = LdaModel(
    corpus.tolist(),
    id2word=dictionary,
    **hyper_params,
)

  diff = np.log(self.expElogbeta)


In [56]:
# できたモデルを保存
import pickle
with open("../data/processed/lda_model.pkl", "wb") as f:
    pickle.dump(lm, f)

In [20]:
# 読み込み
import pickle
with open("../data/processed/lda_model.pkl", "rb") as f:
    lm = pickle.load(f)

### 2.2.4 文章のトピック量の計算

In [21]:
# できたLDAモデルで、1件目のトピック量を計算してみる(上位5位を表示)
topic_test = lm.get_document_topics(corpus[0], minimum_probability=0)
sorted(topic_test, key=lambda x:x[1], reverse=True)[:5]

[(46, 0.10485247),
 (0, 0.061555933),
 (50, 0.05547047),
 (43, 0.051919177),
 (30, 0.051878616)]

In [22]:
import numpy as np
np.array(parsed_col[0])

array(['高速', '走行中', 'トンネル', '車両', '前部', '白', '煙', '水蒸気', 'シート', '起こす',
       '炎', '確認', '高速道路', '備え付け', '消火器', '消火', 'エンジンオイル', '空', '状態',
       'エンジン', '上部', '出火', 'する', '模様'], dtype='<U7')

1件目の文章は、トピック46が最も多く、トピック0, トピック50と続く。

In [23]:
sum([x[1] for x in topic_test])

0.9999999759020284

トピック量の合計は1となるよう正規化されている。(丸めの影響で若干異なるが)

In [24]:
def is_indict(word):
    return word in set(dictionary.values())

In [25]:
[is_indict(x) for x in ["今日","天気", "寒い", "霧", "多い", "テラデータ"]]

[True, True, True, True, True, False]

In [26]:
# できたLDAモデルで、トピック量を計算してみる
topic_test2 = lm.get_document_topics(dictionary.doc2bow(["今日","天気", "寒い", "霧", "多い"]))
sorted(topic_test2, key=lambda x:x[1], reverse=True)[:5]

[(46, 0.10453187),
 (7, 0.052740756),
 (28, 0.050256826),
 (47, 0.049845695),
 (34, 0.047210664)]

こちらの文章は、トピック46が最も多く、トピック7, トピック28と続いた。

seriesに対してapplyすることで、一気にトピックを計算できる

In [27]:
corpus[:5].apply(lm.get_document_topics , minimum_probability=0)

id
JP200410B50001    [(0, 0.061555933), (1, 0.008286304), (2, 0.006...
JP200410B50002    [(0, 0.026352558), (1, 0.01417279), (2, 0.0108...
JP200410B50003    [(0, 0.040343437), (1, 0.008687206), (2, 0.006...
JP200410B50004    [(0, 0.021307122), (1, 0.011459281), (2, 0.008...
JP200410B50005    [(0, 0.020028884), (1, 0.010771825), (2, 0.008...
Name: base_form, dtype: object

In [28]:
from operator import itemgetter

corpus[:5].apply(lm.get_document_topics , minimum_probability=0)\
    .apply(lambda x: pd.Series(map(itemgetter(1), x)))

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JP200410B50001,0.061556,0.008286,0.00634,0.006153,0.008337,0.00558,0.033003,0.029619,0.001875,0.005071,...,0.05547,0.031133,0.009908,0.0047,0.010765,0.008374,0.004746,0.005841,0.006615,0.005253
JP200410B50002,0.026353,0.014173,0.010844,0.04999,0.014259,0.009544,0.016982,0.050659,0.003207,0.008674,...,0.015944,0.013784,0.016946,0.008039,0.018412,0.053788,0.008118,0.00999,0.011315,0.008984
JP200410B50003,0.040343,0.008687,0.006647,0.006451,0.032931,0.00585,0.0346,0.031052,0.001966,0.005317,...,0.033964,0.008449,0.010387,0.004928,0.011285,0.008779,0.004976,0.006123,0.006935,0.005507
JP200410B50004,0.021307,0.011459,0.008768,0.008509,0.011529,0.039627,0.013731,0.00905,0.002593,0.007013,...,0.012891,0.011145,0.013702,0.0065,0.014887,0.01158,0.006564,0.008077,0.041058,0.007264
JP200410B50005,0.020029,0.010772,0.008242,0.007999,0.010837,0.007254,0.012907,0.008507,0.002438,0.006593,...,0.012118,0.010476,0.042875,0.00611,0.013993,0.010886,0.00617,0.037588,0.0086,0.006828


In [29]:
from operator import itemgetter

# できたLDAモデルをもとに、文章のトピック量を計算
df_docs = df_parsed[['id', 'original_text']].drop_duplicates()
df_docs.columns = ['id', 'doc']
df_topic = (corpus
    .apply(lm.get_document_topics , minimum_probability=0)
    .apply(lambda x: pd.Series(map(itemgetter(1), x)))
    .fillna(0)
    .rename(columns=lambda i: f'topic_{i}'))

In [30]:
df_topic.head()

Unnamed: 0_level_0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,...,topic_50,topic_51,topic_52,topic_53,topic_54,topic_55,topic_56,topic_57,topic_58,topic_59
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JP200410B50001,0.061556,0.008286,0.00634,0.006153,0.008337,0.00558,0.033003,0.029619,0.001875,0.005071,...,0.05547,0.031133,0.009908,0.0047,0.010765,0.008374,0.004746,0.005841,0.006615,0.005253
JP200410B50002,0.026353,0.014173,0.010844,0.04999,0.014259,0.009544,0.016982,0.050659,0.003207,0.008674,...,0.015944,0.013784,0.016946,0.008039,0.018412,0.053788,0.008118,0.00999,0.011315,0.008984
JP200410B50003,0.040343,0.008687,0.006647,0.006451,0.032931,0.00585,0.0346,0.031052,0.001966,0.005317,...,0.033964,0.008449,0.010387,0.004928,0.011285,0.008779,0.004976,0.006123,0.006935,0.005507
JP200410B50004,0.021307,0.011459,0.008768,0.008509,0.011529,0.039627,0.013731,0.00905,0.002593,0.007013,...,0.012891,0.011145,0.013702,0.0065,0.014887,0.01158,0.006564,0.008077,0.041058,0.007264
JP200410B50005,0.020029,0.010772,0.008242,0.007999,0.010837,0.007254,0.012907,0.008507,0.002438,0.006593,...,0.012118,0.010476,0.042875,0.00611,0.013993,0.010886,0.00617,0.037588,0.0086,0.006828


In [31]:
df_doc_topics = pd.concat([
    df_docs.set_index('id'), df_topic], axis=1)

In [32]:
df_doc_topics.head(2)

Unnamed: 0_level_0,doc,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_50,topic_51,topic_52,topic_53,topic_54,topic_55,topic_56,topic_57,topic_58,topic_59
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JP200410B50001,高速走行中(トンネルの中)、車両前部より白煙(あるいは水蒸気)シートを起こしたところ炎を確認...,0.061556,0.008286,0.00634,0.006153,0.008337,0.00558,0.033003,0.029619,0.001875,...,0.05547,0.031133,0.009908,0.0047,0.010765,0.008374,0.004746,0.005841,0.006615,0.005253
JP200410B50002,アイドリングの時にキンキンと高い音がする $ 診断中 $,0.026353,0.014173,0.010844,0.04999,0.014259,0.009544,0.016982,0.050659,0.003207,...,0.015944,0.013784,0.016946,0.008039,0.018412,0.053788,0.008118,0.00999,0.011315,0.008984


In [33]:
# 要らなくなったデータを削除してメモリを開放
import gc
del df_docs, df_topic, df_parsed, parsed_col
gc.collect()

2007691

### 2.2.5 単語のトピック量の計算

In [34]:
# 各単語のトピック量はget_topicsメソッドで取得てきる
print(lm.get_topics().shape)
lm.get_topics()

(60, 43942)


array([[2.9358150e-08, 2.9245975e-08, 2.9029462e-08, ..., 1.7191868e-09,
        1.7191868e-09, 1.7191868e-09],
       [5.7133636e-08, 5.6915336e-08, 5.6493981e-08, ..., 3.3456944e-09,
        3.3456944e-09, 3.3456944e-09],
       [8.6399503e-08, 8.6069377e-08, 8.5432191e-08, ..., 5.0594768e-09,
        5.0594768e-09, 5.0594768e-09],
       ...,
       [8.5603965e-08, 8.5276881e-08, 8.4645556e-08, ..., 5.0128910e-09,
        5.0128910e-09, 5.0128910e-09],
       [7.0304480e-08, 7.0035853e-08, 6.9517370e-08, ..., 4.1169672e-09,
        4.1169672e-09, 4.1169672e-09],
       [1.0711761e-07, 1.0670832e-07, 1.0591834e-07, ..., 6.2727104e-09,
        6.2727104e-09, 6.2727104e-09]], dtype=float32)

In [35]:
colnames = [
        f'topic_{i}'
        for i
        in range(hyper_params['num_topics'])
]

# 各単語のトピック量をDataFrameに。最後に合計で割ることで、合計が1となるよう正規化する。
# dictionary.valuesで各番号に対応する単語を取得している
df_topic_words = pd.DataFrame(
    data=lm.get_topics().transpose(),
    index=dictionary.values(),
    columns=colnames
).apply(lambda s: s / s.sum())

In [36]:
df_topic_words.head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,...,topic_50,topic_51,topic_52,topic_53,topic_54,topic_55,topic_56,topic_57,topic_58,topic_59
する,2.935815e-08,5.713363e-08,8.639951e-08,9.533522e-08,6.265058e-08,1.070556e-07,4.823187e-08,6.81128e-08,3.972064e-07,8.87724e-08,...,4.796875e-08,5.965382e-08,4.926487e-08,9.416696e-08,5.604212e-08,5.715835e-08,1.100418e-07,8.560397e-08,7.030447e-08,1.071176e-07
エンジン,2.924597e-08,5.691533e-08,8.606938e-08,9.497096e-08,6.24112e-08,1.066466e-07,4.804758e-08,6.785255e-08,3.956887e-07,8.843322e-08,...,4.778547e-08,5.942589e-08,4.907664e-08,9.380716e-08,5.582799e-08,5.693995e-08,1.096213e-07,8.527689e-08,7.003585e-08,1.067083e-07
エンジンオイル,2.902946e-08,5.649397e-08,8.54322e-08,9.426787e-08,6.194915e-08,1.05857e-07,4.769188e-08,6.735023e-08,3.927594e-07,8.777853e-08,...,4.74317e-08,5.898595e-08,4.871331e-08,9.311269e-08,5.541468e-08,5.651842e-08,1.088098e-07,8.464556e-08,6.951736e-08,1.059183e-07
シート,2.913036e-08,5.669034e-08,8.572914e-08,9.459553e-08,6.216447e-08,1.06225e-07,4.785764e-08,6.758432e-08,3.941245e-07,8.808363e-08,...,4.759657e-08,5.919097e-08,4.888263e-08,9.343633e-08,5.560729e-08,5.671486e-08,1.09188e-07,8.493978e-08,6.975899e-08,1.062865e-07
トンネル,2.882918e-08,5.610421e-08,8.484277e-08,9.361749e-08,6.152175e-08,1.051267e-07,4.736284e-08,6.688555e-08,3.900496e-07,8.717291e-08,...,4.710446e-08,5.857899e-08,4.837722e-08,9.247027e-08,5.503236e-08,5.612848e-08,1.080591e-07,8.406157e-08,6.903774e-08,1.051876e-07


show_topicsメソッドでも各単語のトピック量は取得できる。  
```
list(  
    (トピック番号,  
        list(  
            (単語, トピック量))  
        )  
    )  
)
```  
の構造になっている。辞書から単語を持ってこなくて済む分、こちらの方が簡単。

In [37]:
topics = lm.show_topics(
    formatted=False,
    num_topics=-1,
    num_words=len(dictionary)
)

In [38]:
# トピック46の単語リストの上位5つ
topics[46][1][:5]

[('する', 0.6922604),
 ('音', 0.13246237),
 ('走行中', 0.07664953),
 ('大きい', 0.015344718),
 ('後ろ', 0.014640365)]

In [39]:
df_topic_ranking = pd.DataFrame()
for i, topics in topics:
    df_topic_ranking = df_topic_ranking.assign(**{
        f'word_{i}': list(map(itemgetter(0), topics)),
        f'topic_{i}': list(map(itemgetter(1), topics)),
    })

In [40]:
df_topic_ranking.head()

Unnamed: 0,word_0,topic_0,word_1,topic_1,word_2,topic_2,word_3,topic_3,word_4,topic_4,...,word_55,topic_55,word_56,topic_56,word_57,topic_57,word_58,topic_58,word_59,topic_59
0,車,0.126438,ハンドル,0.233072,常時,0.272778,アイドリング,0.157349,出る,0.358569,...,停止,0.104612,h,0.129137,入る,0.282461,作動,0.30121,バックドア,0.276745
1,ない,0.116999,左右,0.189375,ヘッドライト,0.150579,下がる,0.110083,水,0.110182,...,アイドリングストップ!,0.103098,勝手,0.09881,1,0.135957,キー,0.16008,隙間,0.106451
2,乗る,0.066253,操作,0.123392,曇る,0.108764,回転,0.107755,風,0.076323,...,止まる,0.102777,速度,0.063404,2,0.121095,ロック,0.158537,パネル,0.093098
3,状態,0.065279,切る,0.084143,作業,0.093764,落ちる,0.076472,冷却水,0.033676,...,エンスト,0.094227,03,0.049697,3,0.077928,レス,0.056721,間,0.050023
4,見る,0.063837,動かす,0.051492,内部,0.071177,フロントグリル,0.04169,溜まる,0.026473,...,信号,0.080949,スタート,0.039653,しない,0.067201,08,0.033073,リモコン,0.046368


In [45]:
# df_topic_rankingをexcelシートに保存。少し時間がかかる。
with pd.ExcelWriter("../data/processed/df_topic_ranking.xlsx", engine='xlsxwriter') as writer:
    sheet_name = 'トピックごとの単語ランキング'
    df_topic_ranking.to_excel(writer, sheet_name=sheet_name)
    workbook = writer.book
    worksheet = writer.sheets[sheet_name]
    worksheet.autofilter(0, 0, *df_topic_ranking.shape)
    worksheet.freeze_panes(1, 0)

    worksheet.conditional_format(
        1, 1, *df_topic_ranking.shape,
        {'type': 'data_bar',
         'bar_solid': True,
         'min_value': 0,
         'max_value': 1,
        }
    )

    num_col_format = workbook.add_format({'num_format': '0.0000'})
    for i, column in enumerate(df_topic_ranking.columns, 1):
        if str(column).startswith('topic'):
            worksheet.set_column(i, i, None, num_col_format)

In [41]:
# df_topic_rankingをcsvに保存
df_topic_ranking.to_csv("../data/processed/df_topic_ranking.csv", index=False)

In [47]:
# df_topic_wordsをexcelシートに保存。少し時間がかかる。
with pd.ExcelWriter("../data/processed/df_topic_words.xlsx", engine='xlsxwriter') as writer:
    sheet_name = 'トピック'
    df_topic_words.to_excel(writer, sheet_name=sheet_name)
    workbook = writer.book
    worksheet = writer.sheets[sheet_name]
    worksheet.autofilter(0, 0, *df_topic_words.shape)
    worksheet.freeze_panes(1, 1)

    worksheet.conditional_format(
        1, 1, *df_topic_words.shape,
        {'type': 'data_bar',
         'bar_solid': True,
         'min_value': 0,
         'max_value': 1,
        }
    )

    num_col_format = workbook.add_format({'num_format': '0.0000'})
    for i, column in enumerate(df_topic_words.columns, 1):
        if column.startswith('topic'):
            worksheet.set_column(i, i, None, num_col_format)

In [54]:
# df_topic_wordsをcsvに保存
df_topic_words.reset_index().to_csv("../data/processed/df_topic_words.csv", index=False)

In [49]:
# df_doc_topicsをexcelシートに保存。少し時間がかかる。
with pd.ExcelWriter("../data/processed/df_doc_topics.xlsx", engine='xlsxwriter') as writer:
    sheet_name = '文書トピック'
    df_doc_topics.to_excel(writer, sheet_name=sheet_name)
    workbook = writer.book
    worksheet = writer.sheets[sheet_name]
    worksheet.autofilter(0, 0, *df_doc_topics.shape)
    worksheet.conditional_format(
        1, 2, *df_doc_topics.shape,
        {'type': 'data_bar',
         'bar_solid': True,
         'min_value': 0,
         'max_value': 1,
        }
    )

    text_col_format = workbook.add_format({'bold': False})
    text_col_format.set_text_wrap()
    text_col_format.set_align('left')
    worksheet.set_column(1, 1, 75, text_col_format)

    num_col_format = workbook.add_format({'num_format': '0.0000'})
    worksheet.freeze_panes(1, 2)
    for i, column in enumerate(df_doc_topics.columns[1:], 2):
        if column.startswith('topic'):
            worksheet.set_column(i, i, None, num_col_format)

In [50]:
# df_doc_topicsをcsvに保存
df_doc_topics.reset_index().to_csv("../data/processed/df_doc_topics.csv", index=False)

In [51]:
!ls -lh ../data/processed

合計 744M
-rw-rw-r-- 1 user01 user01 374M  5月  9 21:55 df_doc_topics.csv
-rw-rw-r-- 1 user01 user01 178M  5月  9 21:55 df_doc_topics.xlsx
-rw-rw-r-- 1 user01 user01  96M  5月  9 21:50 df_topic_ranking.csv
-rw-rw-r-- 1 user01 user01  23M  5月  9 21:50 df_topic_ranking.xlsx
-rw-rw-r-- 1 user01 user01  35M  5月  9 21:51 df_topic_words.csv
-rw-rw-r-- 1 user01 user01  19M  5月  9 21:51 df_topic_words.xlsx
-rw-rw-r-- 1 user01 user01  22M  5月  9 08:14 lda_model.pkl


In [51]:
df_doc_topics = pd.read_csv("../data/processed/df_doc_topics.csv")

In [53]:
df_doc_topics.head(2)

Unnamed: 0,id,doc,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,...,topic_50,topic_51,topic_52,topic_53,topic_54,topic_55,topic_56,topic_57,topic_58,topic_59
0,JP200410B50001,高速走行中(トンネルの中)、車両前部より白煙(あるいは水蒸気)シートを起こしたところ炎を確認...,0.061556,0.008286,0.00634,0.006153,0.008337,0.00558,0.033003,0.029619,...,0.05547,0.031133,0.009908,0.0047,0.010765,0.008374,0.004746,0.005841,0.006615,0.005253
1,JP200410B50002,アイドリングの時にキンキンと高い音がする $ 診断中 $,0.026353,0.014173,0.010844,0.04999,0.014259,0.009544,0.016982,0.050659,...,0.015944,0.013784,0.016946,0.008039,0.018412,0.053788,0.008118,0.00999,0.011315,0.008984
