# LDAの動作確認

このサンプルでは、２０００行のテストデータが LDA により１０のトピック分類に分類され、２０の特徴語が抽出されます。

テスト用データは英語ですので、これを MeCab で形態素解析した日本語のテスト用データを入力にして、後日再度テストします。

テスト用コードは、scikit-learn の公式ページから借用しました。



### テストデータ（２０００行）のロード

In [9]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, 
                             random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 1.504s.


テストデータはこんな感じ

In [96]:
data_samples[0]

"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

## 訓練データの作成

これがLDAのインプットになるようです

In [97]:
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95,
                                min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 0.305s.


2,000 件のサンプルテキストから、51,752 件の単語が抽出され、戻り値の tf に単語の出現回数がセットされたものと思われます。

多分認識はあっていると思いますが・・・まだ調査中です。

In [105]:
len(tf.data)

51752

In [106]:
tf.data[0:20]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 4], dtype=int64)

## 学習の実行

分類トピックを１０件と指定します

In [100]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, 
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 2.515s.


## 学習結果

特徴語は１０００件に絞り込まれた様子

In [101]:
tf_feature_names = tf_vectorizer.get_feature_names()
len(tf_feature_names)

1000

分類された１０のトピックに属する２０の特徴語を表示してみます

In [86]:
top_words = []
for topic_idx, topic in enumerate(lda.components_):
    print("Topic #%d:" % topic_idx)
    top_words_idx = topic.argsort()[:(-n_top_words - 1):-1]
    top_words_list = [tf_feature_names[i] for i in top_words_idx]
    top_words.append(top_words_list)
    print(",".join(top_words_list))


Topic #0:
edu,com,mail,send,graphics,ftp,pub,available,contact,university,list,faq,ca,information,cs,1993,program,sun,uk,mit
Topic #1:
don,like,just,know,think,ve,way,use,right,good,going,make,sure,ll,point,got,need,really,time,doesn
Topic #2:
christian,think,atheism,faith,pittsburgh,new,bible,radio,games,alt,lot,just,religion,like,book,read,play,time,subject,believe
Topic #3:
drive,disk,windows,thanks,use,card,drives,hard,version,pc,software,file,using,scsi,help,does,new,dos,controller,16
Topic #4:
hiv,health,aids,disease,april,medical,care,research,1993,light,information,study,national,service,test,led,10,page,new,drug
Topic #5:
god,people,does,just,good,don,jesus,say,israel,way,life,know,true,fact,time,law,want,believe,make,think
Topic #6:
55,10,11,18,15,team,game,19,period,play,23,12,13,flyers,20,25,22,17,24,16
Topic #7:
car,year,just,cars,new,engine,like,bike,good,oil,insurance,better,tires,000,thing,speed,model,brake,driving,performance
Topic #8:
people,said,did,just,didn,know,ti

## 分析または予測に関する処理についての調査

### perplexity 関数

公式ドキュメントには Calculate approximate perplexity for data X. とありました。

予測の精度指標なのでしょうか？

調査を続行します。

In [110]:
lda.perplexity(tf) # 試しに訓練データをそのまま突っ込んでます

840047.75481485634

### score 関数

公式ドキュメントには Calculate approximate log-likelihood as score. とあります。

perplexity と同様、予測の精度指標なのでしょうか？

調査を続行します。

In [114]:
lda.score(tf) # 試しに訓練データをそのまま突っ込んでます

-1116192.3372007054

### transform 関数

公式ドキュメントには Transform data X according to the fitted model. とあります。

これが予測処理そのものかと思われますが・・・為念で調査は続行します。

In [115]:
answers = lda.transform(tf) # 試しに訓練データをそのまま突っ込んでます
answers

array([[  3.44893409e-03,   6.28598196e-01,   3.44908103e-03, ...,
          3.44868917e-03,   3.44943514e-03,   3.44884274e-03],
       [  3.33391110e-03,   3.33467317e-03,   9.69993827e-01, ...,
          3.33448122e-03,   3.33413230e-03,   3.33356505e-03],
       [  3.03086351e-03,   6.71061551e-01,   3.03059022e-03, ...,
          3.03054785e-03,   3.03076742e-03,   3.03062719e-03],
       ..., 
       [  2.08357969e-03,   8.84390569e-02,   2.08344836e-03, ...,
          2.08376930e-03,   2.08360955e-03,   2.08370167e-03],
       [  6.53656983e-04,   6.53705973e-04,   6.53712556e-04, ...,
          6.53665594e-04,   7.88066707e-01,   6.53729463e-04],
       [  2.00001102e-02,   2.00062768e-02,   2.21624245e-01, ...,
          2.00018814e-02,   6.18358729e-01,   2.00023251e-02]])

下記の例では、テストデータ #557 は４番目のトピック #3 に近いと言っているのだと思われます。

In [137]:
answers[557]

array([ 0.01000277,  0.0100014 ,  0.01000072,  0.90998758,  0.01000039,
        0.01000056,  0.01000438,  0.01000135,  0.01000046,  0.01000038])

テストデータ #0 とトピック #1 は下記の通りですが・・・当たってますね。

In [138]:
print("==========")
print("Topic #3:")
print(",".join(top_words[3]))
print("==========")
print("Sample Text:")
print(data_samples[557])

Topic #3:
drive,disk,windows,thanks,use,card,drives,hard,version,pc,software,file,using,scsi,help,does,new,dos,controller,16
Sample Text:
Is there a QIC-80 format tape drive that comes
with an EISA controller ?
Colorado's 250 only has ISA and MCA controllers.

Thanks. e-mail please.

-- 


他の例でも試してみます。

テストデータ #1002 のヒットトピックは、これまた４番目の #3 のようですが・・・こちらも、なんとなく当たっている様子です。

In [139]:
answers[1002]

array([ 0.00526363,  0.22352498,  0.00526431,  0.54179643,  0.00526353,
        0.00526522,  0.00526471,  0.00526447,  0.19782857,  0.00526415])

In [140]:
print("==========")
print("Topic #3:")
print(",".join(top_words[3]))
print("==========")
print("Sample Text:")
print(data_samples[1002])

Topic #3:
drive,disk,windows,thanks,use,card,drives,hard,version,pc,software,file,using,scsi,help,does,new,dos,controller,16
Sample Text:






I've started to notice the same thing myself. I'm running DOS 5 and Win 3.1 so
I can fix it from the Windows Control Panel. At times it is the date, at
others the clock seems to be running several minutes behind where it should
be.

If you find out I'd like to know also. Oh, and I also leave my system running
all the time.
                                                                    
