# LDA topic model
* reference: https://blog.csdn.net/TiffanyRabbit/article/details/76445909

### 1. load data

In [1]:
#加载数据
from sklearn.datasets import fetch_20newsgroups
n_samples=2000
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))

data_samples = dataset.data[:n_samples] #截取需要的量，n_samples=2000

In [3]:
len(data_samples)

2000

In [4]:
data_samples[:3]

["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n",
 "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap

### 2. preprocess
* lower casing
* remove stopwords
* tokenization
* stemmer
* ...

### 3. CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=200,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

In [9]:
tf.shape

(2000, 200)

### 4. LDA & result

In [12]:
from sklearn.decomposition import LatentDirichletAllocation
n_topic = 10
lda = LatentDirichletAllocation(n_topic, 
                                max_iter=50,
                                learning_method='batch')
lda.fit(tf)                           

LatentDirichletAllocation(max_iter=50)

In [14]:
# 输出10个主题下，影响较高的5个词语


def print_top_words(model, feature_names, n_top_words):
    #打印每个主题下权重较高的term
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]  for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
    #打印主题-词语分布矩阵
    print(model.components_)

n_top_words=5
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)



Topic #0:
use car problem using good
Topic #1:
edu com thanks know mail
Topic #2:
said didn year game people
Topic #3:
don people just think like
Topic #4:
god does people life jesus
Topic #5:
space computer time program new
Topic #6:
edu graphics file ftp version
Topic #7:
government law key use state
Topic #8:
10 drive disk 11 16
Topic #9:
55 scsi true bit chip
[[1.00009180e-01 1.00020961e-01 1.00010490e-01 ... 2.48141330e+01
  4.01293547e+00 1.00024846e-01]
 [1.31306629e+00 1.00010515e-01 1.00009805e-01 ... 1.00015531e-01
  1.00015579e-01 9.22601054e+00]
 [1.00006433e-01 2.16236823e+00 1.00016104e-01 ... 1.69423237e+02
  1.25044762e+02 1.00026610e-01]
 ...
 [1.00010712e-01 1.00011526e-01 1.00022145e-01 ... 1.00036837e-01
  1.00025347e-01 1.00027556e-01]
 [1.00853456e+02 2.39509953e+02 1.63173238e+02 ... 1.00021813e-01
  7.21420367e+00 1.00036604e-01]
 [1.00000711e-01 1.00021782e-01 1.00014005e-01 ... 1.00004604e-01
  1.00006773e-01 2.02013174e+00]]


In [17]:
lda.components_.shape

(10, 200)

**refer to above reference for more details**