# Application to Real Data
### About the Data
The data that we will be using in this project is part of LDA-C, which is a C implementation of latent Dirichlet allocation (LDA). The data is composed of three parts: a 1.3-megabytes corpus of text documents, a standard list of stop words (words like "the", "a" and etc. that are supposed to be removed from being considered as words under topics), and a list of unique vocabularies that occur in the texts. The main corpus that we perform the analyses on comprises of 500 documents from the Associated Press.

## Gibbs Sampling Method
First, we use the algorithm based on Gibbs sampling to accomplish our goal of summarizing 30 topics out of the 500 documents, and under each topic pick the top 10 words. This package is called "LDApackage". The final output is the file "Gibbs_RD_topwords.dat". $\textbf{If wish to reproduce results, see README for instructions.}$

In [None]:
! pip install .

In [None]:
import LDApackage

In [None]:
datafile = 'RealData.txt'
pred = LDApackage.preprocessing(datafile)
lda = LDApackage.LDAModel(pred)
lda.estimate()

#### To view the final output, please see "topwords.dat" in the directory. (Results would be rewritten each time running the code with different/same data.)

## VIEM Method

In [None]:
#get documents 
real_docs = LDApackage.read_documents('RealData.txt')
#get topic and words
alpha, log_beta, topicwords = LDApackage.LDA_VIEM(real_docs,10,10)

#save topic words to txt file
with open('VIEM_RD_topwords.txt', 'w') as f:
    for item in topicwords:
        f.write("%s\n" % item)