# Application to Simulated Data

### Gibbs Sampling method
First, we use the algorithm based on Gibbs sampling to accomplish our goal of summarizing 5 topics out of the "One Piece" document, and under each topic pick the top 10 words. This package is called "LDApackage". The final output is the file "Gibbs_Sim_topwords.dat". $\textbf{If wish to reproduce results, see README for instructions.}$

In [1]:
! pip install .

Processing /home/jovyan/work/sta-663-2019/projects/LDA
Building wheels for collected packages: LDApackage
  Building wheel for LDApackage (setup.py) ... [?25l- \ done
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-a1rd_c8w/wheels/e6/03/d4/b04aaa3df510f60cbe959e49b4d2be1f0b88341f429548e7c5
Successfully built LDApackage
Installing collected packages: LDApackage
  Found existing installation: LDApackage 1.0
    Uninstalling LDApackage-1.0:
      Successfully uninstalled LDApackage-1.0
Successfully installed LDApackage-1.0


In [2]:
import LDApackage

In [3]:
datafile = 'simulated.txt'
dpre = LDApackage.preprocessing(datafile)
lda = LDApackage.LDAModel(dpre)
lda.estimate()

#### To view the final output, please see "topwords.dat" in the directory. (Results would be rewritten each time running the code with different/same data.)

### VIEM Method

In [None]:
#get documents 
simulated_docs = LDApackage.read_documents_space('simulated.txt')

#get topic and words
alpha, log_beta, topicwords = LDApackage.LDA_VIEM(simulated_docs,10,10)

#save topic words to txt file
with open('VIEM_Sim_topwords.txt', 'w') as f:
    for item in topicwords:
        f.write("%s\n" % item)

### Discussion & Conclusion

The simulated data is a very short piece of text, thus is easier to eye-spect the generated topics from the algorithms, but more difficult to summarize more valuable and specific topic distributions.

From the results generated by both Gibbs Sampling and VIEM, it is not hard to tell that the words under topics are similar to each other. Taking a closer look, the words distributed under each topic actually make quite a lot of sense.

For example, topic 6 from the Gibbs Sampling result (manga, series and etc.) are all words related to the nature and attributes of the manga/anime industry. Similar topic modeling is shown in the output from EM algorithm. Another example is that, the topic including "Luffy," "pirate," "island" and etc., are the words related to the main content and background of the One Piece manga/anime itself.

It is also quite obvious that the model and either algorithm is able to process different forms/roots of words. Fortunately, different forms of such words are usually put into the same topic, showing the impressive accuracy of the model. For example, both algorithms put "island" and its plural form "islands" in the same topic, even though they fail to recognize that they are actually the same word. Because of the complexity of natural language, this shortcoming is understandable and expected.

Since the dataset is quite small, there's no significant difference in terms of speed and accuracy between the initial and optimized algorithms. Application and discussions on real data can be found under "Real+Data.ipynb".