# Mixture of Unigram Model

To reproduce the model results on the simulated data, please follow the following instruction:
1. Here we provide the code to preprocessing the simulated text file.
2. For this model to run, dowload the text file generated by the program below, turn this txt file into Unix Executable file in your local terminal. Re-upload and run with UMM model.
3. Go to Jupyter terminal, run "$ python pDMM.py --corpus simu --ntopics 10 --twords 10 --niters 500 --name unigram"
4. Check the "output" folder and the file named "unigram.topWords" produced.

In [1]:
import string

In [2]:
with open("stopword.txt", 'r') as s:
        stopwords = s.readlines()
stpw = []
for word in stopwords:
    stpw.append(word.strip())
with open("simulated.txt") as f:
    corpus = f.readlines()
word_list = []
for line in corpus:
        if line != "":
            before = line.strip().split()
            #Remove stopwords from the strings
            for word in before:
                if word.lstrip(string.punctuation).rstrip(string.punctuation).lower() not in stpw:
                    if word != "":
                        word_list.append(word.lstrip(string.punctuation).rstrip(string.punctuation).strip().lower())

In [3]:
new = []
for i in range(0,234,1):
    line = " ".join(word_list[i*5:i*5+5])
    new.append(line)
simu = "\n".join(new)

In [4]:
output_file = open('simu.txt','w')
output_file.write(simu)
output_file.close()

#Then turn this txt file into Unix Executable file in local terminal. Reupload and run with UMM model.

#### Mixture of Unigrams Results

Topic 0: luffy piece search treasure head law monkey fruit named pirate

Topic 1: devil fruit user fruits animals race sea power powers haki 

Topic 2: sea grand red water mountain half rain runs seas wind 

Topic 3: series pirates roger humans gol merry manga video animation produced 

Topic 4: crew luffy robin ancient liberates sabaody archipelago ace navy government 

Topic 5: crew blue pirates navy straw joins named grand east pirate 

Topic 6: luffy nami sanji arlong chopper body properties crew usopp encounters 

Topic 7: pose calm belts log developed called thirteen animated feature films 

Topic 8: piece manga pirates king set island zou series eiichiro history 

Topic 9: pirates straw island luffy hat grand kingdom magnetic island's fishman 

# Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


This is an example of applying :class:`sklearn.decomposition.NMF` and
:class:`sklearn.decomposition.LatentDirichletAllocation` on a corpus
of documents and extract additive models of the topic structure of the
corpus.  The output is a list of topics, each represented as a list of
terms (weights are not shown).

Non-negative Matrix Factorization is applied with two different objective
functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.
The latter is equivalent to Probabilistic Latent Semantic Indexing.

The default parameters (n_samples / n_features / n_components) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

n_samples = 16
n_features = 1000
n_components = 10
n_top_words = 10

In [6]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [7]:
import LDApackage
simulated_docs = LDApackage.read_documents_space('simulated.txt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\txh06\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
# Use tf-idf features for NMF.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(simulated_docs)

In [9]:
# Fit the NMF model
nmf = sklearn.decomposition.NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)

#print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
#print_top_words(nmf, tfidf_feature_names, n_top_words)

#### Topics in NMF model (generalized Kullback-Leibler divergence):
Topic 0: pirates line crew man search roger known king grand luffy

Topic 1: history japan series date manga oda eiichiro body volumes world

Topic 2: body devil animals used result users power presence time fruit

Topic 3: grand time called sea currents line works island making specific

Topic 4: law defeat mom sanji alliance caesar nami clown big straw

Topic 5: group battles robin straw franky leading crew ancient save pluton

Topic 6: used animals wind similar piece world various certain presence eiichiro

Topic 7: ace huge grand luffy adoptive forced fish thousand led new

Topic 8: soon cyborg island crew fishmen sabaody battle alias world archipelago

Topic 9: blue humans usopp government going captures water sanji defeats creatures