# TOpic Model
## Read data

In [1]:
import pandas as pd
import numpy as np

In [47]:
df = pd.read_csv('../data/processed_data.csv')

In [48]:
df.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,text
0,27673,2053,24h,i went on a successful date with someone i fel...,i went on a successful date with someone i fel...,True,1,,affection,went successful date someone felt sympathy con...
1,27674,2,24h,i was happy when my son got 90% marks in his e...,i was happy when my son got 90% marks in his e...,True,1,,affection,happy son got mark examination
2,27675,1936,24h,i went to the gym this morning and did yoga.,i went to the gym this morning and did yoga.,True,1,,exercise,went gym morning yoga
3,27676,206,24h,we had a serious talk with some friends of our...,we had a serious talk with some friends of our...,True,2,bonding,bonding,serious talk friend flaky lately understood go...
4,27677,6227,24h,i went with grandchildren to butterfly display...,i went with grandchildren to butterfly display...,True,1,,affection,went grandchild butterfly display crohn conser...


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100535 entries, 0 to 100534
Data columns (total 10 columns):
hmid                     100535 non-null int64
wid                      100535 non-null int64
reflection_period        100535 non-null object
original_hm              100535 non-null object
cleaned_hm               100535 non-null object
modified                 100535 non-null bool
num_sentence             100535 non-null int64
ground_truth_category    14125 non-null object
predicted_category       100535 non-null object
text                     100534 non-null object
dtypes: bool(1), int64(3), object(6)
memory usage: 7.0+ MB


## Vectorization

Do this using the bag-of-words model:

- Count how many times does a word occur in each message (Known as term frequency)

- Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)

- Normalize the vectors to unit length, to abstract from the original text length (L2 norm)



In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [71]:
df = df[df['text'].isnull() == False]

In [79]:
bow_transformer = CountVectorizer().fit(df['text'])
# Print total number of vocab words
print(len(bow_transformer.vocabulary_))

24907


In [80]:
text_bow = bow_transformer.transform(df['text'])

## LDA

In [81]:
from sklearn.decomposition import LatentDirichletAllocation

In [84]:
n_components = 10
lda = LatentDirichletAllocation(n_components=n_components, max_iter=50,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

In [85]:
lda.fit(text_bow)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,
             n_components=10, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [86]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [89]:
n_top_words = 10

In [90]:
feature_names = bow_transformer.get_feature_names()
print_top_words(lda, feature_names, n_top_words)

Topic #0:
found dog day went took im trip watching around child
Topic #1:
time family went first friend happy moment life month long
Topic #2:
favorite one lunch food college spent didnt delicious girlfriend good
Topic #3:
happy made day got last work job birthday feel event
Topic #4:
happiness thought completed looking song state lost coming person right
Topic #5:
able go work day got gave get morning show ive
Topic #6:
wanted better someone summer date purchased bike ever actually computer
Topic #7:
friend new game bought daughter school house old went mother
Topic #8:
new got work month received finished finally ate working week
Topic #9:
car money got able seeing book sleep brought get company



In [93]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, text_bow, bow_transformer)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
