## DS5559 - Project
## Notebook 6 - Topic Models
#### Name: Mengyao Zhang (mz6jv), Runhao Zhao (rz6dg)

## Synopsis
Use case: this notebook generates topic models for our corpus using Mallet

### Libraries

In [2]:
import pandas as pd
import sqlite3
import os
import textman as tx

### Configuration

In [12]:
corpus_db = 'project.db'
#max_words = 5000  # max number of words queried, used for para, and chap, but polo can't run for chaps
# 5000 used for 8 books (paras as doc)
max_words = 2000

num_topics = 20
num_iters = 1000
show_interval = 100

In [35]:
OHCO = ['book_num','chap_num', 'para_num', 'sent_num', 'token_num']
BOOKS = OHCO[:1] 
CHAPS = OHCO[:2]
PARAS = OHCO[:3]
SENTS = OHCO[:4]
#BAG = PARAS 
BAG = CHAPS

In [9]:
# For MALLET
os.environ ["MALLET_HOME"] = './mallet-2.0.8'
mallet_path = '../../../mallet-2.0.8/bin/mallet'

### Process

#### Get the tokens we want from database

In [36]:
# use SQL to get the tokens we want (filter out stop words and proper nouns)
sql = """
SELECT * FROM token 
WHERE term_id IN (
    SELECT term_id FROM vocab 
    WHERE stop = 0
    ORDER BY tfidf_sum DESC LIMIT {}
) 
AND (pos NOT LIKE 'NNP%')
""".format(max_words)

In [37]:
with sqlite3.connect(corpus_db) as db:
    tokens = pd.read_sql(sql, db)

#### Create dataframe

In [38]:
tokens.head()

Unnamed: 0,book_num,chap_num,para_num,sent_num,token_num,pos,token_str,punc,num,term_str,term_id
0,1,0,1,0,0,NN,Chapter,0,0,chapter,6192
1,1,0,2,0,5,VBG,taking,0,0,taking,38869
2,1,0,2,0,7,NN,walk,0,0,walk,43412
3,1,0,2,0,9,NN,day,0,0,day,9790
4,1,0,2,1,5,RB,indeed,0,0,indeed,20395


In [21]:
tokens.shape

(2042777, 11)

In [39]:
len(tokens.term_str.unique())

1953

In [8]:
tokens.para_num.max()

279

In [40]:
tokens = tokens.set_index(BAG)

In [23]:
tokens.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_num,token_num,pos,token_str,punc,num,term_str,term_id
book_num,chap_num,para_num,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,1,0,0,NN,Chapter,0,0,chapter,6192
1,0,2,0,3,NN,possibility,0,0,possibility,30015
1,0,2,0,5,VBG,taking,0,0,taking,38869
1,0,2,0,7,NN,walk,0,0,walk,43412
1,0,2,0,9,NN,day,0,0,day,9790


#### Convert tokens df to the corpus format MALLET requires

In [24]:
# the following is for BAG=PARAS
'''
corpus = tx.gather_tokens(tokens, level=3, col='term_str')\
    .reset_index().rename(columns={'term_str':'doc_content'})
corpus['doc_label'] = corpus.apply(lambda x: "book-{}_chap-{}_para-{}".format(x.book_num,x.chap_num, x.para_num), 1)
'''


In [41]:
# the following is for BAG=CHAPS

corpus = tx.gather_tokens(tokens, level=2, col='term_str')\
    .reset_index().rename(columns={'term_str':'doc_content'})
corpus['doc_label'] = corpus.apply(lambda x: "book-{}_chap-{}".format(x.book_num,x.chap_num), 1)

In [42]:
corpus.shape

(1593, 4)

In [43]:
corpus.head(10)

Unnamed: 0,book_num,chap_num,doc_content,doc_label
0,1,0,chapter taking walk day indeed hour morning si...,book-1_chap-0
1,1,1,chapter way new thing circumstance greatly bad...,book-1_chap-1
2,1,2,chapter next thing remember feeling seeing ter...,book-1_chap-2
3,1,3,chapter gathered enough hope wishing get well ...,book-1_chap-3
4,1,4,chapter five clock hardly struck morning broug...,book-1_chap-4
5,1,5,hour time exercise day fully bell time another...,book-1_chap-5
6,1,6,chapter next day getting dressing morning obli...,book-1_chap-6
7,1,7,chapter first quarter seemed age golden age ei...,book-1_chap-7
8,1,8,chapter half hour ended five clock struck scho...,book-1_chap-8
9,1,9,chapter rather spring drew indeed already come...,book-1_chap-9


In [25]:
corpus.head(10)

Unnamed: 0,book_num,chap_num,para_num,doc_content,doc_label
0,1,0,1,chapter,book-1_chap-0_para-1
1,1,0,2,possibility taking walk day wandering indeed h...,book-1_chap-0_para-2
2,1,0,3,glad never liked long walks especially afterno...,book-1_chap-0_para-3
3,1,0,4,said round mama drawing room lay sofa time nei...,book-1_chap-0_para-4
4,1,0,5,say done asked,book-1_chap-0_para-5
5,1,0,6,like besides something truly child taking mann...,book-1_chap-0_para-6
6,1,0,7,breakfast room adjoined drawing room contained...,book-1_chap-0_para-7
7,1,0,8,shut view right hand left clear glass day inte...,book-1_chap-0_para-8
8,1,0,9,returned book british birds cared little gener...,book-1_chap-0_para-9
9,1,0,10,vast round melancholy among,book-1_chap-0_para-10


In [44]:
selected_bk = [1,5,6,8,14,16,19,20,27] # select a book for each author (paras as doc)

In [45]:
new_corpus = corpus.loc[corpus.book_num.isin(selected_bk)] # create new corpus for selected books

In [46]:
new_corpus.shape

(480, 4)

#### Write corpus to csv file

In [14]:
#corpus[['doc_label','doc_content']].to_csv('novels_for_mallet.csv', index=False)

In [47]:
# write new corpus to csv, will be used for polo2 later
new_corpus[['doc_label','doc_content']].to_csv('novels_for_mallet_chaps_small.csv', index=False)

### Use MALLET for Topic Model (all chapters in corpus)

#### Check MALLET options

In [10]:
!{mallet_path}

Unrecognized command: 
Mallet 2.0 commands: 

  import-dir         load the contents of a directory into mallet instances (one per file)
  import-file        load a single file into mallet instances (one per line)
  import-svmlight    load SVMLight format data files into Mallet instances
  info               get information about Mallet instances
  train-classifier   train a classifier from Mallet data files
  classify-dir       classify data from a single file with a saved classifier
  classify-file      classify the contents of a directory with a saved classifier
  classify-svmlight  classify data from a single file in SVMLight format
  train-topics       train a topic model from Mallet data files
  infer-topics       use a trained topic model to infer topics for new documents
  evaluate-topics    estimate the probability of new documents under a trained model
  prune              remove features based on frequency or information gain
  split              divide data i

#### Import corpus

In [11]:
!{mallet_path} import-file --input novels_for_mallet_chaps.csv --output novels_project.mallet --keep-sequence TRUE

#### Train topics (topic modeling by chaps)

In [13]:
!{mallet_path} train-topics --input novels_project.mallet --num-topics {num_topics} --num-iterations {num_iters} \
--output-doc-topics project-doc-topics.txt \
--output-topic-keys project-topic-keys.txt \
--word-topic-counts-file project-word-topic-counts-file.txt \
--topic-word-weights-file project-topic-word-weights-file.txt \
--xml-topic-report project-topic-report.xml \
--xml-topic-phrase-report project-topic-phrase-report.xml \
--show-topics-interval {show_interval} \
--use-symmetric-alpha false  \
--optimize-interval 100 \
--diagnostics-file project-diagnostics.xml

Mallet LDA: 20 topics, 5 topic bits, 11111 topic mask
Data loaded.
max tokens: 3186
total tokens: 1779729
<10> LL/token: -9.17189
<20> LL/token: -8.53012
<30> LL/token: -8.34924
<40> LL/token: -8.26447
<50> LL/token: -8.21712
<60> LL/token: -8.18461
<70> LL/token: -8.15975
<80> LL/token: -8.13982
<90> LL/token: -8.1254

0	0.25	like would little one yet day well thought good still long know school shall take master never certain near white 
1	0.25	could would heart mother love even never father time last thought must mind life away first might long yet ever 
2	0.25	one saw could two room asked eye well though see half place seen never ladies looked replied lady look pleasure 
3	0.25	said doctor one father room know think molly came well time come squire mother much tell see sure say thought 
4	0.25	said would life hand seemed felt mind speak must heart may moment without still words sense made look present wish 
5	0.25	said sir know upon say man hand returned head come take looked back 

<370> LL/token: -8.04813
<380> LL/token: -8.05047
<390> LL/token: -8.05062

0	0.24042	like little would yet well day shall long still good seemed take one thought know found never white old time 
1	0.30529	could would heart never love time must life thought first ever one even away long last mind yet might knew 
2	0.27101	one see eyes lady though looked saw look well smile pleasure room wish asked replied half continued eye place ladies 
3	0.20542	father said doctor one mother papa molly room know time much think came well come went squire thought must tell 
4	0.30978	said would felt must could might feeling seemed life made without towards thought mind still eyes even sense hand words 
5	0.26651	said sir upon know returned say head hand little old man name time take another come looked put gentleman mind 
6	0.22868	said aunt uncle know think dear sister well much say mother went came way quite old like made going away 
7	0.30289	face door hand eyes night like one come bed head stood r

<650> LL/token: -8.08148
<660> LL/token: -8.08094
<670> LL/token: -8.08149
<680> LL/token: -8.07919
<690> LL/token: -8.08104

0	0.18782	like little yet would well one day seemed eye still looked night good long take could hand certain neither white 
1	0.40088	would could love never heart must one thought life time ever last still might even long first think away knew 
2	0.23977	would shall see till replied little answered tell master well one never asked must wish take let leave rather may 
3	0.14181	said father doctor one time molly papa mother well know think came much squire room come tell could day little 
4	0.32954	said would might could like made felt mind must without seemed feeling husband anything something towards little new looking sense 
5	0.2584	said upon sir little know say returned hand head time man looked old made back way one come put went 
6	0.15998	aunt uncle brother sister said family niece much dear children cousin hope went well say though years thought never not

<930> LL/token: -8.06167
<940> LL/token: -8.061
<950> LL/token: -8.06067
<960> LL/token: -8.05834
<970> LL/token: -8.05815
<980> LL/token: -8.058
<990> LL/token: -8.05655

0	0.1471	like would little yet well seemed one good still eye long night day hand heart certain indeed looked took white 
1	0.46229	would could love never heart letter must one thought life might think first day last knew long even see ever 
2	0.1625	would shall replied well little till master let answered take tell continued must time asked one cried alone leave see 
3	0.10274	father said doctor molly time papa one mother squire came much well thought house little room mamma would daughter yet 
4	0.25734	said would might like felt made could mind seemed husband must life feeling new without sense something thought even rather 
5	0.24677	said upon sir little hand head know say man returned old one time looked made way took put would back 
6	0.11996	aunt uncle brother said sister family niece cousin children dear hope

### Use MALLET for topic model (9 books)

In [14]:
!{mallet_path} import-file --input novels_for_mallet_chaps_small.csv --output novels_project.mallet --keep-sequence TRUE

In [15]:
!{mallet_path} train-topics --input novels_project.mallet --num-topics {num_topics} --num-iterations {num_iters} \
--output-doc-topics project-doc-topics.txt \
--output-topic-keys project-topic-keys.txt \
--word-topic-counts-file project-word-topic-counts-file.txt \
--topic-word-weights-file project-topic-word-weights-file.txt \
--xml-topic-report project-topic-report.xml \
--xml-topic-phrase-report project-topic-phrase-report.xml \
--show-topics-interval {show_interval} \
--use-symmetric-alpha false  \
--optimize-interval 100 \
--diagnostics-file project-diagnostics.xml

Mallet LDA: 20 topics, 5 topic bits, 11111 topic mask
Data loaded.
max tokens: 3186
total tokens: 525616
<10> LL/token: -8.82889
<20> LL/token: -8.31175
<30> LL/token: -8.15052
<40> LL/token: -8.05911
<50> LL/token: -8.00717
<60> LL/token: -7.96276
<70> LL/token: -7.93112
<80> LL/token: -7.90481
<90> LL/token: -7.88543

0	0.25	sir upon know never old little night time made much looked found life come another ever asked head heart see 
1	0.25	would letter family may might though take brother son thought said opinion even told money possible friend means give known 
2	0.25	little two one like old well young would got often must enough ladies good high large fine soon never children 
3	0.25	would replied half take papa make next began done get going another idea keep put care course boy shall little 
4	0.25	said aunt dear say little head hand returned anything think know like ever face love always quite old hope looking 
5	0.25	said shall must tell one see let know would could till never 

<340> LL/token: -7.84676
<350> LL/token: -7.85229
<360> LL/token: -7.85956
<370> LL/token: -7.8541
<380> LL/token: -7.8648
<390> LL/token: -7.86178

0	0.23555	said upon know sir little old hand time never one much ever see went head looked returned night life come 
1	0.31604	would letter may might though family brother father told well indeed business first friend make friends could leave take known 
2	0.31535	two young old like well good though nothing fine lady little ladies course side enough children three quite looked book 
3	0.30526	would make little replied half take done told back get took last tea care might put keep papa master next 
4	0.21669	said aunt dear little say think like always anything quite made face sat took head love looking eyes happy great 
5	0.30202	said shall must let tell never till could may ever know heart look master come hand one better see door 
6	0.22413	would could girl might daughter love tailor lady told wife think though say must young come done ma

<610> LL/token: -7.96792
<620> LL/token: -7.97222
<630> LL/token: -7.97778
<640> LL/token: -7.97586
<650> LL/token: -7.97401
<660> LL/token: -7.97513
<670> LL/token: -7.97947
<680> LL/token: -7.98155
<690> LL/token: -7.98027

0	0.23694	said little upon know head sir hand old much like returned never see made looking saw one ever face went 
1	0.37683	letter would could might must shall may take write last read wrote time words morning long brother think say hope 
2	0.35082	would well little children good like two fine girl long must sister one young ladies things old half school course 
3	0.28573	would master replied answered till get cried though young exclaimed little home cousin half papa keep father kitchen however next 
4	0.16033	aunt said dear think upon child time happy always quite say course love room agnes anything old wife long indeed 
5	0.36234	said shall must see could tell never let come heart well know would better head ever hand yet left give 
6	0.17593	would girl might 

[beta: 0.12078] 
<900> LL/token: -7.97263
<910> LL/token: -7.97554
<920> LL/token: -7.96904
<930> LL/token: -7.97513
<940> LL/token: -7.9684
<950> LL/token: -7.97412
<960> LL/token: -7.96631
<970> LL/token: -7.96938
<980> LL/token: -7.98002
<990> LL/token: -7.97389

0	0.22641	said little upon know head old sir hand much made time ever like returned dear face looking say never day 
1	0.35434	would letter could might must shall told think take house morning going till last soon tell come see make home 
2	0.29078	would little one well like children could two book school girl sister long day time course good things often lady 
3	0.16676	master replied answered till papa cried father cousin would began though half get young house kitchen door mistress keep lady 
4	0.09433	aunt said dear say agnes first returned upon long great really child traddles hope family time business next course table 
5	0.39136	said could see shall must would heart tell never come let well ever done away leave good 

In [None]:
# End