# Part 5.2 - Topic Modeling
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import gc
import pandas as pd

**In this part, we will perform following operations:**

1. using MALLET to train the training set, getting a topic model and result files;
1. inferring subsets, getting result files.

## 1 Training Topic model

**Since MALLET can take one instance per file or one file one instance per line, the only choice for us is one file one instance per line, we already prepared the .csv file for training at par5.1.**

**Check contents:**

In [2]:
path = r'../data/dataset/sample/train/train.csv'
pd.read_table(path, header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,v-/ .ADVERTISEMENTS. •- I Advertisements will ...
1,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memor..."
2,1854245,Page 1 Advertisements Column 1,NOTICE.—Tim Newspaper may bs sent Free by Post...
3,1854253,THE TRUST DEED.,THE TRUST DEED.This Deed- made the 13th day of...
4,1854264,Page 2 Advertisements Column 1,NOTICE is hereby given that in case the follow...


**We do not think of the number of topics as a natural characteristic of corpora. The topic number is not really combinations of multinomial distributions, so there is no "right" topic number. We think of the number of topics as the scale of a map of corpora. If we want a broad overview, we use a small topic number. If we want more detail, use a larger topic number. The right number is the value that produces meaningful results that allow us to accomplish our goal.**

**There is a wide range of good values for us, here we will train the dataset to get a topic model with 200 topics.**

**Many metric methods and tools could help us to quantitatively tune the topic number,  such as [ldatuning](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html) and [topic coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), those evaluate work could be our future work.**

In [3]:
%%capture capt
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/dataset/sample/train/train.csv' -o '../models/train/' -p 'train'

InputFile=../data/dataset/sample/train/train.csv
OutputDir=../models/train/
Process=train
CORES=6
SEED1=1
SEED2=1
TOPICS=200
ITERATION=2000
INTERVAL=40
BURNIN=300
IDFMIN=0.1
IDFMAX=10
23:45:31 :: Start import dataset...
Import new data for training.
23:48:44 :: Imported.
23:48:44 :: Start prune model...
23:49:54 :: Pruned.
23:49:54 :: Start training dataset...
04:02:55 :: Trained.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 0.1-10.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
13

CPU times: user 252 ms, sys: 188 ms, total: 440 ms
Wall time: 4h 17min 24s


In [4]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('../models/train/log.txt', 'w') as f:
    f.write(capt.stdout)

**The output files are:**
* topics words from `topicKeys.txt`
* topics distribution per document from `topicKeys.txt`
* topic inferencer for inferring subset from `inferencer.model`
* corpus that topics belong to from `stat.gz`
* statistic info from `diagnostics.xml`

## 2 Inferring Subset

**Except analyze and visualize topic model of training dataset, based on typical application scenario, we could extract several subsets from the training dataset to focus on specific point or features. We infer subset by inferencer to get doc-topic matrix to analyze and visualize topics.**

### 2.1 By Range of Time

**Check contents:**

In [5]:
path = r'../data/dataset/sample/subset/wwi/wwi.csv'
pd.read_table(path, header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,3029448,Committee on Social Questions.,Committee on Social Questions.Archdeacon Willi...
1,3031078,"Napier In April, 1859.","Napier In April, 1859.The new town of Napier i..."
2,3031487,A List of Inexpensive Books.,A List of Inexpensive Books.We give month by m...
3,3031919,Poverty Bay Clerical association.,Poverty Bay Clerical association.For the last ...
4,3034547,Rotorua.,"Rotorua.Vicar: Yen. Archdeacon Tisdall, M.A.y...."


**Inferring:**

In [6]:
%%capture capt
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/wwi/' -p 'infer'

In [7]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('../models/wwi/log.txt', 'w') as f:
    f.write(capt.stdout)

### 2.2 By Region

**Check contents:**

In [8]:
path = r'../data/dataset/sample/subset/regions/regions.csv'
pd.read_table(path, header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,v-/ .ADVERTISEMENTS. •- I Advertisements will ...
1,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memor..."
2,1854245,Page 1 Advertisements Column 1,NOTICE.—Tim Newspaper may bs sent Free by Post...
3,1854253,THE TRUST DEED.,THE TRUST DEED.This Deed- made the 13th day of...
4,1854264,Page 2 Advertisements Column 1,NOTICE is hereby given that in case the follow...


**Inferring:**

In [9]:
%%capture capt
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/regions/' -p 'infer'

In [10]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('../models/regions/log.txt', 'w') as f:
    f.write(capt.stdout)

### 2.3 By Label

**Check contents:**

In [11]:
path = r'../data/dataset/sample/subset/ads/ads.csv'
pd.read_table(path, header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,v-/ .ADVERTISEMENTS. •- I Advertisements will ...
1,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memor..."
2,1854245,Page 1 Advertisements Column 1,NOTICE.—Tim Newspaper may bs sent Free by Post...
3,1854264,Page 2 Advertisements Column 1,NOTICE is hereby given that in case the follow...
4,1854344,Page 4 Advertisements Column 1,"JAMES COUPLAND, V f 'l SETTLER'S HOME AND \\j\..."


**Inferring:**

In [12]:
%%capture capt
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/ads/' -p 'infer'

In [13]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('../models/ads/log.txt', 'w') as f:
    f.write(capt.stdout)

---

In [14]:
gc.collect()

0