# Part 5.2 - Topic Modeling
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import gc
import pandas as pd
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_colwidth', 120)

**In this part, we will perform following operations:**

1. using MALLET to train the training set, getting a topic model and result files;
1. inferring subsets, getting result files.

## 1 Training Topic model

**Since MALLET can take one instance per file or one file one instance per line, the only choice for us is one file one instance per line, we already prepared the .csv file for training at par5.1.**

**Check contents:**

In [None]:
path = r'../data/dataset/sample/train/train.csv'
pd.read_table(path, header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,"v-/ .ADVERTISEMENTS. •- I Advertisements will he inserted in the y \Gazette\"" at the nominal rate of Threepence for ..."
1,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memorandum Books, Pens, Ink, &c., on sale at the \ Gazette Office.\"" O-OPE ..."
2,1854245,Page 1 Advertisements Column 1,"NOTICE.—Tim Newspaper may bs sent Free by Post(roithin Seven days of date,) to any part of Great Britain, New Zealan..."
3,1854253,THE TRUST DEED.,"THE TRUST DEED.This Deed- made the 13th day of March in the year 1863, between William Rawson Brame of Auckland New ..."
4,1854264,Page 2 Advertisements Column 1,NOTICE is hereby given that in case the following persons neglectto fulfill1 the Conditions on which all allotments ...


**We do not think of the number of topics as a natural characteristic of corpora. The topic number is not really combinations of multinomial distributions, so there is no "right" topic number. We think of the number of topics as the scale of a map of corpora. If we want a broad overview, we use a small topic number. If we want more detail, use a larger topic number. The right number is the value that produces meaningful results that allow us to accomplish our goal.**

**There is a wide range of good values for us, here we will train the dataset to get a topic model with 200 topics.**

**Many metric methods and tools could help us to quantitatively tune the topic number,  such as [ldatuning](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html) and [topic coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), those evaluate work could be our future work.**

In [None]:
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/dataset/sample/train/train.csv' -o '../models/train/' -p 'train'

#%%capture capt

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/train/log.txt', 'w') as f:
#    f.write(capt.stdout)

**The output files are:**

1. `topicKeys.txt`: topics words;
1. `topicKeys.txt`: topics distribution per document;
1. `inferencer.model`: topic inferencer for inferring subset;
1. `stat.gz`corpus that topics belong to;
1. `diagnostics.xml`: statistic info;

## 2 Inferring Subset

**Except analyze and visualize topic model of training dataset, based on typical application scenario, we could extract several subsets from the training dataset to focus on specific point or features. We infer subset by inferencer to get doc-topic matrix to analyze and visualize topics.**

### 2.1 By Range of Time

**Check contents:**

In [None]:
path = r'../data/dataset/sample/subset/wwi/wwi.csv'
pd.read_table(path, header=None, nrows=5).head()

**Inferring:**

In [None]:
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/wwi/' -p 'infer'

#%%capture capt

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/wwi/log.txt', 'w') as f:
#    f.write(capt.stdout)

### 2.2 By Region

**Check contents:**

In [None]:
path = r'../data/dataset/sample/subset/regions/regions.csv'
pd.read_table(path, header=None, nrows=5).head()

**Inferring:**

In [None]:
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/regions/' -p 'infer'

#%%capture capt

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/regions/log.txt', 'w') as f:
#    f.write(capt.stdout)

### 2.3 By Label

**Check contents:**

In [None]:
path = r'../data/dataset/sample/subset/ads/ads.csv'
pd.read_table(path, header=None, nrows=5).head()

**Inferring:**

In [None]:
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/ads/' -p 'infer'

#%%capture capt

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/ads/log.txt', 'w') as f:
#    f.write(capt.stdout)

---

In [None]:
gc.collect()