# Part 5 - Topic Modeling
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import os, sys, subprocess
sys.path.insert(0, '../utils') # for import customed modules
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import *
from utils_load import conf_pyspark, load_dataset

# intiate PySpark
sc, spark = conf_pyspark()

sc

[('spark.app.id', 'local-1547904364416'),
 ('spark.app.name', 'local'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.driver.host', '192.168.1.207'),
 ('spark.driver.memory', '62g'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.driver.port', '38757'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.cores', '6'),
 ('spark.driver.maxResultSize', '4g')]


**In this part, we will perform following operations:**

1. training a topic model using full dataset by MALLET, getting a topic model and topic words;
1. splitting several subsets by random, by range of time, by region, and by advertisements;
1. inferring subsets from the topic model of full dataset, getting doc-topic matrix.

## 1 Load Data

**Since MALLET can take one instance per file or one file one instance per line, the only choice for us is one file one instance per line, we need to transform the** `*.csv.gz` **file to one** `.csv` **file.**

In [2]:
%%bash

cat ../data/train/*.csv.gz > ../data/train/train.csv.gz

gunzip ../data/train/train.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [3]:
%%bash

wc -l ../data/train/train.csv

160140 ../data/train/train.csv


**Check contents:**

In [4]:
pd.read_table('../data/train/train.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854232,Page 1 Advertisements Column 1,NOTICE.—This Ne?vspaper may b? sent Free by Po...
1,1854244,Page 4 Advertisements Column 1,"T7JOUND, a set of Pekoe Straps, The JJ owner m..."
2,1854262,THE CHRISTIAN CHURCH.,THE CHRISTIAN CHURCH.We have heard of an objec...
3,1854275,Page 1 Advertisements Column 2,"NOTE PAPER, Bill Paper, Envelopes Memorandum B..."
4,1854588,THE EASTERN CRISIS.,THE EASTERN CRISIS.[reuieu's telegrams— copyri...


## 2 Training Full Dataset

**We do not think of the number of topics as a natural characteristic of corpora. The topic number is not really combinations of multinomial distributions, so there is no "right" topic number. We think of the number of topics as the scale of a map of corpora. If we want a broad overview, we use a small topic number. If we want more detail, use a larger topic number. The right number is the value that produces meaningful results that allow us to accomplish our goal.**

**There is a wide range of good values for us, here we will train the dataset to get a topic model with 500 topics.**

**Many metric methods and tools could help us to quantitatively tune the topic number,  such as [ldatuning](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html) and [topic coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), those evaluate work could be our future work.**

In [5]:
%%capture capt
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/train/train.csv' -o './model_train' -p 'train';

In [6]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('./model_train/train.log', 'w') as f:
    f.write(capt.stdout)

**The output files are:**
* topics words from 'topicKeys.txt'
* topics distribution per document from 'topicKeys.txt'
* topic inferencer for inferring subset from 'inferencer.model'
* corpus that topics belong to from 'stat.gz'
* statistic info from 'diagnostics.xml'

## 3 Subset

**Except analyze and visualize topic model of full dataset, based on typical application scenario, we could extract several subsets from the full dataset to focus on specific point to analyze.**

**First of all, load clean dataset and check dimension:**

In [7]:
#df = load_dataset('dataset', spark)
df = load_dataset('dev', spark) # for developement

print('Shape of dataframe: ({}, {})'.format(df.count(), len(df.columns)))
df.sample(False, 0.0001).limit(10).show()

Shape of dataframe: (160140, 7)
+--------+--------------------+-----------------+----------+-----+--------------------+--------------------+
|      id|           publisher|           region|      date|  ads|               title|             content|
+--------+--------------------+-----------------+----------+-----+--------------------+--------------------+
| 2272302|Hawera & Normanby...|         Taranaki|1883-09-26|false|THE EXPLOSIONS AT...|b'THE EXPLOSIONS ...|
| 8600214|     Taranaki Herald|         Taranaki|1886-07-01|false|GENERAL ASSEMBLY ...|GENERAL ASSEMBLY ...|
|11784635|   Otago Daily Times|            Otago|1882-12-05| true|Page 2 Advertisem...|b'Funeral Notices...|
|16100759|  Poverty Bay Herald|         Gisborne|1914-12-29|false|GERMANY'S NICKEL ...|GERMANY'S NICKEL ...|
|17702290|        Evening Post|       Wellington|1940-08-28|false|      MAKING OF ARMS|MAKING OF ARMSEXP...|
|18027194|        Evening Post|       Wellington|1926-03-29|false|     GOOD CRICKETERS|GOOD CRIC

### 3.1 By Range of Time

**For instance, we are interested in the topics in the papers during WWI, so we will research the topic models around the WWI. As wikipedia define it was lasted from 28/7/1914 to 11/11/1918, we expand the time from 1912 to 1921 to analyze and visualize topics during these time.**

**Decide start date and end date to sample:**

In [8]:
START = '1912-01-01'
END = '1921-12-31'

**Filter samples between start and end date, remove advertisements, and generate the subset - wwi:**

In [9]:
# remove advertisements, sampling subset, and select columns.
df_sub = (
    df.filter((df['ads'] == False) & (df['date'] >= START) & (df['date'] <= END))
)
print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (29789, 7)


**Check the date range of the subset is correct:**

In [10]:
(df_sub.select(F.max(F.col('date')).alias('MAX')).limit(1).collect()[0].MAX, 
 df_sub.select(F.min(F.col('date')).alias('MIN')).limit(1).collect()[0].MIN)

(datetime.date(1921, 12, 31), datetime.date(1912, 1, 2))

**Generate subset to infer:**

In [11]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (29789, 3)


**Save subset:**

In [12]:
subset_path = r'../data/subset/wwi'

df_sub.write.csv(subset_path, sep='\t', mode='overwrite', compression='gzip')

print('Save subset to', subset_path)
print('subset size:', subprocess.check_output(['du','-sh', subset_path]).split()[0].decode('utf-8'))

Save subset to ../data/subset/wwi
subset size: 18M


### 3.2 By Region

**There are 16 regions in the full dataset, we focus on the regions that have the most population now (Auckland, Wellington, Canterbury and Otago).**

**Decide regions to sample:**

In [13]:
regions = ['Auckland', 'Wellington', 'Canterbury', 'Otago']

**Filter samples of target regions, remove advertisements, and generate the subset - regions:**

In [14]:
df_sub = df.filter(F.col('region').isin(regions))

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (78072, 7)


**Check region in the subset is correct:**

In [15]:
df_sub.select(F.col('region')).distinct().show()

+----------+
|    region|
+----------+
|Wellington|
|  Auckland|
|     Otago|
|Canterbury|
+----------+



**Generate subset to infer:**

In [16]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (78072, 3)


**Save subset:**

In [17]:
subset_path = r'../data/subset/regions'

df_sub.write.csv(subset_path, sep='\t', mode='overwrite', compression='gzip')

print('Save subset to', subset_path)
print('subset size:', subprocess.check_output(['du','-sh', subset_path]).split()[0].decode('utf-8'))

Save subset to ../data/subset/regions
subset size: 77M


### 3.3 By Label

**There is only one label (ads) in the dataset, marks the sample/row/document/text is an advertisemet or not. Advertisements are less information than articles in news paper. However, they are useful to analyze the life of old time. Advertisements take account 27.4% in the full dataset, we extract a subset for advertisements.**

**Filter samples of advertisements, and generate the subset - ads:**

In [18]:
# remove advertisements, sampling subset, and select columns.
df_sub = df.filter(F.col('ads') == True)

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (44175, 7)


**Check labels in the subset are all "ads":**

In [19]:
df_sub.select(F.col('ads')).distinct().show()

+----+
| ads|
+----+
|true|
+----+



**Generate subset to infer:**

In [20]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (44175, 3)


**Save subset:**

In [21]:
subset_path = r'../data/subset/ads'

df_sub.write.csv(subset_path, sep='\t', mode='overwrite', compression='gzip')

print('Save subset to', subset_path)
print('subset size:', subprocess.check_output(['du','-sh', subset_path]).split()[0].decode('utf-8'))

Save subset to ../data/subset/ads
subset size: 55M


## 4 Inferring Subset

**We infer subset by inferencer to get doc-topic matrix to analyze and visualize topics.**

### 4.1 By Range of Time

**The same with training full dataset, we transform multiple compressed files to one** `*.csv` **file.**

In [22]:
%%bash

cat ../data/subset/wwi/*.csv.gz > ../data/subset/wwi/wwi.csv.gz

gunzip ../data/subset/wwi/wwi.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [23]:
%%bash

wc -l ../data/subset/wwi/wwi.csv

29789 ../data/subset/wwi/wwi.csv


**Check contents:**

In [24]:
pd.read_table('../data/subset/wwi/wwi.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,3025974,Confirmations at Te Aute.,Confirmations at Te Aute.During the last quart...
1,3034188,Diocesan Notes.,Diocesan Notes.A recent letter received from t...
2,3042724,Untitled,We all want quiet ; we all want beauty for the...
3,3045343,Diocesan Paper.,Diocesan Paper.The following sums are acknow- ...
4,3050177,Rotorua.,"Rotorua.Vicar: Yen. Archdeacon Tisdall, M.A. C..."


**Inferring:**

In [25]:
%%capture capt
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/subset/wwi/wwi.csv' -o './model_wwi' -p 'infer';

InputFile=../data/subset/wwi/wwi.csv
OutputDir=./model_wwi
Process=infer
AllDir=./model_train
Inferencer=./model_train/inferencer.model
CORES=6
SEED1=1
SEED2=1
TOPICS=250
ITERATION=3000
INTERVAL=40
BURNIN=300
IDFMIN=1
IDFMAX=10
04:04:12 :: Start import dataset...
 Rewriting extended pipe from ./model_train/import.model
  Instance ID = 73e48a25-531d-4bd9-8b7e-acbfdaced195
Import new data for inferring.
04:04:41 :: Imported.
04:04:41 :: Start prune model...
04:04:53 :: Pruned.
04:04:53 :: Start infering dataset...
04:05:34 :: Inferred.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 1.0-10.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
features: 1414645 -> 261501
Writing instance list to ./model_wwi/pruned.model


CPU times: user 8 ms, sys: 8 ms, total: 16 ms
Wall time: 1min 21s


In [6]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('./model_wwi/wwi.log', 'w') as f:
    f.write(capt.stdout)

### 4.2 By Region

**Transfor dataset files:**

In [26]:
%%bash

cat ../data/subset/regions/*.csv.gz > ../data/subset/regions/regions.csv.gz

gunzip ../data/subset/regions/regions.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [27]:
%%bash

wc -l ../data/subset/regions/regions.csv

78072 ../data/subset/regions/regions.csv


**Check contents:**

In [28]:
pd.read_table('../data/subset/regions/regions.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854232,Page 1 Advertisements Column 1,NOTICE.—This Ne?vspaper may b? sent Free by Po...
1,1854244,Page 4 Advertisements Column 1,"T7JOUND, a set of Pekoe Straps, The JJ owner m..."
2,1854262,THE CHRISTIAN CHURCH.,THE CHRISTIAN CHURCH.We have heard of an objec...
3,1854275,Page 1 Advertisements Column 2,"NOTE PAPER, Bill Paper, Envelopes Memorandum B..."
4,1854588,THE EASTERN CRISIS.,THE EASTERN CRISIS.[reuieu's telegrams— copyri...


**Inferring:**

In [29]:
%%capture capt
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/subset/regions/regions.csv' -o './model_regions' -p 'infer';

InputFile=../data/subset/regions/regions.csv
OutputDir=./model_regions
Process=infer
AllDir=./model_train
Inferencer=./model_train/inferencer.model
CORES=6
SEED1=1
SEED2=1
TOPICS=250
ITERATION=3000
INTERVAL=40
BURNIN=300
IDFMIN=1
IDFMAX=10
04:05:35 :: Start import dataset...
 Rewriting extended pipe from ./model_train/import.model
  Instance ID = 73e48a25-531d-4bd9-8b7e-acbfdaced195
Import new data for inferring.
04:06:19 :: Imported.
04:06:19 :: Start prune model...
04:06:35 :: Pruned.
04:06:35 :: Start infering dataset...
04:09:21 :: Inferred.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 1.0-10.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
features: 1414645 -> 130373
Writing instance list to ./model_regions/pruned.model


CPU times: user 24 ms, sys: 8 ms, total: 32 ms
Wall time: 3min 46s


In [6]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('./model_regions/regions.log', 'w') as f:
    f.write(capt.stdout)

### 4.3 By Label

**Transfor dataset files:**

In [30]:
%%bash

cat ../data/subset/ads/*.csv.gz > ../data/subset/ads/ads.csv.gz

gunzip ../data/subset/ads/ads.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [31]:
%%bash

wc -l ../data/subset/ads/ads.csv

44175 ../data/subset/ads/ads.csv


**Check contents:**

In [32]:
pd.read_table('../data/subset/ads/ads.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854232,Page 1 Advertisements Column 1,NOTICE.—This Ne?vspaper may b? sent Free by Po...
1,1854244,Page 4 Advertisements Column 1,"T7JOUND, a set of Pekoe Straps, The JJ owner m..."
2,1854275,Page 1 Advertisements Column 2,"NOTE PAPER, Bill Paper, Envelopes Memorandum B..."
3,1855273,Page 3 Advertisements Column 7,"Business Notices. '-; NOTICE. DANEVIREE SASH, ..."
4,1855701,Page 4 Advertisements Column 7,NEEDHAM'S POLISHING PASTE. Used by Her Majesty...


**Inferring:**

In [33]:
%%capture capt
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/subset/ads/ads.csv' -o './model_ads' -p 'infer';

InputFile=../data/subset/ads/ads.csv
OutputDir=./model_ads
Process=infer
AllDir=./model_train
Inferencer=./model_train/inferencer.model
CORES=6
SEED1=1
SEED2=1
TOPICS=250
ITERATION=3000
INTERVAL=40
BURNIN=300
IDFMIN=1
IDFMAX=10
04:09:22 :: Start import dataset...
 Rewriting extended pipe from ./model_train/import.model
  Instance ID = 73e48a25-531d-4bd9-8b7e-acbfdaced195
Import new data for inferring.
04:10:01 :: Imported.
04:10:01 :: Start prune model...
04:10:14 :: Pruned.
04:10:14 :: Start infering dataset...
04:12:08 :: Inferred.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 1.0-10.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
features: 1414645 -> 171826
Writing instance list to ./model_ads/pruned.model


CPU times: user 16 ms, sys: 8 ms, total: 24 ms
Wall time: 2min 45s


In [6]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open('./model_ads/ads.log', 'w') as f:
    f.write(capt.stdout)

---