# Part 5 - Topic Modeling
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import os, sys, subprocess
sys.path.insert(0, '../utils') # for import customed modules
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import *
from utils_data import conf_pyspark, load_dataset

# intiate PySpark
sc, spark = conf_pyspark()

sc

[('spark.app.name', 'local'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.id', 'local-1547785755768'),
 ('spark.driver.port', '33895'),
 ('spark.driver.host', '192.168.1.207'),
 ('spark.driver.memory', '62g'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.cores', '6'),
 ('spark.driver.maxResultSize', '4g')]


**In this part, we will perform following operations:**

1. training a topic model using full dataset by MALLET, getting a topic model and topic words;
1. splitting several subsets by random, by range of time, by region, and by advertisements;
1. inferring subsets from the topic model of full dataset, getting doc-topic matrix.

## 1 Load Data

**Since MALLET can take one instance per file or one file one instance per line, the only choice for us is one file one instance per line, we need to transform the** `*.csv.gz` **file to one** `.csv` **file.**

In [2]:
%%bash

cat ../data/train/*.csv.gz > ../data/train/dataset.csv.gz

gunzip ../data/train/dataset.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [3]:
%%bash

wc -l ../data/train/dataset.csv

16131646 ../data/train/dataset.csv


**Check contents:**

In [4]:
pd.read_table('../data/train/dataset.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854213,TO OUR HEADERS.,TO OUR HEADERS.; We have to apologize to our. ...
1,1854214,"GOD REST THEE, WEARY TRAVELLER.\""\t\""GOD REST ...",
2,1854215,Page 1 Advertisements Column 1,v-/ .ADVERTISEMENTS. •- I Advertisements will ...
3,1854216,Correspondence.,"Correspondence.Ship \ MatildavWattenbacti;\"" J..."
4,1854218,General News.,General News.lV AMus£MENTS.--^Our record of sm...


## 2 Training Full Dataset

In [5]:
%%bash
#! /bin/bash

bash ./model.sh -i '../data/train/dataset.csv' -o './model_all' -p 'train'

InputFile=../data/train/dataset.csv
OutputDir=./model_all
Process=train
SEED1=1
SEED2=1
TOPICS=500
ITERATION=2000
INTERVAL=40
BURNIN=300
17:29:54 :: Start import dataset...
Import new data for training.
17:29:54 :: Imported.
17:29:54 :: Start training dataset...
17:29:54 :: Trained.


**The output files are:**
* topics words from 'topicKeys.txt'
* topics distribution per document from 'topicKeys.txt'
* topic inferencer for inferring subset from 'inferencer.model'
* corpus that topics belong to from 'stat.gz'
* statistic info from 'diagnostics.xml'

## 3 Subset

**Except analyze and visualize topic model of full dataset, based on typical application scenario, we could extract several subsets from the full dataset to focus on specific point to analyze.**

**First of all, load clean dataset and check dimension:**

In [6]:
df = load_dataset('dataset', spark)

print('Shape of dataframe: ({}, {})'.format(df.count(), len(df.columns)))
df.sample(False, 0.00001).limit(10).show()

Shape of dataframe: (16131646, 7)
+-------+--------------------+-----------------+----------+-----+--------------------+--------------------+
|     id|           publisher|           region|      date|  ads|               title|             content|
+-------+--------------------+-----------------+----------+-----+--------------------+--------------------+
|2019485|       Bush Advocate|      Hawke's Bay|1891-02-17|false|        FRISCO MAIL.|FRISCO MAIL.The n...|
|2161884|     Lyttelton Times|       Canterbury|1858-07-31| true|Page 6 Advertisem...|SUMNER " NOTICE T...|
|2468499|       Clutha Leader|            Otago|1890-03-21| true|Page 1 Advertisem...|Business Notices....|
|2593027|   Manawatu Standard|Manawatu-Wanganui|1883-05-18|false|  TO CORRESPONDENTS.|b'TO CORRESPONDEN...|
|2728252|New Zealand Illus...|          unknown|1902-10-01|false|In Search of a Fo...|b'In Search of a ...|
|2882510|   North Otago Times|            Otago|1868-06-26|false|            AUSTRIA.|AUSTRIA.A despat

### 3.1 By Range of Time

**For instance, we are interested in the topics in the papers during WWI, so we will research the topic models around the WWI. As wikipedia define it was lasted from 28/7/1914 to 11/11/1918, we expand the time from 1912 to 1921 to analyze and visualize topics during these time.**

**Decide start date and end date to sample:**

In [7]:
START = '1912-01-01'
END = '1921-12-31'

**Filter samples between start and end date, remove advertisements, and generate the subset - wwi:**

In [8]:
# remove advertisements, sampling subset, and select columns.
df_sub = (
    df.filter((df['ads'] == False) & (df['date'] >= START) & (df['date'] <= END))
)
print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (3002271, 7)


**Check the date range of the subset is correct:**

In [9]:
(df_sub.select(F.max(F.col('date')).alias('MAX')).limit(1).collect()[0].MAX, 
 df_sub.select(F.min(F.col('date')).alias('MIN')).limit(1).collect()[0].MIN)

(datetime.date(1921, 12, 31), datetime.date(1912, 1, 1))

**Generate subset to infer:**

In [10]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (3002271, 3)


**Save subset:**

In [11]:
subset_path = r'../data/subset/wwi'

df_sub.write.csv(subset_path, sep='\t', mode='overwrite', compression='gzip')

print('Save subset to', subset_path)
print('subset size:', subprocess.check_output(['du','-sh', subset_path]).split()[0].decode('utf-8'))

Save subset to ../data/subset/wwi
subset size: 1.7G


### 3.2 By Region

**There are 16 regions in the full dataset, we focus on the regions that have the most population now (Auckland, Wellington, Canterbury and Otago).**

**Decide regions to sample:**

In [12]:
regions = ['Auckland', 'Wellington', 'Canterbury', 'Otago']

**Filter samples of target regions, remove advertisements, and generate the subset - regions:**

In [13]:
df_sub = df.filter(F.col('region').isin(regions))

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (7889642, 7)


**Check region in the subset is correct:**

In [14]:
df_sub.select(F.col('region')).distinct().show()

+----------+
|    region|
+----------+
|Wellington|
|  Auckland|
|     Otago|
|Canterbury|
+----------+



**Generate subset to infer:**

In [15]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (7889642, 3)


**Save subset:**

In [16]:
subset_path = r'../data/subset/regions'

df_sub.write.csv(subset_path, sep='\t', mode='overwrite', compression='gzip')

print('Save subset to', subset_path)
print('subset size:', subprocess.check_output(['du','-sh', subset_path]).split()[0].decode('utf-8'))

Save subset to ../data/subset/regions
subset size: 7.4G


### 3.3 By Label

**There is only one label (ads) in the dataset, marks the sample/row/document/text is an advertisemet or not. Advertisements are less information than articles in news paper. However, they are useful to analyze the life of old time. Advertisements take account 27.4% in the full dataset, we extract a subset for advertisements.**

**Filter samples of advertisements, and generate the subset - ads:**

In [17]:
# remove advertisements, sampling subset, and select columns.
df_sub = df.filter(F.col('ads') == True)

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (4417669, 7)


**Check labels in the subset are all "ads":**

In [18]:
df_sub.select(F.col('ads')).distinct().show()

+----+
| ads|
+----+
|true|
+----+



**Generate subset to infer:**

In [19]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')

print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (4417669, 3)


**Save subset:**

In [20]:
subset_path = r'../data/subset/ads'

df_sub.write.csv(subset_path, sep='\t', mode='overwrite', compression='gzip')

print('Save subset to', subset_path)
print('subset size:', subprocess.check_output(['du','-sh', subset_path]).split()[0].decode('utf-8'))

Save subset to ../data/subset/ads
subset size: 5.2G


## 4 Inferring Subset

**We infer subset by inferencer to get doc-topic matrix to analyze and visualize topics.**

### 4.1 By Range of Time

**The same with training full dataset, we transform multiple compressed files to one** `*.csv` **file.**

In [21]:
%%bash

cat ../data/subset/wwi/*.csv.gz > ../data/subset/wwi/wwi.csv.gz

gunzip ../data/subset/wwi/wwi.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [22]:
%%bash

wc -l ../data/subset/wwi/wwi.csv

3002271 ../data/subset/wwi/wwi.csv


**Check contents:**

In [23]:
pd.read_table('../data/subset/wwi/wwi.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,3024444,The New Year.,"The New Year.My Dear People,—-----t Although, ..."
1,3024489,Committee on Social Questions.,Committee on Social Questions.Archdeacon Willi...
2,3024508,The Church and Social Reform.,The Church and Social Reform.Two leading men m...
3,3024532,Bible Teaching in Scbools.,Bible Teaching m Scbools.(By the Yen. Archdeac...
4,3024551,Bishop's Diary.,Bishop's Diary.Bay of Plenty.November 29 : Lef...


**Inferring:**

In [24]:
%%bash
#! /bin/bash

bash ./model.sh -i '../data/subset/wwi/wwi.csv' -o './model_wwi' -p 'infer'

InputFile=../data/subset/wwi/wwi.csv
OutputDir=./model_wwi
Process=infer
AllDir=./model_all
Inferencer=./model_all/inferencer.model
SEED1=1
SEED2=1
TOPICS=500
ITERATION=2000
INTERVAL=40
BURNIN=300
17:59:34 :: Start import dataset...
Import new data for inferring.
17:59:34 :: Imported.
17:59:34 :: Start infering dataset...
17:59:34 :: Inferred.


### 4.2 By Region

**Transfor dataset files:**

In [25]:
%%bash

cat ../data/subset/regions/*.csv.gz > ../data/subset/regions/regions.csv.gz

gunzip ../data/subset/regions/regions.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [26]:
%%bash

wc -l ../data/subset/regions/regions.csv

7889642 ../data/subset/regions/regions.csv


**Check contents:**

In [27]:
pd.read_table('../data/subset/regions/regions.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854213,TO OUR HEADERS.,TO OUR HEADERS.; We have to apologize to our. ...
1,1854214,"GOD REST THEE, WEARY TRAVELLER.\""\t\""GOD REST ...",
2,1854215,Page 1 Advertisements Column 1,v-/ .ADVERTISEMENTS. •- I Advertisements will ...
3,1854216,Correspondence.,"Correspondence.Ship \ MatildavWattenbacti;\"" J..."
4,1854218,General News.,General News.lV AMus£MENTS.--^Our record of sm...


**Inferring:**

In [28]:
%%bash
#! /bin/bash

bash ./model.sh -i '../data/subset/regions/regions.csv' -o './model_regions' -p 'infer'

InputFile=../data/subset/regions/regions.csv
OutputDir=./model_regions
Process=infer
AllDir=./model_all
Inferencer=./model_all/inferencer.model
SEED1=1
SEED2=1
TOPICS=500
ITERATION=2000
INTERVAL=40
BURNIN=300
18:02:36 :: Start import dataset...
Import new data for inferring.
18:02:36 :: Imported.
18:02:36 :: Start infering dataset...
18:02:36 :: Inferred.


### 4.3 By Label

**Transfor dataset files:**

In [29]:
%%bash

cat ../data/subset/ads/*.csv.gz > ../data/subset/ads/ads.csv.gz

gunzip ../data/subset/ads/ads.csv.gz

**Check lines/rows/samples/documents of dataset:**

In [30]:
%%bash

wc -l ../data/subset/ads/ads.csv

4417669 ../data/subset/ads/ads.csv


**Check contents:**

In [31]:
pd.read_table('../data/subset/ads/ads.csv', header=None, nrows=5).head()

Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,v-/ .ADVERTISEMENTS. •- I Advertisements will ...
1,1854231,Page 4 Advertisements Column 1,- Dr. Bell delivered a second Lecture on \Or M...
2,1854232,Page 1 Advertisements Column 1,NOTICE.—This Ne?vspaper may b? sent Free by Po...
3,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memor..."
4,1854234,Page 1 Advertisements Column 3,■ '■:.■ isles' . ■■■■■\■ ■■'■ dining & refresh...


**Inferring:**

In [32]:
%%bash
#! /bin/bash

bash ./model.sh -i '../data/subset/ads/ads.csv' -o './model_ads' -p 'infer'

InputFile=../data/subset/ads/ads.csv
OutputDir=./model_ads
Process=infer
AllDir=./model_all
Inferencer=./model_all/inferencer.model
SEED1=1
SEED2=1
TOPICS=500
ITERATION=2000
INTERVAL=40
BURNIN=300
18:04:28 :: Start import dataset...
Import new data for inferring.
18:04:28 :: Imported.
18:04:28 :: Start infering dataset...
18:04:28 :: Inferred.


---