# Part 5.1 - Subsets
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import gc, sys, subprocess
sys.path.insert(0, '../utils') # for import customed modules
import pandas as pd
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_colwidth', 120)
from pyspark.sql import functions as F
from pyspark.sql.types import *
from utils import conf_pyspark, load_dataset

# intiate PySpark
sc, spark = conf_pyspark()

sc

[('spark.driver.host', 'x99.hub'),
 ('spark.app.name', 'local'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.id', 'local-1548277111283'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.driver.memory', '62g'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.port', '45204'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.cores', '6'),
 ('spark.driver.maxResultSize', '4g')]


**In this part, we will build several subsets for topic modeling:**

1. build a random sample set from the clean dataset;
1. build a training set and meta set from the sample set;
1. build subsets from the sample set, based on three typical application scenario: by range of time, by regions, by label.

**The data directory tree shows as below:**
```
project
└── data                     # save all data
    ├── dataset              # save all processed datasets
    │   ├── clean            # save clean dataset
    │   └── sample           # save sampled dataset
    │       ├── meta         # save metadata of sampled dataset
    │       ├── train        # save training dataset of sampled dataset
    │       └── subset       # save all subsets from sampled dataset
    │           ├── wwi      # save subset for WWI
    │           ├── regions  # save subset for regions
    │           └── ads      # save subset for ADs
    └── papers_past          # save raw dataset
```

## 1 Load Dataset

**Load clean dataset:**

In [2]:
df = load_dataset('clean', spark)
df.cache()

DataFrame[id: int, publisher: string, region: string, date: date, ads: boolean, title: string, content: string]

In [3]:
print('Shape of dataframe: ({}, {})'.format(df.count(), len(df.columns)))

Shape of dataframe: (15121970, 7)


## 2 Sampling

**The topic modeling is a computation-intensive task, training the full dataset need powerful computing resource. For the limit of memory and time, we have to downsize the dataset for training. Here we select the strategies for Random Sampling for the aim to cover the most range of documents.**

In [4]:
# constraint for random sampling
PROPORTION = 0.04 # the proportion of sampling
SEED = 1          # set seed to reproduce

In [5]:
df_sample = df.sample(False, PROPORTION, SEED)
df_sample.cache()

df.unpersist()

DataFrame[id: int, publisher: string, region: string, date: date, ads: boolean, title: string, content: string]

In [6]:
print('Shape of dataframe: ({}, {})'.format(df_sample.count(), len(df_sample.columns)))
#df_sample.limit(5).toPandas().head()

Shape of dataframe: (603629, 7)


**The dataframe** `df_sample` **is the clean sample set, other subset will extract from this dataset. Then we split it to training set and medadata set.**

In [7]:
df_train = (df_sample
            .select(F.col('id'), 
                    F.col('title'), 
                    F.col('content'))
            .orderBy('id'))
df_train.cache()

df_meta = (df_sample
           .select(F.col('id'), 
                   F.col('publisher'), 
                   F.col('region'), 
                   F.col('date'), 
                   F.col('ads'))
           .orderBy('id'))
df_meta.cache()

DataFrame[id: int, publisher: string, region: string, date: date, ads: boolean]

In [8]:
print('Shape of train dataframe: ({}, {})'.format(df_train.count(), len(df_train.columns)))
print('Shape of meta  dataframe: ({}, {})'.format(df_meta.count(), len(df_meta.columns)))

Shape of train dataframe: (603629, 3)
Shape of meta  dataframe: (603629, 5)


**Save datasets, and convert compressed files to one .csv file:**

In [9]:
path = r'../data/dataset/sample/meta'

df_meta.write.csv(path, mode='overwrite')

df_meta.unpersist()

print('Saved dataset to', path)
print('Dataset size:', subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8'))

Saved dataset to ../data/dataset/sample/meta
Dataset size: 32M


In [10]:
%%bash -s $path

# concatenate multi files to one file
cat $1/*.csv > $1/meta.csv

rm -f $1/part-0* $1/\.part-0*

# check row number
wc -l $1/meta.csv

603629 ../data/dataset/sample/meta/meta.csv


In [11]:
path = r'../data/dataset/sample/train'

df_train.write.csv(path, sep='\t', mode='overwrite')

df_train.unpersist()

print('Saved dataset to', path)
print('Dataset size:', subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8'))

Saved dataset to ../data/dataset/sample/train
Dataset size: 1.3G


In [12]:
%%bash -s $path

# concatenate multi files to one file
cat $1/*.csv > $1/train.csv

rm -f $1/part-0* $1/\.part-0*

# check row number
wc -l $1/train.csv

603629 ../data/dataset/sample/train/train.csv


## 3 Subsets

### 3.1 By Range of Time

**For instance, we are interested in the topics in the papers during WWI, so we will research the topic models around the WWI. As wikipedia define it was lasted from 28/7/1914 to 11/11/1918, we expand the time from 1912 to 1921 to analyze and visualize topics during these time.**

In [13]:
START = '1912-01-01'
END = '1921-12-31'

**Filter samples between start and end date, remove advertisements, and generate the subset - wwi:**

In [14]:
# remove advertisements, sampling subset, and select columns.
df_sub = (df_sample.filter((df_sample['ads'] == False) & (df_sample['date'] >= START) & (df_sample['date'] <= END)))

**Check the date range of the subset is correct:**

In [15]:
(df_sub.select(F.max(F.col('date')).alias('MAX')).limit(1).collect()[0].MAX, 
 df_sub.select(F.min(F.col('date')).alias('MIN')).limit(1).collect()[0].MIN)

(datetime.date(1921, 12, 31), datetime.date(1912, 1, 1))

**Generate subset to infer:**

In [16]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')
df_sub.cache()

DataFrame[id: int, title: string, content: string]

In [17]:
print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (113535, 3)


**Save subset:**

In [18]:
path = r'../data/dataset/sample/subset/wwi'

df_sub.write.csv(path, sep='\t', mode='overwrite')

df_sub.unpersist()

print('Saved subset to', path)
print('subset size:', subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8'))

Saved subset to ../data/dataset/sample/subset/wwi
subset size: 157M


**Convert compressed files to one .csv file for MALLET:**

In [19]:
%%bash -s $path

# concatenate multi files to one file
cat $1/*.csv > $1/wwi.csv

rm -f $1/part-0* $1/\.part-0*

# check row number
wc -l $1/wwi.csv

113535 ../data/dataset/sample/subset/wwi/wwi.csv


### 3.2 By Region

**There are 16 regions in the full dataset, we focus on the regions that have the most population now (Auckland, Wellington, Canterbury and Otago).**

**Decide regions to sample:**

In [20]:
regions = ['Auckland', 'Wellington', 'Canterbury', 'Otago']

**Filter samples of target regions, remove advertisements, and generate the subset - regions:**

In [21]:
df_sub = df_sample.filter(F.col('region').isin(regions))

**Check region in the subset is correct:**

In [22]:
df_sub.select(F.col('region')).distinct().show()

+----------+
|    region|
+----------+
|Wellington|
|  Auckland|
|     Otago|
|Canterbury|
+----------+



**Generate subset to infer:**

In [23]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')
df_sub.cache()

DataFrame[id: int, title: string, content: string]

In [24]:
print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (296486, 3)


**Save subset:**

In [25]:
path = r'../data/dataset/sample/subset/regions'

df_sub.write.csv(path, sep='\t', mode='overwrite')

df_sub.unpersist()

print('Saved subset to', path)
print('subset size:', subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8'))

Saved subset to ../data/dataset/sample/subset/regions
subset size: 687M


**Convert compressed files to one .csv file for MALLET:**

In [26]:
%%bash -s $path

# concatenate multi files to one file
cat $1/*.csv > $1/regions.csv

rm -f $1/part-0* $1/\.part-0*

# check row number
wc -l $1/regions.csv

296486 ../data/dataset/sample/subset/regions/regions.csv


### 3.3 By Label

**There is only one label (ads) in the dataset, marks the sample/row/document/text is an advertisemet or not. Advertisements are less information than articles in news paper. However, they are useful to analyze the life of old time. Advertisements take account 27.4% in the full dataset, we extract a subset for advertisements.**

**Filter samples of advertisements, and generate the subset - ads:**

In [27]:
# remove advertisements, sampling subset, and select columns.
df_sub = df_sample.filter(F.col('ads') == True)

**Check labels in the subset are all "ads":**

In [28]:
df_sub.select(F.col('ads')).distinct().show()

+----+
| ads|
+----+
|true|
+----+



**Generate subset to infer:**

In [29]:
df_sub = df_sub.select(F.col('id'), F.col('title'), F.col('content')).orderBy('id')
df_sub.cache()

DataFrame[id: int, title: string, content: string]

In [30]:
print('Shape of dataframe: ({}, {})'.format(df_sub.count(), len(df_sub.columns)))

Shape of dataframe: (167439, 3)


**Save subset:**

In [31]:
path = r'../data/dataset/sample/subset/ads'

df_sub.write.csv(path, sep='\t', mode='overwrite')

df_sub.unpersist()

print('Saved subset to', path)
print('subset size:', subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8'))

Saved subset to ../data/dataset/sample/subset/ads
subset size: 463M


**Convert compressed files to one .csv file for MALLET:**

In [32]:
%%bash -s $path

# concatenate multi files to one file
cat $1/*.csv > $1/ads.csv

rm -f $1/part-0* $1/\.part-0*

# check row number
wc -l $1/ads.csv

167439 ../data/dataset/sample/subset/ads/ads.csv


---

In [33]:
sc.stop()
gc.collect()

148