# Part 6.1 - Prepare Dataframe
---

### Papers Past Topic Modeling
<br/>


Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import sys, gc
sys.path.insert(0, '../utils')
from utils import conf_pyspark, load_dataset
from utils_preplot import preplot, load_doctopic
import pandas as pd
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_colwidth', 120)
from pyspark.sql import functions as F
from pyspark.sql.types import *

import datetime
print (datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

# intiate PySpark
sc, spark = conf_pyspark()

sc

2019-01-30 21:55:29
[('spark.app.name', 'local'),
 ('spark.driver.port', '35958'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.driver.host', '192.168.1.207'),
 ('spark.driver.memory', '62g'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1548838541706'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.cores', '6'),
 ('spark.driver.maxResultSize', '4g')]


**This part we generate the dataframe that part 6.2 will be used:**
* dominant topic per year dataframe;
* average topic weight per year dataframe;
* dominant topic per month dataframe (only for WWI analysis);
* average topic weight per month dataframe (only for WWI analysis);

**After training topic models, we got the topic words list (**`topicKeys.txt`**) and the doc-topic matrix (**`docTopics.txt`**). The doc-topic matrix is a docmument based file which means we could connect the topic weights in the matrix and the metadata (date, region, publisher) in the meta dataset, then the combined dataset could apply to accomplish many data mining or statistical works, for instance, the granularity of time can achieve 1 day, which could perform high accurate time series analsys. Now we load those data to generate dataframe for analysys and visualization.**

## 1 Prepare Dataframe for Train Set

### 1.1 Load data

**Load metadata ("id", "region" and "date"):**

In [2]:
df_meta = load_dataset('meta', spark).select(F.col('id').alias('id_'), F.col('region'), F.col('date'))

In [3]:
df_meta.limit(5).toPandas().head()

Unnamed: 0,id_,region,date
0,1854213,Auckland,1862-06-14
1,1854215,Auckland,1862-06-14
2,1854221,Auckland,1862-06-14
3,1854224,Auckland,1862-07-03
4,1854232,Auckland,1863-08-01


**Load topic words list:**

In [3]:
path = r'../models/train/topicKeys.txt'

data_schema = StructType([
    StructField('topic', IntegerType()),
    StructField('weight_', FloatType()),
    StructField('words', StringType())
])

df_topics = (
    spark.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("inferSchema", "false")
    .option("delimiter", "\t")
    .schema(data_schema)
    .load(path)
)
topic_number = df_topics.count()

In [5]:
print('Shape of dataframe: ({}, {})'.format(topic_number, len(df_topics.columns)))
df_topics.limit(10).toPandas().head(10)

Shape of dataframe: (200, 3)


Unnamed: 0,topic,weight_,words
0,0,0.00161,apply wanted good post work experience wellington position wages salary office experienced box required applications...
1,1,0.00908,killed police received people london persons hundred men explosion city women injured thousand number arrested wound...
2,2,0.00116,rooms price section bungalow deposit large modern home street garage kitchenette good sale post terms tram city view...
3,3,0.00893,meeting committee board motion chairman seconded moved thought matter present report carried messrs read resolution ...
4,4,0.01761,sydney south australia melbourne australian wales new_zealand received victoria queensland government federal adelai...
5,5,0.00908,chinese china japanese japan russia russian turkish troops british turkey received government war constantinople lon...
6,6,0.0036,yds prize yards race handicap sports entrance match rifle points events shooting trophy club competition won prizes ...
7,7,0.00325,handicap lady furlongs meeting miles mile hack soys king acceptances miss cup gold sir bst royal hurdles day rose club
8,8,0.00397,reward lost ost white found dog finder apply notice black branded person bay return pound returning office gold satu...
9,9,0.00332,duty customs duties tariff sugar goods free cent revenue tobacco paid amount beer stamp colony consumption articles ...


**Load doc-topic matrix:**

In [4]:
path = r'../models/train/docTopics.txt'

# generate new column names
columns = [str(x) for x in list(range(topic_number))]
columns.insert(0, 'id')
columns.insert(0, 'index')

# load data
df_doctopic = (
    spark.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("inferSchema", "true")
    .option("delimiter", "\t")
    .load(path)
)

# change columns name and drop # column which is table index and useless
df_doctopic = df_doctopic.toDF(*columns)

In [7]:
print('Shape of dataframe: ({}, {})'.format(df_doctopic.count(), len(df_doctopic.columns)))

df_doctopic.limit(5).toPandas().head()

Shape of dataframe: (3025602, 202)


Unnamed: 0,index,id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,...,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199
0,0,1854213,4.8e-05,0.000271,3.5e-05,0.000267,0.000526,0.000271,0.000108,9.7e-05,0.000119,9.9e-05,0.000183,0.000308,8.9e-05,0.000245,0.000174,0.00022,0.000105,0.000195,0.000162,0.000106,6.2e-05,0.000201,0.000177,0.000266,0.000162,0.000402,5.9e-05,0.000231,0.000406,0.000184,0.000119,0.000165,0.000226,0.000305,0.00019,0.000285,0.000245,0.000212,0.00118,4.7e-05,0.000163,0.00013,0.836856,0.000117,7e-05,0.000226,0.000114,0.000292,0.000209,0.000221,0.000259,0.000103,0.000459,0.000179,0.000226,0.000375,0.000306,0.000181,...,0.000173,0.000134,0.000209,0.000201,0.000163,0.00016,0.06003,0.000203,0.000249,8.8e-05,0.000208,0.000206,0.00023,0.000187,0.000215,0.000275,0.000181,5.1e-05,0.000193,0.000235,6.3e-05,0.000152,0.00018,0.00047,0.000122,0.000221,0.000175,0.000179,0.059977,0.000246,0.000145,0.000224,0.000209,0.000364,9.8e-05,0.000281,0.000189,0.000103,0.000156,0.000195,0.000116,0.000242,0.000158,0.000255,0.000205,0.000294,0.000117,0.000226,0.000207,0.00026,8.5e-05,0.000171,0.000413,0.000192,0.000238,0.000102,0.000266,0.000159,0.000215,0.000231
1,1,1854215,2.8e-05,0.000155,2e-05,0.000153,0.000301,0.000155,6.2e-05,5.6e-05,6.8e-05,5.7e-05,0.000105,0.000176,5.1e-05,0.00014,0.0001,0.000126,6e-05,0.000112,9.3e-05,6e-05,3.6e-05,0.000115,0.000101,0.000152,9.3e-05,0.00023,3.4e-05,0.000132,0.000233,0.000105,6.8e-05,9.4e-05,0.000129,0.000174,0.000109,0.000163,0.00014,0.000121,0.000676,2.7e-05,9.3e-05,7.5e-05,0.00031,6.7e-05,4e-05,0.239527,6.5e-05,0.000167,0.00012,0.000127,0.000148,5.9e-05,0.000263,0.000103,0.000129,0.000215,0.290872,0.000104,...,9.9e-05,7.7e-05,0.00012,0.000115,9.3e-05,9.2e-05,0.000168,0.000116,0.000142,5.1e-05,0.000119,0.000118,0.000131,0.000107,0.000123,0.000157,0.000103,2.9e-05,0.000111,0.000135,3.6e-05,8.7e-05,0.000103,0.000269,7e-05,0.000127,0.0001,0.000103,0.000137,0.000141,8.3e-05,0.000129,0.00012,0.034408,5.6e-05,0.000161,0.000108,5.9e-05,8.9e-05,0.000112,6.6e-05,0.000138,9.1e-05,0.000146,0.000117,0.000168,6.7e-05,0.000129,0.000119,0.000149,4.9e-05,0.359194,0.000237,0.00011,0.000136,5.9e-05,0.000152,9.1e-05,0.000123,0.000132
2,2,1854221,3.9e-05,0.000219,2.8e-05,0.000215,0.000424,0.000219,8.7e-05,7.8e-05,9.6e-05,8e-05,0.000148,0.000249,7.2e-05,0.000198,0.00014,0.000178,8.5e-05,0.000158,0.00013,8.5e-05,5e-05,0.000162,0.000143,0.000214,0.000131,0.000324,4.7e-05,0.000186,0.000328,0.000148,9.6e-05,0.000133,0.000182,0.000246,0.000154,0.00023,0.000198,0.000171,0.000952,3.8e-05,0.000132,0.000105,0.000437,9.4e-05,5.6e-05,0.000182,9.2e-05,0.000236,0.000169,0.000179,0.000209,8.3e-05,0.00037,0.000145,0.000182,0.000303,0.000247,0.000146,...,0.00014,0.000108,0.000169,0.000162,0.000132,0.00013,0.000237,0.000164,0.000201,7.1e-05,0.000168,0.000166,0.000185,0.000151,0.000173,0.000222,0.000146,4.1e-05,0.000156,0.00019,5.1e-05,0.000123,0.000145,0.000379,9.9e-05,0.000178,0.000141,0.000145,0.000194,0.000199,0.000117,0.000181,0.000169,0.000294,7.9e-05,0.000227,0.000152,8.3e-05,0.000126,0.000158,9.3e-05,0.000195,0.000128,0.000206,0.000165,0.000237,9.4e-05,0.000182,0.000167,0.00021,6.9e-05,0.000138,0.000334,0.868039,0.000192,8.3e-05,0.000214,0.000128,0.000174,0.000186
3,3,1854224,3.6e-05,0.000204,2.6e-05,0.000201,0.000396,0.000204,8.1e-05,7.3e-05,8.9e-05,7.5e-05,0.000138,0.000232,6.7e-05,0.000185,0.000131,0.000166,7.9e-05,0.000147,0.000122,7.9e-05,4.7e-05,0.000151,0.000133,0.0002,0.000122,0.000302,4.4e-05,0.000174,0.000306,0.000138,8.9e-05,0.000124,0.00017,0.000229,0.000143,0.000214,0.000185,0.000159,0.000888,3.5e-05,0.000123,9.8e-05,0.000407,8.8e-05,5.2e-05,0.00017,8.6e-05,0.00022,0.000157,0.000167,0.000195,7.7e-05,0.000345,0.000135,0.00017,0.000283,0.00023,0.000136,...,0.000131,0.000101,0.000157,0.000151,0.000123,0.000121,0.000221,0.000153,0.000187,6.6e-05,0.000156,0.000155,0.000173,0.00014,0.000162,0.000207,0.000136,3.8e-05,0.000145,0.000177,4.7e-05,0.000114,0.000135,0.000354,9.2e-05,0.000166,0.000132,0.000135,0.00018,0.000185,0.000109,0.000169,0.000158,0.000274,7.4e-05,0.000212,0.000142,7.7e-05,0.000117,0.000147,8.7e-05,0.000182,0.000119,0.000192,0.000154,0.000221,8.8e-05,0.00017,0.000156,0.000196,6.4e-05,0.000129,0.000311,0.966867,0.000179,7.7e-05,0.0002,0.00012,0.000162,0.000174
4,4,1854232,1.1e-05,6.3e-05,8e-06,6.2e-05,0.000123,6.3e-05,2.5e-05,2.3e-05,2.8e-05,2.3e-05,4.3e-05,7.2e-05,2.1e-05,0.027936,4.1e-05,5.1e-05,2.5e-05,4.6e-05,3.8e-05,2.5e-05,1.5e-05,4.7e-05,4.1e-05,6.2e-05,3.8e-05,9.4e-05,1.4e-05,5.4e-05,9.5e-05,4.3e-05,2.8e-05,3.8e-05,5.3e-05,7.1e-05,4.4e-05,6.6e-05,5.7e-05,4.9e-05,0.021184,1.1e-05,3.8e-05,3e-05,0.000126,2.7e-05,1.6e-05,5.3e-05,2.7e-05,6.8e-05,4.9e-05,5.2e-05,6e-05,2.4e-05,0.264952,4.2e-05,5.3e-05,8.8e-05,0.02795,4.2e-05,...,4e-05,3.1e-05,4.9e-05,4.7e-05,3.8e-05,3.7e-05,6.8e-05,4.7e-05,5.8e-05,2.1e-05,4.8e-05,4.8e-05,5.4e-05,0.007013,5e-05,6.4e-05,4.2e-05,1.2e-05,4.5e-05,5.5e-05,1.5e-05,3.5e-05,4.2e-05,0.104654,2.9e-05,5.2e-05,4.1e-05,4.2e-05,5.6e-05,5.7e-05,3.4e-05,5.2e-05,4.9e-05,8.5e-05,2.3e-05,6.6e-05,4.4e-05,0.25093,3.6e-05,4.6e-05,2.7e-05,5.6e-05,3.7e-05,6e-05,4.8e-05,6.9e-05,2.7e-05,5.3e-05,4.8e-05,6.1e-05,2e-05,0.076706,9.6e-05,4.5e-05,5.6e-05,2.4e-05,6.2e-05,0.083672,5e-05,5.4e-05


**In above dataframe, "index" column is the row number, "id" column is the sample/document/text id, the same with "id" in dataset, "0" to max topic number columns are the weight of each topic per document.**

### 1.2 Add Dominant Topics Column

**Find dominant topic of each document:**

In [8]:
# https://stackoverflow.com/questions/46819405/how-to-get-the-name-of-column-with-maximum-value-in-pyspark-dataframe

def argmax(cols, *args):
    return [c for c, v in zip(cols, args) if v == max(args)][0]

def add_domtopic(df):
    """
    find the dominant topic of each sample/row/document
    input: dataframe of weight of each topic
    output: the dominant topic number dataframe
    """
    argmax_udf = lambda cols: F.udf(lambda *args: argmax(cols, *args), StringType())
    return (df
            .withColumn('domtopic',
                        argmax_udf(df.columns[2:])(*df.columns[2:]))
            .withColumn('weight', 
                        F.greatest(*[F.col(x) for x in df.columns[2:-1]])))

# add the df_dominant to doc-topic matrix
df_doctopic = add_domtopic(df_doctopic)

In [9]:
#print('Shape of dataframe: ({}, {})'.format(df_doctopic.count(), len(df_doctopic.columns)))

#df_doctopic.limit(5).toPandas().head()

### 1.3 Add Metadata Columns

**Here we only add "region" and "date" column as metadata, the accuracy of time could achieve "day" level, using the time series features we could convert "date" to year base or month base etc. depending on the need.**

In [5]:
df_doctopic = (df_doctopic
               .join(df_meta, df_doctopic.id == df_meta.id_)
               .withColumn('year', F.date_format('date', 'yyyy'))
               .drop('id_')
               .drop('date')
               .orderBy('index'))

In [6]:
#print('Shape of dataframe: ({}, {})'.format(df_doctopic.count(), len(df_doctopic.columns)))

#df_doctopic.limit(5).toPandas().head()

Unnamed: 0,index,id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,...,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,region,year
0,0,1854213,4.8e-05,0.000271,3.5e-05,0.000267,0.000526,0.000271,0.000108,9.7e-05,0.000119,9.9e-05,0.000183,0.000308,8.9e-05,0.000245,0.000174,0.00022,0.000105,0.000195,0.000162,0.000106,6.2e-05,0.000201,0.000177,0.000266,0.000162,0.000402,5.9e-05,0.000231,0.000406,0.000184,0.000119,0.000165,0.000226,0.000305,0.00019,0.000285,0.000245,0.000212,0.00118,4.7e-05,0.000163,0.00013,0.836856,0.000117,7e-05,0.000226,0.000114,0.000292,0.000209,0.000221,0.000259,0.000103,0.000459,0.000179,0.000226,0.000375,0.000306,0.000181,...,0.000209,0.000201,0.000163,0.00016,0.06003,0.000203,0.000249,8.8e-05,0.000208,0.000206,0.00023,0.000187,0.000215,0.000275,0.000181,5.1e-05,0.000193,0.000235,6.3e-05,0.000152,0.00018,0.00047,0.000122,0.000221,0.000175,0.000179,0.059977,0.000246,0.000145,0.000224,0.000209,0.000364,9.8e-05,0.000281,0.000189,0.000103,0.000156,0.000195,0.000116,0.000242,0.000158,0.000255,0.000205,0.000294,0.000117,0.000226,0.000207,0.00026,8.5e-05,0.000171,0.000413,0.000192,0.000238,0.000102,0.000266,0.000159,0.000215,0.000231,Auckland,1862-06
1,1,1854215,2.8e-05,0.000155,2e-05,0.000153,0.000301,0.000155,6.2e-05,5.6e-05,6.8e-05,5.7e-05,0.000105,0.000176,5.1e-05,0.00014,0.0001,0.000126,6e-05,0.000112,9.3e-05,6e-05,3.6e-05,0.000115,0.000101,0.000152,9.3e-05,0.00023,3.4e-05,0.000132,0.000233,0.000105,6.8e-05,9.4e-05,0.000129,0.000174,0.000109,0.000163,0.00014,0.000121,0.000676,2.7e-05,9.3e-05,7.5e-05,0.00031,6.7e-05,4e-05,0.239527,6.5e-05,0.000167,0.00012,0.000127,0.000148,5.9e-05,0.000263,0.000103,0.000129,0.000215,0.290872,0.000104,...,0.00012,0.000115,9.3e-05,9.2e-05,0.000168,0.000116,0.000142,5.1e-05,0.000119,0.000118,0.000131,0.000107,0.000123,0.000157,0.000103,2.9e-05,0.000111,0.000135,3.6e-05,8.7e-05,0.000103,0.000269,7e-05,0.000127,0.0001,0.000103,0.000137,0.000141,8.3e-05,0.000129,0.00012,0.034408,5.6e-05,0.000161,0.000108,5.9e-05,8.9e-05,0.000112,6.6e-05,0.000138,9.1e-05,0.000146,0.000117,0.000168,6.7e-05,0.000129,0.000119,0.000149,4.9e-05,0.359194,0.000237,0.00011,0.000136,5.9e-05,0.000152,9.1e-05,0.000123,0.000132,Auckland,1862-06
2,2,1854221,3.9e-05,0.000219,2.8e-05,0.000215,0.000424,0.000219,8.7e-05,7.8e-05,9.6e-05,8e-05,0.000148,0.000249,7.2e-05,0.000198,0.00014,0.000178,8.5e-05,0.000158,0.00013,8.5e-05,5e-05,0.000162,0.000143,0.000214,0.000131,0.000324,4.7e-05,0.000186,0.000328,0.000148,9.6e-05,0.000133,0.000182,0.000246,0.000154,0.00023,0.000198,0.000171,0.000952,3.8e-05,0.000132,0.000105,0.000437,9.4e-05,5.6e-05,0.000182,9.2e-05,0.000236,0.000169,0.000179,0.000209,8.3e-05,0.00037,0.000145,0.000182,0.000303,0.000247,0.000146,...,0.000169,0.000162,0.000132,0.00013,0.000237,0.000164,0.000201,7.1e-05,0.000168,0.000166,0.000185,0.000151,0.000173,0.000222,0.000146,4.1e-05,0.000156,0.00019,5.1e-05,0.000123,0.000145,0.000379,9.9e-05,0.000178,0.000141,0.000145,0.000194,0.000199,0.000117,0.000181,0.000169,0.000294,7.9e-05,0.000227,0.000152,8.3e-05,0.000126,0.000158,9.3e-05,0.000195,0.000128,0.000206,0.000165,0.000237,9.4e-05,0.000182,0.000167,0.00021,6.9e-05,0.000138,0.000334,0.868039,0.000192,8.3e-05,0.000214,0.000128,0.000174,0.000186,Auckland,1862-06
3,3,1854224,3.6e-05,0.000204,2.6e-05,0.000201,0.000396,0.000204,8.1e-05,7.3e-05,8.9e-05,7.5e-05,0.000138,0.000232,6.7e-05,0.000185,0.000131,0.000166,7.9e-05,0.000147,0.000122,7.9e-05,4.7e-05,0.000151,0.000133,0.0002,0.000122,0.000302,4.4e-05,0.000174,0.000306,0.000138,8.9e-05,0.000124,0.00017,0.000229,0.000143,0.000214,0.000185,0.000159,0.000888,3.5e-05,0.000123,9.8e-05,0.000407,8.8e-05,5.2e-05,0.00017,8.6e-05,0.00022,0.000157,0.000167,0.000195,7.7e-05,0.000345,0.000135,0.00017,0.000283,0.00023,0.000136,...,0.000157,0.000151,0.000123,0.000121,0.000221,0.000153,0.000187,6.6e-05,0.000156,0.000155,0.000173,0.00014,0.000162,0.000207,0.000136,3.8e-05,0.000145,0.000177,4.7e-05,0.000114,0.000135,0.000354,9.2e-05,0.000166,0.000132,0.000135,0.00018,0.000185,0.000109,0.000169,0.000158,0.000274,7.4e-05,0.000212,0.000142,7.7e-05,0.000117,0.000147,8.7e-05,0.000182,0.000119,0.000192,0.000154,0.000221,8.8e-05,0.00017,0.000156,0.000196,6.4e-05,0.000129,0.000311,0.966867,0.000179,7.7e-05,0.0002,0.00012,0.000162,0.000174,Auckland,1862-07
4,4,1854232,1.1e-05,6.3e-05,8e-06,6.2e-05,0.000123,6.3e-05,2.5e-05,2.3e-05,2.8e-05,2.3e-05,4.3e-05,7.2e-05,2.1e-05,0.027936,4.1e-05,5.1e-05,2.5e-05,4.6e-05,3.8e-05,2.5e-05,1.5e-05,4.7e-05,4.1e-05,6.2e-05,3.8e-05,9.4e-05,1.4e-05,5.4e-05,9.5e-05,4.3e-05,2.8e-05,3.8e-05,5.3e-05,7.1e-05,4.4e-05,6.6e-05,5.7e-05,4.9e-05,0.021184,1.1e-05,3.8e-05,3e-05,0.000126,2.7e-05,1.6e-05,5.3e-05,2.7e-05,6.8e-05,4.9e-05,5.2e-05,6e-05,2.4e-05,0.264952,4.2e-05,5.3e-05,8.8e-05,0.02795,4.2e-05,...,4.9e-05,4.7e-05,3.8e-05,3.7e-05,6.8e-05,4.7e-05,5.8e-05,2.1e-05,4.8e-05,4.8e-05,5.4e-05,0.007013,5e-05,6.4e-05,4.2e-05,1.2e-05,4.5e-05,5.5e-05,1.5e-05,3.5e-05,4.2e-05,0.104654,2.9e-05,5.2e-05,4.1e-05,4.2e-05,5.6e-05,5.7e-05,3.4e-05,5.2e-05,4.9e-05,8.5e-05,2.3e-05,6.6e-05,4.4e-05,0.25093,3.6e-05,4.6e-05,2.7e-05,5.6e-05,3.7e-05,6e-05,4.8e-05,6.9e-05,2.7e-05,5.3e-05,4.8e-05,6.1e-05,2e-05,0.076706,9.6e-05,4.5e-05,5.6e-05,2.4e-05,6.2e-05,0.083672,5e-05,5.4e-05,Auckland,1863-08


### 1.4 Document - Dominant Topics Dataframe

**It is hard to intuitively plot the doc-topic matrix (high dimension), we need to transform it to extract or reduce features. First we generate dominant topics dataframe which could be used to reveal the relationship between dominant topics and region/year.**

In [12]:
df_docdomtopic = (df_doctopic
                  .join(df_topics, df_doctopic.domtopic == df_topics.topic)
                  .select(F.col('id'), 
                          F.col('region'), 
                          F.col('year'), 
                          F.col('domtopic'), 
                          F.col('weight'), 
                          F.col('words'))
                  .orderBy('id'))

df_docdomtopic.cache();

In [13]:
print('Shape of dataframe: ({}, {})'.format(df_docdomtopic.count(), len(df_docdomtopic.columns)))

df_docdomtopic.limit(5).toPandas().head()

Shape of dataframe: (3025602, 6)


Unnamed: 0,id,region,year,domtopic,weight,words
0,1854213,Auckland,1862,42,0.836856,time question matter present fact made case position public great doubt opinion good make point part reason subject ...
1,1854215,Auckland,1862,191,0.359194,advertisements office post prizes stamps letters exceeding subscribers half ounce prize orders postage horse inserti...
2,1854221,Auckland,1862,193,0.868039,life love god heart day thy man world great thou men death long light thee earth eyes home sweet land
3,1854224,Auckland,1862,193,0.966867,life love god heart day thy man world great thou men death long light thee earth eyes home sweet land
4,1854232,Auckland,1863,52,0.264952,business public notice orders begs attention stock street goods general premises inform prices customers advertiseme...


**Save the dataframe for later use:**

In [14]:
path = r'../models/train/domTopics'

df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist()

DataFrame[id: int, region: string, year: string, domtopic: string, weight: double, words: string]

In [15]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

### 1.5 Average Weight Topics Dataframe

**Beside dominant topics dataframe, we could calculate average weight of each topic in a year and create dataframe for it, which could reveal the weight variety of each topic as time goes on, we could take the average weight as features and execute data mining algorithms to find patterns, e.g. correlation between features. The weights of each topic were already scaled to 0-1 by default, so the sum of average weight of each year is 1, we do not need to scale it.**

In [7]:
df_avgweight = (df_doctopic
                .drop('index')
                .drop('id')
                .drop('domtopic')
                .drop('region')
                .drop('weight')
                .groupBy('year')
                .avg()
                .orderBy('year'))

In [8]:
df_avgweight.limit(5).toPandas().head()

Unnamed: 0,year,avg(0),avg(1),avg(2),avg(3),avg(4),avg(5),avg(6),avg(7),avg(8),avg(9),avg(10),avg(11),avg(12),avg(13),avg(14),avg(15),avg(16),avg(17),avg(18),avg(19),avg(20),avg(21),avg(22),avg(23),avg(24),avg(25),avg(26),avg(27),avg(28),avg(29),avg(30),avg(31),avg(32),avg(33),avg(34),avg(35),avg(36),avg(37),avg(38),avg(39),avg(40),avg(41),avg(42),avg(43),avg(44),avg(45),avg(46),avg(47),avg(48),avg(49),avg(50),avg(51),avg(52),avg(53),avg(54),avg(55),avg(56),avg(57),avg(58),...,avg(140),avg(141),avg(142),avg(143),avg(144),avg(145),avg(146),avg(147),avg(148),avg(149),avg(150),avg(151),avg(152),avg(153),avg(154),avg(155),avg(156),avg(157),avg(158),avg(159),avg(160),avg(161),avg(162),avg(163),avg(164),avg(165),avg(166),avg(167),avg(168),avg(169),avg(170),avg(171),avg(172),avg(173),avg(174),avg(175),avg(176),avg(177),avg(178),avg(179),avg(180),avg(181),avg(182),avg(183),avg(184),avg(185),avg(186),avg(187),avg(188),avg(189),avg(190),avg(191),avg(192),avg(193),avg(194),avg(195),avg(196),avg(197),avg(198),avg(199)
0,1912-01,0.00097,0.014639,0.00018,0.003625,0.012039,0.015342,0.004117,0.004044,0.000988,0.001091,0.002887,0.002561,0.0018,0.004091,0.002668,0.0034,0.001242,0.005476,0.001092,0.004332,0.001933,0.00683,0.006409,0.013327,0.004392,0.01503,0.000293,0.003421,0.009448,0.010049,0.005101,0.00429,0.011478,0.007189,0.01016,0.007483,0.008974,0.000969,0.013567,0.000135,0.001681,0.002957,0.008036,0.001612,0.000127,0.006988,0.000347,0.00357,0.002174,0.003222,0.015092,0.004108,0.002495,0.011696,0.00348,0.007936,0.009228,0.001736,0.005553,...,0.001526,0.00568,0.005696,0.002724,0.002459,0.000946,0.003455,0.002651,0.007027,0.003351,0.01127,0.002863,0.012105,0.005198,0.003716,0.002849,0.004704,0.000997,0.002555,0.004187,0.003343,0.002242,0.00495,0.004057,0.014721,0.00358,0.006489,0.004477,0.003085,0.004728,0.001876,0.00194,0.002702,0.005309,0.001,0.007741,0.002759,0.000784,0.003639,0.003503,0.00077,0.003132,0.00118,0.011675,0.003013,0.008927,0.000419,0.002203,0.004681,0.005974,0.00081,0.002231,0.006692,0.003207,0.002111,0.00187,0.010024,0.002443,0.00311,0.001778
1,1912-02,0.000848,0.01751,0.000212,0.00434,0.008844,0.011836,0.005945,0.002592,0.000931,0.000638,0.003094,0.002476,0.002472,0.004718,0.003288,0.002227,0.002469,0.006201,0.001118,0.004349,0.002492,0.006788,0.007334,0.012528,0.005252,0.013001,0.000326,0.003729,0.008242,0.008366,0.005888,0.004898,0.008079,0.006677,0.008033,0.005567,0.006787,0.001444,0.012696,0.000126,0.00212,0.003752,0.009267,0.002874,0.000225,0.004448,0.000211,0.004003,0.00208,0.006324,0.008897,0.004273,0.003379,0.011202,0.006442,0.009532,0.008136,0.002261,0.005439,...,0.001558,0.00692,0.006052,0.001561,0.001805,0.000781,0.002992,0.001551,0.006171,0.004115,0.00648,0.002966,0.006644,0.003002,0.007283,0.002534,0.00386,0.001377,0.002711,0.003867,0.004292,0.002803,0.005231,0.004539,0.014973,0.003242,0.006125,0.005727,0.004384,0.003558,0.001398,0.000874,0.002736,0.00655,0.000841,0.005772,0.002399,0.00088,0.004088,0.004349,0.00029,0.003106,0.001501,0.007535,0.004371,0.009596,0.000299,0.002606,0.003283,0.006129,0.000593,0.002051,0.006652,0.003152,0.003943,0.001869,0.009577,0.002852,0.003443,0.002388
2,1912-03,0.000791,0.015762,0.000199,0.005694,0.009023,0.008583,0.008668,0.004564,0.001133,0.000931,0.003478,0.002996,0.002092,0.003939,0.003279,0.002569,0.002798,0.005212,0.00119,0.003155,0.002553,0.007258,0.006436,0.012434,0.005328,0.008876,0.000351,0.004403,0.009034,0.00834,0.006284,0.004196,0.008737,0.007015,0.008662,0.005598,0.005254,0.002304,0.014267,0.000316,0.002319,0.004726,0.008736,0.003422,0.000256,0.003655,0.000332,0.004669,0.002472,0.006235,0.009334,0.005239,0.002142,0.008125,0.005339,0.006418,0.009044,0.001925,0.006064,...,0.002377,0.006455,0.004602,0.001594,0.002292,0.000734,0.003811,0.001436,0.006418,0.003793,0.005869,0.00343,0.006661,0.003739,0.005715,0.002366,0.002773,0.001238,0.002121,0.003382,0.004785,0.002717,0.005232,0.004415,0.005899,0.003639,0.005944,0.002727,0.005022,0.003334,0.002744,0.002253,0.002791,0.006378,0.000848,0.004707,0.00219,0.000504,0.003525,0.004885,0.00043,0.003191,0.001424,0.01251,0.004857,0.010049,0.000426,0.002688,0.00396,0.005716,0.001031,0.002145,0.006726,0.003825,0.004377,0.00222,0.009812,0.003443,0.005296,0.002533
3,1912-04,0.000933,0.016111,0.000257,0.004269,0.006811,0.008263,0.003887,0.005724,0.001038,0.000796,0.00373,0.002665,0.002043,0.005094,0.002645,0.002597,0.002099,0.00388,0.001286,0.003197,0.002226,0.008766,0.005186,0.011684,0.004737,0.006585,0.000155,0.004078,0.008968,0.007781,0.005559,0.003919,0.00774,0.006798,0.008815,0.006932,0.008129,0.002566,0.012583,0.000316,0.002767,0.004313,0.006568,0.00238,0.000333,0.004378,0.000505,0.004932,0.002439,0.005853,0.0133,0.006496,0.002666,0.004785,0.005589,0.004396,0.026849,0.002341,0.00685,...,0.001218,0.006292,0.0055,0.002039,0.002018,0.000743,0.002832,0.001955,0.007754,0.003552,0.009009,0.003418,0.007214,0.005715,0.002256,0.002683,0.00272,0.001275,0.002874,0.002233,0.00448,0.002717,0.004416,0.004806,0.002476,0.003966,0.004632,0.003103,0.007231,0.002323,0.003858,0.001831,0.003108,0.007375,0.000598,0.004483,0.002148,0.000795,0.002687,0.004443,0.001046,0.004089,0.001486,0.012439,0.005025,0.009385,0.000354,0.002609,0.003835,0.006119,0.000741,0.002255,0.00899,0.004576,0.003186,0.003081,0.008381,0.00283,0.0052,0.002224
4,1912-05,0.000381,0.014027,0.000128,0.004472,0.007268,0.007818,0.001563,0.003416,0.001144,0.001386,0.003538,0.002646,0.001718,0.003798,0.002098,0.002242,0.00202,0.004903,0.001178,0.003056,0.00175,0.005403,0.007638,0.015948,0.004384,0.007665,0.00028,0.004047,0.008602,0.009689,0.005299,0.002369,0.007978,0.007158,0.008542,0.007745,0.010918,0.002196,0.015451,0.000198,0.002068,0.00204,0.007915,0.003723,0.000382,0.003545,0.000619,0.005792,0.003045,0.006234,0.011344,0.005909,0.003356,0.003983,0.006954,0.005151,0.014604,0.002195,0.00611,...,0.001999,0.005274,0.005111,0.00196,0.00232,0.000784,0.003126,0.002545,0.004657,0.003755,0.009864,0.004588,0.008481,0.003996,0.003429,0.002561,0.002767,0.001239,0.002248,0.00313,0.004769,0.00256,0.004617,0.004901,0.011115,0.003614,0.005089,0.004256,0.005642,0.003761,0.010416,0.001732,0.003558,0.007151,0.000841,0.006252,0.002704,0.00045,0.002322,0.005803,0.000525,0.003843,0.00162,0.010148,0.005124,0.008734,0.000413,0.002369,0.003803,0.005666,0.000547,0.001809,0.007689,0.005193,0.003963,0.001707,0.008985,0.002638,0.003475,0.002195


**check years are identical with dataset:**

In [18]:
year_doct = list(df_doctopic.select('year').distinct().orderBy('year').rdd.map(lambda r: r[0]).collect())
year_avgw = list(df_avgweight.select('year').rdd.map(lambda r: r[0]).collect())
pd.DataFrame({'yearDocTopic':year_doct, 'yearAvgWeight':year_avgw}).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103
yearAvgWeight,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,1855,1856,1857,1858,1859,1860,1861,1862,1863,1864,1865,1866,1867,1868,1869,1870,1871,1872,1873,1874,1875,1876,1877,1878,1879,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1899,1900,1901,1902,1903,1907,1908,1909,1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945
yearDocTopic,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,1855,1856,1857,1858,1859,1860,1861,1862,1863,1864,1865,1866,1867,1868,1869,1870,1871,1872,1873,1874,1875,1876,1877,1878,1879,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1899,1900,1901,1902,1903,1907,1908,1909,1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945


**The dimension of the avrage weight dataframe is small (topic_n\*year_n), so we directly save the dataframe using Pandas:**

In [19]:
path = r'../models/train/avgWeight.csv'

df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

## 2 Prepare Dataframe for Subset

**The process for subset is the same with for full dataset, we wrapped the process to a function and call it for each subset.**

### 2.1 By Range of Time

**The tiem range is changed to short time, so we need to generate new weight for topic list in** `topicKeys.csv`**.**

In [25]:
path = r'../models/wwi/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight.drop(df_weight.columns[[0, 1]], axis=1, inplace=True)
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_wwi = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_wwi = df_topics_wwi[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_wwi.shape)
#df_topics_wwi.head()

Shape of dataframe: (200, 3)


In [26]:
path = r'../models/wwi/topicKeys.csv'
df_topics_wwi.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate new weight for others (except WWI).**

In [27]:
path = r'../models/train/docTopics.txt'

START = '1912-01-01'
END = '1921-12-31'

df_doctopic = load_doctopic(path, topic_number, spark)
df_sub = df_meta.filter((df_meta['date'] < START) | (df_meta['date'] > END))
df_doctopic = (df_doctopic
               .join(df_sub, df_doctopic.id == df_sub.id_)
               .drop('index').drop('id').drop('id_').drop('region').drop('date'))
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_others = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_others = df_topics_others[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_others.shape)
#df_topics_others.head()

Shape of dataframe: (200, 3)


In [28]:
path = r'../models/wwi/topicKeysOthers.csv'
df_topics_others.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate other dataframes:**

In [30]:
path = r'../models/wwi/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)

df_docdomtopic, df_avgweight = preplot(df_doctopic, df_meta, df_topics, 'month')

df_docdomtopic.cache();

**Save dataframes:**

In [31]:
path = r'../models/wwi/avgWeight.csv'
df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

path = r'../models/wwi/domTopics/'
df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist();

**Convert multi files to a csv file:**

In [32]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

### 2.2 By Region

#### 2.2.1 Otago

**Generate new weight for regions.**

In [33]:
path = r'../models/otago/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight.drop(df_weight.columns[[0, 1]], axis=1, inplace=True)
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_regions = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_regions = df_topics_regions[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_regions.shape)
#df_topics_regions.head()

Shape of dataframe: (200, 3)


In [34]:
path = r'../models/otago/topicKeys.csv'
df_topics_regions.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate new weight for others (except The Regions List).**

In [35]:
path = r'../models/train/docTopics.txt'

#regions = ['Otago', 'Canterbury', 'Manawatu-Wanganui', 'Wellington']
regions = ['Otago']

df_doctopic = load_doctopic(path, topic_number, spark)
df_sub = df_meta.filter(F.col('region').isin(regions) == False)
df_doctopic = df_doctopic.join(df_sub, df_doctopic.id == df_sub.id_).drop('index').drop('id').drop('id_').drop('region').drop('date')
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_others = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_others = df_topics_others[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_others.shape)
#df_topics_others.head()

Shape of dataframe: (200, 3)


In [36]:
path = r'../models/otago/topicKeysOthers.csv'
df_topics_others.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate other dataframes:**

In [37]:
path = r'../models/otago/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)

df_docdomtopic, df_avgweight = preplot(df_doctopic, df_meta, df_topics, 'year')

df_docdomtopic.cache();

**Save dataframes:**

In [38]:
path = r'../models/otago/avgWeight.csv'
df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

path = r'../models/otago/domTopics/'
df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist();

**Convert multi files to a csv file:**

In [39]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

#### 2.2.2 Canterbury

**Generate new weight for regions.**

In [40]:
path = r'../models/canterbury/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight.drop(df_weight.columns[[0, 1]], axis=1, inplace=True)
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_regions = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_regions = df_topics_regions[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_regions.shape)
#df_topics_regions.head()

Shape of dataframe: (200, 3)


In [41]:
path = r'../models/canterbury/topicKeys.csv'
df_topics_regions.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate new weight for others (except The Regions List).**

In [42]:
path = r'../models/train/docTopics.txt'

#regions = ['Otago', 'Canterbury', 'Manawatu-Wanganui', 'Wellington']
regions = ['Canterbury']

df_doctopic = load_doctopic(path, topic_number, spark)
df_sub = df_meta.filter(F.col('region').isin(regions) == False)
df_doctopic = df_doctopic.join(df_sub, df_doctopic.id == df_sub.id_).drop('index').drop('id').drop('id_').drop('region').drop('date')
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_others = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_others = df_topics_others[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_others.shape)
#df_topics_others.head()

Shape of dataframe: (200, 3)


In [43]:
path = r'../models/canterbury/topicKeysOthers.csv'
df_topics_others.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate other dataframes:**

In [44]:
path = r'../models/canterbury/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)

df_docdomtopic, df_avgweight = preplot(df_doctopic, df_meta, df_topics, 'year')

df_docdomtopic.cache();

**Save dataframes:**

In [45]:
path = r'../models/canterbury/avgWeight.csv'
df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

path = r'../models/canterbury/domTopics/'
df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist();

**Convert multi files to a csv file:**

In [46]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

#### 2.2.3 Manawatu-Wanganui

**Generate new weight for regions.**

In [47]:
path = r'../models/manawatu-wanganui/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight.drop(df_weight.columns[[0, 1]], axis=1, inplace=True)
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_regions = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_regions = df_topics_regions[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_regions.shape)
#df_topics_regions.head()

Shape of dataframe: (200, 3)


In [48]:
path = r'../models/manawatu-wanganui/topicKeys.csv'
df_topics_regions.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate new weight for others (except The Regions List).**

In [49]:
path = r'../models/train/docTopics.txt'

#regions = ['Otago', 'Canterbury', 'Manawatu-Wanganui', 'Wellington']
regions = ['Manawatu-Wanganui']

df_doctopic = load_doctopic(path, topic_number, spark)
df_sub = df_meta.filter(F.col('region').isin(regions) == False)
df_doctopic = df_doctopic.join(df_sub, df_doctopic.id == df_sub.id_).drop('index').drop('id').drop('id_').drop('region').drop('date')
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_others = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_others = df_topics_others[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_others.shape)
#df_topics_others.head()

Shape of dataframe: (200, 3)


In [50]:
path = r'../models/manawatu-wanganui/topicKeysOthers.csv'
df_topics_others.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate other dataframes:**

In [51]:
path = r'../models/manawatu-wanganui/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)

df_docdomtopic, df_avgweight = preplot(df_doctopic, df_meta, df_topics, 'year')

df_docdomtopic.cache();

**Save dataframes:**

In [52]:
path = r'../models/manawatu-wanganui/avgWeight.csv'
df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

path = r'../models/manawatu-wanganui/domTopics/'
df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist();

**Convert multi files to a csv file:**

In [53]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

#### 2.2.4 Wellington

**Generate new weight for regions.**

In [54]:
path = r'../models/wellington/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight.drop(df_weight.columns[[0, 1]], axis=1, inplace=True)
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_regions = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_regions = df_topics_regions[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_regions.shape)
#df_topics_regions.head()

Shape of dataframe: (200, 3)


In [55]:
path = r'../models/wellington/topicKeys.csv'
df_topics_regions.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate new weight for others (except The Regions List).**

In [56]:
path = r'../models/train/docTopics.txt'

#regions = ['Otago', 'Canterbury', 'Manawatu-Wanganui', 'Wellington']
regions = ['Wellington']

df_doctopic = load_doctopic(path, topic_number, spark)
df_sub = df_meta.filter(F.col('region').isin(regions) == False)
df_doctopic = df_doctopic.join(df_sub, df_doctopic.id == df_sub.id_).drop('index').drop('id').drop('id_').drop('region').drop('date')
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_others = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_others = df_topics_others[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_others.shape)
#df_topics_others.head()

Shape of dataframe: (200, 3)


In [57]:
path = r'../models/wellington/topicKeysOthers.csv'
df_topics_others.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate other dataframes:**

In [58]:
path = r'../models/wellington/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)

df_docdomtopic, df_avgweight = preplot(df_doctopic, df_meta, df_topics, 'year')

df_docdomtopic.cache();

**Save dataframes:**

In [59]:
path = r'../models/wellington/avgWeight.csv'
df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

path = r'../models/wellington/domTopics/'
df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist();

**Convert multi files to a csv file:**

In [60]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

### 2.3 By Label

**Generate new weight for ADs.**

In [61]:
path = r'../models/ads/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight.drop(df_weight.columns[[0, 1]], axis=1, inplace=True)
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_ads = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_ads = df_topics_ads[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_ads.shape)
#df_topics_ads.head()

Shape of dataframe: (200, 3)


In [62]:
path = r'../models/ads/topicKeys.csv'
df_topics_ads.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate new weight for others (except ADs).**

In [63]:
path = r'../models/train/docTopics.txt'

df_doctopic = load_doctopic(path, topic_number, spark)
df_sub = df_meta.filter(F.col('ads') == False)
df_doctopic = df_doctopic.join(df_sub, df_doctopic.id == df_sub.id_).drop('index').drop('id').drop('id_').drop('region').drop('date')
df_weight = df_doctopic.groupBy().sum().toPandas()
df_weight = df_weight.T.reset_index(drop=True)
df_weight.columns = ['weight']
df_weight = df_weight / df_weight.sum()
df_topics_others = df_topics.toPandas().join(df_weight).drop(columns='weight_')
df_topics_others = df_topics_others[['topic', 'weight', 'words']]

print('Shape of dataframe:', df_topics_others.shape)
#df_topics_others.head()

Shape of dataframe: (200, 3)


In [64]:
path = r'../models/ads/topicKeysOthers.csv'
df_topics_others.to_csv(path, header=False, index=False, encoding='utf-8')

**Generate other dataframes:**

In [65]:
path = r'../models/ads/docTopicsInfer.txt'

df_doctopic = load_doctopic(path, topic_number, spark)

df_docdomtopic, df_avgweight = preplot(df_doctopic, df_meta, df_topics, 'year')

df_docdomtopic.cache()

DataFrame[id: int, region: string, year: string, domtopic: string, weight: double, words: string]

**Save dataframes:**

In [66]:
path = r'../models/ads/avgWeight.csv'
df_avgweight.toPandas().to_csv(path, header=False, index=False, encoding='utf-8')

path = r'../models/ads/domTopics'
df_docdomtopic.write.csv(path, mode='overwrite')

df_docdomtopic.unpersist()

DataFrame[id: int, region: string, year: string, domtopic: string, weight: double, words: string]

**Convert multi files to a csv file:**

In [67]:
%%bash -s "$path"

cat $1/*.csv > $1/domTopics.csv

mv $1/domTopics.csv $1/../

rm -rf $1

----

In [68]:
sc.stop()
gc.collect()

144