 <img src="uva_seal.png"> 

### Topic Modeling in Spark

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

Source: https://medium.com/analytics-vidhya/distributed-topic-modelling-using-spark-nlp-and-spark-mllib-lda-6db3f06a4da3

---

#### Instructions

In this notebook, we will walk through a Topic Modeling task.  
The purpose of this exercise is to: 

- Get you comfortable working with a repo and some external code
- Check your understanding of Spark / Show you how much you've learned about Spark!

It is not important to know Topic Modeling for the exercise, but you can read about it [here](http://www.cs.columbia.edu/~blei/papers/Blei2011.pdf)

#### Getting the data

The source article (above) provides a link to the dataset.  
In this folder, the file: `topic_modeling_repo_setup.png` shows the steps in terminal to get the data from the repo.  

#### Install the Spark NLP package

In [3]:
# This version will be compatible with Spark 3.3
! pip install spark-nlp==4.3.2

Defaulting to user installation because normal site-packages is not writeable


In [4]:
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

In [5]:
spark = SparkSession.builder \
    .appName("Spark NLP") \
    .config("spark.driver.memory","8G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2") \
    .config("spark.kryoserializer.buffer.max", "1000M") \
    .getOrCreate()

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found


:: loading settings :: url = jar:file:/opt/conda/lib/python3.7/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/apt4c/.ivy2/cache
The jars for the packages stored in: /home/apt4c/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3977000e-8284-4ccf-bb3d-0f838eb2438e;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.3.2 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.16.0 in central
	found com.google.guava#guava;31.1-jre in central
	found com.google.guava#failurea

23/03/19 01:53:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


#### Read in the data

In [6]:
path_to_data = '/sfs/qumulo/qhome/apt4c/topic_modeling/data/abcnews-date-text.csv'
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(path_to_data)
# Verify the count
df.count()

                                                                                

1041793

#### Show the first few records

In [7]:
df.show(5,False)

+------------+--------------------------------------------------+
|publish_date|headline_text                                     |
+------------+--------------------------------------------------+
|20030219    |aba decides against community broadcasting licence|
|20030219    |act fire witnesses must be aware of defamation    |
|20030219    |a g calls for infrastructure protection summit    |
|20030219    |air nz staff in aust strike for pay rise          |
|20030219    |air nz strike to affect australian travellers     |
+------------+--------------------------------------------------+
only showing top 5 rows



#### Preprocessing

In [8]:
# Spark NLP requires the input dataframe or column to be converted to document. 
document_assembler = DocumentAssembler() \
    .setInputCol("headline_text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

# Split sentence to tokens(array)
tokenizer = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

# Clean unwanted characters
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

# remove stopwords
stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

# Stem the words to bring them to the root form.
stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

# Finisher is the most important annotator. 
# Spark NLP adds its own structure when we convert each row in the dataframe to document. 
# Finisher helps us to bring back the expected structure viz. array of tokens.
finisher = Finisher() \
    .setInputCols(["stem"]) \
    .setOutputCols(["tokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

# Build the ML Pipeline. Many of these steps are common in NLP.
nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher])

#### Train the pipeline

In [9]:
nlp_model = nlp_pipeline.fit(df)



#### Apply the pipeline to transform dataframe

In [10]:
processed_df = nlp_model.transform(df)

#### Select the columns that we need, and first 10K records

In [12]:
tokens_df = processed_df.select('publish_date','tokens').limit(10000)
tokens_df.show()

[Stage 9:>                                                          (0 + 1) / 1]

+------------+--------------------+
|publish_date|              tokens|
+------------+--------------------+
|    20030219|[aba, decid, comm...|
|    20030219|[act, fire, wit, ...|
|    20030219|[g, call, infrast...|
|    20030219|[air, nz, staff, ...|
|    20030219|[air, nz, strike,...|
|    20030219|[ambiti, olsson, ...|
|    20030219|[antic, delight, ...|
|    20030219|[aussi, qualifi, ...|
|    20030219|[aust, address, u...|
|    20030219|[australia, lock,...|
|    20030219|[australia, contr...|
|    20030219|[barca, take, rec...|
|    20030219|[bathhous, plan, ...|
|    20030219|[big, hope, launc...|
|    20030219|[big, plan, boost...|
|    20030219|[blizzard, buri, ...|
|    20030219|[brigadi, dismiss...|
|    20030219|[british, combat,...|
|    20030219|[bryant, lead, la...|
|    20030219|[bushfir, victim,...|
+------------+--------------------+
only showing top 20 rows



                                                                                

#### Build a vocabulary of 500 tokens and create features

In [13]:
from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=500, minDF=3.0)

# train the model
cv_model = cv.fit(tokens_df)

# transform the data. Output column name will be features.
vectorized_tokens = cv_model.transform(tokens_df)

                                                                                

#### Look at some records

In [12]:
vectorized_tokens.show(5, False)

[Stage 16:>                                                         (0 + 1) / 1]

+------------+---------------------------------------------+---------------------------------------------------------------+
|publish_date|tokens                                       |features                                                       |
+------------+---------------------------------------------+---------------------------------------------------------------+
|20030219    |[aba, decid, commun, broadcast, licenc]      |(500,[118,498],[1.0,1.0])                                      |
|20030219    |[act, fire, wit, must, awar, defam]          |(500,[12,116,389],[1.0,1.0,1.0])                               |
|20030219    |[g, call, infrastructur, protect, summit]    |(500,[14,444],[1.0,1.0])                                       |
|20030219    |[air, nz, staff, aust, strike, pai, rise]    |(500,[59,61,112,117,152,292,475],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|20030219    |[air, nz, strike, affect, australian, travel]|(500,[61,93,112,292],[1.0,1.0,1.0,1.0])                        |


                                                                                

#### Build Latent Dirichlet Allocation (LDA) Model

In [17]:
from pyspark.ml.clustering import LDA

num_topics = 3
lda = LDA(k=num_topics, maxIter=10)

model = lda.fit(vectorized_tokens)
ll = model.logLikelihood(vectorized_tokens)
lp = model.logPerplexity(vectorized_tokens)

print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

                                                                                

The lower bound on the log likelihood of the entire corpus: -178739.95728641868
The upper bound on perplexity: 6.299871608854458


                                                                                

#### Visualize the Topics

In [18]:
# extract vocabulary from CountVectorizer
vocab = cv_model.vocabulary
topics = model.describeTopics()   
topics_rdd = topics.rdd

topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocab[idx] for idx in idx_list])\
       .collect()

for idx, topic in enumerate(topics_words):
    print(f"topic: {idx}")
    print("*"*25)
    for word in topic:
       print(word)
    print("*"*25)

topic: 0
*************************
win
man
charg
polic
face
council
world
urg
court
get
*************************
topic: 1
*************************
u
iraq
war
new
iraqi
baghdad
plan
sai
protest
call
*************************
topic: 2
*************************
govt
nsw
set
crash
deni
kill
claim
rain
troop
sai
*************************
