# LDA topic modelling
The notebook on LDA topic modelling for IMDB dataset using PySpark is a comprehensive guide that demonstrates how to perform topic modelling using Latent Dirichlet Allocation (LDA) in PySpark.

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark-Example-30-ChatGPT') \
    .getOrCreate()

In [2]:
# There are some warnings, so we suppress them
import warnings
from pyspark.sql import functions as F
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load and preprocess the data

In [3]:
import re
import numpy as np
tFile="data\IMDB Dataset.csv.bz2"
df = spark.read.csv(tFile,header=True,inferSchema=True )
df.show(3)

+--------------------+---------+
|                text|sentiment|
+--------------------+---------+
|One of the other ...| positive|
|A wonderful littl...| positive|
|I thought this wa...| positive|
+--------------------+---------+
only showing top 3 rows



In [4]:
df = df.sample(.2, seed=100)
#df= df.where(F.col("sentiment")=="positive")

In [5]:
df.groupBy("sentiment").count().show()

+---------+-----+
|sentiment|count|
+---------+-----+
| positive| 5067|
| negative| 4993|
+---------+-----+



In [6]:
# Remove html tags from text
df = df.withColumn("text_c", F.regexp_replace(F.col("text"), r'<[^>]+>', ""));
# Remove non-letters
df = df.withColumn("text_c", F.regexp_replace("text_c", r"[\.\!\,\-\']", " "))
# Remove non-letters
df = df.withColumn("text_c", F.regexp_replace("text_c", r"[^a-zA-Z\ ]", ""))
# Remove words 1, 2 char
df = df.withColumn("text_c", F.regexp_replace("text_c", r"\b\w{1,2}\b", " "))
df.toPandas().head(5)

Unnamed: 0,text,sentiment,text_c
0,A wonderful little production. <br /><br />The...,positive,wonderful little production The filming tec...
1,"Probably my all-time favorite movie, a story o...",positive,Probably all time favorite movie story ...
2,I sure would like to see a resurrection of a u...,positive,sure would like see resurrection d...
3,"This show was an amazing, fresh & innovative i...",negative,This show was amazing fresh innovative ide...
4,The cast played Shakespeare.<br /><br />Shakes...,negative,The cast played Shakespeare Shakespeare lost ...


### Lemmatization (optional)
Lemmatization is the process of reducing a word to its base or root form, which is also known as a lemma. The purpose of lemmatization is to simplify text and make it easier to analyze by grouping together different forms of the same word. For example, the words "running," "ran," and "runs" can all be reduced to the base form "run" through lemmatization. 

However, lemmatization can be a **time-consuming operation**, especially when dealing with large amounts of text data. This is because the process involves analyzing each word in a text and identifying its base form. It also requires a comprehensive understanding of the grammatical rules of a language to accurately identify the correct lemma for each word.

Despite its time-consuming nature, lemmatization can be a powerful tool in natural language processing and text analysis. It can help with tasks such as sentiment analysis, topic modeling, and text classification. When using lemmatization, it's important to use it carefully and correctly to ensure that the text is properly processed and analyzed.

In [7]:
# import spacy
# from pyspark.sql.functions import udf
# from pyspark.sql.types import StringType


# # Load the spaCy model
# nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# # Define a function to apply the lemmatizer to a text
# @udf(returnType=StringType())
# def lemmatize_text(text):
#     doc = nlp(text)
#     lemmas = [token.lemma_ for token in doc]
#     print(lemmas)
#     return " ".join(lemmas)

# # Define a UDF to apply the lemmatizer to a column
# # def l(text):
# #     return text
# # lemmatize_udf = udf(l, StringType())

# # Apply the UDF to a DataFrame column
# df0 = df.withColumn("text_c", lemmatize_text("text_c"))
# df0.show(3)

In [8]:
from pyspark.ml.feature import Tokenizer, CountVectorizer,IDF
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml import Pipeline
from pyspark.ml.clustering import LDA

# Preprocessing Pipeline
Preprocessing pipeline for text classification using LDA may include the following steps:

- Tokenization: This step involves breaking the text into individual words or tokens. A tokenizer can split the text into tokens based on spaces, punctuation, and other delimiters.
- Stop words removal: This step involves removing commonly used words in a language that do not carry much meaning or contribute to the topic of the text, such as "the", "and", "is". Stop words removal helps to reduce the size of the vocabulary and improves the efficiency of subsequent steps.
- Count vectorizer: This step involves converting the tokenized text into a matrix of word counts. Each row represents a document, and each column represents a unique word in the vocabulary. The values in the matrix represent the frequency of each word in each document.
- IDF (Inverse Document Frequency): This step involves weighting the word counts to account for the frequency of each word in the entire corpus. Words that occur frequently across all documents are given a lower weight, while words that occur rarely are given a higher weight.

By using IDF, the weight of each word is inversely proportional to the number of documents that contain that word. This helps to identify the important words that are specific to a particular document and can help to distinguish between different topics.

Once these steps are completed, the resulting TF-IDF vectors can be used as input for the LDA algorithm to identify the underlying topics in the text corpus and classify the documents based on their topic distributions.

In [16]:
# Text preprocessing pipeline
tokenizer = Tokenizer(inputCol="text_c", outputCol="words")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
#Run 1 Use 500 words
# countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features_c", vocabSize=500)
# Run 2 Use 1000 words
# countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features_c", vocabSize=1000)
# Run 3 Use Filter most frequent words
countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features_c", vocabSize=1000,minDF=10, maxDF=5000)

idf = IDF(inputCol=countVectorizer.getOutputCol(), outputCol="features")
pipeline = Pipeline(stages=[tokenizer,remover, countVectorizer,idf])
data_model = pipeline.fit(df)

In [17]:
# Print the vocabulary
vocabulary = data_model.stages[2].vocabulary
print(vocabulary[:100])

['like', 'good', 'even', 'time', 'story', 'see', 'really', 'well', 'much', 'people', 'get', 'also', 'bad', 'great', 'first', 'made', 'make', 'movies', 'way', 'characters', 'think', 'character', 'watch', 'films', 'two', 'seen', 'life', 'best', 'many', 'show', 'never', 'love', 'plot', 'acting', 'know', 'little', 'ever', 'better', 'still', 'man', 'end', 'say', 'scenes', 'scene', 'back', 'something', 'real', 'thing', 'didn', 'funny', 'actors', 'watching', 'years', 'another', 'work', 'doesn', 'though', 'look', 'old', 'going', 'director', 'nothing', 'actually', 'find', 'makes', 'every', 'new', 'lot', 'part', 'world', 'seems', 'pretty', 'things', 'want', 'however', 'young', 'enough', 'fact', 'cast', 'around', 'quite', 'big', 'horror', 'long', 'take', 'got', 'may', 'without', 'give', 'music', 'action', 'comedy', 'series', 'saw', 'almost', 'role', 'right', 'always', 'must', 'gets']


In [18]:
# Transform the dataset using the preprocessing pipeline
dataset = data_model.transform(df)
dataset.toPandas().tail(5)

Unnamed: 0,text,sentiment,text_c,words,filtered,features_c,features
10055,"As someone who loves baseball history, especia...",negative,someone who loves baseball history especial...,"[, , someone, who, loves, baseball, history, ,...","[, , someone, loves, baseball, history, , espe...","(1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, ...","(0.7619554807841183, 0.0, 1.0810748933853436, ..."
10056,<br /><br />Headlines warn us of the current c...,positive,Headlines warn the current campaign demo...,"[headlines, warn, , , , , the, current, campai...","[headlines, warn, , , , , current, campaign, ,...","(0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 2.2319139025253767, ..."
10057,"Dog Bite Dog isn't going to be for everyone, b...",positive,Dog Bite Dog isn going for everyone but...,"[dog, bite, dog, isn, , , going, , , , , for, ...","[dog, bite, dog, isn, , , going, , , , , every...","(0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(0.0, 0.9678927174918324, 0.0, 0.0, 1.22044136..."
10058,This is your typical junk comedy.<br /><br />T...,negative,This your typical junk comedy There are almo...,"[this, , , your, typical, junk, comedy, there,...","[, , typical, junk, comedy, almost, , , laughs...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
10059,I thought this movie did a down right good job...,positive,thought this movie did down right good job...,"[, , thought, this, movie, did, , , down, righ...","[, , thought, movie, , , right, good, job, , ,...","(2.0, 1.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ...","(1.5239109615682367, 0.9678927174918324, 1.081..."


# LDA
LDA (Latent Dirichlet Allocation) is a topic modeling technique that is used to identify the underlying topics within a large collection of text documents. It is a probabilistic model that assumes each document is a mixture of several topics and each topic is a mixture of several words.
- The LDA algorithm works by first randomly assigning each word in each document to a topic. It then iteratively improves these assignments by reassigning words to topics based on the probability of the word belonging to each topic, and the probability of each document belonging to each topic.
- The output of the LDA algorithm is a set of topics, each represented by a distribution over words, and a set of document-topic distributions, which represent the degree to which each document belongs to each topic.
- LDA can be used for a wide range of applications, such as content analysis, information retrieval, and recommendation systems. It has been widely applied in fields such as social media analysis, e-commerce, and market research.
- One of the main advantages of LDA is that it can automatically identify the topics present in a corpus without any prior knowledge of the topics. This makes it a useful tool for analyzing large and complex datasets.
- However, LDA has some limitations. It may not work well with short documents, and it may require a large amount of data to accurately estimate the topic distributions. Additionally, the interpretation of the resulting topics may require human expertise and domain knowledge.

In [19]:
# Find two topics
lda = LDA(k=2, maxIter=20)
model = lda.fit(dataset)

### Topics Matrix
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size $vocabSize$ x $k$, where each column is a topic. No guarantees are given about the ordering of the topics.

Warning: If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of $vocabSize$ x $k$).

In [20]:
# Print the LDA transformation matrix
print(model.topicsMatrix().toArray().shape)
model.topicsMatrix().toArray()

(1000, 2)


array([[1537.03547464, 1148.72668518],
       [1269.86702665, 1293.02715337],
       [1452.0022824 , 1013.17549403],
       ...,
       [ 167.79486067,  234.77503152],
       [ 148.60787517,  188.93108744],
       [ 202.85600699,  169.74665493]])

### Describe topics
Return the topics described by weighted terms. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight.

In [21]:

topics = model.describeTopics(5)
print("The topics described by their top-weighted terms:")
topics.toPandas().head(5)

The topics described by their top-weighted terms:


Unnamed: 0,topic,termIndices,termWeights
0,0,"[29, 12, 0, 2, 3]","[0.004340808299827221, 0.004317573201596433, 0..."
1,1,"[4, 31, 13, 1, 19]","[0.004655718693241141, 0.0042696444077906515, ..."


In [22]:
# Print most important words per topic
topics = model.describeTopics(30)
for r in topics.select("termIndices").collect():
    rez = []
    for l in r:
        for i in l:
            rez.append(vocabulary[i])
    print(rez[:30])

['show', 'bad', 'like', 'even', 'time', 'really', 'people', 'horror', 'get', 'ever', 'plot', 'series', 'good', 'films', 'funny', 'see', 'first', 'made', 'much', 'also', 'think', 'make', 'seen', 'never', 'know', 'thing', 'many', 'look', 'well', 'way']
['story', 'love', 'great', 'good', 'characters', 'character', 'well', 'life', 'see', 'like', 'also', 'much', 'two', 'best', 'really', 'movies', 'scenes', 'way', 'little', 'first', 'even', 'role', 'man', 'time', 'action', 'people', 'made', 'scene', 'films', 'performance']


# Create LDA model wiht ten topics

In [23]:
# Text preprocessing pipeline
tokenizer = Tokenizer(inputCol="text_c", outputCol="words")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
# Run 1: Use all the words
# countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features_c", vocabSize=1000)
# Run 2: Discard the very frequent words
countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features_c", vocabSize=1000,minDF=10, maxDF=3000)

idf = IDF(inputCol=countVectorizer.getOutputCol(), outputCol="features")
pipeline = Pipeline(stages=[tokenizer,remover, countVectorizer,idf])
data_model = pipeline.fit(df)

In [24]:
vocabulary = data_model.stages[2].vocabulary
print(vocabulary[:100])

['story', 'much', 'people', 'get', 'also', 'bad', 'great', 'first', 'made', 'make', 'movies', 'way', 'characters', 'think', 'character', 'watch', 'films', 'two', 'seen', 'life', 'best', 'many', 'show', 'never', 'love', 'plot', 'acting', 'know', 'little', 'ever', 'better', 'still', 'man', 'end', 'say', 'scenes', 'scene', 'back', 'something', 'real', 'thing', 'didn', 'funny', 'actors', 'watching', 'years', 'another', 'work', 'doesn', 'look', 'though', 'old', 'director', 'going', 'nothing', 'actually', 'find', 'makes', 'every', 'new', 'lot', 'part', 'world', 'seems', 'pretty', 'things', 'want', 'young', 'however', 'enough', 'fact', 'cast', 'around', 'quite', 'big', 'horror', 'long', 'take', 'got', 'may', 'without', 'give', 'music', 'action', 'comedy', 'series', 'saw', 'almost', 'role', 'right', 'always', 'must', 'gets', 'interesting', 'times', 'thought', 'least', 'done', 'guy', 'far']


In [25]:
dataset = data_model.transform(df)
dataset.toPandas().head(5)

Unnamed: 0,text,sentiment,text_c,words,filtered,features_c,features
0,A wonderful little production. <br /><br />The...,positive,wonderful little production The filming tec...,"[, , wonderful, little, production, , the, fil...","[, , wonderful, little, production, , filming,...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.3659286614209..."
1,"Probably my all-time favorite movie, a story o...",positive,Probably all time favorite movie story ...,"[probably, , , all, time, favorite, movie, , ,...","[probably, , , time, favorite, movie, , , , st...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.2204413675282908, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,I sure would like to see a resurrection of a u...,positive,sure would like see resurrection d...,"[, , sure, would, like, , , see, , , resurrect...","[, , sure, like, , , see, , , resurrection, , ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,"This show was an amazing, fresh & innovative i...",negative,This show was amazing fresh innovative ide...,"[this, show, was, , , amazing, , fresh, , inno...","[show, , , amazing, , fresh, , innovative, ide...","(0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 2.0, 1.0, ...","(0.0, 0.0, 0.0, 0.0, 1.358554282960248, 2.9161..."
4,The cast played Shakespeare.<br /><br />Shakes...,negative,The cast played Shakespeare Shakespeare lost ...,"[the, cast, played, shakespeare, shakespeare, ...","[cast, played, shakespeare, shakespeare, lost,...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [26]:
# Find two topics
lda = LDA(k=10, maxIter=20)
model = lda.fit(dataset)

In [27]:
# Describe topics
topics = model.describeTopics(5)
print("The topics described by their top-weighted terms:")
topics.toPandas()

The topics described by their top-weighted terms:


Unnamed: 0,topic,termIndices,termWeights
0,0,"[140, 22, 84, 878, 425]","[0.00674244514691243, 0.006721268599666443, 0...."
1,1,"[85, 178, 443, 244, 457]","[0.016008637227919075, 0.015226828334273294, 0..."
2,2,"[462, 160, 206, 353, 71]","[0.005490235910419175, 0.00544727303736298, 0...."
3,3,"[5, 75, 64, 344, 196]","[0.006993053402361597, 0.006285541918551131, 0..."
4,4,"[355, 136, 29, 891, 2]","[0.008581149872091078, 0.007553194568265527, 0..."
5,5,"[502, 203, 122, 519, 223]","[0.01025525005376355, 0.008827180144495544, 0...."
6,6,"[300, 75, 10, 16, 841]","[0.01305622316428585, 0.008668892257177702, 0...."
7,7,"[12, 83, 0, 894, 14]","[0.006017541514811685, 0.0054426614986797925, ..."
8,8,"[260, 217, 22, 24, 350]","[0.007765881819154282, 0.00773736657412204, 0...."
9,9,"[192, 382, 346, 256, 648]","[0.009800991265009261, 0.008754473333628537, 0..."


In [29]:
# Print most important words per topic
topics = model.describeTopics(10)
for r in topics.select("termIndices").collect():
    rez = []
    for l in r:
        for i in l:
            rez.append(vocabulary[i])
    print(rez[:15])

['book', 'show', 'comedy', 'christmas', 'jokes', 'bad', 'laughs', 'laugh', 'funny', 'get']
['series', 'war', 'documentary', 'episode', 'episodes', 'show', 'world', 'western', 'season', 'joe']
['novel', 'american', 'version', 'town', 'cast', 'lee', 'wife', 'role', 'performances', 'modern']
['bad', 'horror', 'pretty', 'killer', 'budget', 'guy', 'first', 'didn', 'minutes', 'low']
['god', 'worst', 'ever', 'sick', 'people', 'bad', 'tom', 'write', 'read', 'think']
['jack', 'men', 'woman', 'room', 'women', 'young', 'story', 'dark', 'work', 'two']
['game', 'horror', 'movies', 'films', 'party', 'thriller', 'dance', 'dog', 'scene', 'director']
['characters', 'action', 'story', 'vampire', 'character', 'fight', 'audience', 'much', 'plot', 'believable']
['kids', 'school', 'show', 'love', 'michael', 'home', 'happy', 'get', 'family', 'water']
['father', 'art', 'son', 'human', 'king', 'powerful', 'japanese', 'baby', 'space', 'david']


# Topic classification with LDA
LDA can be a useful technique for text classification, especially when you want to identify the underlying topics within a corpus and classify documents into those topics.

Advantages of using LDA for classification:
- Identifies latent topics: LDA can automatically identify the underlying topics in a corpus of text. This can be useful for discovering hidden themes and patterns in the data.
- Unsupervised learning: LDA is an unsupervised learning technique, which means that it does not require labeled data to identify the topics. This can save time and effort in preparing labeled data for training a classifier.
- Handles high-dimensional data: LDA can handle high-dimensional data, such as text documents, by reducing the dimensionality of the data to a smaller set of topics.
- Flexibility: LDA is a flexible technique that can be adapted to different types of text data and modeling assumptions.

Problems of using LDA for classification:
- Requires large data sets: LDA can require large data sets to accurately estimate the topic distributions. If the data set is too small, the resulting topic distributions may not be representative of the underlying data.
- Sensitivity to model parameters: LDA is sensitive to the number of topics and other model parameters. Choosing the optimal number of topics can be challenging and requires some trial and error.
- Limited interpretability: While LDA can identify latent topics, the resulting topics may not always be easily interpretable. It may require domain expertise to interpret the topics and assign meaningful labels to them.

In [30]:
# Shows the result
transformed = model.transform(dataset)
transformed.select("text_c","topicDistribution").toPandas().head(5)

Unnamed: 0,text_c,topicDistribution
0,wonderful little production The filming tec...,"[0.19214310682356103, 0.0008841305848342852, 0..."
1,Probably all time favorite movie story ...,"[0.0012333585687258593, 0.0011909018499694423,..."
2,sure would like see resurrection d...,"[0.0010322233310158424, 0.47717268359743487, 0..."
3,This show was amazing fresh innovative ide...,"[0.15415986303207854, 0.45121949918959053, 0.1..."
4,The cast played Shakespeare Shakespeare lost ...,"[0.0011227766163164547, 0.0010840717347798765,..."


In [31]:
from pyspark.sql.functions import udf
@udf
def vect_argmax(row):
    row_arr = row.toArray()
    max_pos = np.argmax(row_arr)
    return(int(max_pos))
transformed1 = transformed.withColumn("argmax",vect_argmax(F.col('topicDistribution')))

In [32]:
transformed1.select("text_c","argmax").toPandas().head(5)

Unnamed: 0,text_c,argmax
0,wonderful little production The filming tec...,2
1,Probably all time favorite movie story ...,8
2,sure would like see resurrection d...,1
3,This show was amazing fresh innovative ide...,1
4,The cast played Shakespeare Shakespeare lost ...,4
