# Sentiment analysis model using PySpark
The task is to create a sentiment analysis model using PySpark that can classify movie reviews as either positive or negative. The input data consists of a large dataset of movie reviews that have been labeled with a sentiment score (positive or negative). The goal is to build a binary classification model that can accurately predict the sentiment of a new movie review that it has not seen before. 

The sentiment analysis model will use natural language processing (NLP) techniques to preprocess the text data, including tokenization, stopword removal, count vectorization, and TF-IDF. It will then use different machine learning algorithms, such as logistic regression, SVM, or a multilayer perceptron, to train a binary classifier on the preprocessed text features. The model will be evaluated using metrics precision, recall, F1 score, and confusion matrix. 

In the end, we show how feature selection techniques such as Chi-Square can be included in the preprocessing pipeline, and we build a logistic regression model again.

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext("local[*]")
spark = SparkSession.builder.getOrCreate()

In [2]:
import numpy as np
from pyspark.sql import functions as F

# Data preprocessing

In [27]:
tFile="data\IMDB Dataset.csv.bz2"
df0 = spark.read.csv(tFile,header=True)
df0.show(10)

+--------------------+---------+
|                text|sentiment|
+--------------------+---------+
|One of the other ...| positive|
|A wonderful littl...| positive|
|I thought this wa...| positive|
|Basically there's...| negative|
|Petter Mattei's L...| positive|
|Probably my all-t...| positive|
|I sure would like...| positive|
|This show was an ...| negative|
|Encouraged by the...| negative|
|If you like origi...| positive|
+--------------------+---------+
only showing top 10 rows



In [28]:
#Sample the data for faster model training (use the full dataset in reality)
df0 = df0.sample(0.25, seed=200)

In [29]:
# Convert sentiment to numbers positive =1, negative =0
df0 = df0.withColumn("label", F.when(F.col("sentiment")=="positive",1).otherwise(0)).cache()

In [30]:
# Remove html tags from text
df0 = df0.withColumn("text_c", F.regexp_replace(F.col("text"), r'<[^>]+>', ""));
# Remove non-letters
df0 = df0.withColumn("text_c", F.regexp_replace("text_c", r"[^a-zA-Z ]", ""))
# Remove words 1, 2 char
df0 = df0.withColumn("text_c", F.regexp_replace("text_c", r"\b\w{1,2}\b", ""))
df0.toPandas().tail(5)

Unnamed: 0,text,sentiment,label,text_c
12643,"To be hones, I used to like this show and watc...",negative,0,hones used like this show and watch regul...
12644,This movie is a disgrace to the Major League F...,negative,0,This movie disgrace the Major League Franch...
12645,John Garfield plays a Marine who is blinded by...,positive,1,John Garfield plays Marine who blinded gre...
12646,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,0,Bad plot bad dialogue bad acting idiotic direc...
12647,I'm going to have to disagree with the previou...,negative,0,going have disagree with the previous comme...


### Lemmatization (optional)
Lemmatization is the process of reducing a word to its base or root form, which is also known as a lemma. The purpose of lemmatization is to simplify text and make it easier to analyze by grouping together different forms of the same word. For example, the words "running," "ran," and "runs" can all be reduced to the base form "run" through lemmatization. 

However, lemmatization can be a **time-consuming operation**, especially when dealing with large amounts of text data. This is because the process involves analyzing each word in a text and identifying its base form. It also requires a comprehensive understanding of the grammatical rules of a language to accurately identify the correct lemma for each word.

Despite its time-consuming nature, lemmatization can be a powerful tool in natural language processing and text analysis. It can help with tasks such as sentiment analysis, topic modeling, and text classification. When using lemmatization, it's important to use it carefully and correctly to ensure that the text is properly processed and analyzed.

In [31]:
import spacy
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Load the spaCy model
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# Define a function to apply the lemmatizer to a text
def lemmatize_text(text):
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return " ".join(lemmas)

# Define a UDF to apply the lemmatizer to a column
lemmatize_udf = udf(lemmatize_text, StringType())

# Apply the UDF to a DataFrame column
df0 = df0.withColumn("text_c", lemmatize_udf(df0["text_c"]))

# Caching must be used !!!!!!
df0 = df0.cache()
df0.toPandas().tail(5)

Unnamed: 0,text,sentiment,label,text_c
12643,"To be hones, I used to like this show and watc...",negative,0,hone use like this show and watch reg...
12644,This movie is a disgrace to the Major League F...,negative,0,this movie disgrace the Major League Fran...
12645,John Garfield plays a Marine who is blinded by...,positive,1,John Garfield play Marine who blind gre...
12646,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,0,bad plot bad dialogue bad act idiotic direct t...
12647,I'm going to have to disagree with the previou...,negative,0,go have disagree with the previous comme...


### Split the text to training and testing dataset

In [32]:
# Split the data in train and test (80%-20%)
df, test = df0.randomSplit(weights=[0.8,0.2], seed=200)
df.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1| 5016|
|    0| 5138|
+-----+-----+



### Assign weights to classes

In [33]:
# Create a weight of each class
from pyspark.sql import functions as F
p_weight = df.filter('label == 1').count()/ df.count()
n_weight = df.filter('label == 0').count()/ df.count()
print(n_weight, p_weight)

0.5060074847350797 0.4939925152649202


In [38]:
df = df.withColumn("weight", F.when(F.col("label")==1,n_weight).otherwise(p_weight))
df.show(5)

+--------------------+---------+-----+--------------------+------------------+
|                text|sentiment|label|              text_c|            weight|
+--------------------+---------+-----+--------------------+------------------+
| Så som i himmele...| positive|    1|   som   himmelen...|0.5060074847350797|
| While sporadical...| negative|    0|  while sporadica...|0.4939925152649202|
|'Blue Desert' may...| negative|    0|Blue Desert may h...|0.4939925152649202|
|'Checking Out' is...| positive|    1|check out    extr...|0.5060074847350797|
|'Presque Rien' ('...| positive|    1|presque Rien come...|0.5060074847350797|
+--------------------+---------+-----+--------------------+------------------+
only showing top 5 rows



In [34]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import PCA
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import GBTClassifier

from pyspark.ml import Pipeline

# Data transformation

### Preprocessing pipeline
Pipelines in Spark are a powerful tool for data processing and analysis, as they enable the creation of complex data workflows that can be executed efficiently on distributed computing systems. They also simplify the data processing and analysis tasks, as they enable the chaining of multiple stages into a single workflow.

In [35]:
# Tokenize the review text
tokenizer = Tokenizer(inputCol="text_c", outputCol="words",)
# Remove stop words
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
# Create a count vectorizer
countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="rawFeatures", vocabSize=1000)
# Calculate the TF-IDF
idf = IDF(inputCol=countVectorizer.getOutputCol(), outputCol="featuresIDF")
# Crate a preprocessing pipeline with 4 stages
pipeline_p = Pipeline(stages=[tokenizer,remover, countVectorizer, idf])

In [36]:
# Learn the data preprocessing model
data_model = pipeline_p.fit(df)

In [39]:
# Transform
transformed_data = data_model.transform(df)
transformed_data.toPandas().head(5)

Unnamed: 0,text,sentiment,label,text_c,weight,words,filtered,rawFeatures,featuresIDF
0,Så som i himmelen .. as above so below.. tha...,positive,1,som himmelen above below that very ...,0.506007,"[, , , som, , , himmelen, , , , , above, , , b...","[, , , som, , , himmelen, , , , , , , special,...","(77.0, 0.0, 4.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,...","(0.015166436921203202, 0.0, 2.147722756669075,..."
1,While sporadically engrossing (including a fe...,negative,0,while sporadically engross include few eff...,0.493993,"[, , while, sporadically, engross, include, , ...","[, , sporadically, engross, include, , , effec...","(72.0, 0.0, 3.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0,...","(0.014181603354891307, 0.0, 1.6107920675018061..."
2,'Blue Desert' may have had the potential to be...,negative,0,Blue Desert may have have the potential eve...,0.493993,"[blue, desert, may, have, have, the, potential...","[blue, desert, may, potential, , , , even, , ,...","(93.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...","(0.01831790433340127, 0.8910976455736771, 0.53..."
3,'Checking Out' is an extraordinary film that t...,positive,1,check out extraordinary film that tower abo...,0.506007,"[check, out, , , , extraordinary, film, that, ...","[check, , , , extraordinary, film, tower, film...","(40.0, 0.0, 3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,...","(0.00787866853049517, 0.0, 1.6107920675018061,..."
4,'Presque Rien' ('Come Undone') is an earlier w...,positive,1,presque Rien come Undone early work the i...,0.506007,"[presque, rien, come, undone, , , , early, wor...","[presque, rien, come, undone, , , , early, wor...","(232.0, 0.0, 6.0, 3.0, 0.0, 2.0, 0.0, 0.0, 0.0...","(0.04569627747687199, 0.0, 3.2215841350036123,..."


In [40]:
# Transform the test data
transformed_test = data_model.transform(test)
transformed_test.toPandas().head(5)

Unnamed: 0,text,sentiment,label,text_c,words,filtered,rawFeatures,featuresIDF
0,'Don't Look In the Basement' is so easy to kno...,positive,1,do not look the Basement easy knock but...,"[do, not, look, , , the, basement, , , , easy,...","[look, , , basement, , , , easy, , , knock, tr...","(63.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0,...","(0.012408902935529893, 0.0, 0.5369306891672687..."
1,"*Flat SPOILERS* <br /><br />Five med students,...",positive,1,flat spoiler five med student Nelson Kiefer Su...,"[flat, spoiler, five, med, student, nelson, ki...","[flat, spoiler, five, med, student, nelson, ki...","(67.0, 2.0, 0.0, 1.0, 2.0, 0.0, 1.0, 0.0, 2.0,...","(0.01319676978857941, 0.8910976455736771, 0.0,..."
2,.... may seem far fetched.... but there really...,negative,0,may seem far fetched but there really be r...,"[, , may, seem, far, fetched, but, there, real...","[, , may, seem, far, fetched, really, , , real...","(69.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0,...","(0.013590703215104168, 0.0, 0.0, 0.0, 0.0, 0.6..."
3,...Our the grandpa's hour.<br /><br />More tha...,positive,1,our the grandpa hourMore than the gangster its...,"[our, the, grandpa, hourmore, than, the, gangs...","[grandpa, hourmore, gangster, , , detailed, de...","(36.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0,...","(0.007090801677445653, 0.0, 0.5369306891672687..."
4,...but I regret having seen it. Since the rati...,negative,0,but regret have see since the rating IMD...,"[but, , , regret, have, see, , , since, the, r...","[, , regret, see, , , since, rating, , , imdb,...","(75.0, 2.0, 2.0, 1.0, 1.0, 0.0, 2.0, 0.0, 0.0,...","(0.014772503494678443, 0.8910976455736771, 1.0..."


In [41]:
# Print the sages of the pipeline
data_model.stages

[Tokenizer_d41190dfd02b,
 StopWordsRemover_42b3525e1430,
 CountVectorizerModel: uid=CountVectorizer_a1a420fd1f66, vocabularySize=1000,
 IDFModel: uid=IDF_fb6e03ab72e7, numDocs=10154, numFeatures=1000]

In [42]:
# Get the vocabulary of the CountVectorizer
data_model.stages[2].vocabulary[:20]

['',
 'movie',
 'film',
 'one',
 'see',
 'make',
 'like',
 'good',
 'get',
 'well',
 'time',
 'character',
 'watch',
 'bad',
 'even',
 'story',
 'really',
 'think',
 'show',
 'scene']

# Metics for the model

In [43]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.evaluation import MultilabelMetrics
from pyspark.mllib.evaluation import MulticlassMetrics

def m_metrics_l(ml_model,test_data):
    predictions = ml_model.transform(test_data).cache()
    predictionAndLabels = predictions.select("label","prediction").rdd.map(lambda x: (float(x[0]), float(x[1]))).cache()
    
    # Print some predictions vs labels
    # print(predictionAndLabels.take(10))
    metrics = MulticlassMetrics(predictionAndLabels)
    
    # Overall statistics
    precision = metrics.precision(1.0)
    recall = metrics.recall(1.0)
    f1Score = metrics.fMeasure(1.0)
    print(f"Precision = {precision:.4f} Recall = {recall:.4f} F1 Score = {f1Score:.4f}")
    print("Confusion matrix \n", metrics.confusionMatrix().toArray().astype(int))

# ML Models
### Logistic Regression

In [44]:
import time
classifier = LogisticRegression(maxIter=10, regParam=0.1, featuresCol = "featuresIDF", weightCol="weight")
start = time.time()
pipeline = Pipeline(stages=[classifier])
print(f"Training started.")
model = pipeline.fit(transformed_data)
print(f"Model created in {time.time()-start:.2f}s.")
m_metrics_l(model,transformed_test)
print(f"Total time {time.time()-start:.2f}s.")

Training started.
Model created in 1.11s.




Precision = 0.8729 Recall = 0.8356 F1 Score = 0.8539
Confusion matrix 
 [[1000  162]
 [ 219 1113]]
Total time 7.30s.


### GBTClassifier

In [45]:
classifier = GBTClassifier(maxIter=10, featuresCol = "featuresIDF", weightCol="weight", maxDepth=5)
pipeline = Pipeline(stages=[classifier])
start = time.time()
print(f"Training started.")
model = pipeline.fit(transformed_data)
print(f"Model created in {time.time()-start:.2f}s.")
m_metrics_l(model,transformed_test)
print(f"Total time {time.time()-start:.2f}s.")

Training started.
Model created in 4.98s.
Precision = 0.8557 Recall = 0.6967 F1 Score = 0.7680
Confusion matrix 
 [[ 744  184]
 [ 475 1091]]
Total time 11.64s.


### LinearSVC

In [46]:
classifier = LinearSVC(maxIter=10, regParam=0.1, featuresCol = "featuresIDF", weightCol="weight")
pipeline = Pipeline(stages=[classifier])
start = time.time()
print(f"Training started.")
model = pipeline.fit(transformed_data)
print(f"Model created in {time.time()-start:.2f}s.")
m_metrics_l(model,transformed_test)
print(f"Total time {time.time()-start:.2f}s.")

Training started.
Model created in 1.21s.
Precision = 0.8800 Recall = 0.8317 F1 Score = 0.8552
Confusion matrix 
 [[ 992  153]
 [ 227 1122]]
Total time 7.71s.


### MultilayerPerceptronClassifier

In [47]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
# Multilayer Perceptron Classifier for a classification task with 1000 input features, a hidden layer with 30 nodes, and 2 output classes
# The input layer must match the dimensionality of the input data currently = 1000
layers = [1000, 30, 2]

# create the trainer and set its parameters
classifier = MultilayerPerceptronClassifier(maxIter=10, layers=layers,featuresCol = "featuresIDF", blockSize=128, seed=1234)
pipeline = Pipeline(stages=[classifier])
start = time.time()
print(f"Training started.")
model = pipeline.fit(transformed_data)
print(f"Model created in {time.time()-start:.2f}s.")
m_metrics_l(model,transformed_test)
print(f"Total time {time.time()-start:.2f}s.")

Training started.
Model created in 1.93s.
Precision = 0.8682 Recall = 0.8330 F1 Score = 0.8502
Confusion matrix 
 [[ 997  168]
 [ 222 1107]]
Total time 8.31s.


# Features Selections
In Spark the following Feature Selectors are available
- VectorSlicer
- RFormula
- ChiSqSelector
- UnivariateFeatureSelector
- VarianceThresholdSelector

We use the ChiSqSelector to reduce the number of features from 1000 to 200. Chi-square variable selection is a technique used in statistics and machine learning to identify the most relevant features or variables in a dataset for a given classification task. It is based on the chi-square test, which is a statistical test used to determine the independence of two variables.

In [48]:
tokenizer = Tokenizer(inputCol="text_c", outputCol="words",)
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
countVectorizer = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="rawFeatures", vocabSize=1000)
idf = IDF(inputCol=countVectorizer.getOutputCol(), outputCol="featuresIDF")

# Select the top 200 features based on their chi-squared test value
selector = ChiSqSelector(numTopFeatures=200, featuresCol=idf.getOutputCol(), outputCol="features", labelCol="label")
# Crate a preprocessing pipeline with 5 stages
pipeline_p = Pipeline(stages=[tokenizer,remover, countVectorizer, idf,selector])
# Learn the data preprocessing model
data_model = pipeline_p.fit(df)

### Print the selected features (words) by ChiSqSelector
To print the selected words by the ChiSqSelector in the sentiment analysis model, we can use the selectedFeatures attribute of the transformer. This attribute returns an array of the indices of the selected features, which we can map back to the original vocabulary of the CountVectorizer to get the corresponding words.

In [49]:
vocabulary = data_model.stages[2].vocabulary
selected_indexes = data_model.stages[4].selectedFeatures
selected_words = [vocabulary[i] for i in selected_indexes]
print(selected_words)

['movie', 'make', 'like', 'good', 'get', 'watch', 'bad', 'even', 'story', 'great', 'look', 'say', 'also', 'play', 'love', 'thing', 'seem', 'life', 'want', 'plot', 'try', 'year', 'act', 'still', 'something', 'guy', 'performance', 'nothing', 'actually', 'young', 'role', 'become', 'point', 'minute', 'pretty', 'world', 'kill', 'horror', 'mean', 're', 'script', 'whole', 'least', 'may', 'acting', 'always', 'enjoy', 'family', 'live', 'series', 'anything', 'reason', 'effect', 'idea', 'fun', 'especially', 'bring', 'maybe', 'different', 'money', 'someone', 'job', 'true', 'shoot', 'waste', 'recommend', 'instead', 'hour', 'excellent', 'short', 'beautiful', 'else', 'war', 'view', 'half', 'attempt', 'poor', 'suppose', 'classic', 'human', 'stupid', 'rest', 'lack', 'either', 'completely', 'meet', 'wrong', 'dialogue', 'save', 'joke', 'awful', 'perfect', 'definitely', 'flick', 'terrible', 'fine', 'wonder', 'wonderful', 'sit', 'low', 'guess', 'experience', 'spend', 'fail', 'throw', 'win', 'relationship',

### Run the algorithm usin the slected features only

In [50]:
# Transform the data
transformed_data = data_model.transform(df)
transformed_test = data_model.transform(test)

### Logistic regression
If we use ChiSqSelector to choose the top 200 features for the sentiment analysis model, we should anticipate that the model's performance may not be as high as when we use all 1000 features. The main factor causing this outcome is the reduction of information. By selecting only the top 200 features, we may discard vital information present in the remaining 800 features, which can lead to a decline in accuracy and overall performance.

In [51]:
classifier = LogisticRegression(maxIter=5, featuresCol = "features")
start = time.time()
pipeline = Pipeline(stages=[classifier])
print(f"Training started.")
model = pipeline.fit(transformed_data)
print(f"Model created in {time.time()-start:.2f}s.")
m_metrics_l(model,transformed_test)
print(f"Total time {time.time()-start:.2f}s.")

Training started.
Model created in 0.94s.
Precision = 0.8659 Recall = 0.8124 F1 Score = 0.8383
Confusion matrix 
 [[ 964  171]
 [ 255 1104]]
Total time 6.05s.
