# NLP pre-processing and Machine Learning Model construction.

Here we will be using cleaned data from the Data Wrangling step to build and train our machine learning models so that when we deploy this model on a new dataset it can make the desired predictions

# Initializing spark instance from a JupyterLab environment and creating spark session:

In [1]:
%%time

import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark import StorageLevel
from pyspark.sql import functions as f
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql import DataFrame
import re

spark = SparkSession \
        .builder \
        .appName("ML Model Building") \
        .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
        .config("spark.driver.memory","4g") \
        .config("spark.executor.memory","5g") \
        .master("local") \
        .getOrCreate()


CPU times: total: 15.6 ms
Wall time: 3.94 s


# Importing the cleaned dataset from mongoDB for NLP pre-processing

In [26]:
%%time

mongo_ip = "mongodb://localhost:27017/streaming."

cleaned_df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", mongo_ip + "df_cleaned").load()

cleaned_df.createOrReplaceTempView("df_cleaned")

cleaned_df = spark.sql("SELECT polarity,text FROM df_cleaned;")

cleaned_df.show(5)

+--------+--------------------+
|polarity|                text|
+--------+--------------------+
|     0.0|Hello everyone I'...|
|     0.0|is not very excit...|
|     0.0|Apparently my par...|
|     0.0|Oh really It is t...|
|     0.0|up at AM on a Sat...|
+--------+--------------------+
only showing top 5 rows

CPU times: total: 0 ns
Wall time: 2.2 s


# Verifying the schema of the dataframe.

In [27]:
cleaned_df.printSchema()

root
 |-- polarity: double (nullable = true)
 |-- text: string (nullable = true)



# Spliting our data randomly into test, train and validation datasets:

In [28]:
(training_data, validation_data, test_data) = cleaned_df.randomSplit([0.80, 0.10, 0.10], seed=5777)

In [29]:
training_data.count() # checking the number of records in the training dataset.

1275891

# NLP Preprocessing

NLP (Natural Language Processing) preprocessing is the series of steps that will be performed on raw natural language text data to prepare it for analysis using computational techniques. The goal of NLP preprocessing is to convert unstructured text data into a structured format that can be easily analyzed using machine learning and other statistical models.

The NLP preprocessing techniques used are:

- Tokenization: Breaking the text into individual words, phrases, or sentences.
- Stopword removal: Removing common words that don't add much meaning to the text, such as "the," "and," and "a."
- Hashing term frequencies (HTF) is a technique used in natural language processing (NLP) to convert a text document into a numerical feature vector.
- Inverse Document Frequency (IDF) is a measure used in natural language processing (NLP) and information retrieval to quantify the importance of a term in a collection of documents. 


In [30]:
%%time

from pyspark.ml.feature import (
    StopWordsRemover,
    Tokenizer,
    HashingTF,
    IDF,
)

from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# Tokenizer takes the input string, converts it into lowercase and splits it by whitespace. The output will be called as bag of words.

tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_words")

# Removing stop words from english language since it has no effect on ML model training.

stopwords_remover = StopWordsRemover(
    inputCol="tokenized_words",
    outputCol="stopwords_removed",
    stopWords=StopWordsRemover.loadDefaultStopWords("english")
)

# Hashing term frequencies:

hashing_tf = HashingTF(
    inputCol="stopwords_removed",
    outputCol="term_frequency",
)

# Inverse Document Frequency: 

idf = IDF(
    inputCol="term_frequency",
    outputCol="features",
    minDocFreq=5,
)


CPU times: total: 0 ns
Wall time: 243 ms


# Why we choosed logistic regression model

Logistic regression has the following advantages:

- Can handle sparse data
- Fast to train
- Weights can be interpreted
- Positive weights will correspond to the words that are positive
- Negative weights will correspond to the words that are negative

# Building Logistic Regression pipeline model for sentimental analysis using NLP preprocessing steps

In [31]:
%%time

lr = LogisticRegression(featuresCol='features',labelCol="polarity")

semantic_analysis_pipeline_lr = Pipeline(
    stages=[tokenizer, stopwords_remover, hashing_tf, idf, lr]
)

semantic_analysis_model_lr = semantic_analysis_pipeline_lr.fit(training_data)

CPU times: total: 0 ns
Wall time: 3min 30s


In [33]:
%%time

trained_df_lr = semantic_analysis_model_lr.transform(training_data)
validation_df_lr = semantic_analysis_model_lr.transform(validation_data)
test_df_lr = semantic_analysis_model_lr.transform(test_data)

trained_df_lr.show(5)
validation_df_lr.show(5)
test_df_lr.show(5)

+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|polarity|                text|     tokenized_words|   stopwords_removed|      term_frequency|            features|       rawPrediction|         probability|prediction|
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     0.0|' Napa was stunni...|[', napa, was, st...|[', napa, stunnin...|(262144,[1512,431...|(262144,[1512,431...|[9.25919367953240...|[0.92261958057139...|       0.0|
|     0.0|' and to think i ...|[', and, to, thin...|[', think, dedcic...|(262144,[19153,43...|(262144,[19153,43...|[8.32203343255971...|[0.64632398300240...|       0.0|
|     0.0|' audition today ...|[', audition, tod...|[', audition, tod...|(262144,[31536,61...|(262144,[31536,61...|[9.19437425386289...|[0.91279185861405..

# Logistic Regression model Evaluation

In [34]:
%%time

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="polarity", metricName="accuracy")

accuracy_val_lr = evaluator.evaluate(validation_df_lr)
accuracy_test_lr = evaluator.evaluate(test_df_lr)

print("Validation Data:")
print(f"Accuracy: {accuracy_val_lr*100:.5f}%")
print("Testing Data:")
print(f"Accuracy: {accuracy_test_lr*100:.5f}%")

# Try to do parameter grid and validation to increase the performance.

Validation Data:
Accuracy: 79.00657%
Testing Data:
Accuracy: 79.00044%
CPU times: total: 0 ns
Wall time: 1min 40s


In [35]:
%%time
final_logistic_model = semantic_analysis_pipeline_lr.fit(cleaned_df)

CPU times: total: 46.9 ms
Wall time: 4min 26s


In [39]:
%%time
final_logistic_model

CPU times: total: 0 ns
Wall time: 0 ns


PipelineModel_16ec05e33491

In [43]:
final_logistic_model_bkp = final_logistic_model

# Inference:

With logistic regression model we got accuracy of 79% with validation dataset and 79% accuracy with test dataset.