# Twitter Sentiment Analysis with the Spark NLP Library

## Data preparation

### Introduction

This notebook was developed using the [all-spark-notebook](https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook) and should work without modification in that environment.

Earlier this year, I produced a [similar notebook](https://www.zepl.com/viewer/notebooks/bm90ZTovL3JvYm9yYXRpdmUvVHdpdHRlci1TZW50aW1lbnQtRXhhbXBsZS9hNTlmZjFkYTAzY2Y0ZWY0YTg5MWRlNjZkMWFlM2I0My9ub3RlLmpzb24) using [Zeppelin](https://zeppelin.apache.org/). In that project, I used the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library to analyze sentiment of a Twitter stream of tweets about Donald Trump. The CoreNLP library is not specific to Spark but is written in Java so it's trivial to use it from within Spark. However, recent investigation into the current NLP space in the Spark ecosystem unearthed the recently released [NLP library](http://nlp.johnsnowlabs.com/index.html) from [John Snow Labs](http://www.johnsnowlabs.com). The library includes a sentiment analyzer which utilizes a novel approach based on the [research](http://arxiv.org/abs/1305.6143) and [code](https://github.com/vivekn/sentiment) produced by Vivek Narayanan et al. In the paper, the authors claim an 88% accuracy on sentiment prediction of reviews from the Internet Movie Database (IMDb).

As I considered this project, I came across a [data set of tweets about Trump](https://www.kaggle.com/ahsanijaz/trumptweets) which includes 4600 tweets collected on August 17, 2015. The data set stood out because the tweets were manually labeled as positive or negative. Of course, having pre-labeled data has proven tremedously helpful in evaluating a supervised sentiment analyzer as the labor-intensive work of providing classified data has already been performed by the provider (ahsanijaz). 

In comparison to the IMDB data set of 25000 reviews, the tweet data set is much more limited in number and in length of material. Nevertheless, in this notebook the analysis of the tweet data set resulted in an accuracy of 82.6%. As the Vivekn et al approach is based on a Naive Bayes model which lends itself to shorter text according to [prior research](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf), it appear to be a good choice for analyzing this data.

One factor that made a significant difference in the performance of the model is the approach taken here of not removing duplicate data from the training set. If duplicate data is removed from the set the accuracy drops to 78%. Duplicate tweets result primarily from the processing approach taken below in which retweets are represented in the data. Furthermore, the common indicator of a retweet ('RT') has been removed from the tweets which results in a modest futher redundancy. It may be the case that the redundancy helps reinforce the general mood of the population and thereby helps shift the weighting in a beneficial way. However, it also seems likely that the approach may cause overfitting of the data as it is more probable that duplicate tweets will appear in both the training and test sets. In this notebook, I have not tested this hypothesis but it certainly would be beneficial in getting better confidence in the resulting model.

### Preparation details

Before I could begin to analyze the data, data wrangling was necessary. The source file (`trumptweets.csv`) appears to have a Windows-1252 encoding and oddly has fewer columns defined than the data it describes. While the data is loaded using the Windows-1252 encoding, encoding artifacts remain that could influence the analsysis of the tweets and may be worth further analysis.

In order to align the data with the defined columns, I add a nonce header value. Furthermore, the Spark CSV parser behaves in a subtlely different way when periods (`.`) exist in the data so I removed these as well. Finally, the data provider used backslashes to escape quotes (`\"`) rather than the more common approach of using double quotes (`""`) so I substituted the former with the later so that the Spark CSV parser would work correctly.

In [1]:
%%bash

if [ ! -f trumptweet_mod.csv ]; then
    echo "Creating copy of trumptweet.csv with modifications"
    iconv -f CP1252 -t UTF-8 trumptweet.csv > trumptweet_mod.csv # Converting to UTF-8
    sed -i 's/"X",/"X","X_copy",/' trumptweet_mod.csv # Add new column to header to align header with data
    sed -i 's/"X.1",/"X_1",/' trumptweet_mod.csv # Spark reader behaves oddly when like dots are in column names
    sed -i 's/\\"/""/g' trumptweet_mod.csv # Use standard quote escapes for CSV rather than backslashes
else
    echo "trumptweet_mod.csv already exists so leaving unmodified"
fi

trumptweet_mod.csv already exists so leaving unmodified


We begin by installing the Spark NLP library into the environment as it is not included in the Jupyter environment noted above.

In [2]:
import os
# Load Spark NLP package as it's not included in all-spark-notebook
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages JohnSnowLabs:spark-nlp:1.2.3 pyspark-shell'

!pip install ftfy



On my local machine, I encountered garbage collection issues processing the data using the default configuration. Increasing the memory for both the executor and the driver were successful in resolving these problems.

In [3]:
import pyspark
from pyspark.sql import SparkSession

conf = pyspark.SparkConf()
# Default settings resulted in GC issues. However, may be able to reduce this if system memory is limited
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "4g")

sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)

The data contains linefeeds so the `multiLine` parameter was required to process the entire data set. As is seen below, the data contains a roughly equal number of positive and negative tweets.

In [11]:
# Use modified version of the data
labeled_data = spark.read.csv("trumptweet_mod.csv", header = True, escape = '"', mode = "FAILFAST", multiLine = "true")

# Show breakdown of positive and negative tweets
labeled_data.groupBy('Class').count().show()

+-----+-----+
|Class|count|
+-----+-----+
|    0| 2262|
|    1| 2340|
+-----+-----+



Below I perform the following normalizations on the data :

- Remove `RT`
- Remove user references (eg, `@realDonaldTrump`)
- Remove URLs
- Remove the hashtag character (`#`)
- Remove odd Unicode and unexplained 'tags' (eg, `<ed><U+00A0>`)
- Removed stray leading characters (`.`, `'`, and `"`)
- Removed whitespace and beginning and end of tweet
- Removed any entries that resulted in empty text after the above processing
- Removed all columns except for the normalized text and the classification

In [12]:
# Normalize tweets
from pyspark.sql.functions import regexp_replace
from pyspark.sql.functions import trim
from pyspark.sql.functions import udf
from ftfy import fix_text

fix_text_udf = udf(lambda s: fix_text(s))

rt_regex = r"(?=\s?)(RT:?)(?=\s?)" # Drop RT
user_regex = r"@\S+" # Drop user references
url_regex = r"http[s]?:\S+" # Drop URLs
hashtag_regex = r"#" # Remove hashtag itself but leave tag value
unicode_regex = r"<ed>|<U\+[^>]*>" # Clean up spurious unicode references
space_etc_regex = r"^[\.'\",]\s+|\s{2,}|\n" # Remove extra spaces, linefeeds and leading periods or commas.

# Remove RT, users (@foo), URLs, #s, line feeds, odd unicode 'tags'
#uber_regex =  "|".join([rt_regex, user_regex, url_regex, hashtag_regex, unicode_regex])
uber_regex =  "|".join([rt_regex, url_regex, hashtag_regex, unicode_regex])

labeled_data = labeled_data.withColumn("norm_text", fix_text_udf("text"))
labeled_data = labeled_data.withColumn("norm_text", regexp_replace("norm_text", uber_regex, ""))
labeled_data = labeled_data.withColumn("norm_text", trim(regexp_replace("norm_text", space_etc_regex, " ")))
labeled_data = labeled_data.filter(labeled_data.norm_text != '') # Drop any text that is now empty from actions above
labeled_data = labeled_data.select("norm_text", "Class") # Drop most of the data as isn't used
labeled_data.show()

+--------------------+-----+
|           norm_text|Class|
+--------------------+-----+
|@GOPBlackChick: I...|    1|
|@CNN is there any...|    0|
|@KurtSchlichter: ...|    0|
|@ajpeacemaker @md...|    0|
|THE TRUMP IMMIGRA...|    1|
|@CNNPolitics: Chr...|    0|
|@Morning_Joe Not ...|    1|
|@ThePatriot143: C...|    0|
|Trump is correct ...|    1|
|"I'm going to pre...|    1|
|I really hope peo...|    0|
|Trump is claiming...|    0|
|@marclamonthill: ...|    1|
|BOOM – Univision ...|    1|
|@GeoScarborough I...|    1|
|@AlexBlackStars @...|    0|
|@charlescwcooke: ...|    0|
|@mdabbss: Donald ...|    0|
|A little surprise...|    1|
|@Salon: Trump is ...|    0|
+--------------------+-----+
only showing top 20 rows



The Vivekn annotator requires separate files for both positive and negative training data. While there is not much documentation on the expected input (that is, whether it will accept a Spark directory containing multiple files as is common with Hadoop files), my attempts to utilize such an approach were not successful. As a result, I resorted to a hybrid approach of writing out the training data using the standard Spark API and then modifying the resulting file by accessing the underlying Hadoop libraries directly. I believe there is likely a better approach to doing so but I leave it as an exercise to the reader to discover.

In [13]:
# Split data 80%/20% for training and testing
train_data, test_data = labeled_data.randomSplit([0.8, 0.2], seed=71082)

# NLP code below requires separate files for both positive and negative training data.
# After much experimentation, not able to simple write to filesystem using Spark libraries
# as it produces both the data file and metadata files that the Vivekn annotator cannot process
# when simply indicating the directory where the file exists in the code below.
def write_df(df, dirname, filename):
    # Use Hadoop library directly to write to filesystem
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

    tmp_name = filename + ".tmp"
    # Write to a single file rather than multiple
    df.coalesce(1).write.mode('overwrite').text(tmp_name)
    
    fs = FileSystem.get(Configuration())
    fs.mkdirs(Path(dirname))
    # Assume one file output
    file = fs.globStatus(Path(tmp_name + "/*.txt"))[0].getPath();
    fs.rename(file, Path(dirname + "/" + filename));
    fs.delete(Path(tmp_name), True);
    
# split training data into positive/negative
positive_data = train_data.filter(train_data.Class == "1").select("norm_text")
write_df(positive_data, "train-data.txt", "positive.txt")

negative_data = labeled_data.filter(train_data.Class == "0").select("norm_text")
# remove duplicates
negative_data = negative_data.distinct()
write_df(negative_data, "train-data.txt", "negative.txt")

test_data.write.mode("overwrite").parquet("test-data.parquet")

## Sentiment prediction

Now that we have the training and test data ready to go, it's time to put it to use!

After reading in the test data, we remove duplicate entries so as not to bias the results.

In [14]:
# Read in data produced earlier
test_data = spark.read.parquet("test-data.parquet")
test_data.show()

+--------------------+-----+
|           norm_text|Class|
+--------------------+-----+
|"@WIRED: There's ...|    1|
|"Disgusting, bigo...|    0|
|"Former Reagan An...|    0|
|"I'm going to pre...|    1|
|'Dilbert' creator...|    0|
|'I don't care if ...|    1|
|.@KatiePavlich sa...|    0|
|0/ da heck he sai...|    0|
|15 Things Trump a...|    1|
|20 times Donald T...|    0|
|5 Marketing Lesso...|    1|
|@97Musick: @FoxNe...|    1|
|@ABCPolitics: Don...|    1|
|@ABCPolitics: Don...|    1|
|@AG_Conservative:...|    0|
|@AdamofAlbion: If...|    0|
|@AdamsFlaFan: Don...|    0|
|@AlexAtFWW: Wait,...|    0|
|@AllenWest: Trump...|    1|
|@AnitaPaoPao: @_A...|    0|
+--------------------+-----+
only showing top 20 rows



We can now execute the NLP pipeline using the Vivekn annotator to create a model and make sentiment predictions on our test tweets.

In [15]:
from pyspark.ml import Pipeline, PipelineModel
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

document_assembler = DocumentAssembler().setInputCol("norm_text")
    
sentence_detector = SentenceDetectorModel().setInputCols(["document"]).setOutputCol("sentence")

tokenizer = RegexTokenizer().setInputCols(["sentence"]).setOutputCol("token")
        
normalizer = Normalizer().setInputCols(["token"]).setOutputCol("normal")        
        
spell_checker = NorvigSweetingApproach().setInputCols(["normal"]).setOutputCol("spell")
        
sentiment_detector = ViveknSentimentApproach().setInputCols(["spell", "sentence"]) \
    .setOutputCol("sentiment").setPositiveSource("train-data.txt/positive.txt") \
    .setNegativeSource("train-data.txt/negative.txt").setPruneCorpus(False)   
    
finisher = Finisher().setInputCols(["sentiment"]).setIncludeKeys(True)
    
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

sentiment_data = pipeline.fit(test_data).transform(test_data)    
sentiment_data.show()

+--------------------+-----+--------------------+
|           norm_text|Class|  finished_sentiment|
+--------------------+-----+--------------------+
|"@WIRED: There's ...|    1|    result->positive|
|"Disgusting, bigo...|    0|    result->negative|
|"Former Reagan An...|    0|    result->negative|
|"I'm going to pre...|    1|result->positive@...|
|'Dilbert' creator...|    0|result->positive@...|
|'I don't care if ...|    1|    result->negative|
|.@KatiePavlich sa...|    0|result->negative@...|
|0/ da heck he sai...|    0|result->negative@...|
|15 Things Trump a...|    1|    result->negative|
|20 times Donald T...|    0|    result->negative|
|5 Marketing Lesso...|    1|result->negative@...|
|@97Musick: @FoxNe...|    1|result->positive@...|
|@ABCPolitics: Don...|    1|    result->positive|
|@ABCPolitics: Don...|    1|    result->positive|
|@AG_Conservative:...|    0|    result->negative|
|@AdamofAlbion: If...|    0|    result->negative|
|@AdamsFlaFan: Don...|    0|    result->negative|


The pipeline adds an extra column with an sentiment indicated for each sentence in the tweet. The value of the indication is either `result->positive` or `result-negative` and multiple sentences are represented with the repetition of these values delimited by `@`.  In order to get a single Boolean value for the tweet, we take the average of each sentiment indication and then round the result to 0 or 1. In Python 3, the result of rounding a value of .5 is 0. Arbitrarily, I decided to be biased towards positive sentiments by adding a small value to force .5 to round to 1. In testing, I found coincidentally that doing so resulted in more accurate predictions.

In [16]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from statistics import mean

def sigmoid(s):
    return 0 if s is None else round(mean(map(lambda x: 1 if (x == "result->positive") else 0, s.split("@"))) + .01)

sigmoid_udf = udf(sigmoid, IntegerType())

sentiment_data = sentiment_data.withColumn("total_sentiment", sigmoid_udf("finished_sentiment"))
sentiment_data.show()

+--------------------+-----+--------------------+---------------+
|           norm_text|Class|  finished_sentiment|total_sentiment|
+--------------------+-----+--------------------+---------------+
|"@WIRED: There's ...|    1|    result->positive|              1|
|"Disgusting, bigo...|    0|    result->negative|              0|
|"Former Reagan An...|    0|    result->negative|              0|
|"I'm going to pre...|    1|result->positive@...|              1|
|'Dilbert' creator...|    0|result->positive@...|              1|
|'I don't care if ...|    1|    result->negative|              0|
|.@KatiePavlich sa...|    0|result->negative@...|              0|
|0/ da heck he sai...|    0|result->negative@...|              0|
|15 Things Trump a...|    1|    result->negative|              0|
|20 times Donald T...|    0|    result->negative|              0|
|5 Marketing Lesso...|    1|result->negative@...|              1|
|@97Musick: @FoxNe...|    1|result->positive@...|              1|
|@ABCPolit

Finally, we determine the accuracy of the model.

In [17]:
correct_count = sentiment_data.filter(sentiment_data.Class == sentiment_data.total_sentiment).count()
total_count =  sentiment_data.count()
accuracy = correct_count / total_count

print(f"Total of {correct_count} correct classifications out of {total_count} observations: {accuracy:.2%}")


Total of 729 correct classifications out of 869 observations: 83.89%
