# Twitter Sentiment Analysis with the Spark NLP Library

## Data preparation

### Introduction

This notebook was developed using the [all-spark-notebook](https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook) and should work without modification in that environment.

Earlier this year, I produced a [similar notebook](https://www.zepl.com/viewer/notebooks/bm90ZTovL3JvYm9yYXRpdmUvVHdpdHRlci1TZW50aW1lbnQtRXhhbXBsZS9hNTlmZjFkYTAzY2Y0ZWY0YTg5MWRlNjZkMWFlM2I0My9ub3RlLmpzb24) using [Zeppelin](https://zeppelin.apache.org/). In that project, I used the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library to analyze sentiment of a Twitter stream of tweets about Donald Trump. The CoreNLP library is not specific to Spark but is written in Java so it's trivial to use it from within Spark. However, recent investigation into the current NLP space in the Spark ecosystem unearthed the recently released [NLP library](http://nlp.johnsnowlabs.com/index.html) from [John Snow Labs](http://www.johnsnowlabs.com). The library includes a sentiment analyzer which utilizes a novel approach based on the [research](http://arxiv.org/abs/1305.6143) and [code](https://github.com/vivekn/sentiment) produced by Vivek Narayanan et al. In the paper, the authors claim an 88% accuracy on sentiment prediction of reviews from the Internet Movie Database (IMDb).

As I considered this project, I came across a [data set of tweets about Trump](https://www.kaggle.com/ahsanijaz/trumptweets) which includes 4600 tweets collected on August 17, 2015. The data set stood out because the tweets were manually labeled as positive or negative. Of course, having pre-labeled data has proven tremedously helpful in evaluating a supervised sentiment analyzer as the labor-intensive work of providing classified data has already been performed by the provider (`ahsanijaz`). 

In comparison to the IMDB data set of 25000 reviews, the tweet data set is much more limited in number and in length of text. Nevertheless, in this notebook the analysis of the tweet data set resulted in an accuracy of 88.1%. As the Vivekn et al approach is based on a Naive Bayes model which lends itself to shorter text according to [prior research](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf), it appears to be a good choice for analyzing this data.

One factor that made a significant difference in the performance of the model is the approach taken here of not removing duplicate data from the training set. If duplicate data is removed from the set the accuracy drops to 83.7%. Duplicate tweets result primarily from the processing approach taken below in which retweets are represented in the data. Furthermore, the common indicators of a retweet ('RT' and 'via') have been removed from the tweets which results in a modest futher redundancy. It may be the case that the redundancy helps reinforce the general mood of the population and thereby helps shift the weighting in a beneficial way. However, it also seems likely that the approach may cause overfitting of the data as it is more probable that duplicate tweets will appear in both the training and test sets. In this notebook, I have not tested this hypothesis but it certainly would be beneficial in getting better confidence in the resulting model.

### Preparation details

Before I could begin to analyze the data, data wrangling was necessary. The source file (`trumptweets.csv`) appears to have a Windows-1252 encoding and oddly has fewer columns defined than the data it describes. While the data is loaded using the Windows-1252 encoding, encoding artifacts remain that could influence the analysis of the tweets and may be worth further investigation.

In order to align the data with the defined columns, I add a nonce header value (`X_copy`). Furthermore, the Spark CSV parser behaves in a subtlely different way when periods (`.`) exist in the column names so I modified these as well. Finally, the original data used backslashes to escape quotes (`\"`) rather than the more common approach of using double quotes (`""`) so I substituted the former with the later so that the Spark CSV parser would work correctly.

As I examined the data, I came across many instances of peculiar text like `<ed><U+00A0><U+00BD><ed><U+00B8><U+0089>`. After much fr research, I was able to determine that the text actually was an R artifact (specifically, it seems that the data was pulled from Twitter using the R ‘twitteR’ package) and are the result of R translating emojis to text (see [Emoticons decoder for social media sentiment analysis in R](http://opiateforthemass.es/articles/emoticons-in-R/) and [Twitter emoji encoding problems with twitteR and R](https://stackoverflow.com/questions/37999896/twitter-emoji-encoding-problems-with-twitter-and-r) for details). I found a [resource](https://github.com/today-is-a-good-day/emojis/blob/master/emojis.csv) that provided an exhaustive list of emojis in this format and which I was able to use to map to text equivalents (eg, 😊 to "smiling face with smiling eyes").

In [1]:
%%bash

if [ ! -f trumptweet_mod.csv ]; then
    echo "Creating copy of trumptweet.csv with modifications"
    iconv -f CP1252 -t UTF-8 trumptweet.csv > trumptweet_mod.csv # Convert to UTF-8
    sed -i 's/"X",/"X","X_copy",/' trumptweet_mod.csv # Add new column to header to align header with data
    sed -i 's/"X.1",/"X_1",/' trumptweet_mod.csv # Spark reader behaves oddly when like dots are in column names
    sed -i 's/\\"/""/g' trumptweet_mod.csv # Use standard quote escapes for CSV rather than backslashes
else
    echo "trumptweet_mod.csv already exists"
fi

if [ ! -f emojis.csv ]; then
    echo "Fetching emojis.csv"
    wget -q http://raw.githubusercontent.com/today-is-a-good-day/Emoticons/master/emojis.csv
else
    echo "emojis.csv already exists"  
fi

Creating copy of trumptweet.csv with modifications
Fetching emojis.csv


We begin by installing the Spark NLP library into the environment as it is not included in the Jupyter environment noted above.

In [2]:
import os
# Load Spark NLP package as it's not included in all-spark-notebook
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages JohnSnowLabs:spark-nlp:1.2.3 pyspark-shell'

import pyspark
from pyspark.sql import SparkSession

conf = pyspark.SparkConf()
#conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "2g")

sc = pyspark.SparkContext('local[*]', conf = conf)
spark = SparkSession(sc)

The data contains linefeeds so the `multiLine` parameter was required to process the entire data set. As is seen below, the data contains a roughly equal number of positive and negative tweets.

In [3]:
# Use modified version of the data
labeled_data = spark.read.csv("trumptweet_mod.csv", header = True, escape = '"', mode = "FAILFAST", multiLine = "true")

# Show breakdown of positive and negative tweets
labeled_data.groupBy('Class').count().show()

+-----+-----+
|Class|count|
+-----+-----+
|    0| 2262|
|    1| 2340|
+-----+-----+



I use the `html` library to translate HTML entities to their text equivalents (eg, `&gt;`, `&amp;`, etc).

Additionally, I perform the following normalizations on the data :

- Remove `RT` and `via`
- Remove the 'at' character (`@`) from user references (eg, `@realDonaldTrump`)
- Remove URLs
- Convert R-style emoji references (eg, `<ed><U+00A0>...`) to their text equivalents
- Removed stray leading characters (`.` and `,`)
- Removed whitespace and beginning and end of tweet
- Removed any entries that resulted in empty text after the above processing
- Removed all columns except for the normalized text and the classification

In [4]:
# Normalize tweets
from pyspark.sql.functions import regexp_replace
from pyspark.sql.functions import trim
from pyspark.sql.functions import udf
import re
import html

rt_regex = r"^RT[ :]" # Drop RT
via_regex = "^Via | via " # Likewise, remove via references while preserving succeeding hashtags
user_regex = r"@(?=\S+)" # Remove @ but leave user value
url_regex = r"http\S+" # Drop URLs
hashtag_regex = r"#(?=\S+)" # Remove hashtag itself but leave tag value
space_etc_regex = r"^[\.,]|\s{2,}|\n" # Remove extra spaces, linefeeds and leading periods or commas.

# Remove RT, vias, @s, URLs, line feeds, etc
uber_regex = "|".join([url_regex, rt_regex, via_regex, user_regex, hashtag_regex])

emoji_data = spark.read.csv("emojis.csv", header = True, sep = ";")
emoji_data_list = map(lambda row: row.asDict(), emoji_data.collect())
emoji_data_dict = { row['ftu8']: row for row in emoji_data_list }

r_unicode_regex = re.compile(r"<e[a-f0-9]>\S+>")
r_unicode_pieces_regex = re.compile(r"<e[a-f]><U\+[A-F0-9]{4}><U\+[A-F0-9]{4}>")

def replace_emojis(s, r_unicode):
  r_unicode_matches = r_unicode_pieces_regex.findall(r_unicode)
  for n in range(len(r_unicode_matches), 0, -1):
    r_unicode_key = "".join(r_unicode_matches[0:n]).replace("U+00", "").lower()
    if r_unicode_key in emoji_data_dict:
      unicodes = emoji_data_dict[r_unicode_key].get("unicode").split()
      unicode = "".join(map(lambda u: f"<{u}>", unicodes))
      to_be_replaced = "".join(r_unicode_matches[0:n])
      s = s.replace(to_be_replaced, unicode) # Make translated value a sentence.
      break
  return s

# def replace_emojis(s, r_unicode):
#   r_unicode_matches = r_unicode_pieces_regex.findall(r_unicode)
#   for n in range(len(r_unicode_matches), 0, -1):
#     r_unicode_key = "".join(r_unicode_matches[0:n]).replace("U+00", "").lower()
#     if r_unicode_key in emoji_data_dict:
#       text = emoji_data_dict[r_unicode_key].get("EN")
#       to_be_replaced = "".join(r_unicode_matches[0:n])
#       s = s.replace(to_be_replaced, f" {text}. ") # Make translated value a sentence.
#       break
#   return s

def fix_emojis(s):
  is_done = False
  while r_unicode_regex.search(s) is not None and is_done is False:
    s_prior = s
    for match in r_unicode_regex.finditer(s):        
      m = match.group()
      s = replace_emojis(s, m) 
      is_done = s == s_prior
  return s

# Replace R-style emoji encoding with text equivalent
fix_emojis_udf = udf(fix_emojis)
# Convert <U+00??> references to Unicode characters
fix_unicode_udf = udf(lambda s: re.sub(r'<U\+([0-9a-fA-F]+)>', lambda m: chr(int(m.group(1),16)), s))
fix_smart_quotes_etc_udf = udf(lambda s: s.replace( "’", "'" ).replace( "“", '"' ).replace( "”", '"' ) \
                           .replace("…", "...").replace("‘","'").replace("⁉️", "!?").replace("‼", "!!") \
                           .replace("❤", "love").replace("♥", "love").replace("&039;", "\""))
unescape_html_udf = udf(lambda s: html.unescape(s))

#labeled_data = labeled_data.filter(labeled_data.lang == "en") # Drop any entries in languages other than English
labeled_data = labeled_data.withColumn("norm_text", regexp_replace("text", uber_regex, ""))
labeled_data = labeled_data.withColumn("norm_text", fix_emojis_udf("norm_text"))
#labeled_data = labeled_data.withColumn("norm_text", regexp_replace("norm_text", "<U\+[0-9a-fA-F]+>", ""))
labeled_data = labeled_data.withColumn("norm_text", fix_unicode_udf("norm_text"))
labeled_data = labeled_data.withColumn("norm_text", unescape_html_udf("norm_text"))
labeled_data = labeled_data.withColumn("norm_text", fix_smart_quotes_etc_udf("norm_text"))
labeled_data = labeled_data.withColumn("norm_text", trim(regexp_replace("norm_text", space_etc_regex, " ")))

labeled_data = labeled_data.filter(labeled_data.norm_text != '') # Drop any text that is now empty from actions above
labeled_data = labeled_data.select("norm_text", "Class") # Drop most of the data as isn't used

labeled_data.coalesce(1).write.mode("overwrite").csv("labeled-data.csv", header = True, escape = "\"")

labeled_data.cache()
labeled_data.show()

+--------------------+-----+
|           norm_text|Class|
+--------------------+-----+
|GOPBlackChick: Il...|    1|
|CNN is there any ...|    0|
|KurtSchlichter: C...|    0|
|ajpeacemaker mdja...|    0|
|THE TRUMP IMMIGRA...|    1|
|CNNPolitics: Chri...|    0|
|Morning_Joe Not a...|    1|
|ThePatriot143: Co...|    0|
|Trump is correct ...|    1|
|"I'm going to pre...|    1|
|I really hope peo...|    0|
|Trump is claiming...|    0|
|marclamonthill: L...|    1|
|BOOM – Univision ...|    1|
|GeoScarborough I ...|    1|
|AlexBlackStars Dm...|    0|
|charlescwcooke: T...|    0|
|mdabbss: Donald t...|    0|
|A little surprise...|    1|
|Salon: Trump is t...|    0|
+--------------------+-----+
only showing top 20 rows



The Vivekn annotator requires separate files for both positive and negative training data. While there is not much documentation on the expected input, my attempts to utilize such an approach were not successful unless I removed the any metadata files (eg, `_SUCCESS` and `_SUCCESS.crc`) from the paritioned files created by Spark. As a result, I resorted to a hybrid approach of writing out the training data using the standard Spark API and then deleting problematic files by accessing the underlying Hadoop libraries directly. 

In [5]:
# Split data 80%/20% for training and testing
train_data, test_data = labeled_data.randomSplit([0.8, 0.2], seed=71082)
 
# Vivekn sentiment approach in SparkNLP below requires separate files for both positive and negative training data.
# After much experimentation, not able to simple write to filesystem using Spark libraries
# as it produces both the data file and metadata files that the Vivekn annotator cannot process
# when simply indicating the directory where the file exists in the code below.
def write_df(df, dirname, filename):
    # Use Hadoop library directly to write to filesystem
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

    tmp_name = filename + ".tmp"
    # Write to a single file rather than multiple
    df.coalesce(1).write.mode('overwrite').text(tmp_name)
    
    fs = FileSystem.get(Configuration())
    fs.mkdirs(Path(dirname))
    # Assume one file output
    file = fs.globStatus(Path(tmp_name + "/*.txt"))[0].getPath();
    fs.rename(file, Path(dirname + "/" + filename));
    fs.delete(Path(tmp_name), True);
    
#train_data = train_data.distinct()    

# split training data into positive/negative
positive_data = train_data.filter(train_data.Class == "1").select("norm_text")
write_df(positive_data, "train-data", "positive.txt")

negative_data = labeled_data.filter(train_data.Class == "0").select("norm_text")
negative_data = negative_data.distinct()
write_df(negative_data, "train-data", "negative.txt")

# remove duplicates
#test_data = test_data.distinct()
test_data.write.mode("overwrite").parquet("test-data.parquet")

## Sentiment prediction

Now that we have the training and test data ready to go, it's time to put it to use!

After reading in the test data, we remove duplicate entries so as not to bias the results.

In [6]:
# Read in data produced earlier
test_data = spark.read.parquet("test-data.parquet")
test_data.cache()
test_data.show()

+--------------------+-----+
|           norm_text|Class|
+--------------------+-----+
|"Did you know tha...|    0|
|"I told you guys ...|    0|
|"In 24 days plus ...|    1|
|"PolitiFact: Dona...|    1|
|'Dilbert' creator...|    0|
|'I am Batman,' Tr...|    0|
|08/17 1952: chess...|    0|
|20 times Donald T...|    0|
|2phonefranki: the...|    0|
|2phonefranki: the...|    0|
|7 ways the latest...|    1|
|ABC: Donald Trump...|    1|
|ABCPolitics: Dona...|    1|
|ADobranic: Alfred...|    1|
|ANY FAITH I HAD L...|    0|
|Actually, if I we...|    1|
|AdamBaldwin: BREA...|    1|
|Adviser: Donald T...|    1|
|Adviser: Donald T...|    1|
|Adviser: Donald T...|    1|
+--------------------+-----+
only showing top 20 rows



We can now execute the NLP pipeline using the Vivekn annotator to create a model and make sentiment predictions on our test tweets.

In [7]:
from pyspark.ml import Pipeline, PipelineModel
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

document_assembler = DocumentAssembler().setInputCol("norm_text")
    
sentence_detector = SentenceDetectorModel().setInputCols(["document"]).setOutputCol("sentence")

tokenizer = RegexTokenizer().setInputCols(["sentence"]).setOutputCol("token")
        
normalizer = Normalizer().setInputCols(["token"]).setOutputCol("normal")        
        
spell_checker = NorvigSweetingApproach().setInputCols(["normal"]).setOutputCol("spell")
  
sentiment_detector = ViveknSentimentApproach().setInputCols(["spell", "sentence"]) \
    .setOutputCol("sentiment").setPositiveSource("train-data/positive.txt") \
    .setNegativeSource("train-data/negative.txt").setPruneCorpus(False)       
    
finisher = Finisher().setInputCols(["sentiment"]).setIncludeKeys(True)
    
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

sentiment_data = pipeline.fit(test_data).transform(test_data)    
sentiment_data.show()

+--------------------+-----+--------------------+
|           norm_text|Class|  finished_sentiment|
+--------------------+-----+--------------------+
|"Did you know tha...|    0|    result->negative|
|"I told you guys ...|    0|    result->negative|
|"In 24 days plus ...|    1|result->positive@...|
|"PolitiFact: Dona...|    1|result->positive@...|
|'Dilbert' creator...|    0|result->positive@...|
|'I am Batman,' Tr...|    0|    result->negative|
|08/17 1952: chess...|    0|result->positive@...|
|20 times Donald T...|    0|    result->negative|
|2phonefranki: the...|    0|    result->negative|
|2phonefranki: the...|    0|    result->negative|
|7 ways the latest...|    1|    result->negative|
|ABC: Donald Trump...|    1|    result->positive|
|ABCPolitics: Dona...|    1|    result->positive|
|ADobranic: Alfred...|    1|result->positive@...|
|ANY FAITH I HAD L...|    0|    result->negative|
|Actually, if I we...|    1|result->positive@...|
|AdamBaldwin: BREA...|    1|    result->negative|


The pipeline adds an extra column with an sentiment indicated for each sentence in the tweet. The value of the indication is either `result->positive` or `result->negative` and multiple sentences are represented with the repetition of these values delimited by `@`.  In order to get a single Boolean value for the tweet, we take the average of each sentiment indication and then round the result to 0 or 1.

In [8]:
from pyspark.sql.types import IntegerType
from statistics import mean

def sigmoid(s):
    return 0 if s is None else round(mean(map(lambda x: 1 if (x == "result->positive") else 0, s.split("@"))) + .01)

sigmoid_udf = udf(sigmoid, IntegerType())

sentiment_data = sentiment_data.withColumn("total_sentiment", sigmoid_udf("finished_sentiment"))
sentiment_data.cache()
sentiment_data.write.mode("overwrite").parquet("sentiment-data.parquet")
sentiment_data.coalesce(1).write.mode("overwrite").csv("sentiment-data.csv", header = True, escape = "\"")
sentiment_data.show()

+--------------------+-----+--------------------+---------------+
|           norm_text|Class|  finished_sentiment|total_sentiment|
+--------------------+-----+--------------------+---------------+
|"Did you know tha...|    0|    result->negative|              0|
|"I told you guys ...|    0|    result->negative|              0|
|"In 24 days plus ...|    1|result->positive@...|              1|
|"PolitiFact: Dona...|    1|result->positive@...|              1|
|'Dilbert' creator...|    0|result->positive@...|              1|
|'I am Batman,' Tr...|    0|    result->negative|              0|
|08/17 1952: chess...|    0|result->positive@...|              1|
|20 times Donald T...|    0|    result->negative|              0|
|2phonefranki: the...|    0|    result->negative|              0|
|2phonefranki: the...|    0|    result->negative|              0|
|7 ways the latest...|    1|    result->negative|              0|
|ABC: Donald Trump...|    1|    result->positive|              1|
|ABCPoliti

Finally, we determine the accuracy of the model.

In [10]:
correct_count = sentiment_data.filter(sentiment_data.Class == sentiment_data.total_sentiment).count()
total_count =  sentiment_data.count()

print(f"Total correct classification percentage: {correct_count / total_count:.1%} [{correct_count} of {total_count}]")

# False Pos %
pos_count = sentiment_data.filter(sentiment_data.Class == 1).count()
true_pos_count = sentiment_data.filter(sentiment_data.total_sentiment == 1).filter(sentiment_data.Class == 1).count()
false_pos_count = sentiment_data.filter(sentiment_data.total_sentiment == 1).filter(sentiment_data.Class == 0).count()

print(f"True positive percentage: {true_pos_count/pos_count:.1%} [{true_pos_count} of {pos_count}]")
print(f"False positive percentage: {false_pos_count/pos_count:.1%} [{false_pos_count} of {pos_count}]")

# False Neg %
neg_count = sentiment_data.filter(sentiment_data.Class == 0).count()
true_neg_count = sentiment_data.filter(sentiment_data.total_sentiment == 0).filter(sentiment_data.Class == 0).count()
false_neg_count = sentiment_data.filter(sentiment_data.total_sentiment == 0).filter(sentiment_data.Class == 1).count()

print(f"True negative percentage: {true_neg_count/neg_count:.1%} [{true_neg_count} of {neg_count}]")
print(f"False negative percentage: {false_neg_count/neg_count:.1%} [{false_neg_count} of {neg_count}]")

Total correct classification percentage: 86.3% [750 of 869]
True positive percentage: 86.4% [370 of 428]
False positive percentage: 14.3% [61 of 428]
True negative percentage: 86.2% [380 of 441]
False negative percentage: 13.2% [58 of 441]
