# Data preparation

This notebook was developed using the [all-spark-notebook](https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook) and should work without modification in that environment.

Earlier this year, I produced a [similar notebook](https://www.zepl.com/viewer/notebooks/bm90ZTovL3JvYm9yYXRpdmUvVHdpdHRlci1TZW50aW1lbnQtRXhhbXBsZS9hNTlmZjFkYTAzY2Y0ZWY0YTg5MWRlNjZkMWFlM2I0My9ub3RlLmpzb24) using [Zeppelin](https://zeppelin.apache.org/). In that project, I used the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library to analyze sentiment of a Twitter stream. That library is not specific to Spark but is written in Java so it's trivial to use it from Spark. Out of curiousity, I recently looked into the current NLP space in the Spark ecosystem to see if anything Spark-specific had been released in the interim and came across a recently released [NLP library](http://www.johnsnowlabs.com/dataops-blog/natural-language-processing-library/) from [John Snow Labs](http://www.johnsnowlabs.com). The library includes a sentiment analyzer which I hadn't heard of before which is based on [research](http://arxiv.org/abs/1305.6143) and [code](https://github.com/vivekn/sentiment) produced by Vivek Narayanan, Ishan Arora, and Arjun Bhatia. In the paper, the authors claim an 88% accuracy on the Internet Movie Database (IMDb).

As I considered this project, I came across a [data set of tweets about Trump](https://www.kaggle.com/ahsanijaz/trumptweets) which includes 4600 tweets collected on August 17, 2015 and manually labeled as positive or negative. This data is tremedously helpful for evaluating a sentiment analyzer as the hard work of providing classified data has already been performed by the provider. 

In comparison to the IMDB data set of 25000 reviews, the tweet data set is much more limited in number and in length of material. Nevertheless, the analysis of the tweet data set resulted in an accuracy value of 91%. As the Vivekn et al approach is based on a Naive Bayes model which lends itself to shorter text according to [prior research](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf), it may explain the robust performance of the algorithm with this data.

Before I could begin to analyze the data, I needed to do some data mangling. The source file (`trumptweets.csv`) appears to have a Windows-1252 encoding and has fewer columns than the data it describes. While the data is loaded using the Windows-1252 encoding, artifacts remain that could be influencing the analsysis of the tweets and may be worth analyzing.

In [1]:
%%bash

if [ ! -f trumptweet_mod.csv ]; then
    echo "Creating copy of trumptweet.csv with modifications"
    sed 's/"X",/"X","X_copy",/' trumptweet.csv > trumptweet_mod.csv # Add new column to header to align header with data
    sed -i 's/"X.1",/"X_1",/g' trumptweet_mod.csv # Spark reader behaves oddly when like dots are in column names
    sed -i 's/\\"/""/g' trumptweet_mod.csv # Use standard quote escapes for CSV rather than backslashes
else
    echo "trumptweet_mod.csv already exists"
fi

trumptweet_mod.csv already exists


In [2]:
import pyspark
from pyspark.sql import SparkSession

conf = pyspark.SparkConf()
# Default settings resulted in GC issues. However, may be able to reduce this if system memory is limited
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "4g")

sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)

In [27]:
# Use modified version of the data
labeled_data = spark.read.csv("trumptweet_mod.csv", header = True, escape = '"', encoding = "windows-1252", mode = "FAILFAST", multiLine = "true")

# Drop possible NAs or null values for safety's sake
labeled_data = labeled_data.filter((labeled_data.Class == '0') | (labeled_data.Class == '1'))

# Show breakdown of positive and negative tweets
labeled_data.groupBy('Class').count().show()

+-----+-----+
|Class|count|
+-----+-----+
|    0| 2262|
|    1| 2340|
+-----+-----+



In [26]:
# Normalize tweets
from pyspark.sql.functions import regexp_replace
from pyspark.sql.functions import trim

rt_regex = r"(?=\s?)(RT:?)(?=\s?)" # Drop RT
user_regex = r"@\S+" # Drop user references
url_regex = r"http[s]?:\S+" # Drop URLs
hashtag_regex = r"#" # Remove hashtag itself but leave tag value
unicode_regex = r"<ed>|<U\+[^>]*>" # Clean up spurious unicode references
space_etc_regex = r"^[\.'\",]\s+|\s{2,}|\n" # Remove extra spaces, linefeeds and leading periods or commas.

# Remove RT, users (@foo), URLs, #s, line feeds, odd unicode 'tags'
uber_regex =  "|".join([rt_regex, user_regex, url_regex, hashtag_regex, unicode_regex])

labeled_data = labeled_data.withColumn("norm_text", regexp_replace("text", uber_regex, ""))
labeled_data = labeled_data.withColumn("norm_text", trim(regexp_replace("norm_text", space_etc_regex, " ")))
labeled_data = labeled_data.filter(labeled_data.norm_text != '') # Drop any text that is now empty from actions above
labeled_data = labeled_data.select("norm_text", "Class") # Drop most of the data as isn't used
labeled_data.show()

+--------------------+-----+
|           norm_text|Class|
+--------------------+-----+
|Illegals must be ...|    1|
|is there any othe...|    0|
|Caring - The GOP ...|    0|
|So much stupid go...|    0|
|THE TRUMP IMMIGRA...|    1|
|Christie on Donal...|    0|
|Not a Trump fan, ...|    1|
|Court Has To Step...|    0|
|Trump is correct ...|    1|
|"I�m going to pre...|    1|
|I really hope peo...|    0|
|Trump is claiming...|    0|
|Latest poll has T...|    1|
|BOOM � Univision ...|    1|
|I am now all in f...|    1|
|"His ratings amon...|    0|
|Today's Trump pos...|    0|
|Donald trump the ...|    0|
|A little surprise...|    1|
|Trump is the last...|    0|
+--------------------+-----+
only showing top 20 rows



In [23]:
# Split data 80%/20% for training and testing
train_data, test_data = labeled_data.randomSplit([0.8, 0.2], seed=71082)

# NLP code below requires separate files for both positive and negative training data
# After much experimentation, not able to simple write to filesystem using Spark libraries
# as it produces multiple files and metadata files that the Vivekn annotator cannot process.
def write_df(df, dirname, filename):
    # Use Hadoop library directly to write to filesystem
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

    tmp_name = filename + ".tmp"
    # Write to a single file rather than multiple
    df.coalesce(1).write.mode('overwrite').text(tmp_name)
    
    fs = FileSystem.get(Configuration())
    fs.mkdirs(Path(dirname))
    # Assume one file output
    file = fs.globStatus(Path(tmp_name + "/*.txt"))[0].getPath();
    fs.rename(file, Path(dirname + "/" + filename));
    fs.delete(Path(tmp_name), True);
       
# split training data into positive/negative
positive_data = train_data.filter(train_data.Class == "1").select("norm_text")
write_df(positive_data, "train-data.txt", "positive.txt")

negative_data = labeled_data.filter(train_data.Class == "0").select("norm_text")
write_df(negative_data, "train-data.txt", "negative.txt")

test_data.write.mode("overwrite").parquet("test-data.parquet")