# Natural language processing and sentiment analysis

In this notebook, we'll show you how to analyze the sentiments (positive, negative, or neutral) of our synthetic social media updates.  There are a lot of tricky problems in preparing natural language inputs for machine learning and in training machine learning models for natural language, but we're going to ignore those because we can use some libraries that have done most of the hard work for us:

1.  [https://spacy.io](spaCy) is a library that has methods and pretrained models for parsing natural language and determining parts of speech, etc., in many languages, and
2.  [VADER](https://github.com/cjhutto/vaderSentiment) is a library that can characterize the sentiments of individual sentences.

While sentiment analysis is an interesting application, we hope you'll be inspired to try other language-processing tasks with spaCy, too.  You'll also learn how to glue sophisticated language-processing code into a Spark pipeline so that you can use it to process streaming data.  

The bigger lesson of this notebook is that you may not always need to train a model to add intelligence to an application:  often pretrained models or even off-the-shelf intelligent APIs are available for interesting tasks (this is particularly true for tasks like language processing and image recognition that are broadly applicable).

## Setup

We'll start by importing the spaCy library and telling it to load a pretrained model for English text.  The spaCy project ships [several pretrained models](https://spacy.io/models/), but we are going to use [a relatively compact model of English](https://spacy.io/models/en#en_core_web_sm) trained on web pages.

In [1]:
import spacy

In [2]:
english = spacy.load('en')

The next step is to import the analyzer class from VADER.

In [3]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## Processing text and identifying sentiments

We'll look at the sentiment of some example text from Jane Austen (we picked a notably recognizable excerpt):

In [4]:
sampletext = """ It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered the rightful property
of some one or other of their daughters. """

We're going to tell spaCy to use `english` -- the model we loaded -- to parse the input text.

In [5]:
result = english(sampletext)

In [6]:
type(result)

spacy.tokens.doc.Doc

We may not know what we can do with a `spacy.tokens.doc.Doc`, but since most good Python code includes documentation, we can find out!

In [7]:
help(result)

Help on Doc object:

class Doc(builtins.object)
 |  A sequence of Token objects. Access sentences and named entities, export
 |  annotations to numpy arrays, losslessly serialize to compressed binary
 |  strings. The `Doc` object holds an array of `TokenC` structs. The
 |  Python-level `Token` and `Span` objects are views of this array, i.e.
 |  they don't own the data themselves.
 |  
 |  EXAMPLE: Construction 1
 |      >>> doc = nlp(u'Some text')
 |  
 |      Construction 2
 |      >>> from spacy.tokens import Doc
 |      >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
 |                    spaces=[True, False, False])
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |  
 |  __getitem__(...)
 |      Get a `Token` or `Span` object.
 |      
 |      i (int or tuple) The index of the token, or the slice of the document
 |          to get.
 |      RETURNS (Token or Span): The token at `doc[i]]`, or the span at
 |          `doc[start : end]`.
 |      
 |      EXAMPLE:
 |

One of the cool things we can do with spaCy is identify parts of speech in natural language text (in fact, we used this feature to identify words that should become hashtags in our [synthetic update generator](/notebooks/generate.ipynb)):

In [8]:
for token in result:
    print(token.text, token.pos_)

  SPACE
It PRON
is VERB
a DET
truth NOUN
universally ADV
acknowledged VERB
, PUNCT
that ADP
a DET
single ADJ
man NOUN
in ADP
possession NOUN

 SPACE
of ADP
a DET
good ADJ
fortune NOUN
, PUNCT
must VERB
be VERB
in ADP
want NOUN
of ADP
a DET
wife NOUN
. PUNCT


 SPACE
However ADV
little ADJ
known ADJ
the DET
feelings NOUN
or CCONJ
views NOUN
of ADP
such ADJ
a DET
man NOUN
may VERB
be VERB
on ADP
his ADJ

 SPACE
first ADV
entering VERB
a DET
neighbourhood NOUN
, PUNCT
this DET
truth NOUN
is VERB
so ADV
well ADV
fixed VERB
in ADP
the DET
minds NOUN

 SPACE
of ADP
the DET
surrounding VERB
families NOUN
, PUNCT
that ADP
he PRON
is VERB
considered VERB
the DET
rightful ADJ
property NOUN

 SPACE
of ADP
some DET
one NUM
or CCONJ
other ADJ
of ADP
their ADJ
daughters NOUN
. PUNCT


It works pretty well!  But for our purposes in this notebook, we're just going to use spaCy to divide updates into sentences, which we can then feed to VADER.  Let's instantiate a VADER analyzer now:

In [9]:
analyzer = SentimentIntensityAnalyzer()

Now we can get the sentiment scores for each sentence:  negative (`neg`), neutral (`neu`), positive (`pos`), and overall sentiment.

In [10]:
[analyzer.polarity_scores(str(s)) for s in list(result.sents)]

[{'neg': 0.0, 'neu': 0.711, 'pos': 0.289, 'compound': 0.6705},
 {'neg': 0.0, 'neu': 0.895, 'pos': 0.105, 'compound': 0.6147}]

Unsurprisingly, the first two sentences of _Pride and Prejudice_ score as neutral-to-positive (we don't have a pretrained hilarity detector, alas).  Let's try some raw text from the negative product reviews corpus:

In [11]:
negative = english(""" This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go. 

Seriously this product was as tasteless as they come. There are much better tasting products out 
there but at 100 calories its better than a special k bar or cookie snack pack. You just have to 
season it or combine it with something else to share the flavor.

These were nasty, they were so greasy and too rich for my blood, plus they lacked major flavor, 
no spicy jalapeno flavor at all.
""")

[(s, analyzer.polarity_scores(str(s))) for s in list(negative.sents)]

[( This oatmeal is not good.,
  {'neg': 0.376, 'neu': 0.624, 'pos': 0.0, 'compound': -0.3412}),
 (Its mushy, soft, I don't like it.,
  {'neg': 0.297, 'neu': 0.703, 'pos': 0.0, 'compound': -0.2755}),
 (Quaker Oats is the way to go. 
  , {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}),
 (Seriously this product was as tasteless as they come.,
  {'neg': 0.175, 'neu': 0.825, 'pos': 0.0, 'compound': -0.1779}),
 (There are much better tasting products out 
  there but at 100 calories its better than a special k bar or cookie snack pack.,
  {'neg': 0.0, 'neu': 0.658, 'pos': 0.342, 'compound': 0.8537}),
 (You just have to 
  season it or combine it with something else to share the flavor.
  , {'neg': 0.0, 'neu': 0.872, 'pos': 0.128, 'compound': 0.296}),
 (These were nasty, they were so greasy and too rich for my blood, plus they lacked major flavor, 
  no spicy jalapeno flavor at all.,
  {'neg': 0.191, 'neu': 0.691, 'pos': 0.118, 'compound': -0.296})]

## Streaming natural language processing

Now we'll connect this sort of analysis to Spark so we can apply it to streaming data.  As before, we're going to load the Kafka connector package:

In [12]:
import os
SPARK_VERSION="2.2.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:%s pyspark-shell" % SPARK_VERSION

Then we'll set up a Spark session:

In [13]:
import pyspark

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("sentiment test") \
    .getOrCreate()

And we'll deserialize the JSON message payloads into structured data that we can process easily with Spark's data frame operations or structured streaming:

In [14]:
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import column, from_json

structure = StructType([StructField(fn, StringType(), True) for fn in "text user_id update_id".split()])

records = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "kafka.kafka.svc:9092") \
  .option("subscribe", "social-firehose") \
  .load() \
  .select(column("value").cast(StringType()).alias("value")) \
  .select(from_json(column("value"), structure).alias("json")) \
  .select(column("json.update_id"), column("json.user_id").alias("user_id"), column("json.text"))

We would like to use spaCy and VADER in Spark [user-defined functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.udf) so that we can 

1.  write a query that parses longer text in a data frame or structured stream into multiple sentences, and
2.  write a query that generates sentiment scores for individual records.

Since spaCy depends on a rather large and expensive-to-load model file, we don't want to refer to the model file directly in our user-defined function:  it would be prohibitive to serialize and deserialize it every time we wanted to run the function.  Typically with Spark programs, we'd prefer to [broadcast](https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables) large data like models, but the spaCy model is tricky to serialize.  So instead, we'll use this trick suggested by the [Sparkling Pandas library](https://github.com/sparklingpandas/sparklingml/blob/627c8f23688397a53e2e9e805e92a54c2be1cf3d/sparklingml/transformation_functions.py#L53), essentially simulating lazily-initialized worker-local storage for Spacy models.

In [15]:
# This code is borrowed from Sparkling Pandas; see here:
# https://github.com/sparklingpandas/sparklingml/blob/627c8f23688397a53e2e9e805e92a54c2be1cf3d/sparklingml/transformation_functions.py#L53
class SpacyMagic(object):
    """
    Simple Spacy Magic to minimize loading time.
    >>> SpacyMagic.get("en")
    <spacy.en.English ...
    """
    _spacys = {}

    @classmethod
    def get(cls, lang):
        if lang not in cls._spacys:
            import spacy
            cls._spacys[lang] = spacy.load(lang)
        return cls._spacys[lang]

Now we can make a user-defined function to split social-media updates into sentences.  We will use spaCy, which is more expensive than most reasonable heuristics for splitting text into sentences (but also much smarter).

In [16]:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import udf

def split_sentences_impl(s):
    """ splits an English string into sentences, using spaCy """
    english = SpacyMagic.get("en")
    return [str(sentence) for sentence in english(s).sents]

split_sentences = udf(split_sentences_impl, ArrayType(StringType()))

To see what this looks like, we'll run it on the first 10 rows of the data frame

In [17]:
split_records = records \
  .orderBy("update_id") \
  .limit(10) \
  .select("update_id", "user_id", split_sentences(column("text")).alias("sentences")) \
  .cache()

split_records.collect()

[Row(update_id='00000000000000000000', user_id='4665560161', sentences=["Elinor wished that the same forbearance could have extended towards herself, but that was impossible, and she was necessarily drawn from the mother's description.", '#socialmedia #marketing #yolo']),
 Row(update_id='00000000000000000000', user_id='1000040647', sentences=['It did not suit her situation or feelings, I might have rejoiced in its termination.', '#tbt #fail #yolo']),
 Row(update_id='00000000000000000000', user_id='9086078734', sentences=['The furniture was in all probability have gained some news of them; and till we know that she ever should receive another so perfectly gratifying in the occasion and the style.', '#retweet #yolo #ff']),
 Row(update_id='00000000000000000001', user_id='3082369400', sentences=['After this period every appearance of equal permanency.', '#health']),
 Row(update_id='00000000000000000001', user_id='5902440326', sentences=['Her performance was pleasing, though by no means tir

We can explode each array into multiple rows to make further processing easier:

In [18]:
from pyspark.sql.functions import explode
sentences = split_records.select("update_id", "user_id", explode(column("sentences")).alias("sentence"))
sentences.show(truncate=False)

+--------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|update_id           |user_id   |sentence                                                                                                                                                                        |
+--------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|00000000000000000000|4665560161|Elinor wished that the same forbearance could have extended towards herself, but that was impossible, and she was necessarily drawn from the mother's description.              |
|00000000000000000000|4665560161|#socialmedia #marketing #yolo                                                                                              

Now we'll create our user-defined function for VADER scoring:  it will take text and return a sentiment structure.  Note that we _are_ actually creating a broadcast variable for the VADER model.

In [19]:
from pyspark.sql.types import FloatType

sentiment_fields = "pos neg neu compound".split()
sentiment_structure = StructType([StructField(fn, FloatType(), True) for fn in sentiment_fields])

analyzer_bcast = spark.sparkContext.broadcast(analyzer)

def vader_impl(s):
    va = analyzer_bcast.value
    result = va.polarity_scores(s)
    return [result[key] for key in sentiment_fields]

sentiment_score = udf(vader_impl, sentiment_structure)

Finally, we can annotate each sentence with its sentiment and order from most negative to most positive:

In [21]:
sentences \
  .select("update_id", "user_id", "sentence", sentiment_score(column("sentence")).alias("sentiment")) \
  .orderBy("sentiment.compound") \
  .show()

+--------------------+----------+--------------------+--------------------+
|           update_id|   user_id|            sentence|           sentiment|
+--------------------+----------+--------------------+--------------------+
|00000000000000000002|7761320665|     Worse than all!|[0.0,0.629,0.371,...|
|00000000000000000002|8304162681|I could not help ...|[0.0,0.361,0.639,...|
|00000000000000000000|4665560161|Elinor wished tha...|   [0.0,0.0,1.0,0.0]|
|00000000000000000000|1000040647|    #tbt #fail #yolo|   [0.0,0.0,1.0,0.0]|
|00000000000000000002|8304162681|    #health #retweet|   [0.0,0.0,1.0,0.0]|
|00000000000000000000|9086078734|  #retweet #yolo #ff|   [0.0,0.0,1.0,0.0]|
|00000000000000000000|4665560161|#socialmedia #mar...|   [0.0,0.0,1.0,0.0]|
|00000000000000000001|3359902759|#marketing #follo...|   [0.0,0.0,1.0,0.0]|
|00000000000000000002|7761320665|       #health #news|   [0.0,0.0,1.0,0.0]|
|00000000000000000003|2529702535|She is netting he...|   [0.0,0.0,1.0,0.0]|
|00000000000