## Analyzing/Predicting Sentiment From Amazon Reviews
For this exercise, let's go back to the sentiment analysis we did earlier in the course - specifically, the Amazon reviews dataset.

It's important to start with a clear goal in mind. In this case, we'd like to determine if we can **predict whether a review is positive or negative based on the language in the review.**

We're going to tackle this problem with Spark - so you'll need to apply the principles you've learned thus far in the context of Spark.

Some tips to help you get started:

1. Pyspark always needs to point at a running Spark instance. You can do that using a SparkContext.
2. We're still working in batch mode, so you'll need to load an entire file into memory in order to run any models you build.
3. Spark likes to execute models in a pipeline, so remember that when the time comes to set up your model.
4. Spark's machine learning algorithms expect numeric variables.

In [3]:
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import IntegerType

from pyspark.ml.feature import Tokenizer, Word2Vec

# these imports are how we build and manager our data science processes: cleaning data, preparing a model,
# executing the model, and evaluating the model.
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics

from matplotlib import pyplot as plt
import numpy as np
import functools
%matplotlib inline

In [4]:
# we use a set of constants for clarity and simplicity in managing the notebook.
# this allows you to refer back to this cell at any time if you need to either confirm or modify any of these values.

DATA_NAME = "AmznInstantVideo.json"
APP_NAME = "Sentiment Analysis with Amazon Reviews Exercise"
SPARK_URL = "local[*]"
RANDOM_SEED = 141107
TRAINING_DATA_RATIO = 0.8
RF_NUM_TREES = 10
RF_MAX_DEPTH = 4
RF_NUM_BINS = 32


The first thing we always do is create a SparkContext, and then immediately afterward create a sqlContext to be able to load and manipulate an RDD/dataframe.

In [6]:
sc = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
sqlContext = SQLContext(sc)

Exception: Java gateway process exited before sending its port number

revisit: https://github.com/Thinkful-Ed/big-data-student-resources/blob/master/examples/Amazon%20Reviews%20Exercise.ipynb