<a href="https://colab.research.google.com/github/susiexia/BigData_ETL-on-Amazon-dataset/blob/master/pyspark_NPLPipeline_MLmodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A pipeline enables us to store all of the functions we have created in different stages and run only once. The output of one stage is passed on to the next one.

In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

# start a  Spark.sql.Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Yelp_NLP').getOrCreate()

In [4]:
# read in data from aws S3 Buckets
from pyspark import SparkFiles

url= "https://s3.amazonaws.com/dataviz-curriculum/day_2/yelp_reviews.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get('yelp_reviews.csv'), sep=',', header=True)
df.show(truncate=False)

+--------+---------------------------------------------------------------------------------------------------------------+
|class   |text                                                                                                           |
+--------+---------------------------------------------------------------------------------------------------------------+
|positive|Wow... Loved this place.                                                                                       |
|negative|Crust is not good.                                                                                             |
|negative|Not tasty and the texture was just nasty.                                                                      |
|positive|Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.                        |
|positive|The selection on the menu was great and so were the prices.                                                    |
|negative|Now I 

Addtional function to use
1. use StringIndexer function to convert 'class' string column to label indices, the labels will be what we'ar predict by LM
2. use length function to add a column shows the length of each row, to be used as a future feature

In [0]:
# import ml.feature functions
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.sql.functions import length

In [0]:
# add a new column as a future feature
df = df.withColumn('length', length(df.text))
#data_df.show()

Create all **transformations** stages to be applied in our pipeline

*(stop by calling them by fit() and transform() function)*


In [0]:
# create all the ml.features to df
strIndexed = StringIndexer(inputCol='class', outputCol='label')
tokenizer = Tokenizer(inputCol='text', outputCol='tokened')
stopremover = StopWordsRemover(inputCol='tokened', outputCol='removed')
hashngTF = HashingTF(inputCol='removed', outputCol='hashed')
idf = IDF(inputCol='hashed', outputCol='idf_token')

create a **feature** **vector** containing the output from IDF and the length.

 This will combine all the raw features to train the ML model that we’ll be using

In [0]:
#from pyspark.ml.linalg import Vector
# create feature vectors
clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')


Create the pipeline and list the stages in the order they need to be executed.

In [0]:
from pyspark.ml import Pipeline
data_prep_pipeline = Pipeline(stages= [strIndexed, tokenizer, stopremover, hashngTF, idf, clean_up])

**RUN** the Pipeline

In [35]:
# fit model and transform the pipeline
cleaner = data_prep_pipeline.fit(df)    # produce a PipelineModel
# use PipelineModel to transform orginal df
cleaned = cleaner.transform(df)
cleaned.select('label','features').show(truncate=False)

+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                                                                                                                                                                                |
+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |(262145,[33933,69654,123604,262144],[4.51085950651685,6.215607598755275,3.8642323415917974,24.0])                                                                             

## RUN ML model

random split df into 70-30, 70% as training data will be passed to NLP model in order to train model to predict results. 
30% is testing data, to test our predictions

In [0]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
# break whole data down into a training set and a testing set
training, testing = cleaned.randomSplit([0.7,0.3])

#create a Naive Bayes Model 
nb = NaiveBayes() 
predictor = nb.fit(training)     # fit training df to nb model, predictor is NaiveBayes object

# transform the model with teasting data
test_results = predictor.transform(testing)
test_results.select('features', 'rawPrediction','probability','prediction').show(truncate= False)

## evaluation the prediction

In [45]:
# use the Class Evaluator for a cleaner description
acc_eval = MulticlassClassificationEvaluator()

acc = acc_eval.evaluate(test_results)    # action (evaluate)

print("Accuracy of model at predicting reviews was : %f "% acc)


0.7272814799644581
