<a href="https://colab.research.google.com/github/susiexia/BigData_Amazon_reviews_ETL_Cloud/blob/master/Amazon_Reviews_Classification_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project PART 3
Perform ETL in the cloud AND analyze data using Natural Language Processing (NLP) pipeline including Machine Learning.
- (**part 1** in "Amazon_Reriews_ETL_process.ipynb" <https://colab.research.google.com/drive/1N0fTd5rpGznaC15aYb5M63PDzCBJ_0e4>)

- (**part 2** in "Amazon_Reviews_NLP_ML.ipynb" <https://colab.research.google.com/drive/1kAFj2v4wxlFVrCksN4CiKrWHC4dQj8f9>)

**This part use the columns of helpful_votes and total_votes in vine dataframe to predict star_rating by linear regression model.**

In [0]:
# Install Java, Spark, Findspark and download a Postgresql driver
!apt-get install openjdk-8-jdk-headless -qq #> /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark


# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [0]:
# Create a spark session, configured with Posetgres driver
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Amazon_LR').getOrCreate()

In [8]:
from google.colab import files
uploaded = files.upload()

In [0]:
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
schema = [StructField('review_id',StringType(),True), StructField('star_rating', IntegerType(), True), 
          StructField('helpful_votes', IntegerType(), True), StructField('total_votes', IntegerType(), True),
          StructField('vine',StringType(),True)]
final = StructType(fields=schema)

In [9]:
vine_reviews_df = spark.read.csv('vine.csv', sep=',',header=True, schema=final)
vine_reviews_df.show()
vine_reviews_df.printSchema


+--------------+-----------+-------------+-----------+----+
|     review_id|star_rating|helpful_votes|total_votes|vine|
+--------------+-----------+-------------+-----------+----+
| RWOE9SUY9N8J7|          5|            0|          0|   N|
|R3QRJQDHI4N1NW|          1|            4|          4|   N|
| RS02YZ0GSKWJJ|          2|            1|          1|   N|
|R2EIZK8D31VYZO|          3|            0|          1|   N|
|R23E01JXNIQRXI|          3|            0|          1|   N|
|R1ZGQ80LY59OEK|          5|            3|          3|   N|
|R1JICQZYMO0IM0|          5|            0|          0|   N|
|R2COOCGQX2QXNB|          5|            0|          1|   N|
|R28KX9E8RB627P|          1|            2|          3|   N|
|R1IIDFQLB2TVLQ|          4|            3|          3|   N|
|R3H6FT9FPDZI6C|          5|            0|          1|   N|
| RMXCHZNGBWHRO|          5|            1|          1|   N|
| R6MEWP6M0LG5D|          5|            0|          0|   N|
|R2KDB9Y7EBBYQB|          5|            

<bound method DataFrame.printSchema of DataFrame[review_id: string, star_rating: int, helpful_votes: int, total_votes: int, vine: string]>

In [10]:
# add a column for helpful votes percentage
from pyspark.sql.types import IntegerType
from pyspark.sql import Column

vine_reviews_df = vine_reviews_df.withColumn('helpful_votes_percentage', 
                                             vine_reviews_df['helpful_votes']/vine_reviews_df['total_votes'])


vine_reviews_df = vine_reviews_df.dropna()
vine_reviews_df.show(5)

+--------------+-----------+-------------+-----------+----+------------------------+
|     review_id|star_rating|helpful_votes|total_votes|vine|helpful_votes_percentage|
+--------------+-----------+-------------+-----------+----+------------------------+
|R3QRJQDHI4N1NW|          1|            4|          4|   N|                     1.0|
| RS02YZ0GSKWJJ|          2|            1|          1|   N|                     1.0|
|R2EIZK8D31VYZO|          3|            0|          1|   N|                     0.0|
|R23E01JXNIQRXI|          3|            0|          1|   N|                     0.0|
|R1ZGQ80LY59OEK|          5|            3|          3|   N|                     1.0|
+--------------+-----------+-------------+-----------+----+------------------------+
only showing top 5 rows



In [11]:
# statistical summary 
vine_reviews_df.describe().show()

+-------+--------------+------------------+-----------------+------------------+------+------------------------+
|summary|     review_id|       star_rating|    helpful_votes|       total_votes|  vine|helpful_votes_percentage|
+-------+--------------+------------------+-----------------+------------------+------+------------------------+
|  count|        353555|            353555|           353555|            353555|353555|                  353555|
|   mean|          null|3.9592962905347115|7.058344529139738| 8.554001499059552|  null|      0.7606408731303688|
| stddev|          null|1.4528450865994198|46.76890783473558|48.930514112084055|  null|      0.3220946950293199|
|    min|R1001H341IC645|                 1|                0|                 1|     N|                     0.0|
|    max| RZZZPYCL9LDIT|                 5|            13362|             13636|     Y|                     1.0|
+-------+--------------+------------------+-----------------+------------------+------+---------

# Naive Bayes

Results: There will be 0.97 accuracy of Naive Bayes to determine if a review will be vine reviews based on helpful_votes, total_votes and star_rating.

In [13]:
# convert vine column to dichotomous types
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

strIndexed = StringIndexer(inputCol='vine', outputCol='label')

assembler = VectorAssembler(inputCols=['helpful_votes', 'total_votes','star_rating'], outputCol='Attributes')

data_prep_pipeline = Pipeline(stages= [strIndexed, assembler])

pipeline = data_prep_pipeline.fit(vine_reviews_df)
lr_df = pipeline.transform(vine_reviews_df)
clean_df = lr_df.select("Attributes","label")
clean_df.show(5)

+-------------+-----+
|   Attributes|label|
+-------------+-----+
|[4.0,4.0,1.0]|  0.0|
|[1.0,1.0,2.0]|  0.0|
|[0.0,1.0,3.0]|  0.0|
|[0.0,1.0,3.0]|  0.0|
|[3.0,3.0,5.0]|  0.0|
+-------------+-----+
only showing top 5 rows



In [16]:
# break whole data down into a training set and a testing set
training, testing = clean_df.randomSplit([0.7,0.3])

#create a Naive Bayes Model 
nb = NaiveBayes(featuresCol='Attributes', labelCol='label') 
predictor = nb.fit(training)     # fit training df to nb model, predictor is NaiveBayes object

# transform the model with teasting data
test_results = predictor.transform(testing)
test_results.select('Attributes', 'rawPrediction','probability','prediction').show(truncate= False)


+-------------+---------------------------------------+-----------------------------------------+----------+
|Attributes   |rawPrediction                          |probability                              |prediction|
+-------------+---------------------------------------+-----------------------------------------+----------+
|[0.0,1.0,1.0]|[-2.428943926370085,-7.120069394144616]|[0.9909070871220875,0.009092912877912376]|0.0       |
|[0.0,1.0,1.0]|[-2.428943926370085,-7.120069394144616]|[0.9909070871220875,0.009092912877912376]|0.0       |
|[0.0,1.0,1.0]|[-2.428943926370085,-7.120069394144616]|[0.9909070871220875,0.009092912877912376]|0.0       |
|[0.0,1.0,1.0]|[-2.428943926370085,-7.120069394144616]|[0.9909070871220875,0.009092912877912376]|0.0       |
|[0.0,1.0,1.0]|[-2.428943926370085,-7.120069394144616]|[0.9909070871220875,0.009092912877912376]|0.0       |
|[0.0,1.0,1.0]|[-2.428943926370085,-7.120069394144616]|[0.9909070871220875,0.009092912877912376]|0.0       |
|[0.0,1.0,1.0]|[-2.

In [18]:
# use the Class Evaluator for a cleaner description
acc_eval = MulticlassClassificationEvaluator()

acc = acc_eval.evaluate(test_results)   

print("Accuracy of model at predicting vine reviews was : %f "% acc)

Accuracy of model at predicting vine reviews was : 0.976541 


# Linear Regression

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

assembler = VectorAssembler(inputCols=['helpful_votes', 'total_votes'], outputCol='Attributes')

assembed_df = assembler.transform(vine_reviews_df)

finalized_data = assembed_df.select("Attributes","star_rating")
finalized_data.show(5)
finalized_data.count()


+----------+-----------+
|Attributes|star_rating|
+----------+-----------+
| [4.0,4.0]|          1|
| [1.0,1.0]|          2|
| [0.0,1.0]|          3|
| [0.0,1.0]|          3|
| [3.0,3.0]|          5|
+----------+-----------+
only showing top 5 rows



2219424

In [0]:
#Split training and testing data
training,testing = finalized_data.randomSplit([0.7,0.3])

regressor = LinearRegression(featuresCol = 'Attributes', labelCol = 'star_rating')
#Learn to fit the model from training set
regressor = regressor.fit(training)
#To predict the prices on testing set
predicting = regressor.evaluate(testing)
#Predict the model
predicting.predictions.show(5)

+----------+-----------+-----------------+
|Attributes|star_rating|       prediction|
+----------+-----------+-----------------+
| [0.0,1.0]|          1|3.918252893250348|
| [0.0,1.0]|          1|3.918252893250348|
| [0.0,1.0]|          1|3.918252893250348|
| [0.0,1.0]|          1|3.918252893250348|
| [0.0,1.0]|          1|3.918252893250348|
+----------+-----------+-----------------+
only showing top 5 rows



In [0]:
#coefficient of the regression model
coeff = regressor.coefficients
#X and Y intercept
intr = regressor.intercept
print ("The coefficient of the model is : %a" %coeff)
print ("The Intercept of the model is : %f" %intr)

The coefficient of the model is : DenseVector([0.097, -0.0914])
The Intercept of the model is : 4.007358


In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
eval = RegressionEvaluator(labelCol="star_rating", predictionCol="prediction", metricName="rmse")

# Root Mean Square Error
rmse = eval.evaluate(predicting.predictions)
print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval.evaluate(predicting.predictions, {eval.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval.evaluate(predicting.predictions, {eval.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval.evaluate(predicting.predictions, {eval.metricName: "r2"})
print("r2: %.3f" %r2)


RMSE: 1.455
MSE: 2.118
MAE: 1.210
r2: 0.026


# Logistic Regression

Logistic regression is a statistical method for analyzing a dataset in which there are 'helpful_votes', 'total_votes','star_rating' variables that determine an outcome ('vine' column)

In [0]:
# convert vine column to dichotomous types
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

strIndexed = StringIndexer(inputCol='vine', outputCol='label')

assembler = VectorAssembler(inputCols=['helpful_votes', 'total_votes','star_rating'], outputCol='Attributes')

data_prep_pipeline = Pipeline(stages= [strIndexed, assembler])

pipeline = data_prep_pipeline.fit(vine_reviews_df)
lr_df = pipeline.transform(vine_reviews_df)
clean_df = lr_df.select("Attributes","label")
clean_df.show(5)

+-------------+-----+
|   Attributes|label|
+-------------+-----+
|[4.0,4.0,1.0]|  0.0|
|[1.0,1.0,2.0]|  0.0|
|[0.0,1.0,3.0]|  0.0|
|[0.0,1.0,3.0]|  0.0|
|[3.0,3.0,5.0]|  0.0|
+-------------+-----+
only showing top 5 rows



In [0]:
#Split training and testing data
train,test = clean_df.randomSplit([0.7,0.3])

lr = LogisticRegression(labelCol ="label", featuresCol="Attributes")
model=lr.fit(train)
predict_train=model.transform(train)
predict_test=model.transform(test)
predict_test.show(10)

+-------------+-----+--------------------+--------------------+----------+
|   Attributes|label|       rawPrediction|         probability|prediction|
+-------------+-----+--------------------+--------------------+----------+
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
|[0.0,1.0,1.0]|  0.0|[4.92895245807518...|[0.99281787870779...|       0.0|
+-------------+-----+----

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator=BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',labelCol='label')
predict_test.select("label","rawPrediction","prediction","probability").show(5)
print("The area under ROC for train set is {}".format(evaluator.evaluate(predict_train)))
print("The area under ROC for test set is {}".format(evaluator.evaluate(predict_test)))

+-----+--------------------+----------+--------------------+
|label|       rawPrediction|prediction|         probability|
+-----+--------------------+----------+--------------------+
|  0.0|[4.92895245807518...|       0.0|[0.99281787870779...|
|  0.0|[4.92895245807518...|       0.0|[0.99281787870779...|
|  0.0|[4.92895245807518...|       0.0|[0.99281787870779...|
|  0.0|[4.92895245807518...|       0.0|[0.99281787870779...|
|  0.0|[4.92895245807518...|       0.0|[0.99281787870779...|
+-----+--------------------+----------+--------------------+
only showing top 5 rows

The area under ROC for train set is 0.5978297406156183
The area under ROC for test set is 0.6018163911448003


The area under the ROC curve (AUC) is a measure of how well parameters can distinguish between two groups. In this case, Predict AUC is 0.6 means the threshold would be 60%. 