# Coursework Part 2:  Amazon fine foods reviews

 ***By:*** *Vaida Gulbinskaite and Gediminas Sadaunykas*

# Data Description

For the second part of the coursework we will use the Amazon Fine Food Reviews dataset which contains 568,454 food reviews that Amazon.com users left between October 1999 and October 2012. We decided to select this dataset because it offers a wide range of venues that cuould be explored: from building product recommendation models to sentiment analysis. This dataset has been widely used by professionals and aspiring data scientists, making it largely compelling. The vast length of the dataset, makes it a viable 'playground' for big data model building.

We have decided to solve text classification task. We will try to predict whether review was helpful or unhelpful to other buyers on Amazon, and what words are distinctive for helpful and unhelpful reviews. In order to do so, we have created a new metric called 'Helpfulness_perct' which is a percentage of helpfulness of the review, derived from HelpfulnessNumerator (number of people who voted the review to be helpful) and HelpfulnessDenominator (number of people who 'reviewed' the review). Helpfulness_perct was converted into two binary classes:  1 for  helpful reviews (reviews with Helpfulness_perct>=75%) 0 for unhelpful reviews (reviews with Helpfulness_perct<=25%). 

First, data was preprocessed via data type initialization, outlier treatment, balancing (due to 87% of classes being labeled positive). Then two data analysis pipelines, were implemented. One on single training/validation (80%/20%) split, and single combination of parameters, other on 10-fold cross validation , with parameter grid, consisting of 27 possible parameter combinations. Models were evaluated, using ROC-AUC, precision, recall, f1 scores. Finally, most significant words were found. They were defined by the total sum of IDF.TF values, adjusted for stopwords and common words.


# Data Preprocessing


In [1]:
## Read data into DataFrame

from pyspark.sql.functions import lit
from pyspark.sql.types import FloatType
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.getOrCreate()
raw_DF = spark.read.csv("hdfs://saltdean/data/reviews/Reviews.csv", header=True, inferSchema=True) # Read the data into Data Frame

#Changing type of data from string (which is set by default) to float
changedTypedf = raw_DF.withColumn("HelpfulnessNumerator", raw_DF["HelpfulnessNumerator"].cast(FloatType()))
changedTypedf = changedTypedf.withColumn("Score", changedTypedf["Score"].cast(FloatType()))
changedTypedf = changedTypedf.withColumn("HelpfulnessDenominator", changedTypedf["HelpfulnessDenominator"].cast(FloatType()))

#Creating a new column that holds helpfulness percentage
review_df = changedTypedf.withColumn("Helpfulness_perct", lit((changedTypedf['HelpfulnessNumerator']/changedTypedf
                                                           ['HelpfulnessDenominator'])*100))


In [2]:
## OUTLIER TREATMENT

# Fill Null with 0, because division by 0 is faulty.
review_df = review_df.na.fill(0)

#Outlier definition 1: Helpfulness_perct>100% (2 rows)
#Outlier definition 2: low visibility or helpfulnessDenominator < 10.
review_df = review_df[(review_df['Helpfulness_perct']<=100) & (review_df['HelpfulnessDenominator']>=10)]

print('Total number of instances: %d ' % review_df.count())
print('-'*100)

## (NO MISSING VALUES)

Total number of instances: 24892 
----------------------------------------------------------------------------------------------------


In [3]:
## Keep only rows for which we will have labels

review_df = review_df[(review_df['Helpfulness_perct']>=75) | (review_df['Helpfulness_perct']<=25)]

print('Total number of instances: %d' % review_df.count())
print('-'*100)

Total number of instances: 19752
----------------------------------------------------------------------------------------------------


In [4]:
## Create the labels. (>75% helpfull, <25% unhelpfull)

from pyspark.sql.functions import udf
from pyspark.sql.types import *

# Define the UDF, that creates the labels
def label(helpfulness):
    label=0
    if helpfulness >= 75:
        label1=1
    else:
        label1=0
    return label1

labelUDF = udf(label, IntegerType())  # Create the UDF
review_df = review_df.withColumn('Helpfulness_Label', labelUDF(review_df['Helpfulness_perct'])) # Apply the created UDF


# Keep only two columns
review_df=review_df['Helpfulness_label', 'Text']
review_df.groupBy('Helpfulness_label').count().orderBy('Helpfulness_label').show()

+-----------------+-----+
|Helpfulness_label|count|
+-----------------+-----+
|                0| 2757|
|                1|16995|
+-----------------+-----+



In [5]:
## Replicating unhelpful reviews

# As a result, of class inblance, logistic regression strugled to differentiate between classes, and tended to predict more frequent one.
# Attempts were made in dealing, with class inblance, via un-/helpfulness threshold changing (<30%; >90%), as well as outlier definition (helpfulness denominator < 5). 
#In the end the choice was made , to use oversampling. Unhelpful reviews were quadrupled.

from pyspark.sql.functions import array, explode, lit

#Creating a dataframe with helpful reviews only. This will be used to filtered out duplicate helpful reviews.
helpful_df = review_df[(review_df['Helpfulness_label']==1)] 

#Duplicate all data  and attaches Helpfulness_label ==0.
review_df = review_df.withColumn("Helpfulness_label", explode(array(lit(0),(review_df["Helpfulness_label"]))))

#Filtered out helpful reviews that were assigned with 0 when data was duplicated.
review_df =review_df[((helpful_df['Helpfulness_label']==review_df['Helpfulness_label']) & (review_df['Text']==helpful_df['Text'])|((review_df['Helpfulness_label']==0) &(review_df['Text']!=helpful_df['Text'])))]

#Duplicate again!
review_df = review_df.withColumn("Helpfulness_label", explode(array(lit(0),(review_df["Helpfulness_label"]))))

#Filtered out helpful reviews that were assigned with 0 when data was duplicated.
review_df =review_df[((helpful_df['Helpfulness_label']==review_df['Helpfulness_label']) & (review_df['Text']==helpful_df['Text'])|((review_df['Helpfulness_label']==0) &(review_df['Text']!=helpful_df['Text'])))]

# TESTING DUPLICATION
#review_df.groupBy('Helpfulness_label').count().orderBy('Helpfulness_label').show() 
#review_df.count()

In [6]:
#Renaming columns for pipeline format
review_df = review_df.selectExpr("Helpfulness_label as label", 'Text as Text') 

# Data analysis

In [7]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import Tokenizer, IDF, CountVectorizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.sql.functions import col

### 1st Pipeline

In [9]:
## TRAINING/VALIDATION SPLIT ESTIMATOR (PIPELINE 1, FULL DATA)

from time import time

training_df, test_df = review_df.randomSplit([0.8, 0.2]) # Split data into training/testing

t1=time()
tokenizer = Tokenizer(inputCol="Text", outputCol="words") # Tokenizer
Cvectorizer = CountVectorizer(inputCol=tokenizer.getOutputCol(), outputCol="featuresTF") # Feature extractor
idf = IDF(inputCol=Cvectorizer.getOutputCol(), outputCol="features") # Transformer
lr = LogisticRegression() # Transformer
pipeline_est = Pipeline(stages=[tokenizer ,Cvectorizer, idf, lr]) # Estimator

bc_eval=BinaryClassificationEvaluator() # Evaluator, measures ROC-AUC by default.
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1]).addGrid(lr.maxIter, [50]).addGrid(Cvectorizer.vocabSize, [100]).build() # Default parameters with 0.1 regularization parameter.

TVS_est = TrainValidationSplit(estimator=pipeline_est, evaluator=bc_eval, estimatorParamMaps=paramGrid, trainRatio=0.8) # Set the splited esimator training/validation 80/20.
TVS_mod = TVS_est.fit(training_df) # Best model (ONLY MODEL)
prediction_train = TVS_mod.transform(training_df)
t2=time()

print("Accuracy training data, pipeline 1 (AOC-ROC): ", bc_eval.evaluate(prediction_train))
print('Duration of training/validation, pipeline 1: {}s'.format(t2-t1))


Accuracy training data, pipeline 1 (AOC-ROC):  0.7983221662251336
Duration of training/validation, pipeline 1: 36.028708696365356s


In [10]:
# EVALUATION (PIPELINE 1)

prediction_test = TVS_mod.transform(test_df) # Get predicitons on test data

predictionAndLabels=prediction_test['label', 'prediction'] # Keep only two columns for evaluation ture label vs predicted label

TP_DF=predictionAndLabels[(predictionAndLabels['label']==1) & (predictionAndLabels['prediction']==1)].count() #True positive
FP_DF=predictionAndLabels[(predictionAndLabels['label']==0) & (predictionAndLabels['prediction']==1)].count() #False positive
TN_DF=predictionAndLabels[(predictionAndLabels['label']==0) & (predictionAndLabels['prediction']==0)].count() #True negative
FN_DF=predictionAndLabels[(predictionAndLabels['label']==1) & (predictionAndLabels['prediction']==0)].count() #False negative

Accuracy=(TP_DF+TN_DF)/(TP_DF+TN_DF+FP_DF+FN_DF) # Accuracy
Precision_Positive=TP_DF/(TP_DF+FP_DF) # Precision for positive class
Precision_Negative=TN_DF/(TN_DF+FN_DF) # Precision for negative class
Recall_Positive=TP_DF/(TP_DF+FN_DF) # Recall for positive class
Recall_Negative=TN_DF/(TN_DF+FP_DF) # Recall for negative clas
F1_positive=2*(Precision_Positive*Recall_Positive)/(Recall_Positive+Precision_Positive) # F1 score for positive class
F1_negative=2*(Precision_Negative*Recall_Negative)/(Precision_Positive+Recall_Positive) # F1 score for negative class

print('Accuracy: {}'.format(Accuracy))
print('Precision positive class: {} , zero class: {}'.format(Precision_Positive, Precision_Negative))
print('Recall positive class: {} , zero class: {}'.format(Recall_Positive, Recall_Negative))
print('F1 score positive: {}, zero class: {}'.format(F1_positive, F1_negative))

Accuracy: 0.7264942016057092
Precision positive class: 0.7465475223395613 , zero class: 0.6877615062761506
Recall positive class: 0.8220035778175313 , zero class: 0.5841848067525545
F1 score positive: 0.7824606215410813, zero class: 0.5122942090895731


In [15]:
# CROSS VALIDATION ESTIMATOR (PIPELINE 2)

from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.tuning import CrossValidator

training_df2, test_df2 = review_df.randomSplit([0.8, 0.2])# Spliting data

t3=time()
tokenizer_2 = Tokenizer(inputCol="Text", outputCol="words") # Tokenizer
remover = StopWordsRemover(inputCol=tokenizer_2.getOutputCol(), outputCol="filtered") # Stopword remover (NEW)
Cvectorizer_2 = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="featuresTF") # Feature creation with hashing
idf_2 = IDF(inputCol=Cvectorizer_2.getOutputCol(), outputCol="features") # TF.IDF
lr_2 = LogisticRegression() #Logistic regression
pipeline_est2 = Pipeline(stages=[tokenizer_2,remover, Cvectorizer_2, idf_2, lr_2]) # Estimator

bc_eval2=BinaryClassificationEvaluator() # Evaluator
paramGrid2 = ParamGridBuilder().addGrid(lr_2.regParam, [0.01, 0.1, 1]).addGrid(lr_2.maxIter, [10, 50, 100]).addGrid(Cvectorizer_2.vocabSize, [10, 100, 1000]).build() # Parameter grid

crossval_est=CrossValidator(estimator=pipeline_est2, estimatorParamMaps=paramGrid2, evaluator=bc_eval2, numFolds=10) # 10 fold cross validation esimator
cv_Model = crossval_est.fit(training_df2) # Best model
prediction_train2 = cv_Model.transform(training_df2)
t4=time()

print("Accuracy training data, pipeline 2 (AOC-ROC): ", bc_eval2.evaluate(prediction_train2))
print('Duration of training/validation, pipeline 2: {}s'.format(t4-t3))

Accuracy training data, pipeline 2 (AOC-ROC):  0.9181437209256595
Duration of training/validation, pipeline 2: 3353.651606321335s


In [16]:
# EVALUATION (PIPELINE 2)
prediction_test2 = cv_Model.transform(test_df2)

predictionAndLabels=prediction_test2['label', 'prediction'] # Keep only two columns for evaluation ture label vs predicted label

TP_DF=predictionAndLabels[(predictionAndLabels['label']==1) & (predictionAndLabels['prediction']==1)].count() #True positive
FP_DF=predictionAndLabels[(predictionAndLabels['label']==0) & (predictionAndLabels['prediction']==1)].count() #False positive
TN_DF=predictionAndLabels[(predictionAndLabels['label']==0) & (predictionAndLabels['prediction']==0)].count() #True negative
FN_DF=predictionAndLabels[(predictionAndLabels['label']==1) & (predictionAndLabels['prediction']==0)].count() #False negative

Accuracy=(TP_DF+TN_DF)/(TP_DF+TN_DF+FP_DF+FN_DF) # Accuracy
Precision_Positive=TP_DF/(TP_DF+FP_DF) # Precision for positive class
Precision_Negative=TN_DF/(TN_DF+FN_DF) # Precision for negative class
Recall_Positive=TP_DF/(TP_DF+FN_DF) # Recall for positive class
Recall_Negative=TN_DF/(TN_DF+FP_DF) # Recall for negative clas
F1_positive=2*(Precision_Positive*Recall_Positive)/(Recall_Positive+Precision_Positive) # F1 score for positive class
F1_negative=2*(Precision_Negative*Recall_Negative)/(Precision_Positive+Recall_Positive) # F1 score for negative class

print('Accuracy, pipeline 2: {}'.format(Accuracy))
print('Precision, pipeline 2, positive class: {} , zero class: {}'.format(Precision_Positive, Precision_Negative))
print('Recall, pipeline 2, positive class: {} , zero class: {}'.format(Recall_Positive, Recall_Negative))
print('F1 score, pipeline 2, positive: {}, zero class: {}'.format(F1_positive, F1_negative))


Accuracy, pipeline 2: 0.8007174887892377
Precision, pipeline 2, positive class: 0.8472222222222222 , zero class: 0.7326557666813964
Recall, pipeline 2, positive class: 0.8226326590442685 , zero class: 0.7661737523105361
F1 score, pipeline 2, positive: 0.8347463929793246, zero class: 0.6723238338944666


In [63]:
# BEST PARAMETERS

paramMap=list(zip(cv_Model.getEstimatorParamMaps(), cv_Model.avgMetrics)) # Map parameters, with average metrics
paramMax=max(paramMap, key=lambda x: x[1]) # The the parameters, with maximum values based on average metrics
print(paramMax) # Print

({Param(parent='LogisticRegression_463980565906a8c07060', name='maxIter', doc='max number of iterations (>= 0).'): 50, Param(parent='CountVectorizer_48249614a40f8a49a4c0', name='vocabSize', doc='max size of the vocabulary. Default 1 << 18.'): 1000, Param(parent='LogisticRegression_463980565906a8c07060', name='regParam', doc='regularization parameter (>= 0).'): 0.01}, 8.932434716819103)


### Important features

In [24]:
### Training data/UNHELPFUL REVIEWS

prediction_train2.createOrReplaceTempView('final_train_df') # Register the table
SQL_0='SELECT label, filtered, features FROM final_train_df WHERE label==0' # SQL query to leave UNHELPFUL REVIEWS
SQL_train_0=spark.sql(SQL_0) # Get UNHELPFUL reviews dataframe

Cvectorizer_0=CountVectorizer(inputCol='filtered', outputCol='vectors', vocabSize=1000) # 1 step create vocabulary (best size)
vocabulary_0 = Cvectorizer_0.fit(SQL_train_0).vocabulary # 2 step create vocabulary


In [25]:
def unpacklists(x):
    x1, x2=x
    newlist=[]
    if len(x1)==len(x2):
        for i in range(len(x1)):
            newlist.append((x1[i],x2[i]))
    else:
        print('Word list != TF.IDF list')
    return newlist

In [33]:
UNHELP_TRAIN_RDD= SQL_train_0.rdd.map(lambda x: ([w for w in x['filtered'] if w in vocabulary_0], [x['features'][vocabulary_0.index(w)] for w in x['filtered'] if w in vocabulary_0]))
UNHELP_TRAIN_RDD2=UNHELP_TRAIN_RDD.flatMap(lambda x: unpacklists(x)) 
UNHELP_TRAIN_RDD3=UNHELP_TRAIN_RDD2.reduceByKey(lambda a,b: a+b) # REDUCE by adding TF.IDF values.

print('TOP 25 terms for TRAIN DATA/UNHELPFULL REVIEWS \n')
UNHELP_TRAIN_RDD3.sortBy(lambda a: a[1], ascending=False).take(25) # Take top 25 most important words/terms

TOP 25 terms for TRAIN DATA/UNHELPFULL REVIEWS 



[('', 61148.64659787561),
 ('/><br', 28604.208956306662),
 ('like', 6580.1790703728657),
 ('bags', 4462.9505104399741),
 ('taste', 4201.1463931012458),
 ('good', 3810.374888460929),
 ('much', 2287.3237676207455),
 ('try', 1634.5649279682625),
 ('us', 1613.466946905728),
 ('something', 1517.8881027582836),
 ('give', 1358.755589770185),
 ('say', 1276.9888102650214),
 ('would', 1235.3370499928162),
 ('however,', 1025.0455407480329),
 ('one', 976.52909909644995),
 ('baby', 771.59632420282401),
 ("it's", 767.40236691715802),
 ('milk', 752.66701822286734),
 ('product', 751.90723645696619),
 ('it.<br', 700.0025735796537),
 ('get', 533.56841592524358),
 ('buy', 524.42148663521516),
 ('says', 484.89410667982054),
 ('cheaper', 480.33221802943621),
 ("don't", 472.87959840949833)]

In [34]:
### Training data/HELPFUL REVIEWS

SQL_1='SELECT label, filtered, features FROM final_train_df WHERE label==1' # SQL query to leave HELPFUL REVIEWS
SQL_train_1=spark.sql(SQL_1) # Get HELPFUL reviews dataframe

Cvectorizer_1=CountVectorizer(inputCol='filtered', outputCol='vectors', vocabSize=1000) # 1 step create vocabulary
vocabulary_1 = Cvectorizer_1.fit(SQL_train_1).vocabulary # 2 step create vocabulary


In [37]:
HELP_TRAIN_RDD= SQL_train_1.rdd.map(lambda x: ([w for w in x['filtered'] if w in vocabulary_1], [x['features'][vocabulary_1.index(w)] for w in x['filtered'] if w in vocabulary_1]))
HELP_TRAIN_RDD2=HELP_TRAIN_RDD.flatMap(lambda x: unpacklists(x)) 
HELP_TRAIN_RDD3=HELP_TRAIN_RDD2.reduceByKey(lambda a,b: a+b) # REDUCE by adding TF.IDF values.

print('TOP 25 terms for TRAIN DATA/HELPFULL REVIEWS \n')
HELP_TRAIN_RDD3.sortBy(lambda a: a[1], ascending=False).take(25) # Take top 25 most important words/terms

TOP 25 terms for TRAIN DATA/HELPFULL REVIEWS 



[('', 203819.79802735956),
 ('/><br', 74836.919605018164),
 ('coffee', 30652.178770446109),
 ('like', 16542.342633498556),
 ('dog', 14854.644743638695),
 ('one', 13991.010111496938),
 ('would', 11089.030436406483),
 ('taste', 10016.690914293471),
 ('really', 9317.8796913897349),
 ('much', 8022.9151884723651),
 ('time', 5996.1203812019521),
 ('try', 4724.7993719982023),
 ('since', 4641.2424109675558),
 ('without', 4317.0192447807613),
 ('good', 3968.2456858418159),
 ('use', 2578.9435606011048),
 ("it's", 2467.2896149171575),
 ('-', 2384.4114863058021),
 ('get', 2373.535762065067),
 ('times', 2318.5023666671068),
 ('tried', 2285.2827710186707),
 ('probably', 2264.349821668699),
 ('little', 1964.8760641943354),
 ('food', 1944.7099005534681),
 ('product', 1847.5619515012852)]

In [38]:
### Testing data/UNHELPFUL REVIEWS

prediction_test2.createOrReplaceTempView('final_test_df') # Register the table
SQL_00='SELECT label, filtered, features FROM final_test_df WHERE label==0' # SQL query to leave UNHELPFUL REVIEWS
SQL_train_00=spark.sql(SQL_00) # Get UNHELPFUL reviews dataframe

Cvectorizer_00=CountVectorizer(inputCol='filtered', outputCol='vectors', vocabSize=1000) # 1 step create vocabulary
vocabulary_00 = Cvectorizer_00.fit(SQL_train_00).vocabulary # 2 step create vocabulary


In [39]:
UNHELP_TEST_RDD= SQL_train_00.rdd.map(lambda x: ([w for w in x['filtered'] if w in vocabulary_00], [x['features'][vocabulary_00.index(w)] for w in x['filtered'] if w in vocabulary_00]))
UNHELP_TEST_RDD2=UNHELP_TEST_RDD.flatMap(lambda x: unpacklists(x)) 
UNHELP_TEST_RDD3=UNHELP_TEST_RDD2.reduceByKey(lambda a,b: a+b) # REDUCE by adding TF.IDF values.

print('TOP 25 terms for TEST DATA/UNHELPFULL REVIEWS \n')
UNHELP_TEST_RDD3.sortBy(lambda a: a[1], ascending=False).take(25) # Take top 25 most important words/terms

TOP 25 terms for TEST DATA/UNHELPFULL REVIEWS 



[('', 17799.020061798212),
 ('/><br', 8791.2837768289319),
 ('like', 1766.6336740332458),
 ('much', 648.58336166312733),
 ('make', 555.63660346386337),
 ('good', 371.15746061154368),
 ('would', 307.98814123108593),
 ('1', 299.24245938673693),
 ('get', 245.52014095593046),
 ('one', 245.38423515756912),
 ("don't", 233.96413625523056),
 ('product', 201.43118195472965),
 ('eternally', 186.29121167860433),
 ('buy', 175.5361163134632),
 ('covering', 166.11907788255613),
 ('us', 154.14603821226157),
 ("it's", 139.17225242164179),
 ('taste', 134.78210775922901),
 ('sin', 132.32368532981363),
 ('ordered', 121.28135778135885),
 ('yahushua', 114.35500028760868),
 ('but,', 110.42206319671919),
 ("i'm", 96.872827542156813),
 ('real', 96.109041963698488),
 ('maltodextrin', 88.365067471207141)]

In [40]:
### Testing data/UNHELPFUL REVIEWS

SQL_11='SELECT label, filtered, features FROM final_test_df WHERE label==1' # SQL query to leave HELPFUL REVIEWS
SQL_train_11=spark.sql(SQL_11) # Get HELPFUL reviews dataframe

Cvectorizer_11=CountVectorizer(inputCol='filtered', outputCol='vectors', vocabSize=1000) # 1 step create vocabulary
vocabulary_11 = Cvectorizer_11.fit(SQL_train_11).vocabulary # 2 step create vocabulary


In [42]:
HELP_TEST_RDD= SQL_train_11.rdd.map(lambda x: ([w for w in x['filtered'] if w in vocabulary_11], [x['features'][vocabulary_11.index(w)] for w in x['filtered'] if w in vocabulary_11]))
HELP_TEST_RDD2=HELP_TEST_RDD.flatMap(lambda x: unpacklists(x)) 
HELP_TEST_RDD3=HELP_TEST_RDD2.reduceByKey(lambda a,b: a+b) # REDUCE by adding TF.IDF values.

print('TOP 25 terms for TEST DATA/HELPFULL REVIEWS \n')
HELP_TEST_RDD3.sortBy(lambda a: a[1], ascending=False).take(25) # Take top 25 most important words/terms

TOP 25 terms for TEST DATA/HELPFULL REVIEWS 



[('', 52017.145452478057),
 ('/><br', 21294.301154063818),
 ('like', 4417.2093208488441),
 ('one', 3390.5019558028735),
 ('taste', 2393.7627416522364),
 ('sugar', 2103.3757414952574),
 ('time', 1556.1531835690769),
 ('found', 1353.652850243634),
 ('good', 1119.0120454258463),
 ('use', 817.9386203482868),
 ('product', 754.51479109674392),
 ('also', 693.59746309876095),
 ('reviews', 673.03345094214956),
 ('really', 658.27775917144595),
 ('getting', 652.41215929125804),
 ('get', 644.21278623902731),
 ('would', 641.85933439872326),
 ("it's", 627.82198020182921),
 ("don't", 626.21841815750224),
 ('coffee', 623.12620175458392),
 ('food', 617.81606054501901),
 ("i've", 508.78155941760372),
 ('even', 422.81383781054063),
 ('flavor', 416.80286640939659),
 ('tried', 411.28126367322386)]

# Findings

First pipeline returned satisfactory results, achieving 72% accuracy. It is well above 57% , which is the proportion of positive reviews, which was an initial threshold we wanted to surpass. Both, precision and recall, were higher for helpful class, indicating algorithms power in differentiating positive class. It is a little surprising, as we expected, that oversampling, will increase the discriminatory power (via duplicaton) of ''unhelpful'' features. Second pipeline, consisted of 10fold cross-validated logistic regression model, had an extra preprocessing step - stopword removal and parameter grids for logistic regression regularization parameter, maximum number of iterations, and size of vocabulary (feature) vector. It returned an accuracy of over 80%, with ROC-AUC score of over 90%, which we consider as being quite high. Precision and recall remained higher for positive class (84%; 82%), however difference was decreased, as could be seen from harmonic mean (f1 scores): helpful reviews (78% -> 83%); unhelpful (51% -> 67%). As was expected, best parameters included, the highest specified feature vector size - 1000. Which, still not huge, considering, there were ~ 80000 different term tokens, before any preprocessing. Best value for max iterations is 50, which is the middle of specified range, indicating, that model converges to optimum values effectively. Regularization parameter takes the lower bound value, (0.01) suggesting that model, is discriminating noise from signal well. Overall, it took 36 seconds to train first pipeline, and 56minutes to train the second one. Even though, time difference seems significant, final model, followed an exhaustive search over parameter grid, along with 10 fold cross-validation, which resulted in much more robust and generalizeble model. 


We defined important words, as having high TF.IDF values. It is a combined weight, consisting of term frequency measure within review (TF) and term specificity along the corpus (IDF). Dataset, was split based on label. TF.IDF values for the whole 1000 word vocabulary (1000 is our best value for the size of feature vector) were summed up. Terms at the top of the lists, were unintended spaces, and line break symbols prevalent along the whole corpus. Most important verb, was ''like'' on both lists as well. ''Bags'', was the first noun on the unhelpful reviews list, while ''Dog'' on helpful. Possibly, indicating our shared compassion for humans' best animal friends. There were a lot shared words, and a lot of, not very distinctive ones. However, we had an interesting result from deducing important words on testing data only. ''Yahushua'', was on top25 significant ''unhelpful" words, after digging deeper we found it is a translation for name Jesus in Hebrew. There was only one review, with this term in the whole corpus. Review, was just a long religious rant, which was very unique, seen by a lot of people, and marked as helpful by very few. (very low helpfulness perct) It is interesting, that model had picked up on a term occurring in a single unhelpful review (which of course was quadrupled). Though at least in this instance, we believe it is justifiable, as food reviews section is not an expected venue for expressing religious beliefs. 


Despite, the fact that high classification accuracy was achieved on this data, it was just a glimpse on a space of infinite possibilities. First of all, data comprised only 4% of total number of reviews. Almost certainly there are certain useful patterns, hidden in unutilized data. Data quality could have been further enhanced, via removal of unintended spaces, and line break characters, which were highly prevalent and important terms. Another possible improvement, is bigger parameter grid. Only 27 different parameter combinations were tested, which is tiny number, considering, that pipeline consisted multiple parametrisable members. Finally, other models like SVMs are known to return higher accuracies, while Naive Bayes is known to be scalable, and quick at converging to minimum absolute error. It would be interesting to see how they would compare against logistic regression on this particular task.


### End.