# Amazon Fine Food Reviews Analysis using pyspark

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 50,000<br>

Attribute Information:


1. Text - text of the review
2. Score - positive/negative


#### Objective:

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] 

In [1]:
import pyspark 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NLP").getOrCreate()
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
spark

In [2]:
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline # For pipeline development
from pyspark.ml.feature import * #CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover
from pyspark.sql.functions import * #col, udf,regexp_replace,isnull
from pyspark.sql.types import * #StringType,IntegerType
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

### Loading the Dataset

In [3]:
# CSV
df = spark.read.csv('amazon_rev.csv',inferSchema=True,header=True)

In [4]:
# Let's read a few full blurbs
df.show(20,False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|Text                                                                                                                                                                                                            

In [5]:
df.limit(5).toPandas()

Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,positive
1,Product arrived labeled as Jumbo Salted Peanut...,negative
2,This is a confection that has been around a fe...,positive
3,If you are looking for the secret ingredient i...,negative
4,Great taffy at a great price. There was a wid...,positive


In [6]:
df.printSchema()

root
 |-- Text: string (nullable = true)
 |-- Score: string (nullable = true)



In [7]:
df.count()

50000

## How many null values do we have?

Let's use our handy dandy function!

In [8]:
# Of course you will want to know how many rows that affected before you actually execute it..
og_len = df.count()
drop_len = df.na.drop().count()
print("Total Rows that contain at least one null value:",og_len-drop_len)
print("Percentage of Rows that contain at least one null value:", (og_len-drop_len)/og_len)

Total Rows that contain at least one null value: 1
Percentage of Rows that contain at least one null value: 2e-05


In [9]:
df = df.dropna()

In [10]:
df.count()

49999

In [11]:
# Quick data quality check on the state column....
# This is going to be our category column so it's important
df.groupBy("Score").count().orderBy(col("count").desc()).limit(10).toPandas()

Unnamed: 0,Score,count
0,positive,41788
1,negative,8211


In [12]:
# Let's check the quality of the blurbs
df.select("Text").show(10,False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |


## Clean the Text column

Keep in mind that you can/should do all of this in one call...
But we will show each individually for the purpose of learning.

In [13]:
df = df.withColumn("Text",translate(col("Text"), "/", " ")) \
        .withColumn("Text",translate(col("Text"), "(", " ")) \
        .withColumn("Text",translate(col("Text"), ")", " "))
df.select("Text").show(7,False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |


In [14]:
# Removing anything that is not a letter
df = df.withColumn("Text",regexp_replace(col('Text'), '[^A-Za-z ]+', ''))
df.select("Text").show(10,False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+-----------------------------

In [15]:
# Remove multiple spaces
df = df.withColumn("Text",regexp_replace(col('Text'), ' +', ' '))
df.select("Text").show(4,False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+-------------------------------------------

In [16]:
# Lower case everything
df = df.withColumn("Text",lower(col('Text')))
df.select("Text").show(4,False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+-------------------------------------------

## Prep Data for NLP 

Alright so here is where our analysis turns from basic text cleaning to actually turning our text into number (the backbone of NLP). These next several steps in our analysis are very unique to NLP.

## Split text into words (Tokenizing)

Yo'll see a new column is added to our dataframe that we call "words". This column contains an array of strings as opposed to just a string (current data type of the blurb column).

In [17]:
    regex_tokenizer = RegexTokenizer(inputCol="Text", outputCol="words", pattern="\W")
raw_words = regex_tokenizer.transform(df)
raw_words.show(2,False)
raw_words.printSchema()

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                                               |Score   |words                                                                                                                                                 

## Removing Stopwords

**Recall from the content review lecture**
Recall that "stopwords" are any word that we feel would "distract" our model from performing it's best. This list can be customized, but for now, we will just use the default list. 

In [18]:
# from pyspark.ml.feature import StopWordsRemover

# Define a list of stop words or use default list
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopwords = remover.getStopWords() 

# Display default list
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [19]:
words_df = remover.transform(raw_words)
words_df.show(1,False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                                                                                                                                                                                                                                  

## Now we need to encode state column to a column of indices

Remember that MLlib requres our dependent variable to not only be a numeric data type, but also zero indexed. We can Sparks handy built in StringIndexer function to accomplish this, just like we did in the classification lectures. 

In [20]:
indexer = StringIndexer(inputCol="Score", outputCol="label")
feature_data = indexer.fit(words_df).transform(words_df)
feature_data.show(5)
feature_data.printSchema()

+--------------------+--------+--------------------+--------------------+-----+
|                Text|   Score|               words|            filtered|label|
+--------------------+--------+--------------------+--------------------+-----+
|i have bought sev...|positive|[i, have, bought,...|[bought, several,...|  0.0|
|product arrived l...|negative|[product, arrived...|[product, arrived...|  1.0|
|this is a confect...|positive|[this, is, a, con...|[confection, arou...|  0.0|
|if you are lookin...|negative|[if, you, are, lo...|[looking, secret,...|  1.0|
|great taffy at a ...|positive|[great, taffy, at...|[great, taffy, gr...|  0.0|
+--------------------+--------+--------------------+--------------------+-----+
only showing top 5 rows

root
 |-- Text: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- l

## Converting text into vectors

We will test out the following three vectorizors:

1. Count Vectors
2. TF-IDF
3. Word2Vec

In [21]:
# Count Vector (count vectorizer and hashingTF are basically the same thing)
# cv = CountVectorizer(inputCol="filtered", outputCol="features")
# model = cv.fit(feature_data)
# countVectorizer_features = model.transform(feature_data)

# Hashing TF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawfeatures", numFeatures=20)
HTFfeaturizedData = hashingTF.transform(feature_data)

# TF-IDF
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(HTFfeaturizedData)
TFIDFfeaturizedData = idfModel.transform(HTFfeaturizedData)
TFIDFfeaturizedData.name = 'TFIDFfeaturizedData'

#rename the HTF features to features to be consistent
HTFfeaturizedData = HTFfeaturizedData.withColumnRenamed("rawfeatures","features")
HTFfeaturizedData.name = 'HTFfeaturizedData' #We will use later for printing

In [22]:
TFIDFfeaturizedData.limit(5).toPandas()

Unnamed: 0,Text,Score,words,filtered,label,rawfeatures,features
0,i have bought several of the vitality canned d...,positive,"[i, have, bought, several, of, the, vitality, ...","[bought, several, vitality, canned, dog, food,...",0.0,"(0.0, 3.0, 2.0, 2.0, 1.0, 1.0, 2.0, 4.0, 4.0, ...","(0.0, 1.3131902060863196, 0.7708544287433142, ..."
1,product arrived labeled as jumbo salted peanut...,negative,"[product, arrived, labeled, as, jumbo, salted,...","[product, arrived, labeled, jumbo, salted, pea...",1.0,"(0.0, 2.0, 4.0, 1.0, 2.0, 0.0, 0.0, 4.0, 0.0, ...","(0.0, 0.8754601373908798, 1.5417088574866284, ..."
2,this is a confection that has been around a fe...,positive,"[this, is, a, confection, that, has, been, aro...","[confection, around, centuries, light, pillowy...",0.0,"(2.0, 3.0, 5.0, 0.0, 2.0, 1.0, 2.0, 0.0, 2.0, ...","(0.5112818338139713, 1.3131902060863196, 1.927..."
3,if you are looking for the secret ingredient i...,negative,"[if, you, are, looking, for, the, secret, ingr...","[looking, secret, ingredient, robitussin, beli...",1.0,"(2.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, ...","(0.5112818338139713, 0.4377300686954399, 0.0, ..."
4,great taffy at a great price there was a wide ...,positive,"[great, taffy, at, a, great, price, there, was...","[great, taffy, great, price, wide, assortment,...",0.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, ...","(0.0, 0.0, 0.0, 0.0, 0.3824030925114636, 0.0, ..."


In [23]:
# Word2Vec
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="filtered", outputCol="features")
model = word2Vec.fit(feature_data)

W2VfeaturizedData = model.transform(feature_data)
# W2VfeaturizedData.show(1,False)

# W2Vec Dataframes typically has negative values so we will correct for that here so that we can use the Naive Bayes classifier
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(W2VfeaturizedData)

# rescale each feature to range [min, max].
scaled_data = scalerModel.transform(W2VfeaturizedData)
W2VfeaturizedData = scaled_data.select('Text','label','scaledFeatures')
W2VfeaturizedData = W2VfeaturizedData.withColumnRenamed('scaledFeatures','features')

W2VfeaturizedData.name = 'W2VfeaturizedData' # We will need this to print later

# Train and Evaluate your model

From here on out, is straight up classification. So we can go and use our trusty function! I'll just go ahead and copy and paste it in here.

In [24]:
# Set up our evaluation objects
Bin_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction') #labelCol='label'
# Bin_evaluator = BinaryClassificationEvaluator() #labelCol='label'
MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",

## Evaluate using HTF

In [25]:
HTF=HTFfeaturizedData

In [26]:
HTF.limit(5).toPandas()

Unnamed: 0,Text,Score,words,filtered,label,features
0,i have bought several of the vitality canned d...,positive,"[i, have, bought, several, of, the, vitality, ...","[bought, several, vitality, canned, dog, food,...",0.0,"(0.0, 3.0, 2.0, 2.0, 1.0, 1.0, 2.0, 4.0, 4.0, ..."
1,product arrived labeled as jumbo salted peanut...,negative,"[product, arrived, labeled, as, jumbo, salted,...","[product, arrived, labeled, jumbo, salted, pea...",1.0,"(0.0, 2.0, 4.0, 1.0, 2.0, 0.0, 0.0, 4.0, 0.0, ..."
2,this is a confection that has been around a fe...,positive,"[this, is, a, confection, that, has, been, aro...","[confection, around, centuries, light, pillowy...",0.0,"(2.0, 3.0, 5.0, 0.0, 2.0, 1.0, 2.0, 0.0, 2.0, ..."
3,if you are looking for the secret ingredient i...,negative,"[if, you, are, looking, for, the, secret, ingr...","[looking, secret, ingredient, robitussin, beli...",1.0,"(2.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, ..."
4,great taffy at a great price there was a wide ...,positive,"[great, taffy, at, a, great, price, there, was...","[great, taffy, great, price, wide, assortment,...",0.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, ..."


In [27]:
HTF = HTF.select('features','label')

In [28]:
HTF.limit(5).toPandas()

Unnamed: 0,features,label
0,"(0.0, 3.0, 2.0, 2.0, 1.0, 1.0, 2.0, 4.0, 4.0, ...",0.0
1,"(0.0, 2.0, 4.0, 1.0, 2.0, 0.0, 0.0, 4.0, 0.0, ...",1.0
2,"(2.0, 3.0, 5.0, 0.0, 2.0, 1.0, 2.0, 0.0, 2.0, ...",0.0
3,"(2.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, ...",1.0
4,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, ...",0.0


### Split train and test dataset 70:30 ratio

In [29]:
train,test = HTF.randomSplit([0.7,0.3])

### Logistic Regression

In [30]:
# This is the most simplistic approach which does not use cross validation
# Let's go ahead and train a Logistic Regression Algorithm
classifier = LogisticRegression()
fitModel = classifier.fit(train)

# Evaluation method for binary classification problem
predictionAndLabels = fitModel.transform(test)
auc = Bin_evaluator.evaluate(predictionAndLabels)
print("AUC:",auc)

# Evaluation for a multiclass classification problem
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: {0:.2f}".format(accuracy),"%") 
print(" ")

AUC: 0.5009030748771118
Accuracy: 83.15 %
 


### Logistic Regression with cross validation

In [31]:
# First tell Spark which classifier you want to use
classifier = LogisticRegression()

# Then Set up your parameter grid for the cross validator to conduct hyperparameter tuning
paramGrid = (ParamGridBuilder().addGrid(classifier.maxIter, [10, 15,20]).build())

# Then set up the Cross Validator which requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MC_evaluator,
                          numFolds=2) # 3 + is best practice

# Then fit your model
fitModel = crossval.fit(train)

# Collect the best model and
# print the coefficient matrix
# These values should be compared relative to eachother
# And intercepts can be prepared to other models
BestModel = fitModel.bestModel
print("Intercept: " + str(BestModel.interceptVector))
print("Coefficients: \n" + str(BestModel.coefficientMatrix))

# You can extract the best model from this run like this if you want
LR_BestModel = BestModel

# Next you need to generate predictions on the test dataset
# fitModel automatically uses the best model 
# so we don't need to use BestModel here
predictions = fitModel.transform(test)

# Now print the accuracy rate of the model or AUC for a binary classifier
accuracy = (MC_evaluator.evaluate(predictions))*100
print(accuracy)

Intercept: [-1.7552356928419908]
Coefficients: 
DenseMatrix([[-0.0458688 , -0.04132501,  0.0649768 , -0.00604277,  0.05469958,
               0.00067068,  0.01100899,  0.0090707 , -0.03797806, -0.01824852,
               0.00506662,  0.00809073,  0.03650167, -0.05553102,  0.00569813,
               0.01660173,  0.01792814, -0.014733  ,  0.0740175 , -0.01065871]])
83.1454714228882


### Classification Diagnostics

In [32]:
# Load the Summary
trainingSummary = LR_BestModel.summary

# General Describe
trainingSummary.predictions.describe().show()

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print(" ")
print("objectiveHistory: (scaled loss + regularization) at each iteration")
for objective in objectiveHistory:
    print(objective)

# for multiclass, we can inspect metrics on a per-label basis
print(" ")
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print(" ")
print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print(" ")
print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

# Generate confusion matrix and print (includes accuracy)
accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print(" ")
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))



+-------+------------------+--------------------+
|summary|             label|          prediction|
+-------+------------------+--------------------+
|  count|             34917|               34917|
|   mean|0.1627287567660452|9.737377208809462E-4|
| stddev|0.3691233000715696| 0.03119002110587992|
|    min|               0.0|                 0.0|
|    max|               1.0|                 1.0|
+-------+------------------+--------------------+

 
objectiveHistory: (scaled loss + regularization) at each iteration
0.4441672055289093
0.44262624396381306
0.44162110090017065
0.43846986595127413
0.43843875718076086
0.4384280294909694
0.43842610172043195
0.4384255876771879
0.4384255285867055
0.4384255251048576
0.4384255248081934
 
False positive rate by label:
label 0: 0.9985920450545582
label 1: 0.0008893449632290064
 
True positive rate by label:
label 0: 0.999110655036771
label 1: 0.0014079549454417458
 
Precision by label:
label 0: 0.8373419717340825
label 1: 0.23529411764705882
 
Recal

### One vs. Rest
Recap from lecture The One-vs-Rest classifier is a type of multiclass classifier that involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. So each class is viewed as it compares to rest of the classes as a whole, as opposed to each one individually.

In [33]:
# instantiate the base classifier.
lr = LogisticRegression()
# instantiate the One Vs Rest Classifier.
classifier = OneVsRest(classifier=lr)

# Add parameters of your choice here:
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
#Cross Validator requires the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 is best practice

# Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Print the Coefficients
# First we need to extract the best model from fit model

# Get Best Model
BestModel = fitModel.bestModel
# Extract list of binary models
models = BestModel.models
for model in models:
    print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)
        
# Now generate predictions on test dataset
predictions = fitModel.transform(test)
# And calculate the accuracy score
accuracy = (MC_evaluator.evaluate(predictions))*100
# And print
print(accuracy)

[1mIntercept: [0m 1.7562212198980984 [1m
Coefficients:[0m [0.03916173632684779,0.035156926314581984,-0.056674186496254074,0.00473309147734678,-0.047524822771146336,-0.0011739823293586322,-0.009425113974763134,-0.00866796444555344,0.031154257959753993,0.015776249717221972,-0.004658416328512517,-0.007286548577725694,-0.03197465970689617,0.046330321108072156,-0.005208539204358248,-0.01362685984205546,-0.015065824769997061,0.011532426330988733,-0.0647510773099216,0.008821781387471066]
[1mIntercept: [0m -1.7562212198980982 [1m
Coefficients:[0m [-0.03916173632684776,-0.03515692631458197,0.056674186496254074,-0.004733091477346791,0.0475248227711463,0.0011739823293586455,0.009425113974763163,0.008667964445553459,-0.031154257959754038,-0.015776249717221975,0.004658416328512565,0.007286548577725655,0.03197465970689616,-0.04633032110807215,0.005208539204358288,0.013626859842055436,0.015065824769997058,-0.011532426330988728,0.0647510773099216,-0.008821781387471112]
83.13221058215092


### Multilayer Perceptron Classifier

In [34]:
# Count how many features you have
features = HTF.select(['features']).collect()
features_count = len(features[0][0])
# Count how many classes you have 
class_count = HTF.select(countDistinct("label")).collect()
classes = class_count[0][0]

# Then use this number to specify the layers
# The first number in this list is the input layer which has to be equal to the number of features in your vector
# The second number is the first hidden layer
# The third number is the second hidden layer 
# The fourth number is the output layer which has to be equal to your class size
layers = [features_count, features_count+1, features_count, classes]
# Instaniate the classifier
classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# Fit the model
fitModel = classifier.fit(train)

# Print the model Weights
print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
   
# Generate predictions on test dataframe
predictions = fitModel.transform(test)
# Calculate accuracy score
accuracy = (MC_evaluator.evaluate(predictions))*100
# Print accuracy score
print("Accuracy: ",accuracy)

[1mModel Weights: [0m 923
Accuracy:  83.17862352473146


### NaiveBayes

In [35]:
# Add parameters of your choice here:
classifier = NaiveBayes()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  83.09242805993901


### Linear Support Vector Machine

In [36]:
# Count how many classes you have and produce an error if it's more than 2.
class_count = HTF.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("LinearSVC cannot be used because PySpark currently only accepts binary classification data for this algorithm")

# Add parameters of your choice here:
classifier = LinearSVC()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.maxIter, [10, 15]) \
             .addGrid(classifier.regParam, [0.1, 0.01]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel

print("Intercept: \n" + str(BestModel.intercept))
print('\033[1m' + " Coefficients"+ '\033[0m')
print("You should compares these relative to eachother")
print("Coefficients: \n" + str(BestModel.coefficients))
    
# Automatically gets the best model
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Intercept: 
-1.0006773247335061
[1m Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.00035307353654296834,-0.0005490781608652362,0.0,-0.0004860194616013917,0.0,0.0,0.0,-7.44495262994637e-05,-0.0006117979350420631,0.0,0.0,-0.00029164920941916723,0.0,-0.0009463157955664083,0.0,0.0012536082826390459,0.0010541717755659904,-0.0005881226613804043,0.0,0.00042638426570139066]
Accuracy:  83.23166688768067


### Decision Tree

In [37]:
# Add parameters of your choice here:
classifier = DecisionTreeClassifier()
paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Collect and print feature importances
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Feature Importances:  [0.0170495  0.         0.         0.         0.15897238 0.05901237
 0.         0.02260163 0.         0.01463241 0.         0.
 0.09054291 0.06136892 0.         0.08208282 0.05856106 0.02156435
 0.36902538 0.04458628]
Accuracy:  83.19851478583742


### Random Forest¶

In [38]:
# Add parameters of your choice here:
classifier = RandomForestClassifier()
paramGrid = (ParamGridBuilder() \
               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Retrieve best model from cross val
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)

accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.06020908 0.03939737 0.07056043 0.03823435 0.05812769 0.03984151
 0.04801501 0.05225476 0.04689558 0.04124826 0.05142836 0.04313354
 0.05872751 0.05157523 0.04310136 0.05637617 0.04787212 0.04896633
 0.06660303 0.03743231]
 
Accuracy:  83.29134067099854


### Gradient Boost Tree Classifier

In [39]:
class_count = HTF.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("GBTClassifier cannot be used because PySpark currently only accepts binary classification data for this algorithm")

# Add parameters of your choice here:
classifier = GBTClassifier()

paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .addGrid(classifier.maxIter, [10, 15,50,100])
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)
    
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.06000566 0.04120609 0.05678355 0.04895157 0.0597304  0.03591276
 0.05130978 0.0512845  0.05811942 0.05014301 0.05865339 0.0435035
 0.05003062 0.04635694 0.04065306 0.05319584 0.04429699 0.04715108
 0.05827277 0.04443907]
 
Accuracy:  83.11231932104496


### Evaluate using Tf_Idf

In [40]:
tf_idf=TFIDFfeaturizedData

In [41]:
tf_idf.limit(5).toPandas()

Unnamed: 0,Text,Score,words,filtered,label,rawfeatures,features
0,i have bought several of the vitality canned d...,positive,"[i, have, bought, several, of, the, vitality, ...","[bought, several, vitality, canned, dog, food,...",0.0,"(0.0, 3.0, 2.0, 2.0, 1.0, 1.0, 2.0, 4.0, 4.0, ...","(0.0, 1.3131902060863196, 0.7708544287433142, ..."
1,product arrived labeled as jumbo salted peanut...,negative,"[product, arrived, labeled, as, jumbo, salted,...","[product, arrived, labeled, jumbo, salted, pea...",1.0,"(0.0, 2.0, 4.0, 1.0, 2.0, 0.0, 0.0, 4.0, 0.0, ...","(0.0, 0.8754601373908798, 1.5417088574866284, ..."
2,this is a confection that has been around a fe...,positive,"[this, is, a, confection, that, has, been, aro...","[confection, around, centuries, light, pillowy...",0.0,"(2.0, 3.0, 5.0, 0.0, 2.0, 1.0, 2.0, 0.0, 2.0, ...","(0.5112818338139713, 1.3131902060863196, 1.927..."
3,if you are looking for the secret ingredient i...,negative,"[if, you, are, looking, for, the, secret, ingr...","[looking, secret, ingredient, robitussin, beli...",1.0,"(2.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, ...","(0.5112818338139713, 0.4377300686954399, 0.0, ..."
4,great taffy at a great price there was a wide ...,positive,"[great, taffy, at, a, great, price, there, was...","[great, taffy, great, price, wide, assortment,...",0.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, ...","(0.0, 0.0, 0.0, 0.0, 0.3824030925114636, 0.0, ..."


In [42]:
tf_idf = tf_idf.select('features','label')

In [43]:
tf_idf.limit(5).toPandas()

Unnamed: 0,features,label
0,"(0.0, 1.3131902060863196, 0.7708544287433142, ...",0.0
1,"(0.0, 0.8754601373908798, 1.5417088574866284, ...",1.0
2,"(0.5112818338139713, 1.3131902060863196, 1.927...",0.0
3,"(0.5112818338139713, 0.4377300686954399, 0.0, ...",1.0
4,"(0.0, 0.0, 0.0, 0.0, 0.3824030925114636, 0.0, ...",0.0


### Split train and test dataset 70:30 ratio

In [44]:
train,test = tf_idf.randomSplit([0.7,0.3])

### Logistic Regression

In [45]:
# This is the most simplistic approach which does not use cross validation
# Let's go ahead and train a Logistic Regression Algorithm
classifier = LogisticRegression()
fitModel = classifier.fit(train)

# Evaluation method for binary classification problem
predictionAndLabels = fitModel.transform(test)
auc = Bin_evaluator.evaluate(predictionAndLabels)
print("AUC:",auc)

# Evaluation for a multiclass classification problem
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: {0:.2f}".format(accuracy),"%") #     print("Test Error = %g " % (1.0 - accuracy))
print(" ")

AUC: 0.5007538720417449
Accuracy: 83.28 %
 


### Logistic Regression with cross validation

In [46]:
# First tell Spark which classifier you want to use
classifier = LogisticRegression()

# Then Set up your parameter grid for the cross validator to conduct hyperparameter tuning
paramGrid = (ParamGridBuilder().addGrid(classifier.maxIter, [10, 15,20]).build())

# Then set up the Cross Validator which requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MC_evaluator,
                          numFolds=2) # 3 + is best practice

# Then fit your model
fitModel = crossval.fit(train)

# Collect the best model and
# print the coefficient matrix
# These values should be compared relative to eachother
# And intercepts can be prepared to other models
BestModel = fitModel.bestModel
print("Intercept: " + str(BestModel.interceptVector))
print("Coefficients: \n" + str(BestModel.coefficientMatrix))

# You can extract the best model from this run like this if you want
LR_BestModel = BestModel

# Next you need to generate predictions on the test dataset
# fitModel automatically uses the best model 
# so we don't need to use BestModel here
predictions = fitModel.transform(test)

# Now print the accuracy rate of the model or AUC for a binary classifier
accuracy = (MC_evaluator.evaluate(predictions))*100
print(accuracy)

Intercept: [-1.739762372354217]
Coefficients: 
DenseMatrix([[-0.16772199, -0.10669487,  0.1930526 , -0.06237147,  0.14219767,
              -0.00458707,  0.01636683,  0.03149717, -0.1486108 , -0.08092725,
              -0.00643504,  0.03529166,  0.15469715, -0.17743527, -0.00495268,
               0.06287747,  0.04761007, -0.00875633,  0.17464348, -0.01523864]])
83.27810180275715


### Classification Diagnostics

In [47]:
# Load the Summary
trainingSummary = LR_BestModel.summary

# General Describe
trainingSummary.predictions.describe().show()

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print(" ")
print("objectiveHistory: (scaled loss + regularization) at each iteration")
for objective in objectiveHistory:
    print(objective)

# for multiclass, we can inspect metrics on a per-label basis
print(" ")
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print(" ")
print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print(" ")
print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

# Generate confusion matrix and print (includes accuracy)
accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print(" ")
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))



+-------+-------------------+--------------------+
|summary|              label|          prediction|
+-------+-------------------+--------------------+
|  count|              34911|               34911|
|   mean|0.16307181117699293|0.001690011744149...|
| stddev|0.36943646955959625|0.041075588044162406|
|    min|                0.0|                 0.0|
|    max|                1.0|                 1.0|
+-------+-------------------+--------------------+

 
objectiveHistory: (scaled loss + regularization) at each iteration
0.4447287187443416
0.4431910806032841
0.4419524355367641
0.4383240446218865
0.43827161266027753
0.4382593171290481
0.43825598410140554
0.43825536794588954
0.43825529161228394
0.438255287165618
0.43825528653432017
 
False positive rate by label:
label 0: 0.9973651853152995
label 1: 0.001505921007598056
 
True positive rate by label:
label 0: 0.998494078992402
label 1: 0.0026348146847005095
 
Precision by label:
label 0: 0.8370825203718582
label 1: 0.2542372881355932
 

### One vs. Rest
Recap from lecture The One-vs-Rest classifier is a type of multiclass classifier that involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. So each class is viewed as it compares to rest of the classes as a whole, as opposed to each one individually.

In [48]:
# instantiate the base classifier.
lr = LogisticRegression()
# instantiate the One Vs Rest Classifier.
classifier = OneVsRest(classifier=lr)

# Add parameters of your choice here:
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
#Cross Validator requires the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 is best practice

# Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Print the Coefficients
# First we need to extract the best model from fit model

# Get Best Model
BestModel = fitModel.bestModel
# Extract list of binary models
models = BestModel.models
for model in models:
    print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)
        
# Now generate predictions on test dataset
predictions = fitModel.transform(test)
# And calculate the accuracy score
accuracy = (MC_evaluator.evaluate(predictions))*100
# And print
print(accuracy)

[1mIntercept: [0m 1.7420281793148344 [1m
Coefficients:[0m [0.14307963346299235,0.09056419819629825,-0.16784392032699819,0.051061306357161655,-0.12367325819801774,0.002344284711907399,-0.014036012294216584,-0.029979669492204878,0.1221436720127177,0.06970948849092526,0.0049318753372649465,-0.03167745542820083,-0.1346814328739016,0.14861053163477678,0.004086692731493456,-0.05214892625370983,-0.04006441884310158,0.0043179028011782064,-0.1525613332074583,0.012085668069256343]
[1mIntercept: [0m -1.742028179314835 [1m
Coefficients:[0m [-0.14307963346299227,-0.09056419819629828,0.16784392032699824,-0.051061306357161516,0.12367325819801768,-0.0023442847119073363,0.014036012294216525,0.02997966949220505,-0.12214367201271781,-0.06970948849092524,-0.004931875337264979,0.03167745542820068,0.13468143287390147,-0.14861053163477678,-0.004086692731493408,0.05214892625370964,0.040064418843101635,-0.004317902801178229,0.15256133320745846,-0.012085668069256292]
83.29798515376459


### Multilayer Perceptron Classifier

In [49]:
# Count how many features you have
features = tf_idf.select(['features']).collect()
features_count = len(features[0][0])
# Count how many classes you have 
class_count = tf_idf.select(countDistinct("label")).collect()
classes = class_count[0][0]

# Then use this number to specify the layers
# The first number in this list is the input layer which has to be equal to the number of features in your vector
# The second number is the first hidden layer
# The third number is the second hidden layer 
# The fourth number is the output layer which has to be equal to your class size
layers = [features_count, features_count+1, features_count, classes]
# Instaniate the classifier
classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# Fit the model
fitModel = classifier.fit(train)

# Print the model Weights
print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
   
# Generate predictions on test dataframe
predictions = fitModel.transform(test)
# Calculate accuracy score
accuracy = (MC_evaluator.evaluate(predictions))*100
# Print accuracy score
print("Accuracy: ",accuracy)

[1mModel Weights: [0m 923
Accuracy:  83.31786850477201


### NaiveBayes

In [50]:
# Add parameters of your choice here:
classifier = NaiveBayes()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  83.31124072110286


### Linear Support Vector Machine

In [51]:
# Count how many classes you have and produce an error if it's more than 2.
class_count = tf_idf.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("LinearSVC cannot be used because PySpark currently only accepts binary classification data for this algorithm")

# Add parameters of your choice here:
classifier = LinearSVC()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.maxIter, [10, 15]) \
             .addGrid(classifier.regParam, [0.1, 0.01]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel

print("Intercept: \n" + str(BestModel.intercept))
print('\033[1m' + " Coefficients"+ '\033[0m')
print("You should compares these relative to eachother")
print("Coefficients: \n" + str(BestModel.coefficients))
    
# Automatically gets the best model
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Intercept: 
-1.0082883209353788
[1m Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.006055138393847619,-0.004954237622905466,0.0025206419775702,-0.0072919574205939095,0.008076459221482814,0.003956149510343715,-0.0015851127298549443,0.00044834218347127844,-0.007149444931317731,-0.004188330611820416,-0.008288242604555485,0.002087474228905888,0.01104548053544675,-0.0053009979426403295,0.0,0.0029510539096506984,0.0018548571343827863,0.0045785196226850655,0.0012388463286382,-0.0004215029370989358]
Accuracy:  83.31124072110286


### Decision Tree

In [52]:
# Add parameters of your choice here:
classifier = DecisionTreeClassifier()
paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Collect and print feature importances
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Feature Importances:  [0.04348115 0.03112926 0.25340888 0.         0.07732208 0.
 0.01869512 0.0163862  0.         0.         0.         0.
 0.         0.         0.02070184 0.06168935 0.0190686  0.03472348
 0.39136731 0.03202673]
Accuracy:  83.27810180275715


### Random Forest

In [53]:
# Add parameters of your choice here:
classifier = RandomForestClassifier()
paramGrid = (ParamGridBuilder() \
               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Retrieve best model from cross val
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)

accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.06146049 0.04240204 0.0789521  0.04234037 0.06046521 0.04140761
 0.04439617 0.04691719 0.04854666 0.04321129 0.05400712 0.03631441
 0.05631209 0.04150863 0.04894504 0.05543624 0.0419969  0.04546184
 0.06924306 0.04067555]
 
Accuracy:  83.39077412513257


## Gradient Boost Tree Classifier

In [54]:
class_count = tf_idf.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("GBTClassifier cannot be used because PySpark currently only accepts binary classification data for this algorithm")

# Add parameters of your choice here:
classifier = GBTClassifier()

paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .addGrid(classifier.maxIter, [10, 15,50,100])
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)
    
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.05264466 0.0489709  0.06427695 0.04452043 0.05883297 0.03739779
 0.04118678 0.05431903 0.05324753 0.05530648 0.04880276 0.04650764
 0.05761578 0.05138938 0.04543186 0.05298764 0.04232292 0.04905561
 0.05731328 0.0378696 ]
 
Accuracy:  83.15217391304348


## Evaluate using W2V

In [55]:
W2V=W2VfeaturizedData

In [56]:
W2V.limit(5).toPandas()

Unnamed: 0,Text,label,features
0,i have bought several of the vitality canned d...,0.0,"[0.37703929231552025, 0.4992033522898879, 0.37..."
1,product arrived labeled as jumbo salted peanut...,1.0,"[0.3369212129641121, 0.465928146427232, 0.4858..."
2,this is a confection that has been around a fe...,0.0,"[0.5058178372353674, 0.4429592355448553, 0.527..."
3,if you are looking for the secret ingredient i...,1.0,"[0.5585962595857322, 0.4035500947362397, 0.634..."
4,great taffy at a great price there was a wide ...,0.0,"[0.33080630765268737, 0.31161728536385536, 0.6..."


In [57]:
W2V = W2V.select('features','label')

In [58]:
W2V.limit(5).toPandas()

Unnamed: 0,features,label
0,"[0.37703929231552025, 0.4992033522898879, 0.37...",0.0
1,"[0.3369212129641121, 0.465928146427232, 0.4858...",1.0
2,"[0.5058178372353674, 0.4429592355448553, 0.527...",0.0
3,"[0.5585962595857322, 0.4035500947362397, 0.634...",1.0
4,"[0.33080630765268737, 0.31161728536385536, 0.6...",0.0


### Split train and test dataset 70:30 ratio

In [59]:
train,test =W2V.randomSplit([0.7,0.3])

### Logistic Regression

In [60]:
# This is the most simplistic approach which does not use cross validation
# Let's go ahead and train a Logistic Regression Algorithm
classifier = LogisticRegression()
fitModel = classifier.fit(train)

# Evaluation method for binary classification problem
predictionAndLabels = fitModel.transform(test)
auc = Bin_evaluator.evaluate(predictionAndLabels)
print("AUC:",auc)

# Evaluation for a multiclass classification problem
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: {0:.2f}".format(accuracy),"%") #     print("Test Error = %g " % (1.0 - accuracy))
print(" ")

AUC: 0.5023500871509174
Accuracy: 83.54 %
 


### Logistic Regression with cross validation

In [61]:
# First tell Spark which classifier you want to use
classifier = LogisticRegression()

# Then Set up your parameter grid for the cross validator to conduct hyperparameter tuning
paramGrid = (ParamGridBuilder().addGrid(classifier.maxIter, [10, 15,20]).build())

# Then set up the Cross Validator which requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MC_evaluator,
                          numFolds=2) # 3 + is best practice

# Then fit your model
fitModel = crossval.fit(train)

# Collect the best model and
# print the coefficient matrix
# These values should be compared relative to eachother
# And intercepts can be prepared to other models
BestModel = fitModel.bestModel
print("Intercept: " + str(BestModel.interceptVector))
print("Coefficients: \n" + str(BestModel.coefficientMatrix))

# You can extract the best model from this run like this if you want
LR_BestModel = BestModel

# Next you need to generate predictions on the test dataset
# fitModel automatically uses the best model 
# so we don't need to use BestModel here
predictions = fitModel.transform(test)

# Now print the accuracy rate of the model or AUC for a binary classifier
accuracy = (MC_evaluator.evaluate(predictions))*100
print(accuracy)

Intercept: [-8.73403647883886]
Coefficients: 
DenseMatrix([[-9.80919095, 12.67384101, 11.12925288]])

83.53523860518037


### Classification Diagnostics

In [62]:
# Load the Summary
trainingSummary = LR_BestModel.summary

# General Describe
trainingSummary.predictions.describe().show()

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print(" ")
print("objectiveHistory: (scaled loss + regularization) at each iteration")
for objective in objectiveHistory:
    print(objective)

# for multiclass, we can inspect metrics on a per-label basis
print(" ")
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print(" ")
print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print(" ")
print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

# Generate confusion matrix and print (includes accuracy)
accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print(" ")
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))



+-------+-------------------+--------------------+
|summary|              label|          prediction|
+-------+-------------------+--------------------+
|  count|              35058|               35058|
|   mean|0.16435620970962406|0.002367505276969...|
| stddev|0.37060378270058114| 0.04860007786901043|
|    min|                0.0|                 0.0|
|    max|                1.0|                 1.0|
+-------+-------------------+--------------------+

 
objectiveHistory: (scaled loss + regularization) at each iteration
0.4468233827202532
0.44278460872843634
0.44155663687416286
0.43196426149801054
0.4279279641407289
0.4276879583135659
0.42760956444362935
0.42760593702596034
0.4276058924327114
0.4276058902273279
0.4276058902220024
 
False positive rate by label:
label 0: 0.9951405761888233
label 1: 0.001877389404696887
 
True positive rate by label:
label 0: 0.9981226105953032
label 1: 0.004859423811176675
 
Precision by label:
label 0: 0.8360543245175125
label 1: 0.3373493975903614


### One vs. Rest
Recap from lecture The One-vs-Rest classifier is a type of multiclass classifier that involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. So each class is viewed as it compares to rest of the classes as a whole, as opposed to each one individually.

In [63]:
# instantiate the base classifier.
lr = LogisticRegression()
# instantiate the One Vs Rest Classifier.
classifier = OneVsRest(classifier=lr)

# Add parameters of your choice here:
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
#Cross Validator requires the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 is best practice

# Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Print the Coefficients
# First we need to extract the best model from fit model

# Get Best Model
BestModel = fitModel.bestModel
# Extract list of binary models
models = BestModel.models
for model in models:
    print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)
        
# Now generate predictions on test dataset
predictions = fitModel.transform(test)
# And calculate the accuracy score
accuracy = (MC_evaluator.evaluate(predictions))*100
# And print
print(accuracy)

[1mIntercept: [0m 1.5769374860074399 [1m
Coefficients:[0m [1.3000949495079814,-0.9683979716326919,-0.1922672880889995]
[1mIntercept: [0m -1.5769374860074388 [1m
Coefficients:[0m [-1.3000949495079814,0.9683979716326914,0.19226728808899865]
83.60886152198648


### Multilayer Perceptron Classifier

In [64]:
# Count how many features you have
features = W2V.select(['features']).collect()
features_count = len(features[0][0])
# Count how many classes you have 
class_count = W2V.select(countDistinct("label")).collect()
classes = class_count[0][0]

# Then use this number to specify the layers
# The first number in this list is the input layer which has to be equal to the number of features in your vector
# The second number is the first hidden layer
# The third number is the second hidden layer 
# The fourth number is the output layer which has to be equal to your class size
layers = [features_count, features_count+1, features_count, classes]
# Instaniate the classifier
classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# Fit the model
fitModel = classifier.fit(train)

# Print the model Weights
print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
   
# Generate predictions on test dataframe
predictions = fitModel.transform(test)
# Calculate accuracy score
accuracy = (MC_evaluator.evaluate(predictions))*100
# Print accuracy score
print("Accuracy: ",accuracy)

[1mModel Weights: [0m 39
Accuracy:  83.6155545144234


### NaiveBayes

In [65]:
# Add parameters of your choice here:
classifier = NaiveBayes()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  83.60886152198648


### Linear Support Vector Machine

In [66]:
# Count how many classes you have and produce an error if it's more than 2.
class_count = W2V.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("LinearSVC cannot be used because PySpark currently only accepts binary classification data for this algorithm")

# Add parameters of your choice here:
classifier = LinearSVC()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.maxIter, [10, 15]) \
             .addGrid(classifier.regParam, [0.1, 0.01]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel

print("Intercept: \n" + str(BestModel.intercept))
print('\033[1m' + " Coefficients"+ '\033[0m')
print("You should compares these relative to eachother")
print("Coefficients: \n" + str(BestModel.coefficients))
    
# Automatically gets the best model
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Intercept: 
-1.1124333603133294
[1m Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.13533571729137764,0.17796321481569402,0.17104163386899277]
Accuracy:  83.60886152198648


### Decision Tree

In [67]:
# Add parameters of your choice here:
classifier = DecisionTreeClassifier()
paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice
# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Collect and print feature importances
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Feature Importances:  [0.45558102 0.49611745 0.04830153]
Accuracy:  83.60886152198648


### Random Forest Classifier

In [68]:
# Add parameters of your choice here:
classifier = RandomForestClassifier()
paramGrid = (ParamGridBuilder() \
               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Retrieve best model from cross val
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)

accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.33805675 0.35609284 0.3058504 ]
 
Accuracy:  83.63563349173415


### Gradient Boost Tree Classifier

In [69]:
class_count = W2V.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("GBTClassifier cannot be used because PySpark currently only accepts binary classification data for this algorithm")

# Add parameters of your choice here:
classifier = GBTClassifier()

paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .addGrid(classifier.maxIter, [10, 15,50,100])
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)
    
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.35424962 0.30067166 0.34507872]
 
Accuracy:  83.49508065055886


## Create all encompassing Classification Training and Evaluation Function
This function also us to iterativley pass through any classifier and train and evaluate it.

In [70]:
def ClassTrainEval(classifier,features,classes,train,test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype," Weights"+ '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            print(BestModel.featureImportances)
            
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficient Matrix"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

In [71]:
# from pyspark.ml.classification import *
# from pyspark.ml.evaluation import *
# from pyspark.sql import functions
# from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

featureDF_list = [HTFfeaturizedData,TFIDFfeaturizedData,W2VfeaturizedData]

In [72]:
for featureDF in featureDF_list:
    print('\033[1m' + featureDF.name," Results:"+ '\033[0m')
    train, test = featureDF.randomSplit([0.7, 0.3],seed = 11)
    features = featureDF.select(['features']).collect()
    # Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
    class_count = featureDF.select(countDistinct("label")).collect()
    classes = class_count[0][0]

    #set up your results table
    columns = ['Classifier', 'Result']
    vals = [("Place Holder","N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier,features,classes,train,test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    print(results.show(truncate=False))

[1mHTFfeaturizedData  Results:[0m
 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.03887312, -0.03588526,  0.07307599, -0.00198963,  0.05241345,
               0.01283564,  0.01537288,  0.00417559, -0.04219799, -0.02513613,
               0.00393539,  0.00339796,  0.04777044, -0.06073248, -0.00429448,
               0.01497416,  0.00661035, -0.01435145,  0.0770096 , -0.01242766]])
Intercept: [-1.74414154586384]
 
[1mOneVsRest[0m
[1mIntercept: [0m 1.7410187222960556 [1m
Coefficients:[0m [0.01437586840641003,0.013219286777314378,-0.03149779225069064,-0.0008909001576969105,-0.02286012542393386,-0.007824434469452527,-0.006936977761607128,-0.004861400745946295,0.012882752354334281,0.009717501838771715,-0.003178312131499693,-0.003971385389827328,-0.020723763421755022,0.01991574970435164,0.0005112888789387337,-0.005562972670953192,-0.0032714604963370894,0.0023937420915032346,-0.033411768973834716,0.0028468

 
[1mOneVsRest[0m
[1mIntercept: [0m 1.4726702019506157 [1m
Coefficients:[0m [1.4013113811736666,-0.8577465187768131,-0.18734253482253893]
[1mIntercept: [0m -1.4726702019506153 [1m
Coefficients:[0m [-1.4013113811736666,0.8577465187768131,0.18734253482253832]
 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.14136597442183704,0.19527600763891134,0.18528475579148718]
 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[0,1,2],[0.3375748086933921,0.3656097920289301,0.2968153992776778])
 
[1mGBTClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[0,1,2],[0.3376577025241064,0.321911701330617,0.34043059614527665])
 
[1mDecisionTreeClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[],[])

[1mMultilayerPerceptronClassifier  Weights[0m
[1mModel Weights: [0m 39

+------------