# Classification in PySpark's MLlib

PySpark offers a good variety of algorithms that can be applied to classification machine learning problems. However, because PySpark operates on distributed dataframes, we cannot use popular Python libraries like scikit learn for our machine learning applications. Which means we need to use PySpark's MLlib packages for these tasks. Luckily, MLlib offers a pretty good variety of algorithms! In this notebook we will go over how to prep our data and train and test the classification algorithms PySpark offers. 

## Algorithms Available

PySpark offers the following algorithms for classification. 

1. Logistic Regression 
2. Naive Bayes
3. One Vs Rest
4. Linear Support Vector Machine (SVC)
5. Random Forest Classifier
6. GBT Classifier
7. Decision Tree Classifier
8. Multilayer Perceptron Classifier (Neural Network)

In [1]:
#Importing pysark and creating a session
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('classification1').getOrCreate()
spark

In [2]:
#Importing required libraries
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

### Data Set Name: Autistic Spectrum Disorder Screening Data for Adult
Autistic Spectrum Disorder (ASD) is a neurodevelopment condition associated with significant healthcare costs, and early diagnosis can significantly reduce these. Unfortunately, waiting times for an ASD diagnosis are lengthy and procedures are not cost effective. The economic impact of autism and the increase in the number of ASD cases across the world reveals an urgent need for the development of easily implemented and effective screening methods. Therefore, a time-efficient and accessible ASD screening is imminent to help health professionals and inform individuals whether they should pursue formal clinical diagnosis. The rapid growth in the number of ASD cases worldwide necessitates datasets related to behaviour traits. However, such datasets are rare making it difficult to perform thorough analyses to improve the efficiency, sensitivity, specificity and predictive accuracy of the ASD screening process. Presently, very limited autism datasets associated with clinical or screening are available and most of them are genetic in nature. Hence, we propose a new dataset related to autism screening of adults that contained 20 features to be utilised for further analysis especially in determining influential autistic traits and improving the classification of ASD cases. In this dataset, we record ten behavioural features (AQ-10-Adult) plus ten individuals characteristics that have proved to be effective in detecting the ASD cases from controls in behaviour science.

### Source: 
https://www.kaggle.com/faizunnabi/autism-screening

In [3]:
path ="datasets-mlib/"
df = spark.read.csv(path+'Toddler Autism dataset July 2018.csv',inferSchema=True,header=True)

In [4]:
#Checking the dataset
df.limit(6).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes
5,6,1,1,0,0,1,1,1,1,1,1,21,8,m,black,no,no,family member,Yes


In [5]:
df.printSchema()

root
 |-- Case_No: integer (nullable = true)
 |-- A1: integer (nullable = true)
 |-- A2: integer (nullable = true)
 |-- A3: integer (nullable = true)
 |-- A4: integer (nullable = true)
 |-- A5: integer (nullable = true)
 |-- A6: integer (nullable = true)
 |-- A7: integer (nullable = true)
 |-- A8: integer (nullable = true)
 |-- A9: integer (nullable = true)
 |-- A10: integer (nullable = true)
 |-- Age_Mons: integer (nullable = true)
 |-- Qchat-10-Score: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Jaundice: string (nullable = true)
 |-- Family_mem_with_ASD: string (nullable = true)
 |-- Who completed the test: string (nullable = true)
 |-- Class/ASD Traits : string (nullable = true)



In the dataset,
- Inependent variables (features): Case_No - Who completed the test
- Dependent variable: Class/ASD Traits 

In [6]:
#Identifying the number of classes in the dpenedent variable
df.groupBy("Class/ASD Traits ").count().show()

+-----------------+-----+
|Class/ASD Traits |count|
+-----------------+-----+
|               No|  326|
|              Yes|  728|
+-----------------+-----+



### Formatting data

MLib requires all the data to be vectorized

In [7]:
#Separating dependent and independent variables
input_columns = df.columns # Collect the column names as a list
input_columns = input_columns[1:-1] # keep only relevant columns: from column 1 to 
input_columns

['A1',
 'A2',
 'A3',
 'A4',
 'A5',
 'A6',
 'A7',
 'A8',
 'A9',
 'A10',
 'Age_Mons',
 'Qchat-10-Score',
 'Sex',
 'Ethnicity',
 'Jaundice',
 'Family_mem_with_ASD',
 'Who completed the test']

In [8]:
dependent_var = 'Class/ASD Traits '
dependent_var

'Class/ASD Traits '

In [9]:
#Convwrting the dependent variables columsn from yes, n to 0 and 1
renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 

In [10]:
indexed = indexer.fit(renamed).transform(renamed)
indexed.limit(5).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,...,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits,label_str,label
0,1,0,0,0,0,0,0,1,1,0,...,28,3,f,middle eastern,yes,no,family member,No,No,1.0
1,2,1,1,0,0,0,1,1,0,0,...,36,4,m,White European,yes,no,family member,Yes,Yes,0.0
2,3,1,0,0,0,0,0,1,1,0,...,36,4,m,middle eastern,yes,no,family member,Yes,Yes,0.0
3,4,1,1,1,1,1,1,1,1,1,...,24,10,m,Hispanic,no,no,family member,Yes,Yes,0.0
4,5,1,1,0,1,1,1,1,1,1,...,20,9,f,White European,no,yes,family member,Yes,Yes,0.0


In [11]:
#Converting all the string type data in the input column list to numeric, otherwise pyspark cannot process it
numeric_inputs = []
string_inputs = []
for column in input_columns:
    #Identifying string type variables
    if str(indexed.schema[column].dataType) == 'StringType':
        #Setting up string indexer function 
        indexer = StringIndexer(inputCol=column, outputCol=column+"_num")
        #Calling the indexer function
        indexed = indexer.fit(indexed).transform(indexed)
        #Collecting the new column names
        new_col_name = column+"_num"
        #Appending the column name to string input list
        string_inputs.append(new_col_name)
    else:
        numeric_inputs.append(column)

In [12]:
indexed.limit(5).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,...,Family_mem_with_ASD,Who completed the test,Class/ASD Traits,label_str,label,Sex_num,Ethnicity_num,Jaundice_num,Family_mem_with_ASD_num,Who completed the test_num
0,1,0,0,0,0,0,0,1,1,0,...,no,family member,No,No,1.0,1.0,2.0,1.0,0.0,0.0
1,2,1,1,0,0,0,1,1,0,0,...,no,family member,Yes,Yes,0.0,0.0,0.0,1.0,0.0,0.0
2,3,1,0,0,0,0,0,1,1,0,...,no,family member,Yes,Yes,0.0,0.0,2.0,1.0,0.0,0.0
3,4,1,1,1,1,1,1,1,1,1,...,no,family member,Yes,Yes,0.0,0.0,5.0,0.0,0.0,0.0
4,5,1,1,0,1,1,1,1,1,1,...,yes,family member,Yes,Yes,0.0,1.0,0.0,0.0,1.0,0.0


#### Treating for skewness and outliers

Skewness measures how much a distribution of values deviates from symmetry around the mean. A value of zero means the distribution is symmetric, while a positive skewness indicates a greater number of smaller values, and a negative value indicates a greater number of larger values. 

As a general rule of thumb: 

 - If skewness is **less than -1 or greater than 1**, the distribution is highly skewed. 
 - If skewness is **between -1 and -0.5 or between 0.5 and 1**, the distribution is moderately skewed. 
 - If skewness is **between -0.5 and 0.5**, the distribution is approximately symmetric.
 
A common recommendation for treating skewness is either a log transformation for positive skewed data or an exponential transformation for negatively skewed data.


**Outliers** <br>
One common way to correct outliers is by flooring and capping which means editing any value that is above or below a certain threshold (99th percentile or 1st percentile) back to the highest/lowest value in that percentile. For example, if the 99th percentile is 96 and there is a value of 1,000, you would change that value to 96. 

In [13]:
#Creating empty dictionary d
d = {}
# Creating a dictionary of quantiles from numeric cols, Doing the top and bottom 1% but it can be adjusted if needed
for col in numeric_inputs: 
    d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number

#Now check for skewness for all numeric cols
for col in numeric_inputs:
    skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
    skew = skew[0][0]
    # If skewness is found,
    # This function will make the appropriate corrections
    if skew > 1: # If right skew, floor, cap and log(x+1)
        indexed = indexed.withColumn(col, \
        log(when(df[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1], d[col][1])\
        .otherwise(indexed[col] ) +1).alias(col))
        print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
    elif skew < -1: # If left skew floor, cap and exp(x)
        indexed = indexed.withColumn(col, \
        exp(when(df[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1], d[col][1])\
        .otherwise(indexed[col] )).alias(col))
        print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

As I didn't get any print statements executed above, I can conclude the data is not treated for skewness

In [14]:
#Checking for negative values in the dataframe 
#Producign a warning if there are negative values in the dataframe that Naive Bayes cannot be used
#Note: Checking only the numeric input values since anything that is indexed won't have negative values

# Calculating the mins for all columns in the df
minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) 
# Creating an array for all mins and select only the input cols
min_array = minimums.select(array(numeric_inputs).alias("mins")) 
# Collecting global min as Python object
df_minimum = min_array.select(array_min(min_array.mins)).collect() 
# Slicing to get the number itself
df_minimum = df_minimum[0][0] 

# If there are ANY negative vals found in the df, print a warning message
if df_minimum < 0:
    print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
else:
    print("No negative values were found in your dataframe.")

No negative values were found in your dataframe.


In [15]:
# Now creating final features list
features_list = numeric_inputs + string_inputs
# Creating vector assembler object
assembler = VectorAssembler(inputCols=features_list,outputCol='features')
# And calling on the vector assembler to transform your dataframe
output = assembler.transform(indexed).select('features','label')

In [16]:
output.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(17,[6,7,9,10,11,...|  1.0|
|(17,[0,1,5,6,10,1...|  0.0|
|(17,[0,6,7,9,10,1...|  0.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|[1.0,1.0,0.0,1.0,...|  0.0|
|[1.0,1.0,0.0,0.0,...|  0.0|
|(17,[0,3,4,5,8,10...|  0.0|
|(17,[1,4,6,7,8,9,...|  0.0|
|(17,[6,9,10,11,13...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
|[1.0,0.0,0.0,1.0,...|  0.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|(17,[10,12,13,14]...|  1.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|(17,[10,13],[18.0...|  1.0|
|(17,[0,1,2,4,6,7,...|  0.0|
|(17,[10,13,15],[3...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
|(17,[0,4,9,10,11,...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
+--------------------+-----+
only showing top 20 rows



In [17]:
# Creating the mix max scaler object (process if there are negative values present)
#Udsing range 0 to 1000

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1000)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
# Computing summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(output)

Features scaled to range: [0.000000, 1000.000000]


In [18]:
#Rescaling each feature to range [min, max].
scaled_data = scalerModel.transform(output)
scaled_data

DataFrame[features: vector, label: double, scaledFeatures: vector]

In [19]:
final_data = scaled_data.select('label','scaledFeatures')
# Renaming to default value
final_data = final_data.withColumnRenamed("scaledFeatures","features")
final_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(17,[6,7,9,10,11,...|
|  0.0|(17,[0,1,5,6,10,1...|
|  0.0|(17,[0,6,7,9,10,1...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,1000.0,0....|
|  0.0|[1000.0,1000.0,0....|
|  0.0|(17,[0,3,4,5,8,10...|
|  0.0|(17,[1,4,6,7,8,9,...|
|  1.0|(17,[6,9,10,11,13...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,0.0,0.0,1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[10,12,13,14]...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[10,13],[250....|
|  0.0|(17,[0,1,2,4,6,7,...|
|  1.0|(17,[10,13,15],[1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[0,4,9,10,11,...|
|  0.0|(17,[0,1,2,4,6,7,...|
+-----+--------------------+
only showing top 20 rows



#### Splitting the data into training and splitting data

In [20]:
#Splitting the data into train and test
train, test = final_data.randomSplit([0.7,0.3])

In [21]:
train.count()

759

In [22]:
test.count()

295

##### Reading in dependencies from PySpark

In [23]:
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [24]:
#Setting up our evaluation objects
binary_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction')
MC_eval = MulticlassClassificationEvaluator(metricName='accuracy')

### Implementing Logistic Regression
The Logistic Regression Algorithm, also known as "Logit", is used to estimate (guess) the probability (a number between 0 and 1) of an event occurring having been given some previous data to “learn” from. It works with either binary or multinomial (more than 2 categories) data and uses logistic function (ie. log) to find a model that fits with the data points.

In [25]:
#Initiatng logistic regression constructor
classifier = LogisticRegression()

In [26]:
fitModel = classifier.fit(train)

In [27]:
# Evaluation method for binary classification problem
predictionAndLabels = fitModel.transform(test)
auc = binary_eval.evaluate(predictionAndLabels)
print("AUC:",auc)

AUC: 1.0


#### Cross validation

In [28]:
#Instantiating the classifier
classifier = LogisticRegression()

In [29]:
#Setting up parameter grid for the cross validator
paramGrid = (ParamGridBuilder().addGrid(classifier.maxIter, [10,15,20]).build())

#Setting up cross validator
crossval = CrossValidator(estimator=classifier,
                         estimatorParamMaps=paramGrid,
                         evaluator=MC_eval,
                         numFolds=2)

In [30]:
#Fitting the model
fitModel = crossval.fit(train)

In [31]:
#Collecting the best model
BestModel = fitModel.bestModel
print("Intercept:" + str(BestModel.interceptVector))
print("Coefficients" + str(BestModel.coefficientMatrix))

Intercept:[51.71794521838649]
CoefficientsDenseMatrix([[-0.01162562, -0.01066733, -0.01043395, -0.0117568 , -0.010962  ,
              -0.01144383, -0.01162115, -0.01056535, -0.01129973, -0.01105928,
              -0.00050193, -0.03290132, -0.00015347, -0.00196149, -0.00026341,
               0.00017106, -0.00143897]])


In [32]:
#We can extract the best model from this run like below
LR_BestModel = BestModel
print("Best model:", LR_BestModel)
#Generating the prediction
predictions = fitModel.transform(test)

#Calculating the accuracy
accuracy = (MC_eval.evaluate(predictions))*100
print("Accuracy:",accuracy)

Best model: LogisticRegressionModel: uid=LogisticRegression_9fbcd87d07a4, numClasses=2, numFeatures=17
Accuracy: 100.0


#### Classification Diagnostics

In [33]:
#Loading the summary
trainingSummary = LR_BestModel.summary

In [34]:
#Generate descibe
trainingSummary.predictions.describe().show()

+-------+------------------+------------------+
|summary|             label|        prediction|
+-------+------------------+------------------+
|  count|               759|               759|
|   mean| 0.310935441370224| 0.310935441370224|
| stddev|0.4631816603062981|0.4631816603062981|
|    min|               0.0|               0.0|
|    max|               1.0|               1.0|
+-------+------------------+------------------+



In [35]:
#Obtaining the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("Objective History at each iteration")
for objective in objectiveHistory:
    print(objective)

Objective History at each iteration
0.6198470847843046
0.51854176071745
0.45458359771239726
0.286090535418719
0.2581878313998878
0.22019114416073587
0.19985339704802196
0.1777687572970849
0.14257636964308962
0.08415978728826765
0.05660084996068759
0.02703699402904461
0.017672459276642388
0.011628025824695512
0.006500667503663923
0.005577037685455732
0.002711847204264477
0.0015241918832879177
0.0007732832507238134
0.00039497815976479193
0.0001977556235196983


In [36]:
#For multiclass, we can inspect metrics on a per-label basis
print(" ")
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print(" ")
print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print(" ")
print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print(" ")
print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

 
False positive rate by label:
label 0: 0.0
label 1: 0.0
 
True positive rate by label:
label 0: 1.0
label 1: 1.0
 
Precision by label:
label 0: 1.0
label 1: 1.0
 
Recall by label:
label 0: 1.0
label 1: 1.0
 
F-measure by label:
label 0: 1.0
label 1: 1.0


In [37]:
#Generating confusion matrix and print (includes accuracy)
accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print(" ")
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

 
Accuracy: 1.0
FPR: 0.0
TPR: 1.0
F-measure: 1.0
Precision: 1.0
Recall: 1.0


### Multilayer Perceptron Classifier
A multilayer perceptron (MLP) is a class of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

#### Building the MLP Classifier

In [45]:
#Selecting the features and counting them
features = final_data.select(['features']).collect()
features_count = len(features[0][0])

#Countng number of classes
class_count = final_data.select(countDistinct("label")).collect()
classes = class_count[0][0]

In [46]:
#Creating a list of layers
#First number in this list is the input layer which has to be equal to the number of features in your vector
#Second number is the first hidden layer
#Third number is the second hidden layer 
#Fourth number is the output layer which has to be equal to your class size
layers = [features_count, features_count+1, features_count, classes]

In [47]:
#Instantiating the classifier
classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

In [48]:
#Fitiing the model
fitModel = classifier.fit(train)

In [49]:
#Getting the model weights
print("Weights:", fitModel.weights.size)

Weights: 683


In [50]:
#Generating the predictions on test dataframe
predictions = fitModel.transform(test)

In [51]:
accuracy = (MC_eval.evaluate(predictions))*100
print("Accuracy", accuracy)

Accuracy 90.84745762711864


### Naive Bayes
The Naive Bayes Classifier is a collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of the value of any other feature. 

**Hyper Parameters:**

 - **smoothing** = It is problematic when a frequency-based probability is zero, because it will wipe out all the information in the other probabilities, and we need to find a solution for this. A solution would be Laplace smoothing , which is a technique for smoothing categorical data. In PySpark, this number needs to be be >= 0, default is 1.0'.
 - **thresholds** = Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. The default value is none. 
 - **weightCol** = If you have a weight column you would enter the name of the column here. If this is not set or empty, we treat all instance weights as 1.0.

#### Building Naive Bayes Classifier

In [52]:
#Instantiating the classifier
classifier = NaiveBayes()

In [53]:
#Adding the paramters 
paramGrid = (ParamGridBuilder() \
            .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
            .build())
#Cross validating 
crossval = CrossValidator(estimator=classifier,
                         estimatorParamMaps=paramGrid,
                         evaluator=MulticlassClassificationEvaluator(),
                         numFolds=2)

In [54]:
#Fitting the model
fitModel = crossval.fit(train)

In [55]:
predictions = fitModel.transform(test)
accuracy = (MC_eval.evaluate(predictions))*100
print("Accuracy:", accuracy)

Accuracy: 86.77966101694915


### Linear Support Vector Machine
Linear SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, which is why you can only use it for binary classification. Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set. Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it. So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

In [56]:
#Counting the number of classes and produce an error if it's more than 2.
class_count = final_data.select(countDistinct("label")).collect()
classes = class_count[0][0]
if classes > 2:
    print("LinearSVC cannot be used because PySpark currently only accepts binary classification data for this algorithm")

In [57]:
#Adding parameters
classifier = LinearSVC()
paramGrid = (ParamGridBuilder() \
             .addGrid(classifier.maxIter, [10, 15]) \
             .addGrid(classifier.regParam, [0.1, 0.01]) \
             .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2)

In [58]:
#Fitting the model and getting the best model
fitModel = crossval.fit(train)

BestModel = fitModel.bestModel

In [59]:
BestModel.intercept

2.118898883442666

In [60]:
BestModel.coefficients

DenseVector([-0.0004, -0.0007, -0.0002, -0.0005, -0.0006, -0.0007, -0.0006, -0.0006, -0.0009, -0.0004, 0.0004, -0.0017, 0.0001, -0.0001, -0.0002, 0.0001, -0.0002])

In [62]:
#fit model contains the best model
predictions = fitModel.transform(test)
accuracy = (MC_eval.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  97.96610169491525


### Decision Tree Classifier
Decision Trees classifiers  are a supervised learning method is used to classify a variable by learning from historical data that the model uses to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model. 

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

#### Common Hyper Parameters

 - **maxBins** = Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.
 - **maxDepth** = The max_depth parameter specifies the maximum depth of each tree. The default value for max_depth is None, which means that each tree will expand until every leaf is pure. A pure leaf is one where all of the data on the leaf comes from the same class.

In [63]:
#Instantiating the classifier
classifier = DecisionTreeClassifier()

In [64]:
paramGrid = (ParamGridBuilder() \
#            .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .build())

#Generating cross validator
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2)

In [65]:
#Fitting the model
fitModel = crossval.fit(train)

In [66]:
#Collecting and printing feature importances
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

Feature Importances:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]


In [68]:
predictions = fitModel.transform(test)
accuracy = (MC_eval.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  100.0


### Random Forest Classifier
Suppose you have a training set with 6 classes, random forest may create three decision trees taking input of each subset. Finally, it predicts based on the majority of votes from each of the decision trees made. This works well because a single decision tree may be prone to noise, but aggregate of many decision trees reduce the effect of noise giving more accurate results. The subsets in different decision trees created may overlap. 

#### Common Hyper Parameters

 - **maxBins** = Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.
 - **maxDepth** = The maxDepth parameter specifies the maximum depth of each tree. The default value for max_depth is None, which means that each tree will expand until every leaf is pure. A pure leaf is one where all of the data on the leaf comes from the same class.

In [69]:
classifier = RandomForestClassifier()

In [70]:
#Adding parameter grid
paramGrid = (ParamGridBuilder() \
               .addGrid(classifier.maxDepth, [2, 5, 10])
#              .addGrid(classifier.maxBins, [5, 10, 20])
#              .addGrid(classifier.numTrees, [5, 20, 50])
               .build())

#Cross validating
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2)

In [71]:
#Fitting the model
fitModel = crossval.fit(train)

In [72]:
#Retrieving the best model 
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

Feature Importances:  [0.         0.         0.         0.00948622 0.06919361 0.10492895
 0.1366632  0.02066992 0.10951917 0.         0.         0.54953893
 0.         0.         0.         0.         0.        ]


In [73]:
predictions = fitModel.transform(test)

In [74]:
accuracy = (MC_eval.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

 
Accuracy:  100.0
