##### PySpark MLlib API provides a DecisionTreeClassifier model to implement classification with decision tree method.

A decision tree method is one of the well known and powerful supervised machine learning algorithms that can be used for classification and regression tasks. It is a tree-like, top-down flow learning method to extract rules from the training data. The branches of the tree are based on certain decision outcomes.

In [None]:
#import necessary packages\
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.sql.types import DoubleType
from pyspark.sql import functions as func
from pyspark.mllib.util import MLUtils
from pyspark.sql import SQLContext
from pyspark import SparkContext

#create spark context
SparkContext.getOrCreate()
sc = SparkContext.getOrCreate("ch6")
sqlCtx = SQLContext(sc)

# Model 1
## Training Dataset

### Preparing the data

We use ch6_training dataset to perform classification and it can be easily loaded from the DSPR_Data_Sets folder. Below code explains how to load dataset

In [None]:
ch6_training = sqlCtx.read.option('header','true').options(delimiter=",").csv('DSPR_Data_Sets/adult_ch6_training')
print("column types:", ch6_training.dtypes)
print("Rows :", ch6_training.count())
# Use withColumn() to convert the data type of a DataFrame column,
# This function takes column name you wanted to convert as a first argument and
# for the second argument apply the casting method cast() with DataType on the column.
ch6_training = ch6_training.withColumn("Cap_Gains_Losses_Double",func.col("Cap_Gains_Losses").cast(DoubleType()))
print("column types:", ch6_training.dtypes)

# Show Data Frame
ch6_training.show(10)

In [None]:
# Index labels, adding metadata to the Income column.
# Fit on whole dataset to include all labels in index.
incomeIndexer = StringIndexer(inputCol="Income", outputCol="indexedLabel")

# Run the indexer.
incomeIndexer_fit = incomeIndexer.fit(ch6_training)

# Transformer : A Transformer is an algorithm which can transform one DataFrame into another DataFrame .
# E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions.
dataframe_training = incomeIndexer_fit.transform(ch6_training)
print(dataframe_training.dtypes)

# Show Data Frame
dataframe_training.show(10)

In [None]:
# Index labels, adding metadata to the Marital status column.
# Fit on whole dataset to include all labels in index.
maritalIndexer = StringIndexer(inputCol="Marital status", outputCol="Marital feature")

# Run the indexer.
maritalIndexer_fit = maritalIndexer.fit(dataframe_training)

# Transformer : A Transformer is an algorithm which can transform one DataFrame into another DataFrame .
# E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions.
dataframe_training = maritalIndexer_fit.transform(dataframe_training)
print(dataframe_training.dtypes)

# Show Data Frame
dataframe_training.show(5)

In [None]:
# vector features,
# Fit on whole dataset to include all features
featureAssembler = VectorAssembler(inputCols = ['Cap_Gains_Losses_Double', 'Marital feature'] , outputCol='features')
dataframe_training = featureAssembler.transform(dataframe_training)
dataframe_training.show(5)

In [None]:
dataframe_training_output = dataframe_training.select(['indexedLabel', 'features'])
dataframe_training_output.show(5)

## Test Dataset


In [None]:
ch6_test = sqlCtx.read.option('header','true').options(delimiter=",").csv('DSPR_Data_Sets/adult_ch6_test')
ch6_test = ch6_test.withColumn("Cap_Gains_Losses_Double",func.col("Cap_Gains_Losses").cast(DoubleType()))

In [None]:
incomeIndexer_test = StringIndexer(inputCol="Income", outputCol="indexedLabel")
incomeIndexer_fit = incomeIndexer_test.fit(ch6_test)
dataframe_test = incomeIndexer_fit.transform(ch6_test)

In [None]:
maritalIndexer = StringIndexer(inputCol="Marital status", outputCol="Marital feature")
maritalIndexer_fit = maritalIndexer.fit(dataframe_test)
dataframe_test = maritalIndexer_fit.transform(dataframe_test)

In [None]:
featureAssembler = VectorAssembler(inputCols = ['Cap_Gains_Losses_Double', 'Marital feature'] , outputCol='features')
dataframe_test = featureAssembler.transform(dataframe_test)

In [None]:
dataframe_test = dataframe_test.select(['indexedLabel', 'features'])
dataframe_test.show()

# Prediction and Accuracy Check

In [None]:
# Create DecisionTreeClassifier
dtc = DecisionTreeClassifier(featuresCol="features", labelCol="indexedLabel")

# Fit dataframe to the DecisionTreeClassifier
dtc = dtc.fit(dataframe_training)

# Make predictions.
pred = dtc.transform(dataframe_test)
pred.show(10)

# Classification model evaluation
While there are many different types of classification algorithms, the evaluation of classification models all share similar principles. In a supervised classification problem, there exists a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

* True Positive (TP) - label is positive and prediction is also positive
* True Negative (TN) - label is negative and prediction is also negative
* False Positive (FP) - label is negative but prediction is positive
* False Negative (FN) - label is positive but prediction is negative

source : https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#:~:text=the%20F%2Dmeasure.-,Binary%20classification,-Binary%20classifiers%20are

#### F1 score
is defined as the harmonic mean between precision and recall. It is used as a statistical measure to rate performance. In other words, an F1-score (from 0 to 9, 0 being lowest and 9 being the highest) is a mean of an individual's performance, based on two factors i.e. precision and recall.

#### Recall
literally is how many of the true positives were recalled (found), i.e. how many of the correct hits were also found.

#### Precision
is how many of the returned hits were true positive i.e. how many of the found were correct hits.

In [None]:
tp = pred.filter((pred.indexedLabel == 1) & (pred.prediction == 1)).count()
tn = pred.filter((pred.indexedLabel == 0) & (pred.prediction == 0)).count()
fp = pred.filter((pred.indexedLabel == 0) & (pred.prediction == 1)).count()
fn = pred.filter((pred.indexedLabel == 1) & (pred.prediction == 0)).count()

print("True Positives:", tp)
print("True Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)

a = ((tp + tn)/pred.count())
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
f1 = 2 * ((p * r)/(p + r))

print("Accuracy:", a)
print("Recall:", r)
print("Precision:", p)
print("F1 score:", f1)

# Model 2
## Training Dataset

### Preparing the data

We use ch3_training dataset to perform classification and it can be easily loaded from the DSPR_Data_Sets folder. Below code explains how to load dataset

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

ch3_read = sqlCtx.read.option('header','true').options(delimiter=",").csv('DSPR_Data_Sets/adult_ch3_training')

ch3 = ch3_read.withColumn("age",func.col("age").cast(DoubleType()))
ch3 = ch3.select("*").withColumn("id", monotonically_increasing_id())

ch3_select = ch3.select(["id", "age", "marital-status", "income"])

ch3_training, ch3_test = ch3_select.randomSplit([0.7, 0.3])

print("Training count :", ch3_training.count())
print("Test count :", ch3_test.count())

ch3_training.show(5)
ch3_test.show(5)

In [None]:
# Index labels, adding metadata to the Income column.
# Fit on whole dataset to include all labels in index.
incomeIndexer = StringIndexer(inputCol="income", outputCol="indexedLabel")

incomeIndexer_fit = incomeIndexer.fit(ch3_training)

dataframe_training = incomeIndexer_fit.transform(ch3_training)

dataframe_training.show(5)

In [None]:
# Index labels, adding metadata to the Marital status column.
# Fit on whole dataset to include all labels in index.
maritalIndexer = StringIndexer(inputCol="marital-status", outputCol="Marital feature")

# Run the indexer.
maritalIndexer_fit = maritalIndexer.fit(dataframe_training)

# Transformer : A Transformer is an algorithm which can transform one DataFrame into another DataFrame .
# E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions.
dataframe_training = maritalIndexer_fit.transform(dataframe_training)
print(dataframe_training.dtypes)

# Show Data Frame
dataframe_training.show(5)

In [None]:
# vector features,
# Fit on whole dataset to include all features
featureAssembler = VectorAssembler(inputCols = ['age', 'Marital feature'] , outputCol='features')
dataframe_training = featureAssembler.transform(dataframe_training)
dataframe_training.show(5)

In [None]:
dataframe_training_output = dataframe_training.select(['indexedLabel', 'features'])
dataframe_training_output.show(5)

## Test Dataset

In [None]:
incomeIndexer = StringIndexer(inputCol="income", outputCol="indexedLabel")

incomeIndexer_fit = incomeIndexer.fit(ch3_test)

dataframe_test = incomeIndexer_fit.transform(ch3_test)

dataframe_test.show(5)

In [None]:
# Index labels, adding metadata to the Marital status column.
# Fit on whole dataset to include all labels in index.
maritalIndexer = StringIndexer(inputCol="marital-status", outputCol="Marital feature")

# Run the indexer.
maritalIndexer_fit = maritalIndexer.fit(dataframe_test)

# Transformer : A Transformer is an algorithm which can transform one DataFrame into another DataFrame .
# E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions.
dataframe_test = maritalIndexer_fit.transform(dataframe_test)
print(dataframe_test.dtypes)

# Show Data Frame
dataframe_test.show(5)

In [None]:
# vector features,
# Fit on whole dataset to include all features
featureAssembler = VectorAssembler(inputCols = ['age', 'Marital feature'] , outputCol='features')
dataframe_test = featureAssembler.transform(dataframe_test)
dataframe_test.show(5)

In [None]:
dataframe_test_output = dataframe_test.select(['indexedLabel', 'features'])
dataframe_test_output.show(5)

In [None]:
# Create DecisionTreeClassifier
dtc = DecisionTreeClassifier(featuresCol="features", labelCol="indexedLabel")

# Fit dataframe to the DecisionTreeClassifier
dtc = dtc.fit(dataframe_training)

# Make predictions.
pred = dtc.transform(dataframe_test)
pred.show(10)

## Search for rawPrediction

# Classification model evaluation
While there are many different types of classification algorithms, the evaluation of classification models all share similar principles. In a supervised classification problem, there exists a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

* True Positive (TP) - label is positive and prediction is also positive
* True Negative (TN) - label is negative and prediction is also negative
* False Positive (FP) - label is negative but prediction is positive
* False Negative (FN) - label is positive but prediction is negative

source : https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#:~:text=the%20F%2Dmeasure.-,Binary%20classification,-Binary%20classifiers%20are


In [None]:
tp = pred.filter((pred.indexedLabel == 1) & (pred.prediction == 1)).count()
tn = pred.filter((pred.indexedLabel == 0) & (pred.prediction == 0)).count()
fp = pred.filter((pred.indexedLabel == 0) & (pred.prediction == 1)).count()
fn = pred.filter((pred.indexedLabel == 1) & (pred.prediction == 0)).count()

print("True Positives:", tp)
print("True Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)

a = ((tp + tn)/pred.count()) # a for accuracy
r = float(tp) / (tp + fn) # r for recall
p = float(tp) / (tp + fp) # p for precision
f1 = 2 * ((p * r)/(p + r)) ### f1 for F1 score

print("Accuracy:", a)
print("Recall:", r)
print("Precision:", p)
print("F1 score:", f1)