## DS/CMPSC 410 Sparing 2021
## Instructor: Professor John Yen
## TA: Rupesh Prajapati and Dongkuan Xu
## Lab 8 Decision Tree Learning Using MLlib and Visualization

## The goals of this lab are for you to be able to
- Understand the function of the different steps/stages involved in Spark ML pipeline
- Be able to construct a decision tree using Spark ML machine learning module
- Be able to generate a visualization of Decision Trees
- Be able to compare and evaluate Decision Tree models created using different hyper-parameters (e.g., maximum tree depth)

## The data set used in this lab is a Breast Cancer diagnosis dataset.

## Submit the following items for Lab 8 (DT)
- Completed Jupyter Notebook of Lab 8 (in HTML or PDF format)
- A word or PDF file that includes the two visualization of decision trees and answers to Exercise 4.

## Total Number of Exercises: 4
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 20 points  
- Exercise 4: 30 points (Word or PDF file)
## Total Points: 60 points

# Due: midnight, March 26 (Friday), 2021

## Load and set up the Python files for decision tree visualizations
1. Create a "Lab8DT" directory in the work directory of your ICDS-ROAR home directory.
2. If you have not done so, copy or upload this file to the "Lab8DT" directory.
3. Create a subdirectory under "Lab8DT" called "decision_tree_plot" (named the directory EXACTLY this way).
4. Upload the following three files in Module 8 from Canvas to the decision_tree_plot directory
- decision_tree_parser.py
- decision_tree_plot.py
- tree_template.jinjia2

### Follow the instructions below and execute the PySpark code cell by cell below. Make modifications as required.

In [None]:
import pyspark
import csv

## Notice that we use PySpark SQL module to import SparkSession because ML works with SparkSession
## Notice also the different methods imported from ML and three submodules of ML: classification, feature, and evaluation.

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

## The following two lines import relevant functions from the two python files you uploaded into the decision_tree_plot subdirectory.

In [None]:
from decision_tree_plot.decision_tree_parser import decision_tree_parse
from decision_tree_plot.decision_tree_plot import plot_trees

## This lab runs Spark in the local mode.
## Notice we are creating a SparkSession, not a SparkContext, when we use ML pipeline.
## The "getOrCreate()" method means we can re-evaluate this without a need to "stop the current SparkSession" first (unlike SparkContext).

In [None]:
ss=SparkSession.builder.master("local").appName("lab 8 DT").getOrCreate()

## As we have seen in Lab 4, SparkSession offers a way to read a CSV/text file with the capability to interpret the first row as being the header and infer the type of different columns based on their values.

## Exercise 1: (5 points) Complete the following path with the path for your home directory.  

In [None]:
data = ss.read.csv("/storage/home/???/Lab8DT/breast-cancer-wisconsin.data.txt", header=True, inferSchema=True)

## Exercise 2: (5 points) Enter your name below:
- My Name: 

## randomSplit is a method for DataFrame that split data in the DataFrame into two subsets, one for training, the other for testing, using a number as the seed for random number generator.
## If you want to generate a different split, you can use a different seed

In [None]:
trainingData, testData= data.randomSplit([0.7, 0.3], seed=1234)

In [None]:
data.printSchema()

In [None]:
labelIndexer = StringIndexer(inputCol="class", outputCol="indexedLabel").fit(data)

In [None]:
bnIndexer = StringIndexer(inputCol="bare_nuclei", outputCol="bare_nuclei_index").fit(data)

In [None]:
input_features = ['clump_thickness', 'unif_cell_size', 'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size', 'bland_chrom', 'norm_nucleoli', 'mitoses', 'bare_nuclei_index']

## Exercise 3: (10 points) Choose two different values for maxDepth hyperparameter of Decision Trees. (Recommended value for maxDepth: 2 to 7). Run the entire sequence of code below to generate two decision trees. 
- Record the f1 measure for each max_depth below  
- Make sure you CHANGE the name of the files for saving decision trees (e.g., "DTtree_d3.html", "DTtree_d5.html", ...). 

## Answer for Exercise 3: 
- f1 measure for max_detph ??? = ???
- f1 measure for max_depth ??? = ???

In [None]:
assembler = VectorAssembler( inputCols=input_features, outputCol="features")
dt=DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features", maxDepth=???, minInstancesPerNode=2)
labelConverter = IndexToString(inputCol="indexedLabel", outputCol="predictedClass", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer, bnIndexer, assembler, dt, labelConverter])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)

In [None]:
predictions.persist()

In [None]:
predictions.take(2)

In [None]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="f1")

In [None]:
f1 = evaluator.evaluate(predictions)
print("f1 score:", f1)

## stages[3] of the pipeline is "dt" (DecisionTreeClassifier).  stages[0] is labelIndexer.
## model is a DataFrame representing a trained pipeline.
## model.stages[3] gives us the Decision Tree model learned.

In [None]:
DTmodel = model.stages[3]
print(DTmodel)

In [None]:
model_path="./DTmodel_vis"

### Provide a file name for your decision tree below.  If your PySpark code runs successfully, you should see your DT visualization file (e.g., DTtree_d3.html) in your Lab8DT directory.  
### Note: Change the name of your file before you run the code above with a different maxDepth hyperparameter value. 
### Download the two decision tree visualization files.  Open them with a browser, you will see the trees.
### Use screen capture tool (e.g., sreenshot in Mac, snipping tool in PC) to capture the decision trees, and include them in your word document for answers to Exercise 4 (see below)

In [None]:
tree=decision_tree_parse(DTmodel, ss, model_path)
column = dict([(str(idx), i) for idx, i in enumerate(input_features)])
plot_trees(tree, column = column, output_path = '????.html')

## Exercise 4: (30 points) Use a word document to 
- (a) Show the decision tree visualization for the two values of max_depth (10 points)
- (b) Discuss the key difference between the two trees. (10 points)
- (c) Discuss which tree you believe is a better model for breast cancer diagnosis, and explain the rationale of your choice. (10 points)
### Submit the PDF version of the word document as a part of this lab.