# ST590 Project 3 - Hepatitis C Predictions
Yi Ren

## Introduction
### Supervised Learning
Supervised learning is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. 

In supervised learning, the algorithm “learns” from the training dataset by iteratively making predictions on the data and adjusting for the correct answer. While supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately. 

Supervised learning is classified into two categories of algorithms: 

+ Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” , “disease” or “no disease”.
+ Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is already tagged with the correct answer.

Types:
+ Regression
+ Logistic Regression
+ Classification
+ Naive Bayes Classifiers
+ K-NN (k nearest neighbors)
+ Decision Trees
+ Support Vector Machine

Advantages:

+ Supervised learning allows collecting data and produces data output from previous experiences.
+ Helps to optimize performance criteria with the help of experience.
+ Supervised machine learning helps to solve various types of real-world computation problems.
+ It performs classification and regression tasks.
+ It allows estimating or mapping the result to a new sample. 
+ We have complete control over choosing the number of classes we want in the training data.

Disadvantages:

+ Classifying big data can be challenging.
+ Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
+ Supervised learning cannot handle all complex tasks in Machine Learning.
+ Computation time is vast for supervised learning.
+ It requires a labelled data set and training process.

More detailed information can be found via [Supervised and Unsupervised Learning](https://www.geeksforgeeks.org/supervised-unsupervised-learning/).

### Data Information
The dataset used in this project is [HCV](https://archive.ics.uci.edu/ml/datasets/HCV+data), which contains the laboratory values of blood donors and Hepatitis C patients with their demographic values. 

All attributes except Category and Sex are numerical. 

The laboratory data are the attributes 5-14. 
1) X (Patient ID/No.) 
2) Category (diagnosis) (values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis') 
3) Age (in years) 
4) Sex (f,m) 
5) ALB 
6) ALP 
7) ALT 
8) AST 
9) BIL 
10) CHE 
11) CHOL 
12) CREA 
13) GGT 
14) PROT 

The target attribute for classification is Category: blood donors vs. Hepatitis C patients (including its progress ('just' Hepatitis C, Fibrosis, Cirrhosis). To be more specific, I combined '0s=suspect blood donors' and '0=blood donors' as blood donors. And I combined the category of '1=Hepatitis', '2=Fibrosis' and '3=Cirrhosis' as 1 to represents the patient is Hepatitis C patients. 


The main goal for this project is to predict whether the patient has Hepatitis C or not (binary response) using the given HCV data. 

### Data Preparation
Create the spark session 

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('hcv').getOrCreate()

Read in the data

In [3]:
hcv_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00571/hcvdat0.csv')
hcv_data.head()

Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


Detect missing value and drop the rows where at least one element is missing

In [4]:
hcv_data.isna().sum()

Unnamed: 0     0
Category       0
Age            0
Sex            0
ALB            1
ALP           18
ALT            1
AST            0
BIL            0
CHE            0
CHOL          10
CREA           0
GGT            0
PROT           1
dtype: int64

In [5]:
hcv_data = hcv_data.dropna()

Replace suspect Blood Donor with Blood Donor

In [6]:
hcv_data.loc[hcv_data['Category'] == '0s=suspect Blood Donor', 'Category'] = '0=Blood Donor'

Convert to a spark SQL data frame

In [7]:
hcv = spark.createDataFrame(hcv_data)
hcv.show(5)

  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():


+----------+-------------+---+---+----+----+----+----+----+-----+----+-----+----+----+
|Unnamed: 0|     Category|Age|Sex| ALB| ALP| ALT| AST| BIL|  CHE|CHOL| CREA| GGT|PROT|
+----------+-------------+---+---+----+----+----+----+----+-----+----+-----+----+----+
|         1|0=Blood Donor| 32|  m|38.5|52.5| 7.7|22.1| 7.5| 6.93|3.23|106.0|12.1|69.0|
|         2|0=Blood Donor| 32|  m|38.5|70.3|18.0|24.7| 3.9|11.17| 4.8| 74.0|15.6|76.5|
|         3|0=Blood Donor| 32|  m|46.9|74.7|36.2|52.6| 6.1| 8.84| 5.2| 86.0|33.2|79.3|
|         4|0=Blood Donor| 32|  m|43.2|52.0|30.6|22.6|18.9| 7.33|4.74| 80.0|33.8|75.7|
|         5|0=Blood Donor| 32|  m|39.2|74.1|32.6|24.8| 9.6| 9.15|4.32| 76.0|29.9|68.7|
+----------+-------------+---+---+----+----+----+----+----+-----+----+-----+----+----+
only showing top 5 rows



## Splitting the Data, Metrics, and Models

### Metrics
#### Test Area Under ROC (ROC AUC)
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions:

![Predictions ranked in ascending order of logistic regression score](https://developers.google.com/static/machine-learning/crash-course/images/AUCPredictionsRanked.svg)

AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has an AUC near 0 which means it has the worst measure of separability. And when AUC is 0.5, it means the model has no class separation capacity whatsoever.

##### Advantages:
+ AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
+ AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

##### Disadvantages:
+ Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
+ Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.

More details can be found via [Classification: ROC Curve and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)

#### Area under PR (PR AUC)
Any prediction relative to labeled data can be a true positive, false positive, true negative, or false negative. The precision-recall curve is constructed by calculating and plotting the precision against the recall for a single classifier at a variety of thresholds. A precision-recall curve helps to visualize how the choice of threshold affects classifier performance, and can even help us select the best threshold for a specific problem.
<img src=https://miro.medium.com/v2/resize:fit:1400/format:webp/1*6QPLsDvjo4H6OZrxEBI8Fg.png width = '600'>

Generally, the higher the AUC-PR score, the better a classifier performs for the given task. In a perfect classifier, AUC-PR =1. In a “baseline” classifier, the AUC-PR will depend on the fraction of observations belonging to the positive class. For example, in a balanced binary classification data set, the “baseline” classifier will have AUC-PR = 0.5. A classifier that provides some predictive value will fall between the “baseline” and perfect classifiers.

##### Advantages:
+ Area under PR is a great metrics for communicating precision/recall decision to other stakeholders.
+ If care more about the positive class, then using PR AUC, which is more sensitive to the improvements for the positive class.

##### Disadvantages:
+ When data is heavily imbalanced, PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.

More details can be found via [Precision-Recall Curves](https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248)

#### ROC AUC vs PR AUC
What is common between ROC AUC and PR AUC is that they both look at prediction scores of classification models and not thresholded class assignments. What is different however is that ROC AUC looks at a true positive rate TPR and false positive rate FPR while PR AUC looks at positive predictive value PPV and true positive rate TPR. ROC uses what's in the data, PR uses what's in the prediction as a basis.

In general, if you care equally about the positive and negative class or your dataset is quite balanced, then going with ROC AUC is a good idea.

### Splitting the Data
A goal of supervised learning is to build a model that performs well on new data, which train test split helps you simulate. To be more specific, by using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model. After a model has been processed by using the training set, you test the model by making predictions against the test set. 

And this process should always be done before doing transformations. Since if we do transformations on the training set, we want to use the exact same transformations on the test set.


In [8]:
train, test = hcv.randomSplit([0.8,0.2], seed = 1)

### Transformation

Create dummy variables

In [9]:
from pyspark.ml.feature import SQLTransformer, StringIndexer, Binarizer, VectorAssembler
indexer = StringIndexer(inputCols = ['Category', 'Sex'], outputCols = ['category_numeric', 'Sex_indicator'])
indexerTrans = indexer.fit(hcv)
indexerTrans.transform(hcv)

DataFrame[Unnamed: 0: bigint, Category: string, Age: bigint, Sex: string, ALB: double, ALP: double, ALT: double, AST: double, BIL: double, CHE: double, CHOL: double, CREA: double, GGT: double, PROT: double, category_numeric: double, Sex_indicator: double]

Convert to a 0/1 indicator

In [10]:
binaryTrans = Binarizer(threshold = 0.5, inputCol = 'category_numeric', outputCol = 'category_indicator')
binaryTrans.transform(
    indexerTrans.transform(hcv))

DataFrame[Unnamed: 0: bigint, Category: string, Age: bigint, Sex: string, ALB: double, ALP: double, ALT: double, AST: double, BIL: double, CHE: double, CHOL: double, CREA: double, GGT: double, PROT: double, category_numeric: double, Sex_indicator: double, category_indicator: double]

In [11]:
sqlTrans = SQLTransformer(
    statement = """
                SELECT Age, ALB, ALP, ALT, AST, BIL, CHE, CHOL, CREA, GGT, PROT, Sex_indicator, category_indicator as label FROM __THIS__
                """
)

In [12]:
sqlTrans.transform(
    binaryTrans.transform(
        indexerTrans.transform(hcv)
    )
).show(10)

+---+----+----+----+----+----+-----+----+-----+----+----+-------------+-----+
|Age| ALB| ALP| ALT| AST| BIL|  CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|
+---+----+----+----+----+----+-----+----+-----+----+----+-------------+-----+
| 32|38.5|52.5| 7.7|22.1| 7.5| 6.93|3.23|106.0|12.1|69.0|          0.0|  0.0|
| 32|38.5|70.3|18.0|24.7| 3.9|11.17| 4.8| 74.0|15.6|76.5|          0.0|  0.0|
| 32|46.9|74.7|36.2|52.6| 6.1| 8.84| 5.2| 86.0|33.2|79.3|          0.0|  0.0|
| 32|43.2|52.0|30.6|22.6|18.9| 7.33|4.74| 80.0|33.8|75.7|          0.0|  0.0|
| 32|39.2|74.1|32.6|24.8| 9.6| 9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|
| 32|41.6|43.3|18.5|19.7|12.3| 9.92|6.05|111.0|91.0|74.0|          0.0|  0.0|
| 32|46.3|41.3|17.5|17.8| 8.5| 7.01|4.79| 70.0|16.9|74.5|          0.0|  0.0|
| 32|42.2|41.9|35.8|31.1|16.1| 5.82| 4.6|109.0|21.5|67.1|          0.0|  0.0|
| 32|50.9|65.5|23.2|21.2| 6.9| 8.69| 4.1| 83.0|13.7|71.3|          0.0|  0.0|
| 32|42.4|86.3|20.3|20.0|35.2| 5.46|4.45| 81.0|15.9|69.9|       

Put the predictors into features

In [13]:
assembler = VectorAssembler(inputCols = ['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT', 'Sex_indicator'], outputCol = 'features', handleInvalid = 'keep')

In [14]:
assembler.transform(
    sqlTrans.transform(
        binaryTrans.transform(
            indexerTrans.transform(hcv)
        )
    )
).show(10)

+---+----+----+----+----+----+-----+----+-----+----+----+-------------+-----+--------------------+
|Age| ALB| ALP| ALT| AST| BIL|  CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|
+---+----+----+----+----+----+-----+----+-----+----+----+-------------+-----+--------------------+
| 32|38.5|52.5| 7.7|22.1| 7.5| 6.93|3.23|106.0|12.1|69.0|          0.0|  0.0|[32.0,38.5,52.5,7...|
| 32|38.5|70.3|18.0|24.7| 3.9|11.17| 4.8| 74.0|15.6|76.5|          0.0|  0.0|[32.0,38.5,70.3,1...|
| 32|46.9|74.7|36.2|52.6| 6.1| 8.84| 5.2| 86.0|33.2|79.3|          0.0|  0.0|[32.0,46.9,74.7,3...|
| 32|43.2|52.0|30.6|22.6|18.9| 7.33|4.74| 80.0|33.8|75.7|          0.0|  0.0|[32.0,43.2,52.0,3...|
| 32|39.2|74.1|32.6|24.8| 9.6| 9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|
| 32|41.6|43.3|18.5|19.7|12.3| 9.92|6.05|111.0|91.0|74.0|          0.0|  0.0|[32.0,41.6,43.3,1...|
| 32|46.3|41.3|17.5|17.8| 8.5| 7.01|4.79| 70.0|16.9|74.5|          0.0|  0.0|[32.0,46.3,41.3,1...|
| 32|42.2|

### Models
+ Logistic Regression

Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the probability of a certain class or event. It is used when the data is linearly separable and the outcome is binary or dichotomous in nature. That means Logistic regression is usually used for Binary classification problems.

In logistic regression in order to map the predicted values to probabilities, sigmoid function is used. This function maps any real value into another value between 0 to 1. This function has a non-negative derivative at each point and exactly one inflection point.

Logistic regression can be extended and further classified into three different types that are as mentioned below:

+ Binomial: Where the target variable can have only two possible types.
+ Multinomial: Where the target variable have three or more possible types, which may not have any quantitative significance. 
+ Ordinal: Where the target variables have ordered categories.

Configure an ML pipeline

In [15]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter = 10)
pipeline1 = Pipeline(stages = [indexerTrans, binaryTrans, sqlTrans, assembler, lr])

Create ParamGrid for cross validation

In [16]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
paramGrid1 = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 0.5, 1.0, 2.0]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.25, 0.5, 0.75, 1.0]) \
    .build()

Run cross-validation and choose the best set of parameters

In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
crossval1 = CrossValidator(estimator = pipeline1,
                          estimatorParamMaps = paramGrid1,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC'),
                          numFolds = 5) 

In [19]:
crossval1_1 = CrossValidator(estimator = pipeline1,
                          estimatorParamMaps = paramGrid1,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR'),
                          numFolds = 5) 

+ Descision tree

Decision Trees are used for both regression and classification problems. They visually flow like trees, hence the name, and in the classification case, they start with the root of the tree and follow binary splits based on variable outcomes until a leaf node is reached and the final binary result is given. 

Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.

Configure an ML pipeline

In [48]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
pipeline2 = Pipeline(stages = [indexerTrans, binaryTrans, sqlTrans, assembler, dt])

Create ParamGrid for cross validation

In [49]:
paramGrid2 = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [2, 5, 10, 20, 30])
             .addGrid(dt.maxBins, [10, 20, 40, 80, 100])
             .build())

Run cross-validation and choose the best set of parameters

In [50]:
crossval2 = CrossValidator(estimator = pipeline2,
                          estimatorParamMaps = paramGrid2,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC'),
                          numFolds = 5) 

In [51]:
crossval2_1 = CrossValidator(estimator = pipeline2,
                          estimatorParamMaps = paramGrid2,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR'),
                          numFolds = 5) 

+ Random Forest Classifer

Random forests are ensembles of decision trees. Random forests combine many decision trees in order to reduce the risk of overfitting. The spark.ml implementation supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features.

If strong predictor, every bootstrap tree will probably use for first split and make more correlated. Thus, random forest considers split using random subset, which is built on decision trees. It contains many decision trees representing a distinct instance of the classification of data input into the random forest. 

Configure an ML pipeline

In [52]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
pipeline3 = Pipeline(stages = [indexerTrans, binaryTrans, sqlTrans, assembler, rf])

Create ParamGrid for cross validation

In [53]:
paramGrid3 = ParamGridBuilder() \
             .addGrid(rf.maxDepth, [2, 5, 10, 20, 30]) \
             .addGrid(rf.maxBins, [10, 20, 40, 80, 100]) \
             .build()

Run cross-validation and choose the best set of parameters

In [54]:
crossval3 = CrossValidator(estimator = pipeline3,
                          estimatorParamMaps = paramGrid3,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC'),
                          numFolds = 5) 

In [55]:
crossval3_1 = CrossValidator(estimator = pipeline3,
                          estimatorParamMaps = paramGrid3,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR'),
                          numFolds = 5) 

+ One-vs-Rest classifier (a.k.a. One-vs-All)

OneVsRest is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently. It is also known as “One-vs-All.”

OneVsRest is implemented as an Estimator. For the base classifier, it takes instances of Classifier and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.

Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.

Configure an ML pipeline

In [57]:
from pyspark.ml.classification import OneVsRest, LinearSVC
LSVC = LinearSVC()
ovr = OneVsRest(classifier = LSVC, featuresCol = 'features', labelCol = 'label')
pipeline4 = Pipeline(stages = [indexerTrans, binaryTrans, sqlTrans, assembler, ovr])                

Create ParamGrid for cross validation

In [58]:
paramGrid4 = ParamGridBuilder() \
            .addGrid(LSVC.maxIter, [10, 100]) \
            .addGrid(LSVC.regParam,[0.001, 0.01, 1.0,10.0]) \
            .build()

Run cross-validation and choose the best set of parameters

In [59]:
crossval4 = CrossValidator(estimator = pipeline4,
                          estimatorParamMaps = paramGrid4,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC'),
                          numFolds = 5) 

In [60]:
crossval4_1 = CrossValidator(estimator = pipeline4,
                          estimatorParamMaps = paramGrid4,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR'),
                          numFolds = 5) 

+ Gradient-Boosted Tree Classifier

Gradient Boosting is a functional gradient algorithm that repeatedly selects a function that leads in the direction of a weak hypothesis or negative gradient so that it can minimize a loss function. Gradient boosting classifier combines several weak learning models to produce a powerful predicting model.

Gradient boosting involves three elements:

+ A loss function to be optimized.
+ A weak learner to make predictions.
+ An additive model to add weak learners to minimize the loss function.


Configure an ML pipeline

In [62]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(maxIter=10, featuresCol = 'features', labelCol = 'label')
pipeline5 = Pipeline(stages = [indexerTrans, binaryTrans, sqlTrans, assembler, gbt])

Create ParamGrid for cross validation

In [63]:
paramGrid5 = ParamGridBuilder() \
            .addGrid(gbt.maxIter, [10, 20]) \
            .addGrid(gbt.maxDepth, [2, 5, 10]) \
            .addGrid(gbt.maxBins, [10, 40, 80]) \
            .build()

Run cross-validation and choose the best set of parameters

In [64]:
crossval5 = CrossValidator(estimator = pipeline5,
                          estimatorParamMaps = paramGrid5,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC'),
                          numFolds = 5) 

In [65]:
crossval5_1 = CrossValidator(estimator = pipeline5,
                          estimatorParamMaps = paramGrid5,
                          evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR'),
                          numFolds = 5) 

## Model Fitting

+ Logistic Regression

Fit the model with cross validation using areaUnderROC metric

In [23]:
cvModel1 = crossval1.fit(train)
cvModel1.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|         probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[4.39100438168772...|[0.98776331112521...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[2.94809674541979...|[0.95017345819235...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[3.44834239436967...|[0.96918166883337...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3.

Fit the model with cross validation using areaUnderPR metric

In [33]:
cvModel1_1 = crossval1_1.fit(train)
cvModel1_1.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|         probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[3.94024607603227...|[0.98092740702406...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[2.72781381482310...|[0.93864806007784...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[3.54728115462578...|[0.97200353406240...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3.

+ Decision Tree

Fit the model with cross validation using areaUnderROC metric

In [49]:
cvModel2 = crossval2.fit(train)
cvModel2.transform(test).show(5)

Exception ignored in: <function JavaWrapper.__del__ at 0x7f2b72797ac0>
Traceback (most recent call last):
  File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 53, in __del__
    if SparkContext._active_spark_context and self._java_obj is not None:
AttributeError: 'BinaryClassificationEvaluator' object has no attribute '_java_obj'


+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+-------------+-----------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|rawPrediction|probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+-------------+-----------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|  [374.0,0.0]|  [1.0,0.0]|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|  [374.0,0.0]|  [1.0,0.0]|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|   [19.0,0.0]|  [1.0,0.0]|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3...|  [374.0,0.0]|  [1.0,0.0]|       0.0|
| 33|45.2|88.3|32.4|31.2|10.1|9.78|5.51|102.0|48.5|76.5

Fit the model with cross validation using areaUnderPR metric

In [67]:
cvModel2_1 = crossval2_1.fit(train)
cvModel2_1.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+-------------+-----------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|rawPrediction|probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+-------------+-----------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|  [374.0,0.0]|  [1.0,0.0]|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|  [374.0,0.0]|  [1.0,0.0]|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|   [19.0,0.0]|  [1.0,0.0]|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3...|  [374.0,0.0]|  [1.0,0.0]|       0.0|
| 33|45.2|88.3|32.4|31.2|10.1|9.78|5.51|102.0|48.5|76.5

+ Random Forest

Fit the model with cross validation using areaUnderROC metric

In [52]:
cvModel3 = crossval3.fit(train)
cvModel3.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|         probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[19.9431068830735...|[0.99715534415367...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[19.9558619851143...|[0.99779309925571...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[19.8847556330415...|[0.99423778165207...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3.

Fit the model with cross validation using areaUnderPR metric

In [66]:
cvModel3_1 = crossval3_1.fit(train)
cvModel3_1.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|         probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[19.9275555104627...|[0.99637777552313...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[19.9338448186388...|[0.99669224093194...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[18.9903193149702...|[0.94951596574851...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3.

+ One-vs-Rest classifier (a.k.a. One-vs-All)

Fit the model with cross validation using areaUnderROC metric

In [55]:
cvModel4 = crossval4.fit(train)
cvModel4.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[1.91871314873404...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[1.29938010699354...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[1.46307702769153...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3...|[1.50783408046238...|       0.0|
| 33|45.2|88.3|32.4|31.2|10.1|9.78|5.51|102.0|48.5|76.5|          0.0|  0.0|[33.0,45.2,88.

Fit the model with cross validation using areaUnderPR metric

In [69]:
cvModel4_1 = crossval4_1.fit(train)
cvModel4_1.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[3.54335798651759...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[1.82233024925300...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[2.41339915166002...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3...|[2.79827502851329...|       0.0|
| 33|45.2|88.3|32.4|31.2|10.1|9.78|5.51|102.0|48.5|76.5|          0.0|  0.0|[33.0,45.2,88.

+ Gradient-Boosted Tree Classifier

Fit the model with cross validation using areaUnderROC metric

In [22]:
cvModel5 = crossval5.fit(train)
cvModel5.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|         probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[1.55529190155845...|[0.95732721157099...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[1.55529190155845...|[0.95732721157099...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[1.51665609307152...|[0.95405657332409...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3.

Fit the model with cross validation using areaUnderPR metric

In [70]:
cvModel5_1 = crossval5_1.fit(train)
cvModel5_1.transform(test).show(5)

+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
|Age| ALB| ALP| ALT| AST| BIL| CHE|CHOL| CREA| GGT|PROT|Sex_indicator|label|            features|       rawPrediction|         probability|prediction|
+---+----+----+----+----+----+----+----+-----+----+----+-------------+-----+--------------------+--------------------+--------------------+----------+
| 32|39.2|74.1|32.6|24.8| 9.6|9.15|4.32| 76.0|29.9|68.7|          0.0|  0.0|[32.0,39.2,74.1,3...|[1.55529190155845...|[0.95732721157099...|       0.0|
| 33|39.0|51.7|15.9|24.0| 6.8|6.46|3.38| 65.0| 7.0|70.4|          0.0|  0.0|[33.0,39.0,51.7,1...|[1.55529190155845...|[0.95732721157099...|       0.0|
| 33|38.7|39.8|22.5|23.0| 4.1|4.63|4.97| 63.0|15.2|71.9|          0.0|  0.0|[33.0,38.7,39.8,2...|[1.51665609307152...|[0.95405657332409...|       0.0|
| 33|41.8|65.0|33.1|38.0| 6.6|8.83|4.43| 71.0|24.0|72.7|          0.0|  0.0|[33.0,41.8,65.0,3.

In [68]:
hcv_data['Category'].value_counts()

0=Blood Donor    533
3=Cirrhosis       24
1=Hepatitis       20
2=Fibrosis        12
Name: Category, dtype: int64

From the output, we can observe the best model is different for our two metrics. Like mentioned above, ROC AUC looks at a true positive rate TPR and false positive rate FPR while PR AUC looks at positive predictive value PPV and true positive rate TPR. ROC uses what's in the data, PR uses what's in the prediction as a basis.

When data is heavily imbalanced, PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class. Thus, we would prefer to use ROC AUC as our metric for cross validation process.

## Model Testing
Use that best model to evaluate the performance among all 5 classifiers using **areaUnderROC** and **areaUnderPR** metrics.

+ Logistic Regression

In [43]:
BinaryClassificationEvaluator(metricName = 'areaUnderROC').evaluate(cvModel1.transform(test))

0.9174311926605506

In [48]:
BinaryClassificationEvaluator(metricName = 'areaUnderPR').evaluate(cvModel1.transform(test))

0.7405865750055891

+ Decision Tree

In [50]:
BinaryClassificationEvaluator(metricName = 'areaUnderROC').evaluate(cvModel2.transform(test))

0.8061926605504587

In [51]:
BinaryClassificationEvaluator(metricName = 'areaUnderPR').evaluate(cvModel2.transform(test))

0.6169871794871795

+ Random Forest

In [53]:
BinaryClassificationEvaluator(metricName = 'areaUnderROC').evaluate(cvModel3.transform(test))

0.9919724770642201

In [54]:
BinaryClassificationEvaluator(metricName = 'areaUnderPR').evaluate(cvModel3.transform(test))

0.920405982905983

+ One-vs-Rest classifier (a.k.a. One-vs-All)

In [56]:
BinaryClassificationEvaluator(metricName = 'areaUnderROC').evaluate(cvModel4.transform(test))

0.9529816513761469

In [57]:
BinaryClassificationEvaluator(metricName = 'areaUnderPR').evaluate(cvModel4.transform(test))

0.865635278548669

+ Gradient-Boosted Tree Classifier

In [24]:
BinaryClassificationEvaluator(metricName = 'areaUnderROC').evaluate(cvModel5.transform(test))

0.9833715596330277

In [25]:
BinaryClassificationEvaluator(metricName = 'areaUnderPR').evaluate(cvModel5.transform(test))

0.8270724067599067

Random forest is the best classifier among all 5 models. The **areaUnderROC** of 0.992 in random forest classifer represents approximtely 99.2% chance that the model will be able to distinguish between positive class and negative class. Similarly, the **areaUnderPR** of 0.920 is very clsoe to 1, which is the highest among all the classifier. The value of it represents 92% true positives out of all that have been predicted as positives.