Good job 100/100

## Adult income Data Prediction
#### Group members: 
>Yingxue Zhu (820894341)  Xuan Li (819264580)
              
#### Goal of the project: 
>Determine whether a person makes over 50K a year.

#### Dataset reference: 
>https://archive.ics.uci.edu/ml/datasets/adult 
              
>This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html

>Donor: Ronny Kohavi and Barry Becker

>Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).

#### Data description
>http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

#### Files and kernels
>income_data_analysis.ipynb -- please use kernel Apache Toree - Scala

>data_visualization.ipynb   -- please use kernel Python 3


#### Organization

>income_data_analysis.ipynb

> * 1 Load data
* 2 Rename 
* 3 Data cleaning & Label encode income column
* 4 Explore data
   * 4.1 Probability of income >50k and <=50k
   * 4.2 Influence of different features on income
   * 4.3 Data Visualization
* 5 Transform text type columns
* 6 Pearson Correlation
* 7 Assemble raw features
* 8 Prediction:
   * 8.1 Logistic Regression
   * 8.2 Decision Tree
   * 8.3 Random Forest
   * 8.4 Naive Bayes
   * 8.5 Neural Network
   * 8.6 Gradient-Boosted Trees Classifier
* 9 Conclusion


In [16]:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.classification.{RandomForestClassifier}
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.{Pipeline}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics

# 1. Load training data and testing data

In [4]:
//load training data
val reader = spark.read
var currentDirectory = new java.io.File(".").getCanonicalPath
val trainDataDir = currentDirectory + "/adult.data"
val trainData = reader.format("csv").
      option("mode","DROPMALFORMED").
      option("nullValue", " ?").
      option("header",false).
      option("inferSchema",true).
      option("sep", ",").
      load(trainDataDir)
trainData.show(2)
var train_size = trainData.count()
println("Total sample size of train data: " + train_size)
//load testing data
val testDataDir = currentDirectory + "/adult.test"
val testData = reader.format("csv").
      option("mode","DROPMALFORMED").
      option("nullValue", " ?").
      option("header",false).
      option("inferSchema",true).
      option("sep", ",").
      load(testDataDir)
testData.show(2)
var test_size = testData.count()
println("Total sample size of test data: " + test_size)
println("Total instances of data: " + (train_size + test_size))

+---+-----------------+-------+----------+----+-------------------+----------------+--------------+------+-----+------+----+----+--------------+------+
|_c0|              _c1|    _c2|       _c3| _c4|                _c5|             _c6|           _c7|   _c8|  _c9|  _c10|_c11|_c12|          _c13|  _c14|
+---+-----------------+-------+----------+----+-------------------+----------------+--------------+------+-----+------+----+----+--------------+------+
| 39|        State-gov|77516.0| Bachelors|13.0|      Never-married|    Adm-clerical| Not-in-family| White| Male|2174.0| 0.0|40.0| United-States| <=50K|
| 28| Self-emp-not-inc|83311.0| Bachelors|13.0| Married-civ-spouse| Exec-managerial|       Husband| White| Male|   0.0| 0.0|13.0| United-States| <=50K|
+---+-----------------+-------+----------+----+-------------------+----------------+--------------+------+-----+------+----+----+--------------+------+
only showing top 2 rows

Total sample size of train data: 32561
+---+--------+--------+-

reader = org.apache.spark.sql.DataFrameReader@7c5ea4ed
currentDirectory = /Users/xuanli/Downloads
trainDataDir = /Users/xuanli/Downloads/adult.data
trainData = [_c0: int, _c1: string ... 13 more fields]
train_size = 32561
testDataDir = /Users/xuanli/Downloads/adult.test
testData = [_c0: int, _c1: string ... 13 more fields]
test_size = 16281


16281

Before remove of those unknown:<br/>
Total sample size of train data: 32561<br/>
Total sample size of test data: 16281<br/>
Total instances: 48842<br/>
which are same numbers in data description. 

# 2. Rename 

In [5]:
val newNames = Seq("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
val trainDataRenamed = trainData.toDF(newNames: _*)
val testDataRenamed = testData.toDF(newNames: _*)
trainDataRenamed.show(5)
//testDataRenamed.show(5)

+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
|age|        workclass|  fnlwgt| education|education_num|     marital_status|        occupation|  relationship|  race|    sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
| 39|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|   Male|      2174.0|         0.0|          40.0| United-States| <=50K|
| 28| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|         0.0|         0.0|          13.0| United-States| <=50K|
| 38|          Private|215646.0|   HS-grad|       

newNames = List(age, workclass, fnlwgt, education, education_num, marital_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week, native_country, income)
trainDataRenamed = [age: int, workclass: string ... 13 more fields]
testDataRenamed = [age: int, workclass: string ... 13 more fields]


[age: int, workclass: string ... 13 more fields]

# 3. Data cleaning and Label encode income column

In [6]:
//label encoder for income column 
val indexerTrain = new StringIndexer().setInputCol("income")
                                 .setOutputCol("label")
                                 .fit(trainDataRenamed)
val trainDataRemoved = indexerTrain.transform(trainDataRenamed).na.drop()
val indexerTest = new StringIndexer().setInputCol("income")
                                 .setOutputCol("label")
                                 .fit(testDataRenamed)
val testDataRemoved = indexerTest.transform(testDataRenamed).na.drop()
//trainDataRemoved.printSchema
//testDataRemoved.printSchema
val numTrain = trainDataRemoved.count()
val numTest = testDataRemoved.count()
val total = numTrain + numTest
println("total number of entries: " + total)

total number of entries: 45222


indexerTrain = strIdx_0b127e504d1f
trainDataRemoved = [age: int, workclass: string ... 14 more fields]
indexerTest = strIdx_d01baa48ae81
testDataRemoved = [age: int, workclass: string ... 14 more fields]
numTrain = 30162
numTest = 15060
total = 45222


45222

After remove missing value:<br/>
Total sample size of train data: 30162<br/>
Total sample size of test data: 15060<br/>
Total instances: 45222<br/>
which are same numbers in data description. 

# 4. Explore data
## 4.1 percentage of income >50k and <=50k 

In [7]:
//combine train and test 
val adultAll = trainDataRemoved.unionAll(testDataRemoved)
adultAll.show(2)
adultAll.count()
val above = adultAll.filter(expr("label = 1")).count().toDouble/total.toDouble
val below = adultAll.filter(expr("label = 0")).count().toDouble/total.toDouble

+---+-----------------+-------+----------+-------------+-------------------+----------------+--------------+------+-----+------------+------------+--------------+--------------+------+-----+
|age|        workclass| fnlwgt| education|education_num|     marital_status|      occupation|  relationship|  race|  sex|capital_gain|capital_loss|hours_per_week|native_country|income|label|
+---+-----------------+-------+----------+-------------+-------------------+----------------+--------------+------+-----+------------+------------+--------------+--------------+------+-----+
| 39|        State-gov|77516.0| Bachelors|         13.0|      Never-married|    Adm-clerical| Not-in-family| White| Male|      2174.0|         0.0|          40.0| United-States| <=50K|  0.0|
| 50| Self-emp-not-inc|83311.0| Bachelors|         13.0| Married-civ-spouse| Exec-managerial|       Husband| White| Male|         0.0|         0.0|          13.0| United-States| <=50K|  0.0|
+---+-----------------+-------+----------+---

## 4.2 Influence of different features on income

In [8]:
//age category
def age_range(x:Double):String = { 
    var n = x.toInt/5
    var s =(n*5).toString
    var e =(n*5+5).toString
    "["+s+"~"+e+")"
}
val age_type = udf(age_range(_:Double):String)
val bucketed = adultAll.withColumn("age_bucket", age_type(col("age")))
bucketed.groupBy("age_bucket").agg(count("*").alias("total_count")).orderBy(desc("total_count")).show

                                                                                +----------+-----------+
|age_bucket|total_count|
+----------+-----------+
|   [30~35)|       6198|
|   [35~40)|       6164|
|   [25~30)|       5737|
|   [40~45)|       5531|
|   [20~25)|       5256|
|   [45~50)|       4774|
|   [50~55)|       3637|
|   [55~60)|       2627|
|   [15~20)|       2052|
|   [60~65)|       1685|
|   [65~70)|        829|
|   [70~75)|        410|
|   [75~80)|        179|
|   [80~85)|         84|
|   [90~95)|         46|
|   [85~90)|         13|
+----------+-----------+


In [10]:
//education vs income
adultAll.groupBy("education").
    agg((sum("label")/count("*")*100).alias("more than 50K(%)")).orderBy(desc("more than 50K(%)")).show

+-------------+------------------+
|    education|  more than 50K(%)|
+-------------+------------------+
|  Prof-school| 75.41401273885351|
|    Doctorate| 73.34558823529412|
|      Masters| 55.40970564836913|
|    Bachelors|41.981505944517835|
|   Assoc-acdm|26.410086264100862|
|    Assoc-voc|25.727411944869832|
| Some-college| 20.10304071118295|
|      HS-grad|16.343096800378813|
|         12th| 7.452339688041594|
|         10th| 6.704824202780049|
|      7th-8th| 6.682867557715674|
|          9th| 5.621301775147929|
|         11th| 5.497220506485485|
|      5th-6th|4.8997772828507795|
|      1st-4th|3.6036036036036037|
|    Preschool|1.3888888888888888|
+-------------+------------------+



**Conclusion:**
education has dramatic influence on income. 
Basically, the higher the person was educated, the higher chance the person has annual income greater than 50k.

In [25]:
// native_country vs income
adultAll.groupBy("native_country").
    agg((sum("label")/count("*")*100).alias("more than 50K(%)")).orderBy(desc("more than 50K(%)")).show

+--------------+------------------+                                             
|native_country|  more than 50K(%)|
+--------------+------------------+
|        Taiwan| 45.45454545454545|
|        France| 44.44444444444444|
|         India| 42.17687074829932|
|       England| 39.49579831932773|
|          Iran|39.285714285714285|
|        Canada|36.809815950920246|
|        Greece|36.734693877551024|
|         Japan|34.831460674157306|
|    Yugoslavia| 34.78260869565217|
|      Cambodia| 34.61538461538461|
|       Hungary| 33.33333333333333|
|         Italy|              33.0|
|         China|31.858407079646017|
|       Germany| 30.05181347150259|
|   Philippines|29.681978798586574|
|          Hong| 28.57142857142857|
|       Ireland| 27.77777777777778|
|          Cuba|25.563909774436087|
| United-States|25.302722076915625|
|        Poland|19.753086419753085|
+--------------+------------------+
only showing top 20 rows



The native country also has influence on income. The data shows that person from Taiwan has the highest chance to get annual income greater than 50k.
And the lowest is from Poland.

In [26]:
//race vs income
adultAll.groupBy("race").
    agg((sum("label")/count("*")*100).alias("more than 50K(%)")).orderBy(desc("more than 50K(%)")).show

+-------------------+------------------+                                        
|               race|  more than 50K(%)|
+-------------------+------------------+
| Asian-Pac-Islander| 28.31926323867997|
|              White| 26.23705112716243|
|              Other|12.747875354107649|
|              Black|12.630085146641438|
| Amer-Indian-Eskimo|12.183908045977011|
+-------------------+------------------+



The data shows race does not significant effect on income.

In [28]:
//sex vs income
adultAll.groupBy("sex").
    agg((sum("label")/count("*")*100).alias("more than 50K(%)")).orderBy(desc("more than 50K(%)")).show

+-------+------------------+                                                    
|    sex|  more than 50K(%)|
+-------+------------------+
|   Male|31.247747895305793|
| Female|11.357604627424294|
+-------+------------------+



The data shows that man that has annual income greater than 50k is about as twice as woman does.

In [29]:
//marital_status vs income
adultAll.groupBy("marital_status").
    agg((sum("label")/count("*")*100).alias("more than 50K(%)")).orderBy(desc("more than 50K(%)")).show

+--------------------+------------------+                                       
|      marital_status|  more than 50K(%)|
+--------------------+------------------+
|  Married-civ-spouse|45.423889812396105|
|   Married-AF-spouse|             43.75|
|            Divorced| 10.40177862474194|
| Married-spouse-a...| 9.782608695652174|
|             Widowed| 9.475332811276429|
|           Separated| 7.016300496102056|
|       Never-married| 4.802027675023976|
+--------------------+------------------+



The data shows marriage status does have influence on annual income. (Married spouse tend to get annual income over 50k)

# 4.3 Data Visualization 
> see dataVisualization.ipynb


# 5. transform text type columns 

In [7]:
//string index all string columns for both train and test data
val stringColumns = Array("workclass","education","marital_status",
    "occupation","relationship","race","sex",
              "native_country")
val index_transformers: Array[org.apache.spark.ml.PipelineStage] = stringColumns.map(
      cname => new StringIndexer()
        .setInputCol(cname)
        .setOutputCol(s"idx_${cname}")
    )
val index_pipeline = new Pipeline().setStages(index_transformers)
//train
var index_model_train = index_pipeline.fit(trainDataRemoved)
val trainData_tran = index_model_train.transform(trainDataRemoved)
//test
var index_model_test = index_pipeline.fit(testDataRemoved)
val testData_tran = index_model_train.transform(testDataRemoved)
trainData_tran.printSchema
//testData_tran.printSchema

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- label: double (nullable = true)
 |-- idx_workclass: double (nullable = true)
 |-- idx_education: double (nullable = true)
 |-- idx_marital_status: double (nullable = true)
 |-- idx_occupation: double (nullable = true)
 |-- idx_relationship: double (nullable = true)
 |-- idx_race: double (nullable = true)
 |-- idx_sex: double (nullable = true)
 |-- idx_native_country

stringColumns = Array(workclass, education, marital_status, occupation, relationship, race, sex, native_country)
index_transformers = Array(strIdx_4fd6a6daf7b6, strIdx_f309bbc97292, strIdx_c1a8cacd6f33, strIdx_e9416b4b7cdd, strIdx_b35a2365e101, strIdx_dd424b002974, strIdx_8de12a432028, strIdx_9560abbfaa00)
index_pipeline = pipeline_d96013e685d7
index_model_train = pipeline_d96013e685d7
trainData_tran = [age: int, workclass: string ... 22 more fields]
index_model_test = pipeline_d96013e685d7
testData_tran = [age: int, workclass: string ... 22 more fields]


[age: int, workclass: string ... 22 more fields]

In [8]:
//drop all the original text columns
val fields_train = trainData_tran.schema.fields filter {
x => x.dataType match { 
      case x: org.apache.spark.sql.types.StringType => true
      case _ => false 
      } 
    } map { x => x.name }
var trainData_newDf = fields_train.foldLeft(trainData_tran){ case(dframe,field) => dframe.drop(field) }
trainData_newDf = trainData_newDf.withColumn("label2", expr("label")).drop("label").withColumnRenamed("label2","label")
//test
val fields_test = testData_tran.schema.fields filter {
x => x.dataType match { 
      case x: org.apache.spark.sql.types.StringType => true
      case _ => false 
      } 
    } map { x => x.name }
val testData_newDf = fields_test.foldLeft(testData_tran){ case(dframe,field) => dframe.drop(field) }
trainData_newDf.show(2)
//testData_newDf.show(2)

+---+-------+-------------+------------+------------+--------------+-------------+-------------+------------------+--------------+----------------+--------+-------+------------------+-----+
|age| fnlwgt|education_num|capital_gain|capital_loss|hours_per_week|idx_workclass|idx_education|idx_marital_status|idx_occupation|idx_relationship|idx_race|idx_sex|idx_native_country|label|
+---+-------+-------------+------------+------------+--------------+-------------+-------------+------------------+--------------+----------------+--------+-------+------------------+-----+
| 39|77516.0|         13.0|      2174.0|         0.0|          40.0|          3.0|          2.0|               1.0|           3.0|             1.0|     0.0|    0.0|               0.0|  0.0|
| 28|83311.0|         13.0|         0.0|         0.0|          13.0|          1.0|          2.0|               0.0|           2.0|             0.0|     0.0|    0.0|               0.0|  0.0|
+---+-------+-------------+------------+----------

fields_train = Array(workclass, education, marital_status, occupation, relationship, race, sex, native_country, income)
trainData_newDf = [age: int, fnlwgt: double ... 13 more fields]
trainData_newDf = [age: int, fnlwgt: double ... 13 more fields]
fields_test = Array(workclass, education, marital_status, occupation, relationship, race, sex, native_country, income)
testData_newDf = [age: int, fnlwgt: double ... 13 more fields]


[age: int, fnlwgt: double ... 13 more fields]

# 6. Pearson Correlation

In [10]:
import org.apache.spark.mllib.stat.Statistics

var feature = 0;
val columnNames = trainData_newDf.columns
val numFeature = columnNames.size - 1
while(feature < numFeature)
{
    var name = columnNames(feature)
    var correlation = trainData_newDf.stat.corr("label", name)
    println(s"The correlation between $name and income is: " + correlation)
    feature +=  1
}

The correlation between age and income is: 0.24203068168136402
The correlation between fnlwgt and income is: -0.008957423359171636
The correlation between education_num and income is: 0.335286196752636
The correlation between capital_gain and income is: 0.221196214548056
The correlation between capital_loss and income is: 0.1500533083972996
The correlation between hours_per_week and income is: 0.22948012988851152
The correlation between idx_workclass and income is: 0.13693664382909343
The correlation between idx_education and income is: 0.04611569765633624
The correlation between idx_marital_status and income is: -0.3133594943582077
The correlation between idx_occupation and income is: -0.1829482090625016
The correlation between idx_relationship and income is: -0.2552405895314646
The correlation between idx_race and income is: -0.0679025544900534
The correlation between idx_sex and income is: -0.21669868107558524
The correlation between idx_native_country and income is: -0.019815557146

feature = 14
columnNames = Array(age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week, idx_workclass, idx_education, idx_marital_status, idx_occupation, idx_relationship, idx_race, idx_sex, idx_native_country, label)
numFeature = 14


14

The result calculated here are same with heat map in dataVisualization.ipynb

# 7. assemble raw features 

In [9]:
//train
val featureCol_train: Array[String] = trainData_newDf.columns.filter(_ != "label")
val assembler_train = new VectorAssembler()
                      .setInputCols(featureCol_train)
                      .setOutputCol("rawFeatures")
val featured_trainData_df = assembler_train.transform(trainData_newDf)
featured_trainData_df.printSchema
//featured_trainData_df.show(1)
//test
val featureCol_test: Array[String] = testData_newDf.columns.filter(_ != "label")
val assembler_test = new VectorAssembler()
                      .setInputCols(featureCol_test)
                      .setOutputCol("rawFeatures")
val featured_testData_df = assembler_test.transform(testData_newDf)
//featured_testData_df.show(1)

root
 |-- age: integer (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education_num: double (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- idx_workclass: double (nullable = true)
 |-- idx_education: double (nullable = true)
 |-- idx_marital_status: double (nullable = true)
 |-- idx_occupation: double (nullable = true)
 |-- idx_relationship: double (nullable = true)
 |-- idx_race: double (nullable = true)
 |-- idx_sex: double (nullable = true)
 |-- idx_native_country: double (nullable = true)
 |-- label: double (nullable = true)
 |-- rawFeatures: vector (nullable = true)



featureCol_train = Array(age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week, idx_workclass, idx_education, idx_marital_status, idx_occupation, idx_relationship, idx_race, idx_sex, idx_native_country)
assembler_train = vecAssembler_ed78e22d83a1
featured_trainData_df = [age: int, fnlwgt: double ... 14 more fields]
featureCol_test = Array(age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week, idx_workclass, idx_education, idx_marital_status, idx_occupation, idx_relationship, idx_race, idx_sex, idx_native_country)
assembler_test = vecAssembler_666dafc49bee


featured_testData_df: org.apache.spark.sql.DataFr...


vecAssembler_666dafc49bee

# 8. Prediction based on different machine learning algorithms:

> 8.1 Logistic Regression

> 8.2 Decision Tree

> 8.3 Random Forest

> 8.4 Naive Bayes

> 8.5 Neural Network 

> 8.6 GBTClassifier


## 8.1 Logistic Regression

In [56]:
//Train a Logistic Model
import scala.collection.mutable
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel, OneVsRest} 

val stages = new mutable.ArrayBuffer[PipelineStage]()
stages += new RFormula().setFormula("label ~ age + fnlwgt + education_num + capital_gain + capital_loss + hours_per_week + idx_workclass + idx_education + idx_marital_status + idx_occupation + idx_relationship + idx_race + idx_sex + idx_native_country")
stages += new LogisticRegression()
                    .setLabelCol("label")
                    .setFeaturesCol("rawFeatures") 
                    .setMaxIter(10)
                    .setTol(1E-6)
                    .setFitIntercept(true)
val pipeline = new Pipeline().setStages(stages.toArray)

var start_log=System.currentTimeMillis()

val pipelineModel = pipeline.fit(featured_trainData_df) 
val model_log = pipelineModel.stages.last.asInstanceOf[LogisticRegressionModel]
//test the model
val examFormular = new RFormula().setFormula("label ~ age + fnlwgt + education_num + capital_gain + capital_loss + hours_per_week + idx_workclass + idx_education + idx_marital_status + idx_occupation + idx_relationship + idx_race + idx_sex + idx_native_country")
val fittedRF = examFormular.fit(featured_testData_df)

var end_log=System.currentTimeMillis()
    
    
val preparedDF = fittedRF.transform(featured_testData_df)
val prediction_log = model_log.transform(preparedDF)

val predictionAndLabels_log = prediction_log.select("prediction", "label").as[(Double,Double)].rdd
val metrics_log = new MulticlassMetrics(predictionAndLabels_log)
var confusion_log = metrics_log.confusionMatrix

var TP_log = confusion_log(1,1)
var TN_log = confusion_log(0,0)
var FP_log = confusion_log(0,1)
var FN_log = confusion_log(1,0)
val total = TP_log + TN_log + FP_log + FN_log
var mis_log = (FP_log+FN_log)/total
var FPrate_log = FP_log/(TN_log + FP_log) 
var FNrate_log = FN_log/(FN_log + TP_log)
var sensitivity_log = TP_log/(FN_log + TP_log)
var specificity_log = TN_log/(TN_log + FP_log)
println("False positive rate = " + FPrate_log)
println("False negative rate = " + FNrate_log)
println("specificity for logistic model = " + specificity_log)
println("sensitivity for logistic model = " + sensitivity_log)

var time_log=end_log-start_log

println("Confusion matrix for logistics model:")
println(metrics_log.confusionMatrix)
println("Accuracy for logistics model:")
println(metrics_log.accuracy)
println("Test Error for logistics model = " + (1.0 - metrics_log.accuracy))
printf("It took %d ms.\n",time_log)

False positive rate = 0.05457746478873239
False negative rate = 0.5218918918918919
specificity for logistic model = 0.9454225352112676
sensitivity for logistic model = 0.4781081081081081
Confusion matrix for logistics model:
10740.0  620.0   
1931.0   1769.0  
Accuracy for logistics model:
0.8306108897742364
Test Error for logistics model = 0.16938911022576364
It took 1840 ms.


stages = ArrayBuffer(RFormula(label ~ age + fnlwgt + education_num + capital_gain + capital_loss + hours_per_week + idx_workclass + idx_education + idx_marital_status + idx_occupation + idx_relationship + idx_race + idx_sex + idx_native_country) (uid=rFormula_9768e00f9ce9), logreg_cf54405bb8f4)
pipeline = pipeline_15783c88f47f
start_log = 1513754758418
pipelineModel = pipeline_15783c88f47f


model_log: org.apache.spark.ml.c...


pipeline_15783c88f47f

In [14]:
//the One vs Rest Classifier algorithm
// instantiate the base classifier
val classifier = new LogisticRegression()
  .setMaxIter(10)
  .setTol(1E-6)
  .setFitIntercept(true)
// instantiate the One Vs Rest Classifier.
val ovr = new OneVsRest().setClassifier(classifier)
                         .setLabelCol("label")
                         .setFeaturesCol("rawFeatures") 

// train the multiclass model.
val model_ovr = ovr.fit(featured_trainData_df)

// evaluate the model 
val predictions_ovr = model_ovr.transform(featured_testData_df)

val predictionAndLabels_ovr = predictions_ovr.select("prediction", "label").as[(Double,Double)].rdd
val metrics_ovr = new MulticlassMetrics(predictionAndLabels_ovr)

println("Confusion matrix for one VS rest model:")
println(metrics_ovr.confusionMatrix)
println("Accuracy for one VS rest model:")
println(metrics_ovr.accuracy)
println("Test Error for one VS rest model = " + (1.0 - metrics_ovr.accuracy))


Confusion matrix for one VS rest model:
10740.0  620.0   
1931.0   1769.0  
Accuracy for one VS rest model:
0.8306108897742364
Test Error for one VS rest model = 0.16938911022576364


classifier = logreg_136956e95035
ovr = oneVsRest_45023ae0a083
model_ovr = oneVsRest_45023ae0a083
predictions_ovr = [age: int, fnlwgt: double ... 15 more fields]
predictionAndLabels_ovr = MapPartitionsRDD[368] at rdd at <console>:113
metrics_ovr = org.apache.spark.mllib.evaluation.MulticlassMetrics@6b01d16f


org.apache.spark.mllib.evaluation.MulticlassMetrics@6b01d16f

## 8.2 Decision Tree


In [10]:
//fits the decision tree model and evaluate the model on test data
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.feature.Binarizer

val evaluator = new MulticlassClassificationEvaluator()
                          .setLabelCol("label")
                          .setPredictionCol("prediction")
val nFolds = 3
val paramGrid = new ParamGridBuilder().build() 

val dt = new DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("rawFeatures").setMaxBins(50)

val cv_dt = new CrossValidator()
  .setEstimator(dt)
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(nFolds)

var start_dt=System.currentTimeMillis()
val model_dt = cv_dt.fit(featured_trainData_df)
//evaluate
var prediction_dt = model_dt.transform(featured_testData_df)
var end_dt=System.currentTimeMillis()
    
//binarize the predition 
val binarizer: Binarizer = new Binarizer().
  setInputCol("prediction").
  setOutputCol("binarized_pred").
  setThreshold(0.5)

prediction_dt = binarizer.transform(prediction_dt) 
val predictionAndLabels_dt = prediction_dt.select("binarized_pred", "label").as[(Double,Double)].rdd
val metrics_dt = new MulticlassMetrics(predictionAndLabels_dt)
val time_dt = end_dt - start_dt

var confusion_dt = metrics_dt.confusionMatrix

var TP_dt = confusion_dt(1,1)
var TN_dt = confusion_dt(0,0)
var FP_dt = confusion_dt(0,1)
var FN_dt = confusion_dt(1,0)
//val total = TP_log + TN_log + FP_log + FN_log
var mis_dt = (FP_dt+FN_dt)/total
var FPrate_dt = FP_dt/(TN_dt + FP_dt) 
var FNrate_dt = FN_dt/(FN_dt + TP_dt)
var sensitivity_dt = TP_dt/(FN_dt + TP_dt)
var specificity_dt = TN_dt/(TN_dt + FP_dt)
println("False positive rate = " + FPrate_dt)
println("False negative rate = " + FNrate_dt)
println("specificity for decision tree model = " + specificity_dt)
println("sensitivity for decision tree model = " + sensitivity_dt)

println("Confusion matrix for decision tree model:")
println(metrics_dt.confusionMatrix)
println("Accuracy for decision tree model:")
println(metrics_dt.accuracy)
println("Test Error for decision tree model = " + (1.0 - metrics_dt.accuracy))
printf("It took %d ms.\n",time_dt)

False positive rate = 0.04665492957746479
False negative rate = 0.5094594594594595
specificity for decision tree model = 0.9533450704225352
sensitivity for decision tree model = 0.4905405405405405
Confusion matrix for decision tree model:
10830.0  530.0   
1885.0   1815.0  
Accuracy for decision tree model:
0.8396414342629482
Test Error for decision tree model = 0.16035856573705176
It took 14925 ms.


evaluator = mcEval_b27de2dc0018
nFolds = 3
paramGrid = 
dt = dtr_dc1d606bb831
cv_dt = cv_cd5f4184d645
start_dt = 1513917282663
model_dt = cv_cd5f4184d645
prediction_dt = [age: i...


Array({
})


[age: int, fnlwgt: double ... 16 more fields]

## 8.3 Random Forest


In [11]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
//Train a RandomForest Model
val rf = new RandomForestClassifier()
                  .setLabelCol("label")
                  .setFeaturesCol("rawFeatures")
                  .setNumTrees(10)
                  .setMaxBins(50)

val cv_rf = new CrossValidator()
  .setEstimator(rf)
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(nFolds)

var start_rf=System.currentTimeMillis()

val model_rf = cv_rf.fit(featured_trainData_df)
//evaluate random forest model
val prediction_rf = model_rf.transform(featured_testData_df)
var end_rf=System.currentTimeMillis()

prediction_rf.show(3)
val predictionAndLabels_rf = prediction_rf.select("prediction", "label").as[(Double,Double)].rdd
val metrics_rf = new MulticlassMetrics(predictionAndLabels_rf)
var confusion = metrics_rf.confusionMatrix
val time_rf = end_rf - start_rf

println("Confusion matrix for random forest:")
println(confusion)
println("Accuracy for random forest:")
println(metrics_rf.accuracy)
println("Test Error for random forest = " + (1.0 - metrics_rf.accuracy))

var TP = confusion(1,1)
var TN = confusion(0,0)
var FP = confusion(0,1)
var FN = confusion(1,0)
val total = TP + TN + FP + FN
var mis = (FP+FN)/total
var FPrate = FP/(TN + FP) 
var FNrate = FN/(FN + TP)
var sensitivity = TP/(FN + TP)
var specificity = TN/(TN + FP)
println("False positive rate = " + FPrate)
println("False negative rate = " + FNrate)
println("specificity = " + specificity)
println("sensitivity = " + sensitivity)
printf("It took %d ms.\n",time_rf)

+---+--------+-------------+------------+------------+--------------+-----+-------------+-------------+------------------+--------------+----------------+--------+-------+------------------+--------------------+--------------------+--------------------+----------+
|age|  fnlwgt|education_num|capital_gain|capital_loss|hours_per_week|label|idx_workclass|idx_education|idx_marital_status|idx_occupation|idx_relationship|idx_race|idx_sex|idx_native_country|         rawFeatures|       rawPrediction|         probability|prediction|
+---+--------+-------------+------------+------------+--------------+-----+-------------+-------------+------------------+--------------+----------------+--------+-------+------------------+--------------------+--------------------+--------------------+----------+
| 25|226802.0|          7.0|         0.0|         0.0|          40.0|  0.0|          0.0|          5.0|               1.0|           6.0|             2.0|     1.0|    0.0|               0.0|[25.0,226802.0,

rf = rfc_dcfa881eddef
cv_rf = cv_b36151ce33ef
start_rf = 1513917396827
model_rf = cv_b36151ce33ef
prediction_rf = [age: int, fnlwgt: double ... 17 more fields]
end_rf = 1513917409581
predictionAndLabels_rf = MapPartitionsRDD[594] at rdd at <console>:112
metrics_rf = org.apache.spark.mllib.evaluation.MulticlassMetrics@1070ce42


confusion: org.apache.spark.mllib.linalg.Matrix...


org.apache.spark.mllib.evaluation.MulticlassMetrics@1070ce42

## 8.4 Naive Bayes


In [12]:
import org.apache.spark.ml.classification.NaiveBayes

// Train a NaiveBayes model.


val nb = new NaiveBayes()                      
          .setLabelCol("label")
          .setFeaturesCol("rawFeatures") 

val cv_nb = new CrossValidator()
  .setEstimator(nb)
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(nFolds)

var start_nb = System.currentTimeMillis()
val model_nb = cv_nb.fit(featured_trainData_df)

val predictions_nb = model_nb.transform(featured_testData_df)
var end_nb = System.currentTimeMillis()

val predictionAndLabels_nb = predictions_nb.select("prediction", "label").as[(Double,Double)].rdd
val metrics_nb = new MulticlassMetrics(predictionAndLabels_nb)
val time_nb = end_nb - start_nb

var confusion_nb = metrics_nb.confusionMatrix

var TP_nb = confusion_nb(1,1)
var TN_nb = confusion_nb(0,0)
var FP_nb = confusion_nb(0,1)
var FN_nb = confusion_nb(1,0)
//val total = TP_log + TN_log + FP_log + FN_log
var mis_nb = (FP_nb+FN_nb)/total
var FPrate_nb = FP_nb/(TN_nb + FP_nb) 
var FNrate_nb = FN_nb/(FN_nb + TP_nb)
var sensitivity_nb = TP_nb/(FN_nb + TP_nb)
var specificity_nb = TN_nb/(TN_nb + FP_nb)
println("False positive rate = " + FPrate_nb)
println("False negative rate = " + FNrate_nb)
println("specificity for naive bayes model = " + specificity_nb)
println("sensitivity for naive bayes model = " + sensitivity_nb)


println("Confusion matrix for Naive Bayes:")
println(metrics_nb.confusionMatrix)
println("Accuracy for Naive Bayes:")
println(metrics_nb.accuracy)
println("Test Error for Naive Bayes = " + (1.0 - metrics_nb.accuracy))
printf("It took %d ms.\n",time_nb)

featured_trainData_df.show(2)

False positive rate = 0.045158450704225354
False negative rate = 0.7683783783783784
specificity for naive bayes model = 0.9548415492957747
sensitivity for naive bayes model = 0.23162162162162162
Confusion matrix for Naive Bayes:
10847.0  513.0  
2843.0   857.0  
Accuracy for Naive Bayes:
0.7771580345285525
Test Error for Naive Bayes = 0.22284196547144752
It took 5616 ms.
+---+-------+-------------+------------+------------+--------------+-------------+-------------+------------------+--------------+----------------+--------+-------+------------------+-----+--------------------+
|age| fnlwgt|education_num|capital_gain|capital_loss|hours_per_week|idx_workclass|idx_education|idx_marital_status|idx_occupation|idx_relationship|idx_race|idx_sex|idx_native_country|label|         rawFeatures|
+---+-------+-------------+------------+------------+--------------+-------------+-------------+------------------+--------------+----------------+--------+-------+------------------+-----+---------------

nb = nb_e95456f172ba
cv_nb = cv_06afee0e1292
start_nb = 1513917446655
model_nb = cv_06afee0e1292
predictions_nb = [age: int, fnlwgt: double ... 17 more fields]
end_nb = 1513917452271
predictionAndLabels_nb = MapPartitionsRDD[719] at rdd at <console>:122
metrics_nb = org.apache.spark.mllib.evaluation.MulticlassMetrics@7bebd76a
time_nb = 5616
confusion_nb = 
TP_nb = 857.0


TN_nb: Doub...


857.0

## 8.5 Neural Network 


In [13]:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// try three different layers  
val layers1 = Array[Int](14, 7, 4, 3)
val layers2 = Array[Int](14, 28, 10, 3)
val layers3 = Array[Int](14, 40, 28, 14, 7, 3)
val start_nn = System.currentTimeMillis()
// create the trainer and set its parameters
val trainer_nn_1 = new MultilayerPerceptronClassifier().
  setLabelCol("label").
  setFeaturesCol("rawFeatures").
  setLayers(layers1).
  setBlockSize(128).
  setSeed(1234L).
  setMaxIter(100)
// train the model and predict
val cv_nn_1 = new CrossValidator()
  .setEstimator(trainer_nn_1)
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(nFolds)

val model_nn_1 = cv_nn_1.fit(featured_trainData_df)
val prediction_nn_1= model_nn_1.transform(featured_testData_df)
val end_nn = System.currentTimeMillis()
val time_nn = end_nn - start_nn

/*other two nn with diff parameters 
val trainer_nn_2=new MultilayerPerceptronClassifier().
  setLabelCol("label").
  setFeaturesCol("rawFeatures").
  setLayers(layers2).
  setBlockSize(128).
  setSeed(1234L).
  setMaxIter(500)
val model_nn_2 = trainer_nn_2.fit(featured_trainData_df)
val prediction_nn_2= model_nn_2.transform(featured_testData_df)


val trainer_nn_3=new MultilayerPerceptronClassifier().
  setLabelCol("label").
  setFeaturesCol("rawFeatures").
  setLayers(layers3).
  setBlockSize(128).
  setSeed(1234L).
  setMaxIter(100)
val model_nn_3 = trainer_nn_3.fit(featured_trainData_df)
val prediction_nn_3= model_nn_3.transform(featured_testData_df)
*/

layers1 = Array(14, 7, 4, 3)
layers2 = Array(14, 28, 10, 3)
layers3 = Array(14, 40, 28, 14, 7, 3)
start_nn = 1513917512253
trainer_nn_1 = mlpc_638150417509
cv_nn_1 = cv_c6211d15f71b
model_nn_1 = cv_c6211d15f71b
prediction_nn_1 = [age: int, fnlwgt: double ... 15 more fields]
end_nn = 1513917598132
time_nn = 85879


85879

In [14]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val spark = SparkSession.builder().appName("Spark-ML").master("local").getOrCreate()
import spark.implicits._

// compute accuracy on the test set
var predictionAndLabels_nn1 = prediction_nn_1.select("prediction", "label").as[(Double,Double)].rdd
val metrics_nn_1 = new MulticlassMetrics(predictionAndLabels_nn1)

var confusion_nn = metrics_nn_1.confusionMatrix

var TP_nn = confusion_nn(1,1)
var TN_nn = confusion_nn(0,0)
var FP_nn = confusion_nn(0,1)
var FN_nn = confusion_nn(1,0)
//val total = TP_log + TN_log + FP_log + FN_log
var mis_nn = (FP_nb+FN_nb)/total
var FPrate_nn = FP_nn/(TN_nn + FP_nn) 
var FNrate_nn = FN_nn/(FN_nn + TP_nn)
var sensitivity_nn = TP_nn/(FN_nn + TP_nn)
var specificity_nn = TN_nn/(TN_nn + FP_nn)
println("False positive rate = " + FPrate_nn)
println("False negative rate = " + FNrate_nn)
println("specificity for neural network model = " + specificity_nn)
println("sensitivity for neural network model = " + sensitivity_nn)

println("Test for nn_1:")
println("Confusion matrix")
println(metrics_nn_1.confusionMatrix)
println("Accuracy="+metrics_nn_1.accuracy)
println("Test Error = " + (1.0 - metrics_nn_1.accuracy))
printf("It took %d ms.\n",time_nn)

/* other model with diff parameters
var predictionAndLabels_nn2 = prediction_nn_2.select("prediction", "label").as[(Double,Double)].rdd
val metrics_nn_2 = new MulticlassMetrics(predictionAndLabels_nn2)
println("Test for nn_2:")
println("Confusion matrix")
println(metrics_nn_2.confusionMatrix)
println("Accuracy="+metrics_nn_2.accuracy)
println("Test Error = " + (1.0 - metrics_nn_2.accuracy))
var predictionAndLabels_nn3 = prediction_nn_3.select("prediction", "label").as[(Double,Double)].rdd
val metrics_nn_3 = new MulticlassMetrics(predictionAndLabels_nn3)
println("Test for nn_3:")
println("Confusion matrix")
println(metrics_nn_3.confusionMatrix)
println("Accuracy="+metrics_nn_3.accuracy)
println("Test Error = " + (1.0 - metrics_nn_3.accuracy))
*/

False positive rate = 0.007394366197183098
False negative rate = 0.8248648648648649
specificity for neural network model = 0.9926056338028169
sensitivity for neural network model = 0.17513513513513512
Test for nn_1:
Confusion matrix
11276.0  84.0   
3052.0   648.0  
Accuracy=0.7917662682602922
Test Error = 0.20823373173970783
It took 85879 ms.


spark = org.apache.spark.sql.SparkSession@6481ce0d
predictionAndLabels_nn1 = MapPartitionsRDD[1420] at rdd at <console>:129
metrics_nn_1 = org.apache.spark.mllib.evaluation.MulticlassMetrics@5f7f23ce
confusion_nn = 
TP_nn = 648.0
TN_nn = 11276.0
FP_nn = 84.0
FN_nn = 3052.0
mis_nn = 0.22284196547144755
FPrate_nn = 0.007394366197183098
FNrate_nn = 0.8248648648648649


sensitivity_nn: Doub...


0.8248648648648649

## 8.6 Gradient-Boosted Trees Classifier

In [15]:
import org.apache.spark.ml.classification.GBTClassifier
val trainer_gbt_1 = new GBTClassifier().
  setLabelCol("label").
  setFeaturesCol("rawFeatures").
  setMaxIter(10).
  setMaxBins(50)

val cv_gbt_1 = new CrossValidator()
  .setEstimator(trainer_gbt_1)
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(nFolds)

val start_gbt = System.currentTimeMillis()
val model_gbt_1 = cv_gbt_1.fit(featured_trainData_df)
// compute accuracy on the test set
val prediction_gbt_1= model_gbt_1.transform(featured_testData_df)
val end_gbt = System.currentTimeMillis()

var predictionAndLabels = prediction_gbt_1.select("prediction", "label").as[(Double,Double)].rdd
val metrics_gbt_1 = new MulticlassMetrics(predictionAndLabels)
val time_gbt = end_gbt - start_gbt

var confusion_gbt = metrics_gbt_1.confusionMatrix

var TP_gbt = confusion_gbt(1,1)
var TN_gbt = confusion_gbt(0,0)
var FP_gbt = confusion_gbt(0,1)
var FN_gbt = confusion_gbt(1,0)
//val total = TP_log + TN_log + FP_log + FN_log
var mis_gbt = (FP_gbt+FN_gbt)/total
var FPrate_gbt = FP_gbt/(TN_gbt + FP_gbt) 
var FNrate_gbt = FN_gbt/(FN_gbt + TP_gbt)
var sensitivity_gbt = TP_gbt/(FN_gbt + TP_gbt)
var specificity_gbt = TN_gbt/(TN_gbt + FP_gbt)
println("False positive rate = " + FPrate_gbt)
println("False negative rate = " + FNrate_gbt)
println("specificity for GBT classifier model = " + specificity_gbt)
println("sensitivity for GBT classifier model = " + sensitivity_gbt)

println("Test for gbt_1:")
println("Confusion matrix")
println(metrics_gbt_1.confusionMatrix)
println("Accuracy="+metrics_gbt_1.accuracy)
println("Test Error = " + (1.0 - metrics_gbt_1.accuracy))
printf("It took %d ms.\n",time_gbt)

False positive rate = 0.06778169014084508
False negative rate = 0.3816216216216216
specificity for GBT classifier model = 0.9322183098591549
sensitivity for GBT classifier model = 0.6183783783783784
Test for gbt_1:
Confusion matrix
10590.0  770.0   
1412.0   2288.0  
Accuracy=0.8551128818061089
Test Error = 0.1448871181938911
It took 32249 ms.


trainer_gbt_1 = gbtc_b3728421fb21
cv_gbt_1 = cv_e3b2dc149575
start_gbt = 1513917665137
model_gbt_1 = cv_e3b2dc149575
prediction_gbt_1 = [age: int, fnlwgt: double ... 17 more fields]
end_gbt = 1513917697386
predictionAndLabels = MapPartitionsRDD[2592] at rdd at <console>:130
metrics_gbt_1 = org.apache.spark.mllib.evaluation.MulticlassMetrics@5eb2540d
time_gbt = 32249
confusion_gbt = 


1412.0   2288.0...


10590.0  770.0   
1412.0   2288.0  

# 9. Conclusion

| item |logistic regression  | decision tree|random forest |NaiveBayes|Neural Network| GBT|
|-------|----------------|-----------------|-----------------|-----------------|-------|---------|
|Accuracy| 0.83061   |  0.83964       | 0.84734      |  0.77716| 0.79177   |  0.85511  |
|Misclassification Rate| 0.16939   | 0.16036   |   0.15266|  0.22284  |  0.20823 |  0.14489 |
|False Positive Rate |     0.05458       |   0.04665        |   0.05035| 0.04516  | 0.00739  |  0.06778 |
|False Negative Rate| 0.52190  | 0.50946 |0.46676 |   0.76838  |  0.82486  |  0.38162   |
|sensitivity (true pos)|  0.47811   |  0.49054   | 0.53324 | 0.23162   |    0.17514 | 0.61838   |
|specificity (true neg) |  0.94542  |  0.95335  |   0.94965  |  0.95484   | 0.99261  |  0.93222    |
|time| 1840ms | 2152ms  |6539ms | 1216ms|   83070ms |    17496ms |
|time with cross validation| --| 14925ms | 12754ms  | 5616ms  | 85879ms  |  32249ms |



It is obvious that:
> 1. Higher Accuracy, lower Misclassification rate. <br/>
2. Naive Bayes took the least time, Neural Network took the most time.<br/>
3. Naive Bayes procudes lowest overall accuracy, Gradient-Boosted Trees Classifier produces highest overall accuracy.<br/>
4. Neural Network has the lowest False Positive rate and highest specificity which means Neural Network is best at predicting negative (if a person will get income <= 50k), Gradient-Boosted Trees Classifier has highest False Positive rate and lowest specificity which means is worst at prediction negative.<br/>
5. Gradient-Boosted Trees Classifier has lowest False Negative rate and highest sensitivity which means GBT is the best model at predition positive (if a person will get income > 50k), Neural Network has the highest False Negative rate and lowest sensitivity which means Neural Network is the worst at prediction positive.<br/>

As we can see from the above table, Gradient-Boosted Trees Classifier produces highest overall Accuracy, and compared to other models, it took less time, thus, it is the best model for our dataset. <br/>

**Gradient boosting and Random Forest** are the most accurate model among the six, both of them are improved version of Decision tree，which coincide with dataVisualization.ipynb result. <br/>
**Gradient boosting** have four improvemnts, Tree Constraints, Shrinkage, Random sampling and Penalized Learning, which make it more accurate.<br/>
**Random Forest** operates by constructing a multitude of decision trees, based on Bootstrap Aggregation(bagging) on training samples. Thus, it is able to classify large amounts of data with accuracy.<br/>

In data description reports, Naive Bayes model achieve 83.88% accuracy, however, our NB model in both spark and scikit learn only achieve 77%-79% accuracy. For prediction result of other algorithm, we got the error rate very close to what they got in data description webpage.