<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Part 2 (Optional): Use Spark and Python to Predict Equipment Purchase: Advanced analysis and Hyperparameter Tuning</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://github.com/pmservice/wml-sample-models/blob/master/spark/product-line-prediction/images/products_graphics.png?raw=true" alt="Icon"> </th>
   </tr>
</table>

## Learning goals

You will learn how to:

-  Load a CSV file into an Apache® Spark DataFrame.
-  Analyze attributes of a dataset.
-  Create an Apache® Spark machine learning pipeline.
-  Train and evaluate a model.
-  Tune hyperparameters of a model.
-  Compare different models


## Contents

This notebook contains the following parts:

1.	[Attribute Analysis](#analyze)
2.  [Hyperparameter tuning](#tune)
3.  [Model comparison](#compare)
4.	[Summary and next steps](#summary)

<a id="analyze"></a>
## 1. Attribute Analysis

## 1.1 Load the data

In this section, you will load the data as an Apache® Spark DataFrame and explore the data.

    1. Go to the 'Find and add data' panel on the right hand side panel
    2. Find the shaped data asset 'Outdoor_Equipment_Sales.csv_shaped.csv'
    3. Click on the 'Insert to code' dropdown
    4. Select 'Insert SparkSession DataFrame'



In [1]:
## Insert code here

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190607073125-0002
KERNEL_ID = 1368dee6-e8d1-4036-9459-4e4df47f9353


In the next part we will load the data in another dataframe `df` and have a look at the data types and values.

In [3]:
df = df_data_1

The attribute `AGE` is in string type, but this is a numerical data, so we need to convert it to a numerical type

In [4]:
from pyspark.sql.types import IntegerType
df = df.withColumn("AGE", df["AGE"].cast(IntegerType()))

In [6]:
split_data = df.randomSplit([0.8, 0.2], 24)
train_data = split_data[0]
test_data = split_data[1]

print('Number of training records: ' + str(train_data.count()))
print('Number of testing records : ' + str(test_data.count()))

Number of training records: 48369
Number of testing records : 12126


## 1.2 Explore the attributes ¶

Since 4 predictors are categorical, you can perform chi-squared tests on them. Chi-squared test can be performed when both the predictor and the target (label) are categorical. The goal of the chi-squared test is to assess the relationship between two categorical variables.

In statistical hypothesis testing, the p-value or probability value or significance is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary (such as the absolute value of the sample mean difference between two compared groups) would be greater than or equal to the actual observed results.

In this method, as part of experimental design, before performing the experiment, one first chooses a model (the null hypothesis) and a threshold value for p, called the significance level of the test, traditionally 5% or 1% and denoted as α. If the p-value is less than the chosen significance level (α), that suggests that the observed data is sufficiently inconsistent with the null hypothesis and that the null hypothesis may be rejected. However, that does not prove that the tested hypothesis is true. When the p-value is calculated correctly, this test guarantees that the type I error rate is at most α. For typical analysis, using the standard α = 0.05 cutoff, the null hypothesis is rejected when p < .05 and not rejected when p > .05. The p-value does not, in itself, support reasoning about the probabilities of hypotheses but is only a tool for deciding whether to reject the null hypothesis <a  href="https://en.wikipedia.org/wiki/P-value" target="_blank" rel="noopener no referrer">[2]</a>.

You will use scipy.stats module for the chi-squared test.


In [7]:
from scipy import stats
import pandas as pd

In [8]:
df_pd = df.toPandas()

The chisquare method returns chi-squared test statistics and the p-value.

In [9]:
stats.chisquare(df_pd['GENDER'].value_counts())

Power_divergenceResult(statistic=99.30409124721051, pvalue=2.1656021491785304e-23)

In [10]:
stats.chisquare(df_pd['MARITAL_STATUS'].value_counts())

Power_divergenceResult(statistic=18189.22479543764, pvalue=0.0)

In [11]:
stats.chisquare(df_pd['PROFESSION'].value_counts())

Power_divergenceResult(statistic=60196.786908008915, pvalue=0.0)

In [12]:
stats.chisquare(df_pd['PRODUCT_LINE'].value_counts())

Power_divergenceResult(statistic=24684.79775188032, pvalue=0.0)

Let's create cross-tabulation matrix for each predictor and get the chi-squared test results.

In [13]:
target_classes = ['Camping Equipment', 'Gold Equipment', 'Mountaineering Equipment', 'Outdoor Protection', 'Personal Accessories']

Cross-tabulation matrix for predictor `GENDER` and target `PRODUCT_LINE`.

In [14]:
cont_gender = pd.crosstab(df_pd['PRODUCT_LINE'], df_pd['GENDER'])

In [15]:
cont_gender_df = cont_gender
cont_gender_df.index = target_classes
cont_gender_df.index.name = 'PRODUCT_LINE'

In [16]:
cont_gender_df

GENDER,F,M
PRODUCT_LINE,Unnamed: 1_level_1,Unnamed: 2_level_1
Camping Equipment,9429,14712
Gold Equipment,2256,4228
Mountaineering Equipment,3393,6665
Outdoor Protection,1924,622
Personal Accessories,12020,5246


The first value of the output of the ` chi2_contingency` method is the chi-squared test statistics, the second values is the p-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [17]:
stats.chi2_contingency(cont_gender)

(6054.423628417609, 0.0, 4, array([[11581.45469874, 12559.54530126],
        [ 3110.64795438,  3373.35204562],
        [ 4825.24631788,  5232.75368212],
        [ 1221.42345648,  1324.57654352],
        [ 8283.22757253,  8982.77242747]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.0$, $H_{0}$ (null hypothesis) is rejected - `GENDER` and `PRODUCT_LINE` are dependent.

Cross-tabulation matrix for predictor `MARITAL_STATUS` and target `PRODUCT_LINE`.

In [18]:
cont_marital = pd.crosstab(df_pd['PRODUCT_LINE'], df_pd['MARITAL_STATUS'])
cont_marital_df = cont_marital
cont_marital_df.index = target_classes
cont_marital_df.index.name = 'PRODUCT_LINE'

In [19]:
cont_marital_df

MARITAL_STATUS,Married,Single,Unspecified
PRODUCT_LINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Camping Equipment,14332,8277,1532
Gold Equipment,4851,427,1206
Mountaineering Equipment,2769,6619,670
Outdoor Protection,1682,578,286
Personal Accessories,7258,8754,1254


The first value of the output of the ` chi2_contingency` method is the chi-squared test statistics, the second values is the p-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [20]:
stats.chi2_contingency(cont_marital)

(7833.008344807513,
 0.0,
 8,
 array([[12327.69273494,  9838.76940243,  1974.53786263],
        [ 3311.07906439,  2642.58236218,   530.33857344],
        [ 5136.15564923,  4099.18158525,   822.66276552],
        [ 1300.12450616,  1037.63335813,   208.24213571],
        [ 8816.94804529,  7036.83329201,  1412.2186627 ]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.0$, $H_{0}$ (null hypothesis) is rejected - `MARITAL_STATUS` and `PRODUCT_LINE` are dependent.

Cross-tabulation matrix for predictor `PROFESSION` and target `PRODUCT_LINE`.

In [21]:
cont_profession = pd.crosstab(df_pd['PRODUCT_LINE'], df_pd['PROFESSION'])
cont_profession_df = cont_profession
cont_profession_df.index = target_classes
cont_profession_df.index.name = 'PRODUCT_LINE'

In [22]:
cont_profession_df

PROFESSION,Executive,Hospitality,Other,Professional,Retail,Retired,Sales,Student,Trades
PRODUCT_LINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Camping Equipment,3775,1977,9684,1866,619,30,3452,505,2233
Gold Equipment,175,459,3477,1025,21,443,357,54,473
Mountaineering Equipment,499,83,4044,2142,449,14,1273,815,739
Outdoor Protection,292,303,1140,136,179,124,79,220,73
Personal Accessories,1152,502,6261,3801,1522,578,1576,1365,509


The first value of the output of the ` chi2_contingency` method is the chi-squared test statistics, the second values is the p-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [23]:
stats.chi2_contingency(cont_profession)

(10305.783485396083,
 0.0,
 32,
 array([[2351.64745847, 1326.46803868, 9819.21557153, 3579.54822713,
         1113.3711877 ,  474.47969254, 2688.45221919, 1180.8119514 ,
         1607.00565336],
        [ 631.62595256,  356.27433672, 2637.330424  ,  961.42623357,
          299.03892884,  127.43988759,  722.08790809,  317.15275643,
          431.6235722 ],
        [ 979.78004794,  552.6538061 , 4091.0347632 , 1491.36722043,
          463.87007191,  197.685131  , 1120.10490123,  491.9682949 ,
          669.53576329],
        [ 248.01352178,  139.89427225, 1035.57113811,  377.5125217 ,
          117.42028267,   50.04040003,  283.53420944,  124.53283742,
          169.4808166 ],
        [1681.93301926,  948.70954624, 7022.84810315, 2560.14579717,
          796.29952889,  339.35488883, 1922.82076205,  844.53415985,
         1149.35419456]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.0$, $H_{0}$ (null hypothesis) is rejected - `PROFESSIONS` and `PRODUCT_LINE` are dependent.

### Therefore we may conclude that the categorical attribites we explored have direct dependency on the `PRODUCT_LINE`, therefore it would be logical to consider them for predicting the `PRODUCT_LINE`

<a id="tuning"></a>
## 2. Hyperparameter Tuning

In this step we will tune the parameters of a model to select the best parameters to achieve highest accuracy

### 2.1 Create the pipeline<a id="pipe"></a>

In this subsection, you will create an Apache® Spark machine learning pipeline and train the model.

In the first step, you need to import the Apache® Spark machine learning modules that will be needed in the subsequent steps.

In [24]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
from pyspark.ml.tuning import ParamGridBuilder
import numpy as np
from pyspark.ml.tuning import CrossValidator

In the following step, use the `StringIndexer` transformer to convert all string fields into numerical type.

In [25]:
stringIndexer_label = StringIndexer(inputCol='PRODUCT_LINE', outputCol='label').fit(df)
stringIndexer_prof = StringIndexer(inputCol='PROFESSION', outputCol='PROFESSION_IX')
stringIndexer_gend = StringIndexer(inputCol='GENDER', outputCol='GENDER_IX')
stringIndexer_mar = StringIndexer(inputCol='MARITAL_STATUS', outputCol='MARITAL_STATUS_IX')

In the following step, create a feature vector to combine all features (predictors) together.

In [26]:
vectorAssembler_features = VectorAssembler(inputCols=['GENDER_IX', 'AGE', 'MARITAL_STATUS_IX', 'PROFESSION_IX'], outputCol='features')

We will use a `Random Forest Classifier`

In [27]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

Finally, convert the indexed labels back to original labels.

In [40]:
labelConverter = IndexToString(inputCol='prediction', outputCol='predictedLabel', labels=stringIndexer_label.labels)

Next we sill build the pipeline. A pipeline consists of transformers and an estimator.

In [41]:
pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar,vectorAssembler_features, rf, labelConverter])

###  2.2 Create a parameter grid

In this subsection, we will create a grid of paramaters and train the model multiple times to check which parameters give us the best accuracy.



#### We will use a `ParamGridBuilder` to construct a grid of parameters to search over. It will try all combinations of values of two parameters `numTrees` and `maxDepth` and determine best model using the evaluator.

In [29]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
    .addGrid(rf.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 30, num = 3)]) \
.build()

In [30]:
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
numFolds=3)

In next step we will cross validate against training data

In [31]:
cvModel = crossval.fit(train_data)

In [32]:
predictions = cvModel.transform(test_data)

### So the best selected model achieved an accuracy of:

In [33]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
accuracy = evaluator.evaluate(predictions)
print('Accuracy = {:.2f}%'.format(accuracy*100))
print('Test Error = {:.2f}%'.format((1.0 - accuracy)*100))

Accuracy = 58.12%
Test Error = 41.88%


So Now we can see which parameters gave us the best model:

In [34]:
bestPipeline = cvModel.bestModel

In [35]:
bestPipeline.stages

[StringIndexer_4d97bfab309ca7e974d5,
 StringIndexer_4bea95623432f96644b5,
 StringIndexer_40f8b033bff79943a767,
 StringIndexer_4fadb77431afead94fe6,
 VectorAssembler_4fa7813a47805614ef97,
 RandomForestClassificationModel (uid=RandomForestClassifier_4c2b9c9397b1f6ed7622) with 50 trees]

In [36]:
bestModel = bestPipeline.stages[5]

In [37]:
print('numTrees - ', bestModel.getNumTrees)
print('maxDepth - ', bestModel.getOrDefault('maxDepth'))

numTrees -  50
maxDepth -  17


### Therefore we may conclude to have the best `RandomForestClassifier` model, we will have to use the parameters `numTrees` & `maxDepth` with the value that we have obtained.

<a id="compare"></a>
## 3. Model Comparison

In this section we will train some more classification models with the same training data and evaluate their performance against the first trained model

### 3.1 DecisionTreeClassifier

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning<a  href="https://www.datacamp.com/community/tutorials/decision-tree-classification-python" target="_blank" rel="noopener no referrer">[3]</a>. 

In [42]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
pipeline_dt = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, dt, labelConverter])
model_dt = pipeline_dt.fit(train_data)

In [43]:
predictions = model_dt.transform(test_data)
evaluatorDT = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='accuracy')
accuracy = evaluatorDT.evaluate(predictions)

print('Accuracy = {:.2f}%'.format(accuracy*100))
print('Test Error = {:.2f}%'.format((1.0 - accuracy)*100))

Accuracy = 55.62%
Test Error = 44.38%


### 3.2 MultilayerPerceptronClassifier

The Multilayer perceptron classifier (MLPC) is a classifier based on the feedforward artificial neural network in the current implementation of Spark ML API. The MLPC employs backpropagation for learning the model. Technically, Spark used the logistic loss function for optimization and L-BFGS as an optimization routine. The number of nodes (say) N in the output layer corresponds to the number of classes <a  href="https://dzone.com/articles/deep-learning-via-multilayer-perceptron-classifier" target="_blank" rel="noopener no referrer">[4]</a>. 

In [44]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
layers = [4, 5, 4, 5]
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
pipeline_mpc = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, trainer, labelConverter])
model_mpc = pipeline_mpc.fit(train_data)

In [45]:
predictions = model_mpc.transform(test_data)
evaluatorMPC = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='accuracy')
accuracy = evaluatorMPC.evaluate(predictions)

print('Accuracy = {:.2f}%'.format(accuracy*100))
print('Test Error = {:.2f}%'.format((1.0 - accuracy)*100))

Accuracy = 47.52%
Test Error = 52.48%


### So we have now trained and evaluated two more models. We therefore are in a position to confidently select the best model that we have and use it to predict equipment purchase of customers

<a id="summary"></a>
## 4. Summary and next steps     

You successfully completed this notebook! 
 
You learned how to use Apache® Spark Machine Learning for model creation and evaluation. 

You also learned how to tune parameters of a model, and how to select the best model for your purpose after comparing multiple models.
 
Check out our [Online Documentation](https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html) for more samples, tutorials, documentation, how-tos, and blog posts. 

### Authors

**Lukasz Cmielowski**, Ph.D., is an Automation Architect and Data Scientist at IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge.  
**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who strives to make data science easy for everyone through Watson Studio.

Copyright © 2017-2019 IBM. This notebook and its source code are released under the terms of the MIT License.