# Spark ML - Part Two

The Spark ML library was introduced because MLlib (RDD-based library you used in Part One) wasn’t scalable and extendable enough, nor was it sufficiently practical for use in real machine learning projects. The goal of the new, now official Spark's machine learning library, is to generalize machine learning operations and streamline machine learning processes. Influenced by the Python’s scikit-learn library, it introduces several new abstractions - estimators, transformers, and evaluators — that can be combined to form pipelines. All four can be parameterized with ML parameters in a general way.

Spark ML ubiquitously uses DataFrame objects to present datasets. This is why the old MLlib algorithms can’t be simply upgraded: the Spark ML architecture requires structural changes, so new implementations of the same algorithms are necessary. 

In this notebook you will use Spark ML library to perform classification and clustering using logistic regression, decision trees, random forests and k-means clustering algorithms.

Before anything else, let's initialize a Spark session:

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark course - ML Part 2").\
    master("local[*]").enableHiveSupport().getOrCreate()

sc = spark.sparkContext

## Logistic regression - preparing the data

The example dataset that you’ll use for logistic regression is the well-known adult dataset (http://archive.ics.uci.edu/ml/datasets/Adult), extracted from the 1994 United States census data. It contains 13 attributes with data about a person’s sex, age, education, marital status, race, native country, and so on, and the target variable (income). The goal is to predict whether a person earns more or less than $50,000 per year (the income column contains only values 1 and 0).

To load the data execute the following snippet:

In [2]:
from pyspark.sql import DataFrame
from pyspark.sql import Row
from pyspark.sql.types import *

def tofloat(s):
    try:
        return float(s)
    except:
        return None
    
def rawtorow(s):
    return Row(float(s[0]), s[1], float(s[2]), s[3], s[4], s[5], s[6], s[7], s[8], 
               float(s[9]), float(s[10]), float(s[11]), s[12], s[13])

dfraw = sc.textFile("../first-edition/ch08/adult.raw", 4).\
    map(lambda x: x.split(", ")).map(rawtorow).toDF(['age', 'workclass', 'fnlwgt', 'education', 
                                                    'marital_status', 'occupation', 'relationship',
                                                   'race', 'sex', 'capital_gain', 'capital_loss', 
                                                    'hours_per_week', 'native_country', 'income'])

Examine the first 200 rows of the data using the `show` method and specifying 200 as the argument.

In [3]:
dfraw.show(200)

+----+----------------+--------+------------+--------------------+-----------------+--------------+------------------+------+------------+------------+--------------+--------------+------+
| age|       workclass|  fnlwgt|   education|      marital_status|       occupation|  relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+----+----------------+--------+------------+--------------------+-----------------+--------------+------------------+------+------------+------------+--------------+--------------+------+
|39.0|       State-gov| 77516.0|   Bachelors|       Never-married|     Adm-clerical| Not-in-family|             White|  Male|      2174.0|         0.0|          40.0| United-States| <=50K|
|50.0|Self-emp-not-inc| 83311.0|   Bachelors|  Married-civ-spouse|  Exec-managerial|       Husband|             White|  Male|         0.0|         0.0|          13.0| United-States| <=50K|
|38.0|         Private|215646.0|     HS-grad|          

As you can see, some cells contain `?` as their values. Those are missing values. They appear only in these columns: `workclass`, `occupation` and `native_country`. Display the most frequent values from the `workclass` column.

In [4]:
import pyspark.sql.functions as F
dfraw.groupBy(dfraw.workclass).count().orderBy(F.col('count').desc()).show()

+----------------+-----+
|       workclass|count|
+----------------+-----+
|         Private|33906|
|Self-emp-not-inc| 3862|
|       Local-gov| 3136|
|               ?| 2799|
|       State-gov| 1981|
|    Self-emp-inc| 1695|
|     Federal-gov| 1432|
|     Without-pay|   21|
|    Never-worked|   10|
+----------------+-----+



Do the same for the other two columns.

In [5]:
dfraw.groupBy(dfraw.occupation).count().orderBy(F.col('count').desc()).show()
dfraw.groupBy(dfraw.native_country).count().orderBy(F.col('count').desc()).show()

+-----------------+-----+
|       occupation|count|
+-----------------+-----+
|   Prof-specialty| 6172|
|     Craft-repair| 6112|
|  Exec-managerial| 6086|
|     Adm-clerical| 5611|
|            Sales| 5504|
|    Other-service| 4923|
|Machine-op-inspct| 3022|
|                ?| 2809|
| Transport-moving| 2355|
|Handlers-cleaners| 2072|
|  Farming-fishing| 1490|
|     Tech-support| 1446|
|  Protective-serv|  983|
|  Priv-house-serv|  242|
|     Armed-Forces|   15|
+-----------------+-----+

+------------------+-----+
|    native_country|count|
+------------------+-----+
|     United-States|43832|
|            Mexico|  951|
|                 ?|  857|
|       Philippines|  295|
|           Germany|  206|
|       Puerto-Rico|  184|
|            Canada|  182|
|       El-Salvador|  155|
|             India|  151|
|              Cuba|  138|
|           England|  127|
|             China|  122|
|             South|  115|
|           Jamaica|  106|
|             Italy|  105|
|Dominican-Republic

Now use the `DataFrame`'s built-in `na.replace` method to change all missing values into the most frequent value of the corresponding column (`Private` for `workclass`, `Prof-specialty` for `occupation` and `United-States` for `native_country`).

In [6]:
dfrawrp = dfraw.na.replace({"?": "Private"}, subset=["workclass"])
dfrawrpl = dfrawrp.na.replace({"?": "Prof-specialty"}, subset=["occupation"])
dfrawnona = dfrawrpl.na.replace({"?": "United-States"}, subset=["native_country"])

Examine the new DataFrame to see if the changes have been made.

In [7]:
dfrawnona.groupBy(dfrawnona.workclass).count().orderBy(F.col('count').desc()).show(5)
dfrawnona.groupBy(dfrawnona.occupation).count().orderBy(F.col('count').desc()).show(9)
dfrawnona.groupBy(dfrawnona.native_country).count().orderBy(F.col('count').desc()).show(4)

+----------------+-----+
|       workclass|count|
+----------------+-----+
|         Private|36705|
|Self-emp-not-inc| 3862|
|       Local-gov| 3136|
|       State-gov| 1981|
|    Self-emp-inc| 1695|
+----------------+-----+
only showing top 5 rows

+-----------------+-----+
|       occupation|count|
+-----------------+-----+
|   Prof-specialty| 8981|
|     Craft-repair| 6112|
|  Exec-managerial| 6086|
|     Adm-clerical| 5611|
|            Sales| 5504|
|    Other-service| 4923|
|Machine-op-inspct| 3022|
| Transport-moving| 2355|
|Handlers-cleaners| 2072|
+-----------------+-----+
only showing top 9 rows

+--------------+-----+
|native_country|count|
+--------------+-----+
| United-States|44689|
|        Mexico|  951|
|   Philippines|  295|
|       Germany|  206|
+--------------+-----+
only showing top 4 rows



You don't have any more missing values, but the other categorical values cannot be used by machine learning algorithms as strings. They have to be converted to numerical values, but a naive solution of simply enumerating them wouldn't work because that would imply existence of a ranking scheme, while no such scheme exists in the real world. What you can do instead is to on-hot-encode those values into several columns.

For that you need Spark's `StringIndexer`, `OneHotEncoder` and `VectorAssembler` classes.

Write a method which takes a DataFrame and a list of columns and replaces each column from the list with its string-indexed version.

In [8]:
from pyspark.ml.feature import StringIndexer

def indexStringColumns(df, cols):
    newdf = df
    for col in cols:
        si = StringIndexer(inputCol=col, outputCol=col+"-num")
        sm = si.fit(newdf)
        newdf = sm.transform(newdf).drop(col)
        newdf = newdf.withColumnRenamed(col+"-num", col)
    return newdf

Use it now on your DataFrame with no missing values and string-index the following columns: `workclass`, `education`, `marital_status`, `occupation`, `relationship`, `race`, `sex`, `native_country`, `income`. Inspect the output to see what the function has done.

In [9]:
dfnumeric = indexStringColumns(dfrawnona, ["workclass", "education", "marital_status", "occupation", "relationship", 
                                           "race", "sex", "native_country", "income"])
dfnumeric.show()

+----+--------+------------+------------+--------------+---------+---------+--------------+----------+------------+----+---+--------------+------+
| age|  fnlwgt|capital_gain|capital_loss|hours_per_week|workclass|education|marital_status|occupation|relationship|race|sex|native_country|income|
+----+--------+------------+------------+--------------+---------+---------+--------------+----------+------------+----+---+--------------+------+
|39.0| 77516.0|      2174.0|         0.0|          40.0|      3.0|      2.0|           1.0|       3.0|         1.0| 0.0|0.0|           0.0|   0.0|
|50.0| 83311.0|         0.0|         0.0|          13.0|      1.0|      2.0|           0.0|       2.0|         0.0| 0.0|0.0|           0.0|   0.0|
|38.0|215646.0|         0.0|         0.0|          40.0|      0.0|      0.0|           2.0|       8.0|         1.0| 0.0|0.0|           0.0|   0.0|
|53.0|234721.0|         0.0|         0.0|          40.0|      0.0|      5.0|           0.0|       8.0|         0.0| 1.

As you can see, the string values have been replaced with numerical values. The following function will one-hot-encode those values into their separate columns. 

In [10]:
def oneHotEncodeColumns(df, cols):
    from pyspark.ml.feature import OneHotEncoder
    newdf = df
    for c in cols:
        onehotenc = OneHotEncoder(inputCol=c, outputCol=c+"-onehot", dropLast=False)
        newdf = onehotenc.transform(newdf).drop(c)
        newdf = newdf.withColumnRenamed(c+"-onehot", c)
    return newdf

Use it now to one-hot-encode the string-indexed columns. Inspect the resulting DataFrame to see the results.

In [11]:
dfhot = oneHotEncodeColumns(dfnumeric, ["workclass", "education", "marital_status", "occupation", 
                                        "relationship", "race", "native_country"])
dfhot.show()

+----+--------+------------+------------+--------------+---+------+-------------+---------------+--------------+--------------+-------------+-------------+---------------+
| age|  fnlwgt|capital_gain|capital_loss|hours_per_week|sex|income|    workclass|      education|marital_status|    occupation| relationship|         race| native_country|
+----+--------+------------+------------+--------------+---+------+-------------+---------------+--------------+--------------+-------------+-------------+---------------+
|39.0| 77516.0|      2174.0|         0.0|          40.0|0.0|   0.0|(8,[3],[1.0])| (16,[2],[1.0])| (7,[1],[1.0])|(14,[3],[1.0])|(6,[1],[1.0])|(5,[0],[1.0])| (41,[0],[1.0])|
|50.0| 83311.0|         0.0|         0.0|          13.0|0.0|   0.0|(8,[1],[1.0])| (16,[2],[1.0])| (7,[0],[1.0])|(14,[2],[1.0])|(6,[0],[1.0])|(5,[0],[1.0])| (41,[0],[1.0])|
|38.0|215646.0|         0.0|         0.0|          40.0|0.0|   0.0|(8,[0],[1.0])| (16,[0],[1.0])| (7,[2],[1.0])|(14,[8],[1.0])|(6,[1],[1.0])

So, each one-hot-encoded column now contains arrays. The next step is to use `VectorAssembler` to merge all the columns into a single column called `features`. 

Construct a new instance of `VectorAssembler`, set its output columns to be `features` and set its input columns to be all columns except `income`.

In [12]:
from pyspark.ml.feature import VectorAssembler
cols = dfhot.columns
cols.remove("income")
va = VectorAssembler(outputCol="features", inputCols=cols)

Now use this `VectorAssembler` instance to transform the DataFrame with one-hot-encoded columns. Then preserve only the `features` and `income` columns (use `select`). Finally, rename the *income* column to *label*.

In [13]:
lpoints = va.transform(dfhot).select("features", "income").withColumnRenamed("income", "label")

Inspect the resulting DataFrame. 

In [14]:
lpoints.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(103,[0,1,2,4,9,1...|  0.0|
|(103,[0,1,4,7,16,...|  0.0|
|(103,[0,1,4,6,14,...|  0.0|
|(103,[0,1,4,6,19,...|  0.0|
|(103,[0,1,4,5,6,1...|  0.0|
|(103,[0,1,4,5,6,1...|  0.0|
|(103,[0,1,4,5,6,2...|  0.0|
|(103,[0,1,4,7,14,...|  1.0|
|(103,[0,1,2,4,5,6...|  1.0|
|(103,[0,1,2,4,6,1...|  1.0|
|(103,[0,1,4,6,15,...|  1.0|
|(103,[0,1,4,9,16,...|  1.0|
|(103,[0,1,4,5,6,1...|  0.0|
|(103,[0,1,4,6,20,...|  0.0|
|(103,[0,1,4,6,18,...|  1.0|
|(103,[0,1,4,6,22,...|  0.0|
|(103,[0,1,4,7,14,...|  0.0|
|(103,[0,1,4,6,14,...|  0.0|
|(103,[0,1,4,6,19,...|  0.0|
|(103,[0,1,4,5,7,1...|  1.0|
+--------------------+-----+
only showing top 20 rows



In [16]:
lpoints.show(20, False)

+-------------------------------------------------------------------------------------------------+-----+
|features                                                                                         |label|
+-------------------------------------------------------------------------------------------------+-----+
|(103,[0,1,2,4,9,16,31,40,52,57,62],[39.0,77516.0,2174.0,40.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(103,[0,1,4,7,16,30,39,51,57,62],[50.0,83311.0,13.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                |0.0  |
|(103,[0,1,4,6,14,32,45,52,57,62],[38.0,215646.0,40.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])               |0.0  |
|(103,[0,1,4,6,19,30,45,51,58,62],[53.0,234721.0,40.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])               |0.0  |
|(103,[0,1,4,5,6,16,30,37,55,58,70],[28.0,338409.0,40.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])         |0.0  |
|(103,[0,1,4,5,6,17,30,39,55,57,62],[37.0,284582.0,40.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])         |0.0  |
|(103,[0,1,4,5,6,24,35,42,52,58,74],[49.0,1601

## Using pipelines

To see how pipelines work, let's repeat the same procedure, but using the `pyspark.ml.Pipeline` class. First, create a DataFrame with the "income" column renamed to "label", using the DataFrame without missing data (the one you created above). (Just call `withColumnRenamed`)

In [38]:
pipdf = dfrawnona.withColumnRenamed('income', 'label')

Now execute the following cell which defines two lists with names of columns to string-index and the names of the remaining columns (without "label").

In [39]:
columns_to_encode = ["workclass", "education", "marital_status", "sex", "occupation", "relationship", "race", "native_country"]
other_columns = ['age', 'fnlwgt', 'capital_gain', 'capital_loss', 'hours_per_week']

In the next step create a list of `StringIndexer` and a list of `OneHotEncoder` objects for each column from `columns_to_encode` list. `OneHotEncoder` objects should reference columns created by `StringIndexer` objects and produce columns with original names sufixed with "-onehot".

In [28]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

def createIndexers(cols):
    return [StringIndexer(inputCol=col, outputCol=col+"-num") for col in cols]

def createOneHots(cols):
    return [OneHotEncoder(inputCol=col+"-num", outputCol=col+"-onehot", dropLast=False) for col in cols]

indexers = createIndexers(columns_to_encode)
onehots = createOneHots(columns_to_encode)

This is the list of column names which will be assembled.

In [40]:
columns_to_assemble = [c+'-onehot' for c in columns_to_encode] + other_columns

Now create a `VectorAssembler` which creates `features` columns from columns in `columns_to_assemble`.

In [41]:
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler(outputCol="features", inputCols=columns_to_assemble)

Finally, create a `Pipeline` object with stages comprised of all `StringIndexer`, all `OneHotEncoder` objects and the `VectorAssembler` object (in that order).

In [43]:
from pyspark.ml import Pipeline

pip = Pipeline(stages=indexers+onehots+[va])

Now you can `fit` the pipeline on the input DataFrame and then use the resulting model to `transform` the same dataset. 

In [44]:
pipmodel = pip.fit(pipdf)
pippred = pipmodel.transform(pipdf)

Examine the resulting DataFrame to see if everything is OK and then create the final DataFrame containing only `features` and `label` columns.

In [45]:
pippred.show()
final = pippred.select('features', 'label')

+----+----------------+--------+------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+-----+-------------+-------------+------------------+-------+--------------+----------------+--------+------------------+----------------+----------------+---------------------+-------------+-----------------+-------------------+-------------+---------------------+--------------------+
| age|       workclass|  fnlwgt|   education|      marital_status|       occupation| relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|native_country|label|workclass-num|education-num|marital_status-num|sex-num|occupation-num|relationship-num|race-num|native_country-num|workclass-onehot|education-onehot|marital_status-onehot|   sex-onehot|occupation-onehot|relationship-onehot|  race-onehot|native_country-onehot|            features|
+----+----------------+--------+------------+--------------------+

Now you have only two columns: *features*, encoded as a sparse vector, and *label*, containing the target value. 

Split the DataFrame into a training DataFrame containing 80% of the data and a validation DataFrame containing the rest.

In [46]:
splits = final.randomSplit([0.8, 0.2])
adulttrain = splits[0].cache()
adultvalid = splits[1].cache()

## Training a LogisticRegression model

Create an instance of `pyspark.ml.classification.LogisticRegression` called `lr`, set its regularization parameter to 0.01, maximum number of iterations to 500 and `fitIntercept` field to `true`. Then call `fit` using the training dataset.

In [19]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(regParam=0.01, maxIter=500, fitIntercept=True)
lrmodel = lr.fit(adulttrain)

NameError: name 'adulttrain' is not defined

Look at the model's learned weights (the `coefficients` field) and the intercept value.

In [35]:
print(lrmodel.coefficients)
print(lrmodel.intercept)

[0.020157626786388494,6.603690621609869e-07,0.0001425610141114458,0.000561527929209688,0.026527963778700495,-0.5034520945928904,0.033770285373482804,-0.3853274952722448,0.03521916248055273,-0.10081997017360551,0.23598360344219937,0.5917481429736907,-0.9504456630045122,-1.2939506517324026,-0.3468401927845901,-0.021640247997683545,0.7326485140578827,1.1143948230449094,0.1464011490022018,-0.9457051156688034,0.22953299112509415,-1.0561161912451364,-1.4308120330518177,1.7059261986249328,-1.2411800244199034,-0.6884111713218324,1.5770625796830247,-1.1462311640499518,-1.4807511501074684,-2.117125852067845,0.8365116861077424,-0.6915924629346963,-0.2821544325777778,-0.38737079855242146,-0.25761496103162,-0.20826729870911606,0.6908537419663245,0.19987472349472823,0.031023453933745705,0.6798583305387058,-0.0600088823729738,0.18673554944348694,-0.7433830575106748,-0.2757349372048197,-0.10390864672220759,-0.6002859026788119,-0.7969053670815228,0.47400562398819435,0.3420620283496889,-1.11365133100832

To interpret these values, take a weight value, for example w=0.1928862316, and calculate e^w. The resulting value is the percentage change in odds if the corresponding feature is increased by 1. For example, if the weight value is 1.2127, that means that odds of a person earning more than $50,000 per year increases by 21.27% if that feature increases by 1 (most of these columns are one-hot encoded ones, so this increase 'by 1' actually means 'switch on').

Now you can use the model to get predictions for input values. The model is an instance of a `Transformer`, which means it has a `transform` method for transforming DataFrames. Do that now with your validation dataset and inspect the result.

In [36]:
validpredicts = lrmodel.transform(adultvalid)
validpredicts.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(103,[0,1,2,4,5,6...|  1.0|[-0.9188485528202...|[0.28519256768000...|       1.0|
|(103,[0,1,2,4,5,6...|  0.0|[0.41442085633236...|[0.60214744202577...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[3.07676852698740...|[0.95592423306525...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[3.70090307759355...|[0.97589423232617...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[3.19323563613614...|[0.96057892720889...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[3.43249273119202...|[0.96870472544821...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[4.44710823085458...|[0.98842320435030...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[3.00334907145142...|[0.95272519756421...|       0.0|
|(103,[0,1,2,4,5,6...|  0.0|[1.75403073205888...|[0.85246047419474...|       0.0|
|(103,[0,1,2,4,5

The `probability` column contains vectors with two values: the probability that the sample isn’t in the category (the person is making less than $50,000) and the probability that it is. These two values always add up to 1. The  `rawPrediction` column also contains vectors with two values: the log-odds that a sample doesn’t belong to the category and the log-odds that it does. These two values are always opposite numbers (they add up to 0). The `prediction` column contains 1s and 0s, which indicates whether a sample is likely to belong to the category. A sample is likely to belong to the category if its probability is greater than a certain threshold (0.5 by default).

You can evaluate the performance of the model using `BinaryClassificationEvaluator`. Just instantiate a new instance and call `evaluate` using your validation dataset.

In [37]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
bceval = BinaryClassificationEvaluator()
bceval.evaluate(validpredicts)

0.9040311371194393

The resulting value depends on the current metric. Find it out with the `getMetricName` method.

In [38]:
bceval.getMetricName()

'areaUnderROC'

Set the metric name to "areaUnderPR" and evaluate the model again.

In [39]:
bceval.setMetricName("areaUnderPR")
bceval.evaluate(validpredicts)

0.7580554567790339

The methods `pr` and `roc` of the Scala version of the `BinaryClassificationMetrics` class allow you to obtain graph points for precision-recall and receiver-operating curves. Unfortunatelly, this capability is missing from the Python implementation.

## K-fold cross-validation

k-fold cross-validation consists of dividing the dataset into k subsets of equal sizes and training k models excluding a different subset each time. The excluded subset is used as the validation set, and all other subsets are used together as the training set. For each set of parameters you want to validate, you train all k models and then calculate the mean error across all k models (as in figure 8.6). Finally, you choose the set of parameters giving you the smallest average error.

Spark's `pyspark.ml.tuning.CrossValidator` class can automate this for you. It needs an *estimator* (a `LogisticRegression` object, for example) and an *evaluator* (`BinaryClassificationEvaluator`, for example), and the number of folds to use.

Construct a `CrossValidator` and `set` the needed objects (described above) you constructed before. Specify 5 folds.

In [116]:
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=lr, evaluator=bceval, numFolds=5)

Now build an instance of `pyspark.ml.tuning.ParamGridBuilder`: create a new instance, call `addGrid` for parameters `lr.maxIter` (with one value of 1000) and `lr.regParam` (with values 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5), and then call `build`. 

In [119]:
from pyspark.ml.tuning import ParamGridBuilder
paramGrid = ParamGridBuilder().addGrid(lr.maxIter, [1000]).\
    addGrid(lr.regParam, [0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]).build()

Finally, set the resulting `ParamGridBuilder` instance as the value of `estimatorParamMaps` field of your `CrossValidator` and `fit` it on the training dataset.

In [120]:
cv.setEstimatorParamMaps(paramGrid)
cvmodel = cv.fit(adulttrain)

The resulting model has a `bestModel` field which contains the model with best statistical performance. Examine its `coefficients` field.

In [121]:
cvmodel.bestModel.coefficients

DenseVector([0.0222, 0.0, 0.0003, 0.0007, 0.03, -0.6956, -0.4136, -0.916, -0.4334, -0.5776, -0.2629, 0.1881, -1.4715, -4.8562, -0.5909, -0.2194, 0.5937, 1.0053, -0.0685, -1.398, 0.0502, -1.4597, -1.951, 1.7028, -1.6965, -1.0666, 1.5611, -1.616, -2.1614, -11.7542, 1.2528, -1.4578, -0.9816, -1.1204, -0.923, -0.9109, 1.1223, -0.0298, -0.155, 0.5113, -0.2565, -0.0062, -1.0898, -0.4754, -0.2933, -0.8742, -1.1179, 0.3092, 0.1876, -2.3661, 0.308, -0.3914, 0.1259, -0.9891, -0.0657, 0.7056, -0.8472, -0.6744, -0.9351, -0.4725, -1.2847, -0.8483, -0.9671, -1.9029, -0.9007, -0.9924, -1.3611, -0.6444, -1.5826, -1.1403, -0.8561, -0.5944, -1.7426, -2.155, -0.7574, -0.4024, -2.6014, -1.4475, -1.6589, -1.0907, -2.1497, -3.5133, -1.4285, -0.3714, -1.2017, -0.934, -1.2177, -1.6934, -2.0246, -2.2358, -0.2823, 0.0647, -1.9894, -1.7176, -0.227, -2.2883, 0.3142, -2.5081, -1.7239, -1.4792, -5.4461, -0.7987, -4.7835])

Now use this best model to transform the validation dataset and validate it using a new `BinaryClassificationEvaluator`.

In [122]:
BinaryClassificationEvaluator().evaluate(cvmodel.bestModel.transform(adultvalid))

0.9063025251654941

The result should be better than the previous result you obtained.

## Multiclass logistic regression

As we said earlier, multiclass classification means a classifier categorizes input examples into several classes. One of the options for performing multiclass classification in Spark is the *one vs. rest strategy*. When using the one vs. rest strategy, you train one model per class, each time treating all other classes (the rest) as negatives. Then, when classifying new samples, you classify them using all the trained models and pick the class corresponding to the
model that gives the highest probability. Spark ML provides the `pyspark.ml.classification.OneVsRest` class precisely for this purpose. It produces a `OneVsRestModel` that you can use for dataset transformation. You can use the `MulticlassMetrics` class from MLlib for evaluating the results.

As an example dataset you will use the data extracted from scaled images of handwritten numbers. It’s a public dataset available from the UCI machine learning repository, containing 10,992 samples of handwritten digits from 0 to 9. Each sample contains 16 pixels with intensity values of 0–100.

Use the following code to load the dataset and split it into the training and validation datasets.

In [17]:
dfpen = sc.textFile("../first-edition/ch08/penbased.dat", 4).map(lambda x: x.split(", ")).\
    map(lambda row: [int(float(x)) for x in row]).map(lambda raw: Row(*raw)).\
    toDF(['pix1', 'pix2', 'pix3', 'pix4', 'pix5', 'pix6', 'pix7', 'pix8', 'pix9', 'pix10', 
         'pix11', 'pix12', 'pix13', 'pix14', 'pix15', 'pix16', 'label'])

from pyspark.ml.feature import VectorAssembler
va = VectorAssembler().setOutputCol("features")
cols = dfpen.columns
cols.remove('label')
va.setInputCols(cols)
penlpoints = va.transform(dfpen).select("features", "label")

pensets = penlpoints.randomSplit([0.8, 0.2])
pentrain = pensets[0].cache()
penvalid = pensets[1].cache()

Create a `LogisticRegression` classifier with a regularization parameter of 0.01. Then create a `OneVsRest` object and give it this classifier using the `setClassifier` method. Then call the `fit` method of the `OneVsRest` object using the `pentrain` dataset.

In [20]:
from pyspark.ml.classification import OneVsRest
penlr = LogisticRegression(regParam=0.01)
ovrest = OneVsRest()
ovrest.setClassifier(penlr)
ovrestmodel = ovrest.fit(pentrain)

Use the resulting model to transform the validation dataset and convert the result into an RDD containing prediction-label tuples. This is needed because you will use `pyspark.mllib.evaluation.MulticlassMetrics` evaluator from the MLlib library.

In [21]:
penresult = ovrestmodel.transform(penvalid)
penPreds = penresult.select("prediction", "label").\
    rdd.map(lambda row: (float(row[0]), float(row[1])))

Finally, provide the resulting RDD to the `MulticlassMetrics` constructor.

In [23]:
from pyspark.mllib.evaluation import MulticlassMetrics
mce = MulticlassMetrics(penPreds)


0.8922339278852328

You can now access `precision` and `recall` metrics for each class by calling methods with those names and providing the index of the class you want. To that now for several classes.

In [136]:
print(mce.precision(3))
print(mce.recall(3))

0.8981481481481481
1.0


`MulticlassMetrics`'s `confusionMatrix` is a good way to visualize the results. Print it out now.

In [138]:
print(mce.confusionMatrix())

DenseMatrix([[201.,   3.,   0.,   0.,   2.,   0.,   2.,   0.,  19.,   1.],
             [  0., 133.,  28.,   5.,   0.,  16.,   0.,   1.,   0.,   0.],
             [  0.,   6., 168.,   0.,   0.,   1.,   0.,   4.,   0.,   0.],
             [  0.,   0.,   0., 194.,   0.,   0.,   0.,   0.,   0.,   0.],
             [  0.,   1.,   0.,   0., 188.,   1.,   1.,   0.,   0.,   2.],
             [  0.,   0.,   1.,  12.,   0., 120.,  11.,   5.,   3.,  29.],
             [  0.,   0.,   0.,   0.,   1.,   0., 181.,   0.,   4.,   0.],
             [  0.,  10.,   2.,   3.,   1.,   4.,   0., 201.,   1.,   5.],
             [ 10.,   6.,   0.,   0.,   0.,   4.,   0.,   1., 168.,   0.],
             [  2.,   7.,   0.,   2.,   2.,   3.,   0.,   0.,   0., 175.]])


## Decision trees

You will train a decision tree and random forests models on the same dataset, but we will first string-index it. Please execute this code to prepare the data.

In [140]:
dtsi = StringIndexer(inputCol="label", outputCol="label-i")
dtsm = dtsi.fit(penlpoints)
pendtlpoints = dtsm.transform(penlpoints).drop("label").\
    withColumnRenamed("label-i", "label")
pendtsets = pendtlpoints.randomSplit([0.8, 0.2])
pendttrain = pendtsets[0].cache()
pendtvalid = pendtsets[1].cache()

Create an instance of `pyspark.ml.classification.DecisionTreeClassifier`, set its `maxDepth` to 20 and call `fit` using `pendttrain` DataFrame.

In [141]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(maxDepth=20)
dtmodel = dt.fit(pendttrain)

In Scala, you can examine the decisions the model makes by traversing its decision tree from the `rootNode` all the way to the leaf nodes.

However, this feature is not available in Python.

Now evaluate the model by transforming the validation set (`pendtvalid`), again creating an RDD of prediction-value tuples and using a `MulticlassMetrics` instance. Print out the resulting precision and confusion matrix.

In [153]:
dtpredicts = dtmodel.transform(pendtvalid)
dtresrdd = dtpredicts.select("prediction", "label").rdd.map(lambda row: (float(row[0]), float(row[1])))
dtmm = MulticlassMetrics(dtresrdd)
print(dtmm.precision())
print(dtmm.confusionMatrix())

0.9601468274777137
DenseMatrix([[189.,   0.,   0.,   0.,   0.,   0.,   1.,   2.,   1.,   1.],
             [  0., 203.,   0.,   2.,   2.,   1.,   0.,   0.,   0.,   1.],
             [  0.,   0., 171.,   0.,   0.,   0.,   0.,   5.,   0.,   0.],
             [  0.,   2.,   0., 193.,   6.,   3.,   1.,   3.,   1.,   3.],
             [  0.,   0.,   0.,  12., 182.,   0.,   0.,   0.,   0.,   0.],
             [  0.,   0.,   1.,   0.,   0., 187.,   0.,   1.,   0.,   0.],
             [  2.,   1.,   0.,   0.,   0.,   0., 163.,   1.,   3.,   0.],
             [  0.,   0.,   2.,   0.,   0.,   0.,   3., 178.,   3.,   0.],
             [  0.,   0.,   0.,   2.,   0.,   0.,   0.,   1., 183.,   0.],
             [  0.,   0.,   0.,   3.,   1.,   0.,   0.,   1.,   4., 182.]])


Are the results better than those obtained using logistic regression?

## Random forests

Random-forests algorithm is an ensemble method of training a number of decision trees and selecting the best result by averaging results from all of them. This enables the algorithm to avoid overfitting and to find a global optima that particular decision trees can’t find on their own.

Random forests in Spark are implemented by the classes `RandomForestClassifier` and `RandomForestRegressor` (here we will be using the classification version). You can configure it with two additional parameters (besides `maxDepth` you already saw being used for decision trees): `numTrees` (the number of trees to train; the default is 20) and `featureSubsetStrategy` (determines how feature bagging is done). The defaults work fine in most cases.

Create a new instance of `RandomForestClassifier`, set its `maxDepth` to 20 and `fit` it on the `pendttrain` dataset.

In [154]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(maxDepth=20)
rfmodel = rf.fit(pendttrain)

The resulting model has the `trees` field containing the trees it has trained. Access it now.

In [155]:
rfmodel.trees

[DecisionTreeClassificationModel (uid=dtc_d531958eab76) of depth 16 with 681 nodes,
 DecisionTreeClassificationModel (uid=dtc_698c6240285a) of depth 20 with 757 nodes,
 DecisionTreeClassificationModel (uid=dtc_4e0734765111) of depth 15 with 637 nodes,
 DecisionTreeClassificationModel (uid=dtc_4c4bfaf01bca) of depth 18 with 671 nodes,
 DecisionTreeClassificationModel (uid=dtc_4f88bbab552c) of depth 17 with 713 nodes,
 DecisionTreeClassificationModel (uid=dtc_7ae7f53fdc69) of depth 16 with 765 nodes,
 DecisionTreeClassificationModel (uid=dtc_7d14e3004158) of depth 18 with 707 nodes,
 DecisionTreeClassificationModel (uid=dtc_400a81ca6dc1) of depth 20 with 811 nodes,
 DecisionTreeClassificationModel (uid=dtc_4e548f1bd104) of depth 20 with 717 nodes,
 DecisionTreeClassificationModel (uid=dtc_cac58538d0b1) of depth 17 with 679 nodes,
 DecisionTreeClassificationModel (uid=dtc_671ed323f30e) of depth 15 with 759 nodes,
 DecisionTreeClassificationModel (uid=dtc_6f2d951d886c) of depth 19 with 671

And the resulting model is just another `Transformer` so you can use it to transform the validation dataset similarly to what you did previously. Do that now and then use a `MulticlassMetrics` instance to obtain its precision and confusion matrix.

In [156]:
rfpredicts = rfmodel.transform(pendtvalid)
rfresrdd = rfpredicts.select("prediction", "label").rdd.map(lambda row: (float(row[0]), float(row[1])))
rfmm = MulticlassMetrics(rfresrdd)
print(rfmm.precision())
print(rfmm.confusionMatrix())

0.9884635553224961
DenseMatrix([[193.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.],
             [  0., 208.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.],
             [  0.,   0., 175.,   0.,   0.,   0.,   0.,   1.,   0.,   0.],
             [  0.,   2.,   0., 203.,   5.,   0.,   0.,   0.,   0.,   2.],
             [  0.,   1.,   0.,   2., 191.,   0.,   0.,   0.,   0.,   0.],
             [  0.,   0.,   0.,   0.,   0., 189.,   0.,   0.,   0.,   0.],
             [  0.,   0.,   0.,   0.,   0.,   0., 169.,   0.,   1.,   0.],
             [  0.,   1.,   0.,   0.,   0.,   0.,   0., 185.,   0.,   0.],
             [  0.,   0.,   0.,   0.,   0.,   0.,   1.,   2., 183.,   0.],
             [  0.,   0.,   0.,   1.,   1.,   0.,   0.,   0.,   0., 189.]])


## K-means clustering

K-means clustering is the simplest and the most often used of the three. Unfortunately, it has drawbacks: it has trouble handling non-spherical clusters and unevenly sized clusters (uneven by density or by radius). It also can’t make efficient use of the one-hotencoded features you used in section 8.2.2. It’s often used for classifying text documents, along with the term frequency-inverse document frequency (TF-IDF) featurevectorization method.

Each image of handwritten digits, which you’ll use for this example, is represented as a series of numbers (dimensions) representing image pixels. As such, each image is a point in an n-dimensional space. K-means clustering can group together images that are close in this space. In an ideal case, all of these will be images of the same digit.

To implement k-means, you first have to make sure your dataset is standardized (all dimensions are of comparable ranges), because k-means clustering doesn’t work well with non-standardized data. The dimensions of the handwritten digit dataset are already standardized (all the values go from 0 to 100), so you can skip this step now. But with clustering algorithms, there’s no point in having a validation and a training dataset. So, you’ll use the entire dataset contained in the penlpoints DataFrame you used before.

The KMeans estimator can be parameterized with the following parameters:
- k — Number of clusters to find (default is 2)
- maxIter — Maximum number of iterations to perform (required).
- predictionCol — Prediction column name (default is “prediction”)
- featuresCol — Features column name (default is “features”)
- tol — Convergence tolerance
- seed — Random seed value for cluster initialization

Create a new instance of `pyspark.ml.clustering.KMeans`, set `k` to 10 (there are 10 digits in the dataset) and `maxIter` to 500. Then use it to fit a model with the `penlpoints` DataFrame.

In [157]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=10, maxIter=500)
kmmodel = kmeans.fit(penlpoints)

Execute the following cell to define the `printContingency` function which can output a contingency table with the original labels as rows and k-means cluster indexes as columns. The cells in the table contain counts of examples belonging both to the original label and the predicted cluster. 

In [164]:
#rdd contains tuples (prediction, label)
def printContingency(rdd, labels):
    import operator
    numl = len(labels)
    tablew = 6*numl + 10
    divider = "----------"
    for l in labels:
        divider += "+-----"
    summ = 0
    print("orig.class", end='')
    for l in labels:
        print("|Pred"+str(l), end='')
    print()
    print(divider)
    labelMap = {}
    for l in labels:
        #filtering by predicted labels
        predCounts = rdd.filter(lambda p:  p[1] == l).countByKey()
        #get the cluster with most elements
        topLabelCount = sorted(predCounts.items(), key=operator.itemgetter(1), reverse=True)[0]
        #if there are two (or more) clusters for the same label
        if(topLabelCount[0] in labelMap):
            #and the other cluster has fewer elements, replace it
            if(labelMap[topLabelCount[0]][1] < topLabelCount[1]):
                summ -= labelMap[l][1]
                labelMap.update({topLabelCount[0]: (l, topLabelCount[1])})
                summ += topLabelCount[1]
            #else leave the previous cluster in
        else:
            labelMap.update({topLabelCount[0]: (l, topLabelCount[1])})
            summ += topLabelCount[1]
        predictions = iter(sorted(predCounts.items(), key=operator.itemgetter(0)))
        predcount = next(predictions)
        print("%6d    " % (l), end='')
        for predl in labels:
            if(predcount[0] == predl):
                print("|%5d" % (predcount[1]), end='')
                try:
                    predcount = next(predictions)
                except:
                    pass
            else:
                print("|    0", end='')
        print()
        print(divider)
    print("Purity: %s" % (float(summ)/rdd.count()))
    print("Predicted->original label map: %s" % str([str(x[0])+": "+str(x[1][0]) for x in labelMap.items()]))

The method takes an RDD containing tuples with predictions and the original labels (both double values). Call the model's `transform` method on `penlpoints` DataFrame, then obtain the RDD with the `rdd` field, and then call the function using the RDD as the first argument and 0 to 9 range as the second.

In [165]:
kmpredicts = kmmodel.transform(penlpoints)
printContingency(kmpredicts.select("prediction", "label").rdd, range(0, 9))

orig.class|Pred0|Pred1|Pred2|Pred3|Pred4|Pred5|Pred6|Pred7|Pred8
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     0    |    0|    3|    6|   30|    7|    0|  638|    0|  353
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     1    |   21|  574|  282|    8|    1|   70|    0|   66|    0
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     2    |    2|   15| 1003|    0|    0|    2|    0|    0|    0
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     3    |    0|   19|    1|    0|    1|  919|    0|    2|    0
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     4    |    0|   12|    4|   41|  938|    1|    0|   31|    0
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     5    |    0|    0|    0|    6|    0|  211|    0|  175|    0
----------+-----+-----+-----+-----+-----+-----+-----+-----+-----
     6    |    0|    0|    0|  965|    3|    0|    0|    0|    0
----------+-----+-----+--