# Week : Machine Learning over Spark

# Lab Goals


* Get familiar with SparkML and understand the main concepts of Pipelines
* Takes you through building a simple logistic regression model using Spark's ML pipeline interface and tunning model hyperparameters. The different algorithms options are contained in https://spark.apache.org/docs/latest/ml-guide.html.

## Main concepts in Pipelines


MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.



   * **DataFrame**: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

   * **Transformer**: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended. Also, a learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.
   * **Estimator**: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. Technically, an Estimator implements a method *fit()*, which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling *fit()* trains a *LogisticRegressionModel*, which is a Model and hence a *Transformer*.

   * **Pipeline**: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

   * **Parameter**: All Transformers and Estimators now share a common API for specifying parameters.



## Model Hyperparameter Tuning

In machine learning problems it is necessary to distinguish the parameters of the model and hyperparameters (structural parameters). The model parameters are adjusted during the training (e.g., weights in the linear model or the structure of the decision tree), while hyperparameters are set in advance (for example, the regularization in linear model or maximum depth of the decision tree). Each model usually has many hyperparameters, and there is no universal set of hyperparameters optimal working in all tasks, for each task one should choose a different set of hyperparameters. Grid search is commonly used to optimize model hyperparameters: for each parameter several values are selected and combination of parameter values where the model shows the best quality (in terms of the metric that is being optimized) is selected. However, in this case, it is necessary to correctly assess the constructed model, namely to do the split into training and testing samples. 

# Task1: building a logistic regression model using Spark's ML pipeline interface

### We are going to use Adult dataset https://archive.ics.uci.edu/ml/datasets/Adult. This data derives from census data, and consists of information about individuals and their annual income. We will use this information to predict if an individual earns <=50K or >50k a year. 


* **Configure your SparkContext. You can set the master, by default this will use a local master unless you add setMaster or change your enviorment variables.**

* **Use get or create here so that if the cell is evaluated multiple times we don't get multiple SparkContexts.**
    

In [382]:
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql.session import SparkSession
conf = SparkConf().setAppName("IntroductionSparkML")
sc = SparkContext.getOrCreate()
sqlContext = SparkSession.builder.getOrCreate()

## Basic operation with PySpark 

**Now start by loading some data which is in csv format**

In [383]:
df = sqlContext.read.format("csv").option("header", "true").load("Desktop/adult.data")

**Have a look on the dataframe and write down the attributes and their types**

In [384]:
df.printSchema

<bound method DataFrame.printSchema of DataFrame[age: string, workclass: string, fnlwgt: string, education: string, education-num: string, maritial-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: string, capital-loss: string, hours-per-week: string, native-country: string, category: string]>

**Display a subset of the dataframe**

In [354]:
df.show(5)

+---+-----------------+-------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+--------+
|age|        workclass| fnlwgt| education|education-num|    maritial-status|        occupation|  relationship|  race|    sex|capital-gain|capital-loss|hours-per-week|native-country|category|
+---+-----------------+-------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+--------+
| 39|        State-gov|  77516| Bachelors|           13|      Never-married|      Adm-clerical| Not-in-family| White|   Male|        2174|           0|            40| United-States|   <=50K|
| 50| Self-emp-not-inc|  83311| Bachelors|           13| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|           0|           0|            13| United-States|   <=50K|
| 38|          Private| 215646|   HS-grad|   

**So far Spark has simply loaded all of the values as strings since we haven't specified another schema. Now, infer the schema ( schema also handle this extra space)**

In [396]:
df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("Desktop/adult.data")

**Have a look on the dataframe and write down the attributes and their types (Cache the dataframe in memory)**

In [397]:
df.printSchema

<bound method DataFrame.printSchema of DataFrame[age: int, workclass: string, fnlwgt: double, education: string, education-num: double, maritial-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: double, capital-loss: double, hours-per-week: double, native-country: string, category: string]>

**Get a summary statistics, of the data, you can use describe(). It will compute the count, mean, standarddeviation, min, max**

In [398]:
df.describe().show()

+-------+------------------+------------+------------------+-------------+-----------------+---------------+-----------------+------------+-------------------+-------+------------------+----------------+------------------+--------------+--------+
|summary|               age|   workclass|            fnlwgt|    education|    education-num|maritial-status|       occupation|relationship|               race|    sex|      capital-gain|    capital-loss|    hours-per-week|native-country|category|
+-------+------------------+------------+------------------+-------------+-----------------+---------------+-----------------+------------+-------------------+-------+------------------+----------------+------------------+--------------+--------+
|  count|             32561|       32561|             32561|        32561|            32561|          32561|            32561|       32561|              32561|  32561|             32561|           32561|             32561|         32561|   32561|
|   mean| 38

**Now we want to get the summary statistic of only one column (capital-gain), add the name of the column inside describe()**

In [399]:
df.describe('capital-gain').show()

+-------+------------------+
|summary|      capital-gain|
+-------+------------------+
|  count|             32561|
|   mean|1077.6488437087312|
| stddev| 7385.292084840354|
|    min|               0.0|
|    max|           99999.0|
+-------+------------------+



**In some occasion, it can be interesting to see the descriptive statistics between two pairwise columns. Count the number of people with income below or above 50k by education level. This operation is called a crosstab.**

In [400]:
df.crosstab('age', 'category').sort("age_category").show(100)

+------------+------+-----+
|age_category| <=50K| >50K|
+------------+------+-----+
|          17|   395|    0|
|          18|   550|    0|
|          19|   710|    2|
|          20|   753|    0|
|          21|   717|    3|
|          22|   752|   13|
|          23|   865|   12|
|          24|   767|   31|
|          25|   788|   53|
|          26|   722|   63|
|          27|   754|   81|
|          28|   748|  119|
|          29|   679|  134|
|          30|   690|  171|
|          31|   705|  183|
|          32|   639|  189|
|          33|   684|  191|
|          34|   643|  243|
|          35|   659|  217|
|          36|   635|  263|
|          37|   566|  292|
|          38|   545|  282|
|          39|   538|  278|
|          40|   526|  268|
|          41|   529|  279|
|          42|   510|  270|
|          43|   497|  273|
|          44|   443|  281|
|          45|   446|  288|
|          46|   445|  292|
|          47|   420|  288|
|          48|   326|  217|
|          49|   371

**Use drop() API to drop education_num column**

In [401]:
df.drop('education_num').columns

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education-num',
 'maritial-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 'category']

## Data preprocessing

**Data processing is a critical step in machine learning. For instance, you know that age is not a linear function with the income. When people are young, their income is usually lower than mid-age. After retirement, a household uses their saving, meaning a decrease in income. To capture this pattern, you can add a square to the age feature. To add a new feature, you need to:**  

* **Select the column**
* **Apply the transformation and add it to the DataFrame**

**After applying the transformation, print the schema**

In [402]:
from pyspark.sql.functions import *

# 1 Select the column
age_square = df.select(col("age")**2)

# 2 Apply the transformation and add it to the DataFrame
df = df.withColumn("age_square", col("age")**2)

df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: double (nullable = true)
 |-- maritial-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: double (nullable = true)
 |-- capital-loss: double (nullable = true)
 |-- hours-per-week: double (nullable = true)
 |-- native-country: string (nullable = true)
 |-- category: string (nullable = true)
 |-- age_square: double (nullable = true)



**You can see that age_square has been successfully added to the data frame. Now change the order of the variables with select; bring age_square right after age and then print the first record.**

In [403]:
COLUMNS = ['age', 'age_square', 'workclass', 'fnlwgt', 'education', 'education-num', 'maritial-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'category']
df = df.select(COLUMNS)
df.first()

Row(age=39, age_square=1521.0, workclass=' State-gov', fnlwgt=77516.0, education=' Bachelors', education-num=13.0, maritial-status=' Never-married', occupation=' Adm-clerical', relationship=' Not-in-family', race=' White', sex=' Male', capital-gain=2174.0, capital-loss=0.0, hours-per-week=40.0, native-country=' United-States', category=' <=50K')

**Spark has two different machine learning libraries, SparkML DataFrame-based is the default one and the SparkML RDD-based has entered maintenance mode. We will use the default one which is defined in pyspark.ml.** 

**Similar to scikit-learn, Pyspark has a pipeline API. A pipeline is very convenient to maintain the structure of the data. You push the data into the pipeline. Inside the pipeline, various operations are done, the output is used to feed the algorithm.**

**Since we are going to try Logistic Regression algorithm, we will have to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.**
  *  **Category Indexing:** this is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}. This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)
  * **One-Hot Encoding:** this converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))

**Here, we will use a combination of StringIndexer and OneHotEncoderEstimator to convert the categorical variables.**
   
**Since we will have more than 1 stage of feature transformations, we use a Pipeline to tie the stages together. This simplifies our code.**

**1. Index each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row.** 

In [404]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer
from pyspark.ml import Pipeline

In [405]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator
CATE_FEATURES = ['workclass', 'education', 'maritial-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
stages = [] # stages in our Pipeline
for categoricalCol in CATE_FEATURES:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
                                     outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]

**2. Index the class label (category)**

Spark, like many other libraries, does not accept string values for the label. You convert the label feature with StringIndexer and add it to the list stages 

In [406]:
# Convert label into label indices using the StringIndexer
label_stringIdx =  StringIndexer(inputCol="category", outputCol="label")
stages += [label_stringIdx]

**3. Use a VectorAssembler to combine all the feature columns into a single vector column. This includes both the numeric columns and the one-hot encoded binary vector columns in our dataset.**


In [407]:
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "age_square", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
assemblerInputs = [c + "classVec" for c in CATE_FEATURES] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

**4. Now that all the steps are ready, push the data to the pipeline.**

In [408]:
# Create a Pipeline.
pipeline = Pipeline().setStages(stages)
pipelineModel = pipeline.fit(df)
PreparedData = pipelineModel.transform(df)

In [409]:
PreparedData.show(1)

+---+----------+----------+-------+----------+-------------+---------------+-------------+--------------+------+-----+------------+------------+--------------+--------------+--------+--------------+-----------------+--------------+-----------------+--------------------+-----------------------+---------------+------------------+-----------------+--------------------+---------+-------------+--------+-------------+-------------------+----------------------+-----+--------------------+
|age|age_square| workclass| fnlwgt| education|education-num|maritial-status|   occupation|  relationship|  race|  sex|capital-gain|capital-loss|hours-per-week|native-country|category|workclassIndex|workclassclassVec|educationIndex|educationclassVec|maritial-statusIndex|maritial-statusclassVec|occupationIndex|occupationclassVec|relationshipIndex|relationshipclassVec|raceIndex| raceclassVec|sexIndex|  sexclassVec|native-countryIndex|native-countryclassVec|label|            features|
+---+----------+----------+-

**Randomly split data into training and test sets. Set seed for reproducibility**

In [410]:
(trainingData, testData) = PreparedData.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

22838
9723


**Create initial LogisticRegression model and set the regularization parameter =0.1**

In [411]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10, regParam=0.1)


In [412]:

# Train model with Training Data
lrModel = lr.fit(trainingData)

In [413]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = lrModel.transform(testData)

In [414]:
# View model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well. For example's sake we will choose age & occupation
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)


DataFrame[label: double, prediction: double, probability: vector, age: int, occupation: string]

**We can use BinaryClassificationEvaluator to evaluate our model. We can set the required column names in rawPredictionCol and labelCol Param and the metric in metricName Param.**

**Note that the default metric for the BinaryClassificationEvaluator is areaUnderROC**

In [415]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

0.8922700536574224

**Now try the model with the ParamGridBuilder and the CrossValidator.**

**If you are unsure what params are available for tuning, you can use explainParams() to print a list of all params and their definitions.**


In [345]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

**As we indicate 3 values for regParam, 3 values for maxIter, and 2 values for elasticNetParam, this grid will have 3 x 3 x 3 = 27 parameter settings for CrossValidator to choose from. We will create a 5-fold cross validator.**

In [346]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())


In [347]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

In [348]:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [349]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.9014437786492666