### Introduction to Machine Learning in Spark using MLlib

In this notebook we demonstrate some of the basics of machine learning using Spark's MLlib library. The examples are taken from Spark's documentation. Full details can be found [here](https://spark.apache.org/docs/latest/ml-pipeline.html) along with the equivalent code in the scala programming language. 

Topics
- MLlib
- Vectors
- Transformers
- Estimators
- Models
- Pipelines
- Pipeline models
- Parameters

### Example from MLlib documentation (no pipeline)

This example uses Logistic Regression for classification. It highlights some of the most important classes in MLlib
 - Transformers
 - Estimators
 - Parameters

**Note:** The first paragraph shows the format of the DataFrame columns required for machine learning with MLlib
- label (or dependent variable) in 'Double'
- features (independent variables) in a `Vector`
- The default column names for these are `label` and `features` respectively

In [0]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples
# Note that MLlib requires the label (or dependent variable) to be of type double

# The coulumn names "label" and "features" are the defaults in MLlib 
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

In [0]:
training.printSchema()

In [0]:
training.show()

Create a Logistic Regression instance. This instance is an **Estimator**

In [0]:
# Create a LogisticRegression instance
lr = LogisticRegression(maxIter=10, regParam=0.01)

Print out the parameters, documentation, and any default values

In [0]:
print("Logistic Regression parameters:\n" + lr.explainParams() + "\n")

Learn a LogisticRegression model

In [0]:
# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Note that even though this is a toy problem fitting the model initiates serveral Spark jobs (fitting a model requires many iterations)

Since `model1` is a Model (i.e. a transformer produced by an Estimator), we can view the parameters it used during `fit()`.

Note that the values for `maxIter` and `regParam` are as specified above. The other parameters are at their default values

In [0]:
# Print the parameter (name: value) pairs, where names are unique IDs for
# this LogisticRegression instance

print("Model 1 was fit using parameters: ")
params = lr.extractParamMap()
for item in params:
    print(item.name, params[item])
    # print(item.name, item.doc, params[item]) # item.doc provides a short description of the parameter

We may alternatively specify parameters using a Python dictionary as a paramMap

In [0]:
# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)
    
# print (paramMapCombined)
params = paramMapCombined
for item in params:
    print(item.name, params[item])


Now learn a new model using the paramMapCombined parameters.

`paramMapCombined` overrides all parameters set earlier via `lr.set*` methods

In [0]:
model2 = lr.fit(training, paramMapCombined)
print ("model2 is of type: "+ str(type(model2)))

In [0]:
print("Model 2 was fit using parameters: ")
params = lr.extractParamMap(extra = paramMapCombined)
for item in params:
    print(item.name, params[item])

In [0]:
# Examine the model coefficents
print(model2.coefficientMatrix)
print(model2.coefficients)

Make predictions on test data using the `Transformer.transform()` method. LogisticRegression.transform will only use the 'features' column.
Note that `model2.transform()` outputs a "myProbability" column instead of the usual 'probability' column since we renamed the lr.probabilityCol parameter previously.

In [0]:
# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

prediction = model2.transform(test)

result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

In [0]:
prediction.show()

In [0]:
prediction.head(3)

### Example from MLlib documentation (with pipeline)

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single `pipeline`, or workflow. MLlib's Pipelines API was inspired by the scikit-learn project.

This example illustrates a machine learning pipeline to chain multiple Transformers and Estimators together to specify a simple ML workflow. It connects:
- DataFrames: This ML API uses DataFrames from Spark SQL as an ML dataset, which can hold a variety of data types. e.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
- Transformers: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. e.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
- Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. e.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
- Parameter: All Transformers and Estimators share a common API for specifying parameters.

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

In [0]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

print(type(model))

In [0]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

In [0]:
# Make predictions on test documents and examine resulting dataframe
prediction2 = model.transform(test)

# prediction2.show()
prediction2.head()

In [0]:
# Print columns of interest
selected = prediction2.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

Work through this code line by line.
1. What are DataFames?
2. What are Transformers?
3. What are Estimators?
4. What are Pipelines?
5. How do you set model parameters?