In [16]:
import findspark
findspark.init()
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



# SparkML: Transformers and Estimators

![SparkML](sparkml.webp)

Rodrigo Agundez - 06 April 2023 - [SparkML Documentation](https://spark.apache.org/docs/latest/ml-pipeline.html)

# Main concepts

- ## Transformer
    A Transformer is an algorithm which can transform one DataFrame into another DataFrame.
- ## Estimator
    An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
- ## Pipeline
    A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

# Estimator fit -> Transformer

![SparkML Fit](sparkml_pipeline_fit.png)

Above, the top row represents a Pipeline with three stages. The first two (Tokenizer and HashingTF) are Transformers (blue), and the third (LogisticRegression) is an Estimator (red). The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames. The Pipeline.fit() method is called on the original DataFrame, which has raw text documents and labels. The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame. The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

# Transformer transform

![SparkML Transform](sparkml_pipeline_transform.png)

In the figure above, the PipelineModel has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline have become Transformers. When the PipelineModel’s transform() method is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage.

# Example: Classifier of documents

In [17]:
training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



# Build Pipeline

![SparkML Pipeline Estimator](sparkml_pipeline_estimator.png)

In [18]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
pipeline

Pipeline_3847b0efdcec

# Fit the pipeline

![](sparkml_pipeline_estimator_fit.png)

In [19]:
doc_transformer = pipeline.fit(training)
doc_transformer

PipelineModel_c9e6dcc16c74

# Transform a Dataframe

![](sparkml_pipeline_transformer_transform.png)

In [6]:
doc_transformer.transform(training).show()

+---+----------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
| id|            text|label|               words|            features|       rawPrediction|         probability|prediction|
+---+----------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|  0| a b c d e spark|  1.0|[a, b, c, d, e, s...|(262144,[74920,89...|[-5.9388192690225...|[0.00262821349694...|       1.0|
|  1|             b d|  0.0|              [b, d]|(262144,[89530,14...|[5.62050636899133...|[0.99639027118011...|       0.0|
|  2|     spark f g h|  1.0|    [spark, f, g, h]|(262144,[36803,17...|[-6.1134100250082...|[0.00220810505702...|       1.0|
|  3|hadoop mapreduce|  0.0| [hadoop, mapreduce]|(262144,[132966,1...|[6.66214714866364...|[0.99872323370637...|       0.0|
+---+----------------+-----+--------------------+--------------------+--------------------+--------------------+----------+



# Save pipeline

In [7]:
doc_transformer.write().overwrite().save("doc_transformer.pipeline")

                                                                                

# Transform another Dataframe

In [10]:
%reset -f
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
spark = SparkSession.builder.getOrCreate()
test = spark.createDataFrame([(4, "spark i j k"), (5, "l m n"), (6, "spark hadoop spark"), (7, "apache hadoop")], ["id", "text"])
test.show()

+---+------------------+
| id|              text|
+---+------------------+
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|
+---+------------------+



In [11]:
doc_transformer = PipelineModel.load('doc_transformer.pipeline')
doc_transformer.transform(test).show()

+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id|              text|               words|            features|       rawPrediction|         probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  4|       spark i j k|    [spark, i, j, k]|(262144,[19036,68...|[0.52882855227968...|[0.62920984896684...|       0.0|
|  5|             l m n|           [l, m, n]|(262144,[1303,526...|[4.16914139534005...|[0.98477000676230...|       0.0|
|  6|spark hadoop spark|[spark, hadoop, s...|(262144,[173558,1...|[-1.8649814141188...|[0.13412348342566...|       1.0|
|  7|     apache hadoop|    [apache, hadoop]|(262144,[68303,19...|[5.41564427200184...|[0.99557321143985...|       0.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+



# Custom Transformer and Estimator

Is not easy.

- [StackOverflow: Create a custom Transformer in PySpark ML](https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml)
- [StackOverflow: Serialize a custom transformer using python to be used within a Pyspark ML pipeline](https://stackoverflow.com/questions/41399399/serialize-a-custom-transformer-using-python-to-be-used-within-a-pyspark-ml-pipel/44377489#44377489)

# [Custom SparkML](04-custom_sparkml.ipynb)