In [1]:
# create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").\
                                     appName("spark_on_docker").\
                                     getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/27 05:56:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/27 05:56:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


----------------------------------------------------
    Formatting Models According to Your Use Case
----------------------------------------------------

To preprocess data for Spark’s different advanced analytics tools, you must consider your end objective. The following list walks through the requirements for input data structure for each advanced analytics task in MLlib:

    - In the case of most classification and regression algorithms, you want to get your data into a column of type Double to represent the label and a column of type Vector (either dense or sparse) to represent the features.

    - In the case of recommendation, you want to get your data into a column of users, a column of items (say movies or books), and a column of ratings.

    - In the case of unsupervised learning, a column of type Vector (either dense or sparse) is needed to represent the features.

    - In the case of graph analytics, you will want a DataFrame of vertices and a DataFrame of edges.


The best way to get your data in these formats is through transformers. Transformers are functions that accept a DataFrame as an argument and return a new DataFrame as a response.

Before we proceed, we’re going to read in several different sample datasets, each of which has
different properties we will manipulate in this chapter:

In [2]:
spark.conf.set("spark.sql.shuffle.partitions", 5)

sales = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("work/TheDefinitiveGuide/Spark-The-Definitive-Guide/data/retail-data/by-day/*.csv")\
    .coalesce(5)\
    .where("Description IS NOT NULL")
fakeIntDF = spark.read.parquet("work/TheDefinitiveGuide/Spark-The-Definitive-Guide/data/simple-ml-integers")
simpleDF = spark.read.json("work/TheDefinitiveGuide/Spark-The-Definitive-Guide/data/simple-ml")
scaleDF = spark.read.parquet("work/TheDefinitiveGuide/Spark-The-Definitive-Guide/data/simple-ml-scaling")


                                                                                

In [3]:
sales.cache()
sales.show(5)

[Stage 7:>                                                          (0 + 1) / 1]

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|   14075.0|United Kingdom|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|   14075.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



                                                                                

--------------------
    Transformers
--------------------

Transformers are functions that convert raw data in some way. This might be to create a new interaction variable (from two other variables), to normalize a column, or to simply turn it into a Double to be input into a model. Transformers are primarily used in preprocessing or feature generation.

Spark’s transformer only includes a transform method. This is because it will not change based on the input data.

Estimators for Preprocessing

An estimator is necessary when a transformation you would like to perform must be initialized with data or information about the input column (often derived by doing a pass over the input column itself). 

For example, if you wanted to scale the values in our column to have mean zero and unit variance, you would need to perform a pass over the entire data in order to calculate the values you would use to normalize the data to mean zero and unit variance. In effect, an estimator can be a transformer configured according to your particular input data. In simplest terms, you can either blindly apply a transformation (a “regular” transformer type) or perform a transformation based on your data (an estimator type).

An example of this type of estimator is the StandardScaler, which scales your input column according to the range of values in that column to have a zero mean and a variance of 1 in each dimension. For that reason it must first perform a pass over the data to create the transformer.

High-Level Transformers

High-level transformers allow you to concisely specify a number of transformations in one. These operate at a “high level”, and allow you to avoid doing data manipulations or transformations one by one. 

In general, you should try to use the highest level transformers you can, in order to minimize the risk of error and help you focus on the business problem instead of the smaller details of implementation. While this is not always possible, it’s a good objective.

RFormula

The RFormula is the easiest transfomer to use when you have “conventionally” formatted data. 

Spark borrows this transformer from the R language to make it simple to declaratively specify a set of transformations for your data.

The RFormula will automatically handle categorical inputs (specified as strings) by performing something called one-hot encoding. 

With the RFormula, numeric columns will be cast to Double but will not be one-hot encoded. If the label column is of type String, it will be first transformed to Double with StringIndexer.

The RFormula allows you to specify your transformations in declarative syntax. It is simple to use once you understand the syntax. Currently, RFormula supports a limited subset of the R operators that in practice work quite well for simple transformations. The basic operators are:
    
    ~
    Separate target and terms

    +
    Concatenate terms; “+ 0” means removing the intercept (this means the y-intercept of the line
    that we will fit will be 0)
    
    -
    Remove a term; “- 1” means removing intercept (this means the y-intercept of the line that
    we will fit will be 0)
    
    :
    Interaction (multiplication for numeric values, or binarized categorical values)
    
    .
    All columns except the target/dependent variabl

RFormula also uses default columns of label and features to label, you guessed it, the label and the set of features that it outputs (for supervised machine learning).

In [4]:
from pyspark.ml.feature import RFormula

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")
supervised.fit(simpleDF).transform(simpleDF).show(10)

                                                                                

+-----+----+------+------------------+--------------------+-----+
|color| lab|value1|            value2|            features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue| bad|    12|14.386294994851129|(10,[2,3,6,9],[12...|  0.0|
|green|good|    15| 38.97187133755819|(10,[1,2,3,5,8],[...|  1.0|
|green|good|    12|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
|green| bad|    16|14.386294994851129|(10,[1,2,3,5,8],[...|  0.0|
|  red|good|    35|14.386294994851129|(10,[0,2,3,4,7],[...|  1.0|
|  red| bad|     1| 38.97187133755819|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|     2|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|    16|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
+-----+----+------+------------------+--------------------+-----+
only showing top 10 rows



SQL Transformers

A SQLTransformer allows you to leverage Spark’s vast library of SQL-related manipulations just as you would a MLlib transformation. 

Any SELECT statement you can use in SQL is a valid transformation. The only thing you need to change is that instead of using the table name, you should just use the keyword THIS. 

You might want to use SQLTransformer if you want to formally codify some DataFrame manipulation as a preprocessing step, or try different SQL expressions for features during hyperparameter tuning. Also note that the output of this transformation will be appended as a column to the output DataFrame.

You might want to use an SQLTransformer in order to represent all of your manipulations on the very rawest form of your data so you can version different variations of manipulations as transformers. This gives you the benefit of building and testing varying pipelines, all by simply swapping out transformers.

In [5]:
from pyspark.ml.feature import SQLTransformer

basicTransformation = SQLTransformer()\
    .setStatement("""
                    SELECT sum(Quantity), count(*), CustomerID
                    FROM __THIS__
                    GROUP BY CustomerID
    """)

basicTransformation.transform(sales).show(10)




+-------------+--------+----------+
|sum(Quantity)|count(1)|CustomerID|
+-------------+--------+----------+
|         1721|     119|   18180.0|
|         1070|     107|   12782.0|
|          701|      59|   17402.0|
|          478|      35|   16642.0|
|          477|      28|   16811.0|
|          986|      71|   15053.0|
|         1419|      50|   12913.0|
|          445|      43|   12628.0|
|         4505|     183|   14401.0|
|          271|      20|   16851.0|
+-------------+--------+----------+
only showing top 10 rows



                                                                                

-----------------------
    VectorAssembler
-----------------------

The VectorAssembler is a tool you’ll use in nearly every single pipeline you generate. It helps concatenate all your features into one big vector you can then pass into an estimator. 

It’s used typically in the last step of a machine learning pipeline and takes as input a number of columns of Boolean, Double, or Vector. This is particularly helpful if you’re going to perform a number of manipulations using a variety of transformers and need to gather all of those results together.

In [7]:
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler().setInputCols(["int1", "int2", "int3"])
va.transform(fakeIntDF).show()

                                                                                

+----+----+----+------------------------------------+
|int1|int2|int3|VectorAssembler_6c6d56b8b957__output|
+----+----+----+------------------------------------+
|   1|   2|   3|                       [1.0,2.0,3.0]|
|   4|   5|   6|                       [4.0,5.0,6.0]|
|   7|   8|   9|                       [7.0,8.0,9.0]|
+----+----+----+------------------------------------+



Working with Continuous Features

Continuous features are just values on the number line, from positive infinity to negative infinity.

There are two common transformers for continuous features. 

First, you can convert continuous features into categorical features via a process called bucketing, 
or you can scale and normalize your features according to several different requirements. 

These transformers will only work on Double types, so make sure you’ve turned any other numerical values to Double:

In [8]:
contDF = spark.range(20).selectExpr("cast(id as double)")

In [10]:
contDF.show(10)

+---+
| id|
+---+
|0.0|
|1.0|
|2.0|
|3.0|
|4.0|
|5.0|
|6.0|
|7.0|
|8.0|
|9.0|
+---+
only showing top 10 rows



-------------------
    Bucketing
-------------------

The most straightforward approach to bucketing or binning is using the Bucketizer. This will split a given continuous feature into the buckets of your designation. You specify how buckets should be created via an array or list of Double values. This is useful because you may want to simplify the features in your dataset or simplify their representations for interpretation later on.

For example, imagine you have a column that represents a person’s weight and you would like to predict some value based on this information. In some cases, it might be simpler to create three buckets of “overweight,” “average,” and “underweight.” 

To specify the bucket, set its borders. For example, setting splits to 5.0, 10.0, 250.0 on our contDF will actually fail because we don’t cover all possible input ranges. When specifying your bucket points, the values you pass into splits must satisfy three requirements:

    - The minimum value in your splits array must be less than the minimum value in your DataFrame.
    - The maximum value in your splits array must be greater than the maximum value in your DataFrame.
    - You need to specify at a minimum three values in the splits array, which creates two buckets.

In addition to splitting based on hardcoded values, another option is to split based on percentiles in our data. This is done with QuantileDiscretizer, which will bucket the values into userspecified buckets with the splits being determined by approximate quantiles values. 

In [11]:
from pyspark.ml.feature import Bucketizer
bucketBorders = [-1.0, 5.0, 10.0, 250.0, 600.0]
bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("id")
bucketer.transform(contDF).show()

+----+-------------------------------+
|  id|Bucketizer_76b264ea51eb__output|
+----+-------------------------------+
| 0.0|                            0.0|
| 1.0|                            0.0|
| 2.0|                            0.0|
| 3.0|                            0.0|
| 4.0|                            0.0|
| 5.0|                            1.0|
| 6.0|                            1.0|
| 7.0|                            1.0|
| 8.0|                            1.0|
| 9.0|                            1.0|
|10.0|                            2.0|
|11.0|                            2.0|
|12.0|                            2.0|
|13.0|                            2.0|
|14.0|                            2.0|
|15.0|                            2.0|
|16.0|                            2.0|
|17.0|                            2.0|
|18.0|                            2.0|
|19.0|                            2.0|
+----+-------------------------------+



In [16]:
from pyspark.ml.feature import QuantileDiscretizer
bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id")
fittedBucketer = bucketer.fit(contDF)
fittedBucketer.transform(contDF).show()

+----+----------------------------------------+
|  id|QuantileDiscretizer_d5e52fdec202__output|
+----+----------------------------------------+
| 0.0|                                     0.0|
| 1.0|                                     0.0|
| 2.0|                                     0.0|
| 3.0|                                     1.0|
| 4.0|                                     1.0|
| 5.0|                                     1.0|
| 6.0|                                     1.0|
| 7.0|                                     2.0|
| 8.0|                                     2.0|
| 9.0|                                     2.0|
|10.0|                                     2.0|
|11.0|                                     3.0|
|12.0|                                     3.0|
|13.0|                                     3.0|
|14.0|                                     3.0|
|15.0|                                     4.0|
|16.0|                                     4.0|
|17.0|                                  

-----------------------------------
    Scaling and Normalization
-----------------------------------

You might want to do this when your data contains a number of columns based on different scales. 

For instance, say we have a DataFrame with two columns: weight (in ounces) and height (in feet). If you don’t scale or normalize, the algorithm will be less sensitive to variations in height because height values in feet are much lower than weight values in ounces. That’s an example where you should scale your data. 

An example of normalization might involve transforming the data so that each point’s value is a representation of its distance from the mean of that column.

In MLlib, this is always done on columns of type Vector. MLlib will look across all the rows in a given column (of type Vector) and then treat every dimension in those vectors as its own particular column. It will then apply the scaling or normalization function on each dimension separately.

----------------------
    StandardScaler
----------------------

The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit standard deviation while the flag withMean (false by default) will center the data prior to scaling it.

In [20]:
from pyspark.ml.feature import StandardScaler
sScaler = StandardScaler().setInputCol("features")
sScaler.fit(scaleDF).transform(scaleDF).show()

                                                                                

+---+--------------+-----------------------------------+
| id|      features|StandardScaler_9ee186f65881__output|
+---+--------------+-----------------------------------+
|  0|[1.0,0.1,-1.0]|               [1.19522860933439...|
|  1| [2.0,1.1,1.0]|               [2.39045721866878...|
|  0|[1.0,0.1,-1.0]|               [1.19522860933439...|
|  1| [2.0,1.1,1.0]|               [2.39045721866878...|
|  1|[3.0,10.1,3.0]|               [3.58568582800318...|
+---+--------------+-----------------------------------+



MinMaxScaler

The MinMaxScaler will scale the values in a vector (component wise) to the proportional values on a scale from a given min value to a max value. If you specify the minimum value to be 0 and the maximum value to be 1, then all the values will fall in between 0 and 1:

In [23]:
from pyspark.ml.feature import MinMaxScaler
minMax = MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()

+---+--------------+---------------------------------+
| id|      features|MinMaxScaler_17b9eb1feb0e__output|
+---+--------------+---------------------------------+
|  0|[1.0,0.1,-1.0]|                    [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                    [7.5,5.5,7.5]|
|  0|[1.0,0.1,-1.0]|                    [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                    [7.5,5.5,7.5]|
|  1|[3.0,10.1,3.0]|                 [10.0,10.0,10.0]|
+---+--------------+---------------------------------+



MaxAbsScaler

The max absolute scaler (MaxAbsScaler) scales the data by dividing each value by the maximum absolute value in this feature. All values therefore end up between −1 and 1. This transformer does not shift or center the data at all in the process:

In [24]:
from pyspark.ml.feature import MaxAbsScaler
maScaler = MaxAbsScaler().setInputCol("features")
fittedmaScaler = maScaler.fit(scaleDF)
fittedmaScaler.transform(scaleDF).show()

+---+--------------+---------------------------------+
| id|      features|MaxAbsScaler_bcddc0d98dbc__output|
+---+--------------+---------------------------------+
|  0|[1.0,0.1,-1.0]|             [0.33333333333333...|
|  1| [2.0,1.1,1.0]|             [0.66666666666666...|
|  0|[1.0,0.1,-1.0]|             [0.33333333333333...|
|  1| [2.0,1.1,1.0]|             [0.66666666666666...|
|  1|[3.0,10.1,3.0]|                    [1.0,1.0,1.0]|
+---+--------------+---------------------------------+



ElementwiseProduct

The ElementwiseProduct allows us to scale each value in a vector by an arbitrary value. For example, given the vector below and the row “1, 0.1, -1” the output will be “10, 1.5, -20.” Naturally the dimensions of the scaling vector must match the dimensions of the vector inside the relevant column:

In [25]:
from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors
scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)
scalingUp = ElementwiseProduct()\
.setScalingVec(scaleUpVec)\
.setInputCol("features")
scalingUp.transform(scaleDF).show()

+---+--------------+---------------------------------------+
| id|      features|ElementwiseProduct_32eb16dea613__output|
+---+--------------+---------------------------------------+
|  0|[1.0,0.1,-1.0]|                       [10.0,1.5,-20.0]|
|  1| [2.0,1.1,1.0]|                       [20.0,16.5,20.0]|
|  0|[1.0,0.1,-1.0]|                       [10.0,1.5,-20.0]|
|  1| [2.0,1.1,1.0]|                       [20.0,16.5,20.0]|
|  1|[3.0,10.1,3.0]|                      [30.0,151.5,60.0]|
+---+--------------+---------------------------------------+



                                                                                

Normalizer

The normalizer allows us to scale multidimensional vectors using one of several power norms, set through the parameter “p”. For example, we can use the Manhattan norm (or Manhattan distance) with p = 1, Euclidean norm with p = 2, and so on. The Manhattan distance is a measure of distance where you can only travel from point to point along the straight lines of an axis (like the streets in Manhattan).

In [26]:
from pyspark.ml.feature import Normalizer
manhattanDistance = Normalizer().setP(1).setInputCol("features")
manhattanDistance.transform(scaleDF).show()

+---+--------------+-------------------------------+
| id|      features|Normalizer_9379351bf8c8__output|
+---+--------------+-------------------------------+
|  0|[1.0,0.1,-1.0]|           [0.47619047619047...|
|  1| [2.0,1.1,1.0]|           [0.48780487804878...|
|  0|[1.0,0.1,-1.0]|           [0.47619047619047...|
|  1| [2.0,1.1,1.0]|           [0.48780487804878...|
|  1|[3.0,10.1,3.0]|           [0.18633540372670...|
+---+--------------+-------------------------------+

