### DataFrame：此ML API使用Spark SQL中的DataFrame作为ML数据集，它可以包含各种数据类型。


### Transformer：Transformer是一种可以将一个DataFrame转换为另一个DataFrame的算法。例如，ML模型是变换器，其将具有特征的DataFrame转换为具有预测的DataFrame

###  Estimator：Estimator是一种算法，可以适应DataFrame以生成Transformer。例如，学习算法是Estimator，其在DataFrame上训练并产生模型。

###  管道：管道将多个变形金刚和估算器链接在一起以指定ML工作流程。

### 参数：所有Transformers和Estimators现在共享一个用于指定参数的通用API。

# Example: Estimator, Transformer, and Param

In [1]:
from pyspark.ml.linalg import Vectors

In [2]:
from pyspark.ml.classification import LogisticRegression

In [4]:
from pyspark.sql import SparkSession

In [5]:
spark=SparkSession\
.builder.\
appName("python spark sql example")\
.config("spark.some.config.option","some-value")\
.getOrCreate()

In [6]:
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

In [7]:
training.show()

+-----+--------------+
|label|      features|
+-----+--------------+
|  1.0| [0.0,1.1,0.1]|
|  0.0|[2.0,1.0,-1.0]|
|  0.0| [2.0,1.3,1.0]|
|  1.0|[0.0,1.2,-0.5]|
+-----+--------------+



In [8]:
lr=LogisticRegression(maxIter=10,regParam=0.01)

In [10]:
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bou

In [11]:
model1 = lr.fit(training)

In [12]:
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

Model 1 was fit using parameters: 
{Param(parent='LogisticRegression_498bb11e5275a0978037', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_498bb11e5275a0978037', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_498bb11e5275a0978037', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_498bb11e5275a0978037', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_498bb11e5275a0978037', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_498bb11e5275a0978037', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegressio

In [13]:
paramMap={lr.maxIter:20}

In [14]:
paramMap[lr.maxIter]=30

In [15]:
paramMap.update({lr.regParam:0.1,lr.threshold:0.55})

In [16]:
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

In [17]:
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())

Model 2 was fit using parameters: 
{Param(parent='LogisticRegression_498bb11e5275a0978037', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_498bb11e5275a0978037', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_498bb11e5275a0978037', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_498bb11e5275a0978037', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_498bb11e5275a0978037', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_498bb11e5275a0978037', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegressio

In [18]:
# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

In [19]:
prediction = model2.transform(test)

In [23]:
prediction.show()

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|       myProbability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0|[-1.0,1.5,1.3]|[-2.8046569418746...|[0.05707304171034...|       1.0|
|  0.0|[3.0,2.0,-0.1]|[2.49587635664205...|[0.92385223117041...|       0.0|
|  1.0|[0.0,2.2,-1.5]|[-2.0935249027913...|[0.10972776114779...|       1.0|
+-----+--------------+--------------------+--------------------+----------+



In [20]:
result = prediction.select("features", "label", "myProbability", "prediction").collect()

In [21]:
for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

features=[-1.0,1.5,1.3], label=1.0 -> prob=[0.057073041710340174,0.9429269582896599], prediction=1.0
features=[3.0,2.0,-0.1], label=0.0 -> prob=[0.9238522311704104,0.07614776882958973], prediction=0.0
features=[0.0,2.2,-1.5], label=1.0 -> prob=[0.10972776114779419,0.8902722388522057], prediction=1.0


# Example:Pipeline

In [24]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

In [25]:
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

In [26]:
training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



In [27]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")

In [28]:
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

In [29]:
lr = LogisticRegression(maxIter=10, regParam=0.001)

In [30]:
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

In [31]:
model = pipeline.fit(training)

In [32]:
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])


In [33]:
prediction = model.transform(test)

In [34]:
selected = prediction.select("id", "text", "probability", "prediction")

In [35]:
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.1596407738787475,0.8403592261212525], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000
