# Pipelines
### 下图中描述的小的管道由两个transformers和一个estimator组成:
### 当调用Pipeline.fit()函数时，包含原始文本的输入DataFrame将被传递给Tokenizer transformer
### 其输出将被传递到HashingTF transformer，它将单词转换为特征
### 该Pipeline认识到LogisticRegression是一个estimator，因此它将调用fit函数和计算特征来产生一个
### LogisticRegressionModel

In [6]:
# 使用管道（Pipeline）创建一个小型工作流
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("spark-python").getOrCreate()
# 构造一个DataFrame
text_data = spark.createDataFrame(
[
[1, "Spark is a unified data analytics engine", 0.0],
[2, "Spark is cool and it is fun to work with Spark", 0.0],
[3, "There is a lot of exciting sessions at upcoming Spark summit", 0.0],
[4, "signup to win a million dollars", 0.0]
], schema=["id", "line", "label"])
text_data.show()

+---+--------------------+-----+
| id|                line|label|
+---+--------------------+-----+
|  1|Spark is a unifie...|  0.0|
|  2|Spark is cool and...|  0.0|
|  3|There is a lot of...|  0.0|
|  4|signup to win a m...|  0.0|
+---+--------------------+-----+



In [7]:
# 第一个阶段transformer
tokenizer = Tokenizer(inputCol = "line", outputCol="words")

# 第二个阶段transformer (第一个阶段的输出作为第二个阶段的输入)
hashingTF = HashingTF(inputCol = "words", outputCol = "features", numFeatures =4096)
# 第三个阶段estimator
logisticReg = LogisticRegression(maxIter=5, regParam = 0.01)

In [8]:
# 构造一个管道，由以上三个阶段组成
pipeline = Pipeline(stages = [tokenizer, hashingTF, logisticReg])
# 触发各阶段的顺序执行
logisticRegModel = pipeline.fit(text_data)

In [9]:
from pyspark.sql import SparkSession

# 保存模型和管道(持久化到HDFS上)

In [None]:
logisticRegModel.save("/logistic-regression-model")
pipeline.save("/logistic-regression-pipeline")

# 加载模型和管道(从HDFS上加载)
prevModel = PipelineModel.load("/logistic-regression-model")
prevPipeline = Pipeline.load("/logistic-regression-pipeline")

In [None]:
# 使用学习到的模型对数据进行转换
logisticRegModel.transform(text_data).show() # 如果想显示全部内容，传入false参数