处理大规模数据的应用：
- PySpark: http://spark.apache.org/docs/latest/api/python/
    - http://spark.apache.org/docs/latest/rdd-programming-guide.html
    - 安装：https://www.zhihu.com/question/35973656?sort=created
    
- H20: http://www.h20.ai/
- XGBoost: https://github.com/dmlc/xgboost

`PySpark` 是一个分布式计算框架，它使用了**弹性分布式数据集**（Resilient Distributed Database， RDD）的抽象来处理并行的对象集合，使得我们可以如同访问单个计算机一样访问一个分布式的数据集。

# 创建 SparkContext
`SparkContext` 包含了所有与作业（job）相关的配置（例如内存设置或者 worker tasks 的数量），允许我们指定 master 并连接到一个 Spark 集群（cluster）。

In [3]:
!java -version

java version "9.0.4"
Java(TM) SE Runtime Environment (build 9.0.4+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.4+11, mixed mode)


In [1]:
from pyspark import SparkContext
from uuid import uuid4

def generateUUID():
    return str(uuid4())

sc = SparkContext('local', 'job_.{0}'.format(generateUUID()))

```py
SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>)
```

- `master`：指定的 URL 给出了 Spark master 所在位置，这是那台协调 Spark 作业调度并将任务分发到集群中的 worker 机器上的计算机。
所有 Spark 作业都要两类任务构成：`Driver`（发出命令并且收集作业的进度信息）和 `Executor`（在 RDD 上执行操作）。这些任务可以在同一计算机上创建；或者在不同的机器上创建，这样就允许那些无法用一台计算的内存处理的数据集通过并行计算的方式分布到若干计算机上进行处理和分析。
    - `localhost`：本地运行
    - 也可以是集群中某台远程机器的 URL
    
- `appName`：给定应用设定名字，通过使用 `uuid` 库生成的唯一标识 `id` 来指定。

我们可以使用地址 `http://localhost:4040` 打开 Spark UI：

In [8]:
sc.stop()      # 停止 Spark UI

注意：如果在本地运行，那么每次只能在 `localhost` 上启动一个 `SparkContext`，因此，如果需要对 `Context` 做些修改，那么需要停止和重新启动它。在 `SparkContext` 创建完成之后，我们就可以初始化其他 `Context` 对象了，这些对象包含了处理特定的数据集所需要的参数和功能。例如，我们使用 `SqlContext`，它允许我们使用 `SQL` 的逻辑访问数据集并对 DataFrame 进行操作。

In [10]:
from pyspark import SQLContext
sqlContext = SQLContext(sc)

# 创建 RDD

In [2]:
from pyspark import SparkContext
from uuid import uuid4
import pandas as pd


data = pd.read_csv('movies.csv')

def generateUUID():
    return str(uuid4())

sc = SparkContext('local', 'job_.{0}'.format(generateUUID()))

rdd_data = sc.parallelize([list(r)[2:-1]] for r in data.itertuples())

`itertuples()` 命令把 pandas DataFrame 中的每一行作为一个元组返回。然后我们又做了一个切片，把索引大于 $2$（表示除了由 Pandas 自动插入的索引列和原始文件中的行数列之外的所有列）的数据转换到列表中。

`sc.parallelize` 命令把一个集合转换为一个 RDD。我们可以使用 `getNumPartitions()` 函数获取这个分布式集合中所包含的分区数量:

In [4]:
rdd_data.getNumPartitions()

1

## 修改 RDD 分区数量
我们可以修改某个 RDD 中的分区数量，进而改变每个数据集上需要完成的工作负载。使用 `repartition()` 可以增加分区的数量，使用 `coalesce()` 可以减少分区的数量：

In [7]:
rdd_data.repartition(10).getNumPartitions()

10

In [8]:
rdd_data.coalesce(2).getNumPartitions()

1

## 切片
`take()` 函数可以查看 RDD 上的数据样本。

In [11]:
rdd_data.take(5)

[[['$',
   1971,
   121,
   nan,
   6.4,
   348,
   4.5,
   4.5,
   4.5,
   4.5,
   14.5,
   24.5,
   24.5,
   14.5,
   4.5,
   4.5,
   nan,
   0,
   0,
   1,
   1,
   0,
   0]],
 [['$1000 a Touchdown',
   1939,
   71,
   nan,
   6.0,
   20,
   0.0,
   14.5,
   4.5,
   24.5,
   14.5,
   14.5,
   14.5,
   4.5,
   4.5,
   14.5,
   nan,
   0,
   0,
   1,
   0,
   0,
   0]],
 [['$21 a Day Once a Month',
   1941,
   7,
   nan,
   8.2,
   5,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   24.5,
   0.0,
   44.5,
   24.5,
   24.5,
   nan,
   0,
   1,
   0,
   0,
   0,
   0]],
 [['$40,000',
   1996,
   70,
   nan,
   8.2,
   6,
   14.5,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   34.5,
   45.5,
   nan,
   0,
   0,
   1,
   0,
   0,
   0]],
 [['$50,000 Climax Show, The',
   1975,
   71,
   nan,
   3.4,
   17,
   24.5,
   4.5,
   0.0,
   14.5,
   14.5,
   4.5,
   0.0,
   0.0,
   0.0,
   24.5,
   nan,
   0,
   0,
   0,
   0,
   0,
   0]]]

在 Spark UI 中，除非你输入命令请求某个结果被打印/输出到 notebook 上，否则没有任何活动。这是因为 Spark 采用了一种惰性执行的模型，只有在被要求向下一步操作提供结果时才执行操作，否则只是在等待类似请求。

为了使用 PySpark DataFrame 的 API（应用编程接口）加载数据，我们需要一个以 `JavaScript` **对象标记** （JSON）格式存储的文件。我们可以采用以下命令生成这个文件，这个命令把每行的元素映射到一个目录后转换为 JSON 格式:

In [10]:
rdd_data.map(lambda x: json.JSONEncoder().encode({str(k):str(v) for (k, v) in zip(data.columns[2:-1], x)})).\
saveAsTextFile('movies.json')

通过查看输出目录可以发现，我们创建了 `movies.json` 目录，其中包含了多个文件（数量与 RDD　中包含的分区数量是一样的）。这与 **Hadoop 分布式文件系统（HDFS）**在目录中存储数据的方式是一样的。

更多的信息参考：http://spark.apache.org/docs/latest/api/python/pyspark.html

In [14]:
rdd_data.take(1)

[[['$',
   1971,
   121,
   nan,
   6.4,
   348,
   4.5,
   4.5,
   4.5,
   4.5,
   14.5,
   24.5,
   24.5,
   14.5,
   4.5,
   4.5,
   nan,
   0,
   0,
   1,
   1,
   0,
   0]]]

# 创建 Spark DataFrame 
加载 JSON　文件：

In [18]:
from pyspark import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.json('movies.json')

如果希望对这个数据进行很多其他的操作。那么我们可以将它缓存起来（持久化到临时存储区域），这样就可以 Spark 自己内部的存储格式进行操作了，从而优化了那些需要不断重复的访问。使用下面的命令缓存数据集：

In [20]:
df.cache()

DataFrame[year: string]

`sqlContext` 可以用于给数据集声明一个表别名：

In [21]:
df.registerTempTable('movies')

然后就可以查询这个数据集了，仿佛关系数据系统有了一个表似的：

In [22]:
sqlContext.sql('select * from movies limit 5').show()

+--------------------+
|                year|
+--------------------+
|['$', 1971, 121, ...|
|['$1000 a Touchdo...|
|['$21 a Day Once ...|
|['$40,000', 1996,...|
|['$50,000 Climax ...|
+--------------------+



类似于 Pandas DataFrame，我们可以按照特定的列对数据进行聚合：

In [23]:
df.groupBy('year').count().collect()

Py4JJavaError: An error occurred while calling o135.collectToPython.
: java.lang.IllegalArgumentException
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
	at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
	at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
	at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:3195)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3192)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3192)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:3225)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3192)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.base/java.lang.Thread.run(Unknown Source)


也可以使用类似 Pandas 的语法来访问某个单独的列：

In [24]:
df.year

Column<b'year'>

如果想把所有数据集中到一台计算机上，而不是分布在若干计算机上的数据集分区上进行操作，那么可以通过调用 `collect()` 把所有数据集中到一台计算机上。使用这个命令需要注意，对于大型数据集，它会导致所有分区中的数据集被合并然后发送到 Drive，这样可能会导致 Drive 中的内存超载。

`collect()` 命令返回行对象的一个数组，我们可以通过 `get()` 命令来访问每个单独的元素（列）：

In [25]:
df.collect()[0].get(0)

Py4JJavaError: An error occurred while calling o127.collectToPython.
: java.lang.IllegalArgumentException
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
	at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
	at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
	at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:3195)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3192)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3192)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:3225)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3192)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.base/java.lang.Thread.run(Unknown Source)


我们也可以将 PySpark DataFrame 转换为 RDD：

In [26]:
rdd_data = df.rdd

也可以把 PySpark DataFrame 转换到 Pandas DataFrame：

In [27]:
df.toPandas()

Py4JJavaError: An error occurred while calling o127.collectToPython.
: java.lang.IllegalArgumentException
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
	at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
	at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
	at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
	at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
	at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:3195)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3192)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3192)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:3225)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3192)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.base/java.lang.Thread.run(Unknown Source)


In [28]:
sc.stop()