Spark版本为2.0.0,Python版本为3.5.2,Jupyter notebook server版本为4.2.1

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.
<img src="http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png">
<img src="http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png">
<h1>Spark Tutorial: Learning Apache Spark</h1>
<p>This tutorial will teach you how to use [Apache Spark](http://spark.apache.org/), a framework for large-scale data processing, within a notebook. Many traditional frameworks were designed to be run on a single computer.  However, many datasets today are too large to be stored on a single computer, and even when a dataset can be stored on one computer (such as the datasets in this tutorial), the dataset can often be processed much more quickly using multiple computers.</p>
<p>Spark has efficient implementations of a number of <em>transformations</em> and <em>actions</em> that can be composed together to perform data processing and analysis.  Spark excels at(在某一活动方面表现杰出，擅长于某项活动) distributing these operations across a cluster while(同时) abstracting away(抽象) many of the underlying implementation details(底层的实现细节).  Spark has been designed with a focus on scalability and efficiency.  With Spark you can begin developing your solution on your laptop, using a small dataset, and then use that same code to process terabytes or even petabytes across a distributed cluster.</p>
<p>During this tutorial we will cover:
<ul>
<li>Part 1:Basic notebook usage and [Python](https://docs.python.org/2/) integration</li>
<li>Part 2:An introduction to using [Apache Spark](https://spark.apache.org/) with the [PySpark SQL API](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module) running in a notebook</li>
<li>Part 3:Using DataFrames and chaining together transformations and actions</li>
<li>Part 4:Python Lambda functions and User Defined Functions</li>
<li>Part 5:Additional DataFrame actions</li>
<li>Part 6:Additional DataFrame transformations</li>
<li>Part 7:Caching DataFrames and storage options</li>
<li>Part 8:Debugging Spark applications and lazy evaluation</li>
</ul></p>
<p>The following transformations will be covered:
<ul><li><strong>select(), filter(), distinct(), dropDuplicates(), orderBy(), groupBy()</strong></li></ul></p>
<p>The following actions will be covered:
<ul><li><strong>first(), take(), count(), collect(), show()</strong></li></ul></p>
<p>Also covered:
<ul><li><strong>cache(), unpersist()</strong></li></ul></p>
<p>Note that, for reference, you can look up the details of these methods in the [Spark's PySpark SQL API](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module).</p>
<h2>Part 1: Basic notebook usage and [Python](https://docs.python.org/2/) integration</h2>
<h3>(1a) Notebook usage</h3>
<p>A notebook is comprised of a linear sequence of cells.  These cells can contain either <em>markdown</em> or <em>code</em>, but we won't mix both in one cell.  When a markdown cell is executed it renders formatted text, images, and links just like HTML in a normal webpage.  The text you are reading right now is part of a markdown cell.  <em>Python code</em> cells allow you to execute arbitrary Python commands just like in any Python shell. Place your cursor inside the cell below, and press <strong>"Shift" + "Enter"</strong> to execute the code and advance to the next cell.  You can also press <strong>"Ctrl" + "Enter"</strong> to execute the code and remain in the cell.  These commands work the same in both markdown and code cells.</p>

In [1]:
# This is a Python cell. You can run normal Python code here...
print('The sum of 1 and 1 is {0}'.format(1+1))

The sum of 1 and 1 is 2


In [2]:
# Here is another Python cell, this time with a variable (x) declaration and an if statement:
x = 42
if x > 40:
    print ('The sum of 1 and 2 is {0}'.format(1+2))

The sum of 1 and 2 is 3


<h3>(1b) Notebook state</h3>
<p>As you work through a notebook it is important that you run all of the code cells.  The notebook is stateful(有状态的), which means that variables and their values are retained until the kernel is restarted  in Jupyter notebooks.  If you do not run all of the code cells as you proceed through the notebook, your variables will not be properly initialized and later code might fail.  You will also need to rerun any cells that you have modified in order for the changes to be available to other cells.</p>

In [3]:
# This cell relies on x being defined already.
# If we didn't run the cells from part (1a) this code would fail.
print (x * 2)

84


<h2>(1c) Library imports</h2>
<p>We can import standard Python libraries ([modules](https://docs.python.org/2/tutorial/modules.html)) the usual way.  An `import` statement will import the specified module.  In this tutorial and future labs, we will provide any imports that are necessary.</p>

In [4]:
# Import the regular expression library
import re
m = re.search('(?<=abc)def', 'abcdef')
m.group(0)

'def'

**(?:pattern)**<br/>
&emsp;&emsp;匹配pattern但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。这在使用或字符“(|)”来组合一个模式的各个部分是很有用的。例如“industr(?:y|ies)”就是一个比“industry|industries”更简略的表达式。<br/>
**(?=pattern)**<br/>
&emsp;&emsp;正向肯定预查，在任何匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如，“Windows（?=95|98|NT|2000）”能匹配“Windows2000”中的“Windows”，但不能匹配“Windows3.1”中的“Windows”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。<br/>
**(?!pattern)**<br/>
&emsp;&emsp;正向否定预查，在任何不匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如“Windows（?！95|98|NT|2000）”能匹配“Windows3.1”中的“Windows”，但不能匹配“Windows2000”中的“Windows”。<br/>
**(?<=pattern)**<br/>
&emsp;&emsp;反向肯定预查，与正向肯定预查类似，只是方向相反。例如，“（?<=95|98|NT|2000）Windows”能匹配“2000Windows”中的“Windows”，但不能匹配“3.1Windows”中的“Windows”。<br/>
**(?<!pattern)**<br/>
&emsp;&emsp;反向否定预查，与正向否定预查类似，只是方向相反。例如“(?<!95|98|NT|2000)Windows”能匹配“3.1Windows”中的“Windows”，但不能匹配“2000Windows”中的“Windows”。<br/>

In [5]:
# Import the datetime library
import datetime
print ('This was last run on: {0}'.format(datetime.datetime.now()))

This was last run on: 2016-09-27 23:35:56.864694


<h1>Part 2: An introduction to using [Apache Spark](https://spark.apache.org/) with the [PySpark SQL API](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module) running in a notebook</h1>
<h2>Spark Context</h2>
<p>In Spark, communication occurs between a `driver` and `executors`.  The `driver` has Spark jobs that it needs to run and these jobs are split into tasks that are submitted to the `executors` for completion.  The results from these tasks are delivered back to the `driver`.</p>
<p>In part 1, we saw that normal Python code can be executed via cells. When using `Databricks` this code gets executed in the Spark driver's Java Virtual Machine (JVM) and not in an executor's JVM, and when using an `Jupyter notebook` it is executed within the kernel associated with the notebook. Since no Spark functionality is actually being used, no tasks are launched on the executors.</p>
<p>In order to use Spark and its DataFrame API we will need to use a `SQLContext`.  When running Spark, you start a new Spark application by creating a [SparkContext](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext). You can then create a [SQLContext](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext) from the `SparkContext`. When the `SparkContext` is created, it asks the master for some cores to use to do work.  The master sets these cores aside just for you; they won't be used for other applications. When using Databricks, both a `SparkContext` and a `SQLContext` are created for you automatically. `sc` is your `SparkContext`, and `sqlContext` is your `SQLContext`.</p>
<h3>(2a) Example Cluster</h3>
<p>The diagram shows an example cluster, where the slots allocated for an application are outlined in purple(紫色的). (Note: We're using the term _slots_ here to indicate threads available to perform parallel work for Spark.</p>
<p>Spark documentation often refers to these threads as _cores_, which is a confusing term, as the number of slots available on a particular machine does not necessarily have any relationship to the number of physical CPU
cores on that machine.)</p>
<img src="http://spark-mooc.github.io/web-assets/images/cs105x/diagram-2a.png" style="height: 800px;float: right"/>
<p>You can view the details of your Spark application in the Spark web UI.  The web UI is accessible in Databricks by going to "Clusters" and then clicking on the "Spark UI" link for your cluster.  In the web UI, under the "Jobs" tab, you can see a list of jobs that have been scheduled or run.  It's likely there isn't any thing interesting here yet because we haven't run any jobs, but we'll return to this page later.</p>
<p>At a high level, every Spark application consists of a `driver program` that launches various parallel operations on executor Java Virtual Machines (JVMs) running either in a cluster or locally on the same machine. In Databricks, "Databricks Shell" is the driver program.  When running locally, `pyspark` is the driver program. In all cases, this driver program contains the main loop for the program and creates distributed datasets on the cluster, then applies operations (transformations & actions) to those datasets.
<strong>Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.</strong>A Spark SQL context object (`sqlContext`) is the main entry point for Spark DataFrame and SQL functionality. A `SQLContext` can be used to create DataFrames, which allows you to direct the operations on your data.</p>
<p>Try printing out `sqlContext` to see its type.</p>

In [6]:
# Display the type of the Spark sqlContext
type(sqlContext)

pyspark.sql.context.SQLContext

Note that the type is `HiveContext`. This means we're working with a version of Spark that has Hive support. Compiling Spark with Hive support is a good idea, even if you don't have a Hive metastore. As the
[Spark Programming Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext) states, a `HiveContext` "provides a superset(超集) of the functionality provided by the basic `SQLContext`. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs [user-defined functions], and the ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an existing Hive setup, and all of the data sources available to a `SQLContext` are still available."
<h3>(2b) SparkContext attributes</h3>
<p>You can use Python's [dir()](https://docs.python.org/2/library/functions.html?highlight=dir#dir) function to get a list of all the attributes (including methods) accessible through the `sqlContext` object.</p>

In [7]:
# List sqlContext's attributes
dir(sqlContext)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_inferSchema',
 '_instantiatedContext',
 '_jsc',
 '_jsqlContext',
 '_jvm',
 '_sc',
 '_ssql_ctx',
 'cacheTable',
 'clearCache',
 'createDataFrame',
 'createExternalTable',
 'dropTempTable',
 'getConf',
 'getOrCreate',
 'newSession',
 'range',
 'read',
 'readStream',
 'registerDataFrameAsTable',
 'registerFunction',
 'setConf',
 'sparkSession',
 'sql',
 'streams',
 'table',
 'tableNames',
 'tables',
 'udf',
 'uncacheTable']

### (2c) Getting help

Alternatively, you can use Python's [help()](https://docs.python.org/2/library/functions.html?highlight=help#help) function to get an easier to read list of all the attributes, including examples, that the `sqlContext` object has.

In [8]:
# Use help to obtain more detailed information
help(sqlContext)

Help on SQLContext in module pyspark.sql.context object:

class SQLContext(builtins.object)
 |  The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
 |  
 |  As of Spark 2.0, this is replaced by :class:`SparkSession`. However, we are keeping the class
 |  here for backward compatibility.
 |  
 |  A SQLContext can be used create :class:`DataFrame`, register :class:`DataFrame` as
 |  tables, execute SQL over tables, cache tables, and read parquet files.
 |  
 |  :param sparkContext: The :class:`SparkContext` backing this SQLContext.
 |  :param sparkSession: The :class:`SparkSession` around which this SQLContext wraps.
 |  :param jsqlContext: An optional JVM Scala SQLContext. If set, we do not instantiate a new
 |      SQLContext in the JVM, instead we make all calls to this object.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, sparkContext, sparkSession=None, jsqlContext=None)
 |      Creates a new SQLContext.
 |      
 |      >>> from date

<p>Outside of `pyspark` or a notebook, `SQLContext` is created from the lower-level `SparkContext`, which is usually used to create Resilient Distributed Datasets (RDDs). An RDD is the way Spark actually represents data internally(内部地); DataFrames are actually implemented in terms of RDDs.</p>
<p>**While you can interact directly with RDDs, DataFrames are preferred.** They're generally faster, and they perform the same no matter what language (Python, R, Scala or Java) you use with Spark.</p>
<p>**In this course, we'll be using DataFrames, so we won't be interacting directly with the Spark Context object very much.** However, it's worth knowing that inside `pyspark` or a notebook, you already have an existing `SparkContext` in the `sc` variable. One simple thing we can do with `sc` is check the version of Spark we're using:</p>

In [9]:
# After reading the help we've decided we want to use sc.version to see what version of Spark we are running
sc.version

'2.0.0'

通过python的platform模块获取版本号:

In [10]:
import platform
print(platform.python_version())

3.5.2


In [11]:
dir(sc)

['PACKAGE_EXTENSIONS',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accumulatorServer',
 '_active_spark_context',
 '_batchSize',
 '_callsite',
 '_checkpointFile',
 '_conf',
 '_dictToJavaMap',
 '_do_init',
 '_ensure_initialized',
 '_gateway',
 '_getJavaStorageLevel',
 '_initialize_context',
 '_javaAccumulator',
 '_jsc',
 '_jvm',
 '_lock',
 '_next_accum_id',
 '_pickled_broadcast_vars',
 '_python_includes',
 '_temp_dir',
 '_unbatched_serializer',
 'accumulator',
 'addFile',
 'addPyFile',
 'appName',
 'applicationId',
 'binaryFiles',
 'binaryRecords',
 'broadcast',
 'cancelAllJobs',
 'cancelJobGroup',
 'clearFiles',
 'defaultMinPartitions',
 'd

In [12]:
help(sc)

Help on SparkContext in module pyspark.context object:

class SparkContext(builtins.object)
 |  Main entry point for Spark functionality. A SparkContext represents the
 |  connection to a Spark cluster, and can be used to create L{RDD} and
 |  broadcast variables on that cluster.
 |  
 |  Methods defined here:
 |  
 |  __enter__(self)
 |      Enable 'with SparkContext(...) as sc: app(sc)' syntax.
 |  
 |  __exit__(self, type, value, trace)
 |      Enable 'with SparkContext(...) as sc: app' syntax.
 |      
 |      Specifically stop the context on exit of the with block.
 |  
 |  __getnewargs__(self)
 |  
 |  __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>)
 |      Create a new SparkContext. At least the master and app name should be set,
 |      either through the named parameters here or through C{conf}.
 |      

In [13]:
# Help can be used on any Python object
help(map)

Help on class map in module builtins:

class map(object)
 |  map(func, *iterables) --> map object
 |  
 |  Make an iterator that computes the function using arguments from
 |  each of the iterables.  Stops when the shortest iterable is exhausted.
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __reduce__(...)
 |      Return state information for pickling.



<h2>Part 3: Using DataFrames and chaining together transformations and actions</h2>
<h3>Working with your first DataFrames</h3>

In Spark, we first create a base [DataFrame](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame). We can then apply one or more transformations to that base DataFrame. **A DataFrame is immutable(不可变的), so once it is created, it cannot be changed. As a result(因此), each transformation creates a new DataFrame.** Finally, we can apply one or more actions to the DataFrames.

> Note that Spark uses <em>lazy evaluation</em>, so transformations are not actually executed until an action occurs.

We will perform several exercises to obtain a better understanding of DataFrames:
* Create a Python collection of 10,000 integers
* Create a Spark DataFrame from that collection
* Subtract(减去) one from each value using `map`
* Perform action `collect` to view results
* Perform action `count` to view counts
* Apply transformation `filter` and view results with `collect`
* Learn about `lambda` functions
* Explore how lazy evaluation works and the debugging challenges that it introduces

A DataFrame consists of a series of **`Row`** objects; each **`Row`** object has a set of named columns. You can think of a DataFrame as modeling a table, though the data source being processed does not have to be a table.

More formally(更正式一点的说法是), a DataFrame must have a **_schema_**, which means it must consist of columns, each of which has a _name_ and a _type_. Some data sources have schemas built into them. Examples include RDBMS databases, Parquet files, and NoSQL databases like Cassandra. Other data sources don't have computer-readable schemas, but you can often apply a schema programmatically.
<h3>(3a) Create a Python collection of 10,000 people</h3>
<p>We will use a third-party Python testing library called [fake-factory](https://pypi.python.org/pypi/fake-factory/0.5.3) to create a collection of fake(伪造的) person records.
When using Faker for unit testing, you will often want to generate the same data set. The generator offers a seed() method, which seeds(设置随机种子发生器) the _random number generator_(随机数发生器). Calling the same script twice with the same seed produces the same results.</p>

Faker是一个可以让你生成伪造数据的Python包。当你需要初始化数据库，创建美观的XML文档，不断产生数据来进行压力测试或者想从生产服务器上拉取匿名数据的时候，Faker将是你最棒的选择。

可以使用pip进行安装:
>pip install fake-factory

In [14]:
from faker import Factory
fake = Factory.create()
fake.seed(4321)

We're going to use this factory to create a collection of randomly generated people records. In the next section, we'll turn that collection into a DataFrame. We'll use the Spark `Row` class, because that will help us define the Spark DataFrame schema. There are other ways to define schemas, though; see the Spark Programming Guide's discussion of [schema inference](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection) for more information. (For instance,we could also use a Python `namedtuple`.)

In [15]:
# Each entry(记录) consists of last_name, first_name, ssn(身份证号), job, and age (at least 1)
from pyspark.sql import Row
def fake_entry():
  #type(fake.name()) --> unicode
  #S.split([sep [,maxsplit]]) -> list of strings(If sep is not specified or is None, any whitespace string is a separator)
  name = fake.name().split()
  return (name[1], name[0], fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1)

In [16]:
# Create a helper function to call a function repeatedly
def repeat(times, func, *args, **kwargs):
    for _ in range(times):
        yield func(*args, **kwargs)

In [17]:
data = list(repeat(10000, fake_entry))

`data` is just a normal Python list, containing Python tuples objects. Let's look at the first item in the list:

In [18]:
data[0]

('Brown', 'Jason', '182-83-5988', 'Community education officer', 2)

We can check the size of the list using the Python `len()` function.

In [19]:
len(data)

10000

<h3>(3b) Distributed data and using a collection to create a DataFrame</h3>

In Spark, datasets are represented as a list of entries, where the list is broken up into many different partitions that are each stored on a different machine.  Each partition holds a unique(唯一的)subset of the entries in the list.  Spark calls datasets that it stores "Resilient Distributed Datasets" (RDDs). Even DataFrames are ultimately(最终)represented as RDDs, with additional meta-data(元数据是用来描述数据的数据.(Data that describes other data)元数据最大的好处是，它使信息的描述和分类可以实现格式化，从而为机器处理创造了可能.)

<img src="http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3b.png" style="width: 900px; float: right; margin: 5px"/>

One of the defining features(定义性特征,本质特征) of Spark, compared to other data analytics frameworks (e.g., Hadoop), is that _**it stores data in memory rather than on disk**_.  This allows Spark applications to run much more quickly, because they are not slowed down by needing to read data from disk.
The figure to the right illustrates how Spark breaks a list of data entries into partitions that are each stored in memory on a worker.


To create the DataFrame, we'll use `sqlContext.createDataFrame()`, and we'll pass our array of data in as an argument to that function. Spark will create a new set of input data based on data that is passed in.  A DataFrame requires a _**schema**_, which is a list of columns, where each column has a name and a type. Our list of data has elements with types (mostly strings, but one integer). We'll supply the rest of the schema and the column names as the second argument to `createDataFrame()`.

Let's view the help for `createDataFrame()`.

In [20]:
help(sqlContext.createDataFrame)

Help on method createDataFrame in module pyspark.sql.context:

createDataFrame(data, schema=None, samplingRatio=None) method of pyspark.sql.context.SQLContext instance
    Creates a :class:`DataFrame` from an :class:`RDD`, a list or a :class:`pandas.DataFrame`.
    
    When ``schema`` is a list of column names, the type of each column
    will be inferred from ``data``.
    
    When ``schema`` is ``None``, it will try to infer the schema (column names and types)
    from ``data``, which should be an RDD of :class:`Row`,
    or :class:`namedtuple`, or :class:`dict`.
    
    When ``schema`` is :class:`DataType` or datatype string, it must match the real data, or
    exception will be thrown at runtime. If the given schema is not StructType, it will be
    wrapped into a StructType as its only field, and the field name will be "value", each record
    will also be wrapped into a tuple, which can be converted to row later.
    
    If schema inference is needed, ``samplingRatio`` is use

In [21]:
dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age'))

Let's see what type `sqlContext.createDataFrame()` returned.

In [22]:
print ('type of dataDF: {0}'.format(type(dataDF)))

type of dataDF: <class 'pyspark.sql.dataframe.DataFrame'>


Let's take a look at the DataFrame's schema and some of its rows.

In [23]:
dataDF.printSchema()

root
 |-- last_name: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- ssn: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- age: long (nullable = true)



We can register the newly created DataFrame as a named table, using the `registerDataFrameAsTable()` method.

**NOTE:**Registers the given DataFrame as a temporary table in the catalog.Temporary tables exist only during the lifetime of this instance of SQLContext.

In [24]:
sqlContext.registerDataFrameAsTable(dataDF, 'dataframe')

What methods can we call on this DataFrame?

In [25]:
help(dataDF)

Help on DataFrame in module pyspark.sql.dataframe object:

class DataFrame(builtins.object)
 |  A distributed collection of data grouped into named columns.
 |  
 |  A :class:`DataFrame` is equivalent to a relational table in Spark SQL,
 |  and can be created using various functions in :class:`SQLContext`::
 |  
 |      people = sqlContext.read.parquet("...")
 |  
 |  Once created, it can be manipulated using the various domain-specific-language
 |  (DSL) functions defined in: :class:`DataFrame`, :class:`Column`.
 |  
 |  To select a column from the data frame, use the apply method::
 |  
 |      ageCol = people.age
 |  
 |  A more concrete example::
 |  
 |      # To create DataFrame using SQLContext
 |      people = sqlContext.read.parquet("...")
 |      department = sqlContext.read.parquet("...")
 |  
 |      people.filter(people.age > 30).join(department, people.deptId == department.id)          .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})
 |  
 |  .. ve

How many partitions will the DataFrame be split into?

In [26]:
dataDF.rdd.getNumPartitions()

1

In [27]:
help(dataDF.rdd.getNumPartitions)

Help on method getNumPartitions in module pyspark.rdd:

getNumPartitions() method of pyspark.rdd.RDD instance
    Returns the number of partitions in RDD
    
    >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
    >>> rdd.getNumPartitions()
    2



###### A note about DataFrames and queries

When you use DataFrames or Spark SQL, you are building up a _**query plan**_(查询计划). Each transformation you apply to a DataFrame adds some information to the query plan. When you finally call an action, which triggers execution of your **Spark job**, several things happen:

1. Spark's Catalyst optimizer analyzes the query plan (called an _unoptimized logical query plan_未经优化的逻辑查询计划) and attempts to optimize it. Optimizations include (but aren't limited to) rearranging and combining(重新排列和组合) `filter()` operations for efficiency, converting `Decimal` operations to more efficient long integer operations, and pushing some operations down into the data source (e.g., a `filter()` operation might be translated to a SQL `WHERE` clause, if the data source is a traditional SQL RDBMS). The result of this optimization phase(优化阶段) is an _optimized logical plan_(优化的逻辑计划).
2. Once Catalyst has an optimized logical plan, it then constructs multiple _physical_ plans from it. Specifically, it implements the query in terms of lower level Spark RDD operations.
3. Catalyst chooses which physical plan to use via _cost optimization_(成本最优化). That is, it determines which physical plan is the most efficient (or least expensive), and uses that one.
4. Finally, once the physical RDD execution plan is established, Spark actually executes the job.

You can examine the query plan using the `explain()` function on a DataFrame. By default, `explain()` only shows you the final physical plan; however, if you pass it an argument of `True`, it will show you all phases(所有的解析过程).

(If you want to take a deeper dive into how Catalyst optimizes DataFrame queries, this blog post, while a little old, is an excellent overview: [Deep Dive into Spark SQL's Catalyst Optimizer](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html).)

Let's add a couple transformations to our DataFrame and look at the query plan on the resulting transformed DataFrame. Don't be too concerned if it looks like gibberish(胡言乱语；快速而不清楚的言语). As you gain more experience with Apache Spark, you'll begin to be able to use `explain()` to help you understand more about your DataFrame operations.

In [28]:
newDF = dataDF.distinct().select('*')
newDF.explain(True)

== Parsed Logical Plan ==
'Project [*]
+- Aggregate [last_name#0, first_name#1, ssn#2, occupation#3, age#4L], [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]
   +- LogicalRDD [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]

== Analyzed Logical Plan ==
last_name: string, first_name: string, ssn: string, occupation: string, age: bigint
Project [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]
+- Aggregate [last_name#0, first_name#1, ssn#2, occupation#3, age#4L], [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]
   +- LogicalRDD [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]

== Optimized Logical Plan ==
Aggregate [last_name#0, first_name#1, ssn#2, occupation#3, age#4L], [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]
+- LogicalRDD [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]

== Physical Plan ==
*HashAggregate(keys=[last_name#0, first_name#1, ssn#2, occupation#3, age#4L], functions=[], output=[last_name#0, first_name#1, ssn#2

### (3c): Subtract one from each value using _select_

So far, we've created a distributed DataFrame that is split into many partitions, where each partition is stored on a single machine in our cluster.  Let's look at what happens when we do a basic operation on the dataset.  Many useful data analysis operations can be specified as "do something to each item in the dataset".  These data-parallel operations are convenient because each item in the dataset can be processed individually: the operation on one entry doesn't effect the operations on any of the other entries.  Therefore, Spark can parallelize the operation.

One of the most common DataFrame operations is `select()`, and it works more or less like a SQL `SELECT` statement: You can select specific columns from the DataFrame, and you can even use `select()` to create _new_ columns with values that are derived from existing column values. We can use `select()` to create a new column that decrements the value of the existing `age` column.

**`select()` is a _transformation_. It returns a new DataFrame that captures both the previous DataFrame and the operation to add to the query (`select`, in this case). But it does *not* actually execute anything on the cluster. When transforming DataFrames, we are building up a _query plan_. That query plan will be optimized, implemented (in terms of RDDs), and executed by Spark _only_ when we call an action.**

In [29]:
# Transform dataDF through a select transformation and rename the newly created '(age -1)' column to 'age'
# Because select is a transformation and Spark uses lazy evaluation, no jobs, stages,
# or tasks will be launched when we run this code.
subDF = dataDF.select('last_name', 'first_name', 'ssn', 'occupation', (dataDF.age - 1).alias('age'))

Let's take a look at the query plan.

In [30]:
subDF.explain(True)

== Parsed Logical Plan ==
'Project [unresolvedalias('last_name, None), unresolvedalias('first_name, None), unresolvedalias('ssn, None), unresolvedalias('occupation, None), (age#4L - 1) AS age#19]
+- LogicalRDD [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]

== Analyzed Logical Plan ==
last_name: string, first_name: string, ssn: string, occupation: string, age: bigint
Project [last_name#0, first_name#1, ssn#2, occupation#3, (age#4L - cast(1 as bigint)) AS age#19L]
+- LogicalRDD [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]

== Optimized Logical Plan ==
Project [last_name#0, first_name#1, ssn#2, occupation#3, (age#4L - 1) AS age#19L]
+- LogicalRDD [last_name#0, first_name#1, ssn#2, occupation#3, age#4L]

== Physical Plan ==
*Project [last_name#0, first_name#1, ssn#2, occupation#3, (age#4L - 1) AS age#19L]
+- Scan ExistingRDD[last_name#0,first_name#1,ssn#2,occupation#3,age#4L]


A better way to visualize the data is to use the `show()` method. If you don't tell `show()` how many rows to display, it displays 20 rows.

In [None]:
subDF.show()

+---------+----------+-----------+--------------------+---+
|last_name|first_name|        ssn|          occupation|age|
+---------+----------+-----------+--------------------+---+
|    Brown|     Jason|182-83-5988|Community educati...|  1|
|    Brown|      Cody|298-53-9877|   Financial planner| 19|
|  Griffin|    Sandra|175-58-0111|Community educati...| 10|
|    Wyatt|     David|270-76-3455|Teaching laborato...| 24|
|   George|    Daniel|200-38-3837| Medical illustrator| 18|
|   Rogers|     Barry|634-25-3185|   Market researcher| 30|
|   Foster|    Morgan|464-88-6116|Production assist...| 39|
|   Hansen|      John|773-12-5058|Armed forces logi...| 42|
|     Rice|     Derek|054-51-9007|Presenter, broadc...| 23|
|   Little|    Cheryl|725-70-4549|Broadcast journalist| 41|
|   Peters|      Leah|826-58-6908|   Therapist, sports|  8|
|    Simon|     Logan|062-37-6157|Horticulturist, c...| 31|
|  Bennett|   Krystal|308-43-0932|Engineer, communi...| 24|
|    Burke|     Kelly|299-52-4282|Aerona

### (3d) Use _collect_ to view results

<img src="http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3d.png" style="height:700px;float:right"/>

To see a list of elements decremented by one, we need to create a new list on the driver from the the data distributed in the executor nodes.  To do this we can call the `collect()` method on our DataFrame.  `collect()` is often used after transformations to **ensure that we are only returning a *small* amount of data to the driver.  This is done because the data returned to the driver must fit into the driver's available memory.  If not, the driver will crash**.

The `collect()` method is the first action operation that we have encountered.  Action operations cause Spark to perform the (lazy) transformation operations that are required to compute the values returned by the action.  In our example, this means that tasks will now be launched to perform the **`createDataFrame`**,** `select`**, and **`collect`** operations.

In the diagram, the dataset is broken into four partitions, so four `collect()` tasks are launched. Each task collects the entries in its partition and sends the result to the driver, which creates a list of the values, as shown in the figure below.

Now let's run `collect()` on `subDF`.

In [None]:
# Let's collect the data
results = subDF.collect()
print(results)

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 43566)
----------------------------------------


Traceback (most recent call last):
  File "/usr/mySofter/Anaconda3/lib/python3.5/socketserver.py", line 313, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/mySofter/Anaconda3/lib/python3.5/socketserver.py", line 341, in process_request
    self.finish_request(request, client_address)
  File "/usr/mySofter/Anaconda3/lib/python3.5/socketserver.py", line 354, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/mySofter/Anaconda3/lib/python3.5/socketserver.py", line 681, in __init__
    self.handle()
  File "/usr/mySofter/spark-2.0.0-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "/usr/mySofter/spark-2.0.0-bin-hadoop2.7/python/pyspark/serializers.py", line 545, in read_int
    raise EOFError
EOFError


timeout: timed out

A better way to visualize the data is to use the `show()` method. If you don't tell `show()` how many rows to display, it displays 20 rows.

In [None]:
subDF.show()

If you'd prefer that `show()` not **`truncate`**(截断) the data, you can tell it not to:

In [None]:
subDF.show(n=30, truncate=False)

### (3e) Use _count_ to get total

One of the most basic jobs that we can run is the `count()` job which will count the number of elements in a DataFrame, using the `count()` action. Since `select()` creates a new DataFrame with the same number of elements as the starting DataFrame, we expect that applying `count()` to each DataFrame will return the same result.

<img src="http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3e.png" style="height:700px;float:right"/>

Note that because `count()` is an action operation, if we had not already performed an action with `collect()`, then Spark would now perform the transformation operations when we executed `count()`.

Each task counts the entries in its partition and sends the result to your SparkContext, which adds up all of the counts. The figure on the right shows what would happen if we ran `count()` on a small example dataset with just four partitions.

In [None]:
print(dataDF.count())
print(subDF.count())

### (3f) Apply transformation _filter_ and view results with _collect_

Next, we'll create a new DataFrame that only contains the people whose ages are less than 10. To do this, we'll use the **`filter()`** transformation. (You can also use `where()`, an alias for `filter()`, if you prefer something more SQL-like). The `filter()` method is a transformation operation that creates a new DataFrame from the input DataFrame, keeping only values that match the filter expression.

The figure shows how this might work on the small four-partition dataset.

<img src="http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3f.png" style="height:700px;float:right"/>

To view the filtered list of elements less than 10, we need to create a new list on the driver from the distributed data on the executor nodes.  We use the `collect()` method to return a list that contains all of the elements in this filtered DataFrame to the driver program.

In [None]:
filteredDF = subDF.filter(subDF.age < 10)
filteredDF.show(truncate=False)
filteredDF.count()

(These are some seriously precocious children...)

## Part 4: Python Lambda functions and User Defined Functions

Python supports the use of small one-line **anonymous functions**(匿名函数)that are not bound to a name at runtime.

`lambda` functions, borrowed from(借用了)LISP(全名LIStProcessor,即表处理语言), can be used wherever function objects are required. They are syntactically(在语法上)restricted to a single expression. Remember that `lambda` functions are **a matter of style** and using them is never required - semantically(语义上), they are just **syntactic sugar** for a normal function definition. You can always define a separate normal function instead, but using a `lambda` function is an equivalent and more compact(紧凑型,简洁的) form of coding. Ideally you should consider using `lambda` functions where you want to encapsulate(封装) non-reusable(不可重用的) code without littering(乱扔;使杂乱) your code with one-line functions.

Here, instead of defining a separate function for the `filter()` transformation, we will use an inline `lambda()` function and we will register that lambda as a Spark <a target="_blank" href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.functions.udf">_User Defined Function_ (UDF)</a>. A UDF is a special **wrapper** around a function, allowing the function to be used in a DataFrame query.

In [None]:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
less_ten = udf(lambda s: s < 10, BooleanType())
lambdaDF = subDF.filter(less_ten(subDF.age))
lambdaDF.show(truncate=False)
lambdaDF.count()