<h1>Spark SQL, DataFrames and Datasets Guide</h1>
<ul>
  <li>Overview<ul>
      <li>SQL</li>
      <li>Datasets and DataFrames</li>
    </ul>
  </li>
  <li>Getting Started<ul>
      <li>Starting Point: SparkSession</li>
      <li>Creating DataFrames</li>
      <li>Untyped Dataset Operations(aka DataFrame Operations)</li>
      <li>Running SQL Queries Programmatically</li>
      <li>Creating Datasets</li>
      <li>Interoperating with RDDs<ul>
          <li>Inferring the Schema Using Reflection</li>
          <li>Programmatically Specifying the Schema</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Data Sources<ul>
      <li>Generic Load/Save Functions<ul>
          <li>Manually Specifying Options</li>
          <li>Run SQL on files directly</li>
          <li>Save Modes</li>
          <li>Saving to Persistent Tables</li>
        </ul>
      </li>
      <li>Parquet Files<ul>
          <li>Loading Data Programmatically</li>
          <li>Partition Discovery</li>
          <li>Schema Merging</li>
          <li>Hive metastore Parquet table conversion<ul>
              <li>Hive/Parquet Schema Reconciliation</li>
              <li>Metadata Refreshing</li>
            </ul>
          </li>
          <li>Configuration</li>
        </ul>
      </li>
      <li>JSON Datasets</li>
      <li>Hive Tables<ul>
          <li>Interacting with Different Versions of Hive Metastore</li>
        </ul>
      </li>
      <li>JDBC To Other Databases</li>
      <li>Troubleshooting</li>
    </ul>
  </li>
  <li>Performance Tuning<ul>
      <li>Caching Data In Memory</li>
      <li>Other Configuration Options</li>
    </ul>
  </li>
  <li>Distributed SQL Engine<ul>
      <li>Running the Thrift JDBC/ODBC server</li>
      <li>Running the Spark SQL CLI</li>
    </ul>
  </li>
  <li>Migration Guide<ul>
      <li>Upgrading From Spark SQL 1.6 to 2.0</li>
      <li>Upgrading From Spark SQL 1.5 to 1.6</li>
      <li>Upgrading From Spark SQL 1.4 to 1.5</li>
      <li>Upgrading from Spark SQL 1.3 to 1.4 <ul>
          <li>DataFrame data reader/writer interface</li>
          <li>DataFrame.groupBy retains grouping columns</li>
          <li>Behavior change on DataFrame.withColumn</li>
        </ul>
      </li>
      <li>Upgrading from Spark SQL 1.0-1.2 to 1.3<ul>
          <li>Rename of SchemaRDD to DataFrame</li>
          <li>Unification of the Java and Scala APIs</li>
          <li>Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)</li>
          <li>Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)</li>
          <li>UDF Registration Moved to <code>sqlContext.udf</code> (Java &amp; Scala)</li>
          <li>Python DataTypes No Longer Singletons</li>
        </ul>
      </li>
      <li>Compatibility with Apache Hive<ul>
          <li>Deploying in Existing Hive Warehouses</li>
          <li>Supported Hive Features</li>
          <li>Unsupported Hive Functionality</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Reference    <ul>
      <li>Data Types</li>
      <li>NaN Semantics</li>
    </ul>
  </li>
</ul>

<h1>Overview</h1>
<p>Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,Spark SQL uses this extra information to perform extra optimizations. There are several ways to
interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification(统一，联合; 一致;
) means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.</p>
<p>All of the examples on this page use sample data included in the Spark distribution and can be run in the <code>spark-shell</code>, <code>pyspark</code> shell, or <code>sparkR</code> shell.</p>

<h2>SQL</h2>
<p>One use of Spark SQL is to execute SQL queries.Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the <strong>Hive Tables</strong> section. When running SQL from within another programming language the results will be returned as a <strong>Dataset/DataFrame</strong>.You can also interact with the SQL interface using the <strong>command-line</strong> or over <strong>JDBC/ODBC</strong>.</p>

<h2>Datasets and DataFrames</h2>

<p>A Dataset is a distributed collection of data.Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL&#8217;s optimized execution engine. A Dataset can be <strong>constructed</strong> from JVM objects and then
manipulated using functional transformations (<code>map</code>, <code>flatMap</code>, <code>filter</code>, etc.).
The Dataset API is available in <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset">Scala</a> and
<a href="http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html">Java</a>. Python does not have the support for the Dataset API. But due to Python&#8217;s dynamic nature,many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally<code>row.columnName</code>). The case for R is similar.</p>

<p>A DataFrame is a <em>Dataset</em> organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood(在幕后). DataFrames can be constructed from a wide array of <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources">sources</a> such as: <u>structured data files, tables in Hive, external databases, or existing RDDs</u>.The DataFrame API is available in Scala,Java, <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame">Python</a>, and <a href="http://spark.apache.org/docs/latest/api/R/index.html">R</a>.In Scala and Java, a DataFrame is represented by a Dataset of <code>Row</code>s.In <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset">the Scala API</a>, <code>DataFrame</code> is simply a type alias of <code>Dataset[Row]</code>.While, in <a href="http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html">Java API</a>, users need to use <code>Dataset&lt;Row&gt;</code> to represent a <code>DataFrame</code>.</p>

<p>Throughout this document, we will often refer to Scala/Java Datasets of <code>Row</code>s as DataFrames.</p>

<h1>Getting Started</h1>
<h2>Starting Point: SparkSession</h2>
<p>The entry point into(入口点) all functionality in Spark is the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession"><code>SparkSession</code></a> class. To create a basic <code>SparkSession</code>, just use <code>SparkSession.builder</code>:</p>

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PythonSQL")\
.config("spark.some.config.option", 'some-value')\
.getOrCreate()

<p><code>SparkSession</code> in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs(User defined Function,用户自定义函数), and the ability to read data from Hive tables.To use these features, you do not need to have an existing Hive setup.</p>

<h2 id="creating-dataframes">Creating DataFrames</h2>
<p>With a <code>SparkSession</code>, applications can create DataFrames from an <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds">existing <code>RDD</code></a>,from a Hive table, or from <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources">Spark data sources</a>.</p>

<p>As an example, the following creates a DataFrame based on the content of a JSON file:</p>