Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull latest from apache spark #6

Merged
merged 1,347 commits into from
Mar 18, 2016
Merged

pull latest from apache spark #6

merged 1,347 commits into from
Mar 18, 2016
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Mar 3, 2016

  1. [SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala…

    … 2.10
    
    ## What changes were proposed in this pull request?
    
    Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.
    
    ## How was this patch tested?
    
    Jenkins tests / compilation
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11493 from srowen/SPARK-13423.2.
    srowen committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    645c3a8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using…

    … include_example
    
    Replace example code in mllib-clustering.md using include_example
    https://issues.apache.org/jira/browse/SPARK-13013
    
    The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
    
    Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
    `{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
    Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
    `{% highlight %}`
     in the markdown.
    
    See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
    
    Author: Xin Ren <iamshrek@126.com>
    
    Closes #11116 from keypointt/SPARK-13013.
    keypointt authored and mengxr committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    70f6f96 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13599][BUILD] remove transitive groovy dependencies from Hive

    ## What changes were proposed in this pull request?
    
    Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.
    
    This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.
    
    ## How was this patch tested?
    
    1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
    1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
    1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver  -Dverbose > target/dependencies.txt`
    1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
    1. Patch applied
    1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
    1. Examined created spark-assembly, verified no org.codehaus packages
    1. Verified that the maven dependency tree no longer references groovy
    
    Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded
    
    Author: Steve Loughran <stevel@hortonworks.com>
    
    Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
    steveloughran authored and rxin committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    9a48c65 View commit details
    Browse the repository at this point in the history
  4. [SPARK-12877][ML] Add train-validation-split to pyspark

    ## What changes were proposed in this pull request?
    The changes proposed were to add train-validation-split to pyspark.ml.tuning.
    
    ## How was the this patch tested?
    This patch was tested through unit tests located in pyspark/ml/test.py.
    
    This is my original work and I license it to Spark.
    
    Author: JeremyNixon <jnixon2@gmail.com>
    
    Closes #11335 from JeremyNixon/tvs_pyspark.
    JeremyNixon authored and mengxr committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    511d492 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13543][SQL] Support for specifying compression codec for Parqu…

    …et/ORC via option()
    
    ## What changes were proposed in this pull request?
    
    This PR adds the support to specify compression codecs for both ORC and Parquet.
    
    ## How was this patch tested?
    
    unittests within IDE and code style tests with `dev/run_tests`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #11464 from HyukjinKwon/SPARK-13543.
    HyukjinKwon authored and rxin committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    cf95d72 View commit details
    Browse the repository at this point in the history
  6. [MINOR][ML][DOC] Remove duplicated periods at the end of some sharedP…

    …aram
    
    ## What changes were proposed in this pull request?
    Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
    cc mengxr srowen
    ## How was this patch tested?
    Documents change, no test.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11344 from yanboliang/shared-cleanup.
    yanboliang authored and jkbradley committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    ce58e99 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala…

    … 2.10, again
    
    ## What changes were proposed in this pull request?
    
    Fixes (another) compile problem due to inadvertent use of Option.contains, only in Scala 2.11
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11496 from srowen/SPARK-13423.3.
    srowen committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    52035d1 View commit details
    Browse the repository at this point in the history
  8. [MINOR] Fix typos in comments and testcase name of code

    ## What changes were proposed in this pull request?
    
    This PR fixes typos in comments and testcase name of code.
    
    ## How was this patch tested?
    
    manual.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
    dongjoon-hyun authored and srowen committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    941b270 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13632][SQL] Move commands.scala to command package

    ## What changes were proposed in this pull request?
    
    This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11482 from andrewor14/commands-package.
    Andrew Or authored and rxin committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    3edcc40 View commit details
    Browse the repository at this point in the history
  10. [SPARK-13584][SQL][TESTS] Make ContinuousQueryManagerSuite not output…

    … logs to the console
    
    ## What changes were proposed in this pull request?
    
    Make ContinuousQueryManagerSuite not output logs to the console. The logs will still output to `unit-tests.log`.
    
    I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid changing the log level which won't output logs to `unit-tests.log`.
    
    ## How was this patch tested?
    
    Just check Jenkins output.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11439 from zsxwing/quietly-ContinuousQueryManagerSuite.
    zsxwing committed Mar 3, 2016
    Configuration menu
    Copy the full SHA
    ad0de99 View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2016

  1. [SPARK-13415][SQL] Visualize subquery in SQL web UI

    ## What changes were proposed in this pull request?
    
    This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen.
    
    For example:
    ```python
    >>> sqlContext.range(100).registerTempTable("range")
    >>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True)
    == Parsed Logical Plan ==
    'Project [unresolvedalias(('id / subquery#9), None)]
    :  +- 'SubqueryAlias subquery#9
    :     +- 'Project [unresolvedalias('sum('id), None)]
    :        +- 'UnresolvedRelation `range`, None
    +- 'Filter ('id > subquery#8)
       :  +- 'SubqueryAlias subquery#8
       :     +- 'GlobalLimit 1
       :        +- 'LocalLimit 1
       :           +- 'Project [unresolvedalias('id, None)]
       :              +- 'UnresolvedRelation `range`, None
       +- 'UnresolvedRelation `range`, None
    
    == Analyzed Logical Plan ==
    (id / scalarsubquery()): double
    Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
    :  +- SubqueryAlias subquery#9
    :     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
    :        +- SubqueryAlias range
    :           +- Range 0, 100, 1, 4, [id#0L]
    +- Filter (id#0L > subquery#8)
       :  +- SubqueryAlias subquery#8
       :     +- GlobalLimit 1
       :        +- LocalLimit 1
       :           +- Project [id#0L]
       :              +- SubqueryAlias range
       :                 +- Range 0, 100, 1, 4, [id#0L]
       +- SubqueryAlias range
          +- Range 0, 100, 1, 4, [id#0L]
    
    == Optimized Logical Plan ==
    Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
    :  +- SubqueryAlias subquery#9
    :     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
    :        +- Range 0, 100, 1, 4, [id#0L]
    +- Filter (id#0L > subquery#8)
       :  +- SubqueryAlias subquery#8
       :     +- GlobalLimit 1
       :        +- LocalLimit 1
       :           +- Project [id#0L]
       :              +- Range 0, 100, 1, 4, [id#0L]
       +- Range 0, 100, 1, 4, [id#0L]
    
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
    :     :  +- Subquery subquery#9
    :     :     +- WholeStageCodegen
    :     :        :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
    :     :        :     +- INPUT
    :     :        +- Exchange SinglePartition, None
    :     :           +- WholeStageCodegen
    :     :              :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
    :     :              :     +- Range 0, 1, 4, 100, [id#0L]
    :     +- Filter (id#0L > subquery#8)
    :        :  +- Subquery subquery#8
    :        :     +- CollectLimit 1
    :        :        +- WholeStageCodegen
    :        :           :  +- Project [id#0L]
    :        :           :     +- Range 0, 1, 4, 100, [id#0L]
    :        +- Range 0, 1, 4, 100, [id#0L]
    ```
    
    The web UI looks like:
    
    ![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)
    
    This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 .
    
    ## How was this patch tested?
    
    Existing tests, also manual tests with the example query, check the explain and web UI.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11417 from davies/viz_subquery.
    Davies Liu authored and yhuai committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    b373a88 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13601] [TESTS] use 1 partition in tests to avoid race conditions

    ## What changes were proposed in this pull request?
    
    Fix race conditions when cleanup files.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11507 from davies/flaky.
    Davies Liu authored and davies committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    d062587 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13647] [SQL] also check if numeric value is within allowed ran…

    …ge in _verify_type
    
    ## What changes were proposed in this pull request?
    
    This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.
    
    ## How was this patch tested?
    
    newly added doc test.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11492 from cloud-fan/py-verify.
    cloud-fan authored and davies committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    15d57f9 View commit details
    Browse the repository at this point in the history
  4. [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map…

    … string datatypes to Oracle VARCHAR datatype mapping
    
    ## What changes were proposed in this pull request?
    A test suite added for the bug fix -SPARK 12941; for the mapping of the StringType to corresponding in Oracle
    
    ## How was this patch tested?
    manual tests done
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: thomastechs <thomas.sebastian@tcs.com>
    Author: THOMAS SEBASTIAN <thomas.sebastian@tcs.com>
    
    Closes #11489 from thomastechs/thomastechs-12941-master-new.
    thomastechs authored and yhuai committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    f6ac7c3 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13652][CORE] Copy ByteBuffer in sendRpcSync as it will be recy…

    …cled
    
    ## What changes were proposed in this pull request?
    
    `sendRpcSync` should copy the response content because the underlying buffer will be recycled and reused.
    
    ## How was this patch tested?
    
    Jenkins unit tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11499 from zsxwing/SPARK-13652.
    zsxwing committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    465c665 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13603][SQL] support SQL generation for subquery

    ## What changes were proposed in this pull request?
    
    This is support SQL generation for subquery expressions, which will be replaced to a SubqueryHolder inside SQLBuilder recursively.
    
    ## How was this patch tested?
    
    Added unit tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11453 from davies/sql_subquery.
    Davies Liu authored and liancheng committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    dd83c20 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get…

    ## What changes were proposed in this pull request?
    
    It avoids counting the dataframe twice.
    
    Author: Abou Haydar Elias <abouhaydar.elias@gmail.com>
    Author: Elie A <abouhaydar.elias@gmail.com>
    
    Closes #11491 from eliasah/quantile-discretizer-patch.
    eliasah authored and srowen committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    27e88fa View commit details
    Browse the repository at this point in the history
  8. [SPARK-13398][STREAMING] Move away from thread pool task support to f…

    …orkjoin
    
    ## What changes were proposed in this pull request?
    
    Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't give us a way to specify the thread pool name (and is a wrapper of Java's in 2.12) except by providing a custom factory. Note that we can't use Java's ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which reports system parallelism. One other implicit change that happens is the old ExecutionContext would have reported a different default parallelism since it used system parallelism rather than threadpool parallelism (this was likely not intended but also likely not a huge difference).
    
    The previous version of this PR attempted to use an execution context constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class) so as to keep the ability to have human readable named threads but this reported system parallelism.
    
    ## How was this patch tested?
    
    unit tests: streaming/testOnly org.apache.spark.streaming.util.*
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #11423 from holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.
    holdenk authored and srowen committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    c04dc27 View commit details
    Browse the repository at this point in the history
  9. [SPARK-12925] Improve HiveInspectors.unwrap for StringObjectInspector.…

    Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding
    
    Author: Rajesh Balamohan <rbalamohan@apache.org>
    
    Closes #11477 from rajeshbalamohan/SPARK-12925.2.
    rbalamohan authored and srowen committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    204b02b View commit details
    Browse the repository at this point in the history
  10. [SPARK-13673][WINDOWS] Fixed not to pollute environment variables.

    ## What changes were proposed in this pull request?
    
    This patch fixes the problem that `bin\beeline.cmd` pollutes environment variables.
    The similar problem is reported and fixed in https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems to be added later.
    
    ## How was this patch tested?
    
    manual tests:
      I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% doesn't remain in the command prompt.
    
    Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
    
    Closes #11516 from tsudukim/feature/SPARK-13673.
    tsudukim authored and srowen committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    e617508 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13676] Fix mismatched default values for regParam in LogisticR…

    …egression
    
    ## What changes were proposed in this pull request?
    
    The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.
    
    **Scala**
    ```
    scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
    res0: Double = 0.0
    ```
    
    **Python**
    ```
    >>> from pyspark.ml.classification import LogisticRegression
    >>> LogisticRegression().getRegParam()
    0.1
    ```
    
    ## How was this patch tested?
    manual. Check the following in `pyspark`.
    ```
    >>> from pyspark.ml.classification import LogisticRegression
    >>> LogisticRegression().getRegParam()
    0.0
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11519 from dongjoon-hyun/SPARK-13676.
    dongjoon-hyun authored and mengxr committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    c8f2545 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py

    Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.
    
    In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in #9884. I'll add them in this PR if #9884 gets merged first. Or add a follow-up JIRA for `RFormula`.
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes #11203 from yinxusen/SPARK-13036.
    yinxusen authored and mengxr committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    83302c3 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13633][SQL] Move things into catalyst.parser package

    ## What changes were proposed in this pull request?
    
    This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11506 from andrewor14/parser-package.
    Andrew Or committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    b7d4147 View commit details
    Browse the repository at this point in the history
  14. [SPARK-13459][WEB UI] Separate Alive and Dead Executors in Executor T…

    …otals Table
    
    ## What changes were proposed in this pull request?
    
    Now that dead executors are shown in the executors table (#10058) the totals table is updated to include the separate totals for alive and dead executors as well as the current total, as originally discussed in #10668
    
    ## How was this patch tested?
    
    Manually verified by running the Standalone Web UI in the latest Safari and Firefox ESR
    
    Author: Alex Bozarth <ajbozart@us.ibm.com>
    
    Closes #11381 from ajbozarth/spark13459.
    ajbozarth authored and Tom Graves committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    5f42c28 View commit details
    Browse the repository at this point in the history
  15. [SPARK-13255] [SQL] Update vectorized reader to directly return Colum…

    …narBatch instead of InternalRows.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    Currently, the parquet reader returns rows one by one which is bad for performance. This patch
    updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
    codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
    of rows). The current implementation is a bit of a hack to get this to work and we should do
    more refactoring of these low level interfaces to make this work better.
    
    ## How was this patch tested?
    
    ```
    Results:
    TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
    ---------------------------------------------------------------------------------
    q55 (before)                             8897 / 9265         12.9          77.2
    q55                                      5486 / 5753         21.0          47.6
    ```
    
    Author: Nong Li <nong@databricks.com>
    
    Closes #11435 from nongli/spark-13255.
    nongli authored and davies committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    a6e2bd3 View commit details
    Browse the repository at this point in the history

Commits on Mar 5, 2016

  1. [SPARK-12073][STREAMING] backpressure rate controller consumes events…

    … preferentially from lagg…
    
    …ing partitions
    
    I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure.
    
    `maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages.
    
    This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%.
    
    Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed.
    
    We're going to try deploying this internally at Shopify to see if this resolves our issue.
    
    tdas koeninger holdenk
    
    Author: Jason White <jason.white@shopify.com>
    
    Closes #10089 from JasonMWhite/rate_controller_offsets.
    JasonMWhite authored and zsxwing committed Mar 5, 2016
    Configuration menu
    Copy the full SHA
    f19228e View commit details
    Browse the repository at this point in the history
  2. [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Group…

    …ing Sets
    
    #### What changes were proposed in this pull request?
    
    This PR is for supporting SQL generation for cube, rollup and grouping sets.
    
    For example, a query using rollup:
    ```SQL
    SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
    ```
    Original logical plan:
    ```
      Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
                [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
                 (key#17L % cast(5 as bigint))#47L AS _c1#45L,
                 grouping__id#46 AS _c2#44]
      +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
                 List(key#17L, value#18, null, 1)],
                [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
         +- Project [key#17L,
                     value#18,
                     (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
            +- Subquery t1
               +- Relation[key#17L,value#18] ParquetRelation
    ```
    Converted SQL:
    ```SQL
      SELECT count( 1) AS `cnt`,
             (`t1`.`key` % CAST(5 AS BIGINT)),
             grouping_id() AS `_c2`
      FROM `default`.`t1`
      GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
      GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
    ```
    
    #### How was the this patch tested?
    
    Added eight test cases in `LogicalPlanToSQLSuite`.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    Author: xiaoli <lixiao1983@gmail.com>
    Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
    
    Closes #11283 from gatorsmile/groupingSetsToSQL.
    gatorsmile authored and liancheng committed Mar 5, 2016
    Configuration menu
    Copy the full SHA
    adce5ee View commit details
    Browse the repository at this point in the history
  3. [SPARK-13693][STREAMING][TESTS] Stop StreamingContext before deleting…

    … checkpoint dir
    
    ## What changes were proposed in this pull request?
    
    Stop StreamingContext before deleting checkpoint dir to avoid the race condition that deleting the checkpoint dir and writing checkpoint happen at the same time.
    
    The flaky test log is here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11531 from zsxwing/SPARK-13693.
    zsxwing committed Mar 5, 2016
    Configuration menu
    Copy the full SHA
    8290004 View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2016

  1. Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan withou…

    …t a project on top of it"
    
    This reverts commit f87ce05.
    
    According to discussion in #11466, let's revert PR #11466 for safe.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #11539 from liancheng/revert-pr-11466.
    liancheng committed Mar 6, 2016
    Configuration menu
    Copy the full SHA
    8ff8809 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunct…

    …ionSerializer.loads
    
    ## What changes were proposed in this pull request?
    
    Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.
    
    ## How was this patch tested?
    
    Manually test in the shell.
    
    Before this patch:
    ```
    >>> from pyspark.streaming import StreamingContext
    >>> from pyspark.streaming.util import TransformFunction
    >>> ssc = StreamingContext(sc, 1)
    >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
    >>> func.rdd_wrapper(lambda x: x)
    TransformFunction(<function <lambda> at 0x106ac8b18>)
    >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
    >>> func2 = ssc._transformerSerializer.loads(bytes)
    >>> print(func2.func.__module__)
    None
    >>> print(func2.rdd_wrap_func.__module__)
    None
    >>>
    ```
    After this patch:
    ```
    >>> from pyspark.streaming import StreamingContext
    >>> from pyspark.streaming.util import TransformFunction
    >>> ssc = StreamingContext(sc, 1)
    >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
    >>> func.rdd_wrapper(lambda x: x)
    TransformFunction(<function <lambda> at 0x108bf1b90>)
    >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
    >>> func2 = ssc._transformerSerializer.loads(bytes)
    >>> print(func2.func.__module__)
    __main__
    >>> print(func2.rdd_wrap_func.__module__)
    __main__
    >>>
    ```
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11535 from zsxwing/loads-module.
    zsxwing authored and davies committed Mar 6, 2016
    Configuration menu
    Copy the full SHA
    ee913e6 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2016

  1. [SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalog

    ## What changes were proposed in this pull request?
    
    Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11526 from andrewor14/rename-catalog.
    Andrew Or authored and rxin committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    bc7a3ec View commit details
    Browse the repository at this point in the history
  2. [SPARK-13705][DOCS] UpdateStateByKey Operation documentation incorrec…

    …tly refers to StatefulNetworkWordCount
    
    ## What changes were proposed in this pull request?
    The reference to StatefulNetworkWordCount.scala from updateStatesByKey documentation should be removed, till there is a example for updateStatesByKey.
    
    ## How was this patch tested?
    Have tested the new documentation with jekyll build.
    
    Author: rmishra <rmishra@pivotal.io>
    
    Closes #11545 from rishitesh/SPARK-13705.
    rmishra authored and srowen committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    4b13896 View commit details
    Browse the repository at this point in the history
  3. Fixing the type of the sentiment happiness value

    ## What changes were proposed in this pull request?
    
    Added the conversion to int for the 'happiness value' read from the file. Otherwise, later on line 75 the multiplication will multiply a string by a number, yielding values like "-2-2" instead of -4.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Yury Liavitski <seconds.before@gmail.com>
    Author: Yury Liavitski <yury.liavitski@il111.ice.local>
    
    Closes #11540 from heliocentrist/fix-sentiment-value-type.
    heliocentrist authored and srowen committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    03f57a6 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13651] Generator outputs are not resolved correctly resulting …

    …in run time error
    
    ## What changes were proposed in this pull request?
    
    ```
    Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
    sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value")
    ```
    Results in following logical plan
    
    ```
    Project [key#2,value#3]
    +- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3]
       +- SubqueryAlias src
          +- Project [_1#0 AS key#2,_2#1 AS value#3]
             +- LocalRelation [_1#0,_2#1], [[id1,value1]]
    ```
    
    The above query fails with following runtime error.
    ```
    java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
    	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
    	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
    	at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
    	at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
    	at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
    	at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
    	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
            <stack-trace omitted.....>
    ```
    In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to
    https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    Added unit tests in hive/SQLQuerySuite and AnalysisSuite
    
    Author: Dilip Biswal <dbiswal@us.ibm.com>
    
    Closes #11497 from dilipbiswal/spark-13651.
    dilipbiswal authored and davies committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    d7eac9d View commit details
    Browse the repository at this point in the history
  5. [SPARK-13694][SQL] QueryPlan.expressions should always include all ex…

    …pressions
    
    ## What changes were proposed in this pull request?
    
    It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it.
    
    Note that this PR doesn't fix the problem in #11497
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11532 from cloud-fan/generate.
    cloud-fan authored and marmbrus committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    4896411 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joi…

    …ns based on their data constraints
    
    ## What changes were proposed in this pull request?
    
    This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints.
    
    Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins.
    
    ## How was this patch tested?
    
    1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules.
    2. Test generated ExpressionTrees via `OrcFilterSuite`
    3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`
    
    cc yhuai nongli
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11372 from sameeragarwal/gen-isnotnull.
    sameeragarwal authored and yhuai committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    ef77003 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.

    ## What changes were proposed in this pull request?
    
    In the Jenkins pull request builder, PySpark tests take around [962 seconds ](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console) of end-to-end time to run, despite the fact that we run four Python test suites in parallel. According to the log, the basic reason is that the long running test starts at the end due to FIFO queue. We first try to reduce the test time by just starting some long running tests first with simple priority queue.
    
    ```
    ========================================================================
    Running PySpark tests
    ========================================================================
    ...
    Finished test(python3.4): pyspark.streaming.tests (213s)
    Finished test(pypy): pyspark.sql.tests (92s)
    Finished test(pypy): pyspark.streaming.tests (280s)
    Tests passed in 962 seconds
    ```
    
    ## How was this patch tested?
    
    Manual check.
    Check 'Running PySpark tests' part of the Jenkins log.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11551 from dongjoon-hyun/SPARK-12243.
    dongjoon-hyun authored and JoshRosen committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    e72914f View commit details
    Browse the repository at this point in the history
  8. [MINOR][DOC] improve the doc for "spark.memory.offHeap.size"

    The description of "spark.memory.offHeap.size" in the current document does not clearly state that memory is counted with bytes....
    
    This PR contains a small fix for this tiny issue
    
    document fix
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes #11561 from CodingCat/master.
    CodingCat authored and zsxwing committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    a3ec50a View commit details
    Browse the repository at this point in the history
  9. [SPARK-13722][SQL] No Push Down for Non-deterministics Predicates thr…

    …ough Generate
    
    #### What changes were proposed in this pull request?
    
    Non-deterministic predicates should not be pushed through Generate.
    
    #### How was this patch tested?
    
    Added a test case in `FilterPushdownSuite.scala`
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #11562 from gatorsmile/pushPredicateDownWindow.
    gatorsmile authored and marmbrus committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    b6071a7 View commit details
    Browse the repository at this point in the history
  10. [SPARK-13655] Improve isolation between tests in KinesisBackedBlockRD…

    …DSuite
    
    This patch modifies `KinesisBackedBlockRDDTests` to increase the isolation between tests in order to fix a bug which causes the tests to hang.
    
    See #11558 for more details.
    
    /cc zsxwing srowen
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11564 from JoshRosen/SPARK-13655.
    JoshRosen authored and zsxwing committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    e9e67b3 View commit details
    Browse the repository at this point in the history
  11. [SPARK-529][CORE][YARN] Add type-safe config keys to SparkConf.

    This is, in a way, the basics to enable SPARK-529 (which was closed as
    won't fix but I think is still valuable). In fact, Spark SQL created
    something for that, and this change basically factors out that code
    and inserts it into SparkConf, with some extra bells and whistles.
    
    To showcase the usage of this pattern, I modified the YARN backend
    to use the new config keys (defined in the new `config` package object
    under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
    logic had to be slightly modified in a handful of places.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #10205 from vanzin/conf-opts.
    Marcelo Vanzin committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    e1fb857 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13442][SQL] Make type inference recognize boolean types

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-13442
    
    This PR adds the support for inferring `BooleanType` for schema.
    It supports to infer case-insensitive `true` / `false` as `BooleanType`.
    
    Unittests were added for `CSVInferSchemaSuite` and `CSVSuite` for end-to-end test.
    
    ## How was the this patch tested?
    
    This was tested with unittests and with `dev/run_tests` for coding style
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #11315 from HyukjinKwon/SPARK-13442.
    HyukjinKwon authored and rxin committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    8577260 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13596][BUILD] Move misc top-level build files into appropriate…

    … subdirs
    
    ## What changes were proposed in this pull request?
    
    Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`.
    
    I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope.
    
    ## How was this patch tested?
    
    `./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11522 from srowen/SPARK-13596.
    srowen authored and rxin committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    0eea12a View commit details
    Browse the repository at this point in the history
  14. [SPARK-13665][SQL] Separate the concerns of HadoopFsRelation

    `HadoopFsRelation` is used for reading most files into Spark SQL.  However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.  As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency.  This PR is a first cut at separating this into several components / interfaces that are each described below.  Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`.  External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
    
    ### HadoopFsRelation
    A simple `case class` that acts as a container for all of the metadata required to read from a datasource.  All discovery, resolution and merging logic for schemas and partitions has been removed.  This an internal representation that no longer needs to be exposed to developers.
    
    ```scala
    case class HadoopFsRelation(
        sqlContext: SQLContext,
        location: FileCatalog,
        partitionSchema: StructType,
        dataSchema: StructType,
        bucketSpec: Option[BucketSpec],
        fileFormat: FileFormat,
        options: Map[String, String]) extends BaseRelation
    ```
    
    ### FileFormat
    The primary interface that will be implemented by each different format including external libraries.  Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`.  A format can optionally return a schema that is inferred from a set of files.
    
    ```scala
    trait FileFormat {
      def inferSchema(
          sqlContext: SQLContext,
          options: Map[String, String],
          files: Seq[FileStatus]): Option[StructType]
    
      def prepareWrite(
          sqlContext: SQLContext,
          job: Job,
          options: Map[String, String],
          dataSchema: StructType): OutputWriterFactory
    
      def buildInternalScan(
          sqlContext: SQLContext,
          dataSchema: StructType,
          requiredColumns: Array[String],
          filters: Array[Filter],
          bucketSet: Option[BitSet],
          inputFiles: Array[FileStatus],
          broadcastedConf: Broadcast[SerializableConfiguration],
          options: Map[String, String]): RDD[InternalRow]
    }
    ```
    
    The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner).  Additionally, scans are still returning `RDD`s instead of iterators for single files.  In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
    
    ### FileCatalog
    This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
    
    ```scala
    trait FileCatalog {
      def paths: Seq[Path]
      def partitionSpec(schema: Option[StructType]): PartitionSpec
      def allFiles(): Seq[FileStatus]
      def getStatus(path: Path): Array[FileStatus]
      def refresh(): Unit
    }
    ```
    
    Currently there are two implementations:
     - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`.  Infers partitioning by recursive listing and caches this data for performance
     - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
    
    ### ResolvedDataSource
    Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
     - `paths: Seq[String] = Nil`
     - `userSpecifiedSchema: Option[StructType] = None`
     - `partitionColumns: Array[String] = Array.empty`
     - `bucketSpec: Option[BucketSpec] = None`
     - `provider: String`
     - `options: Map[String, String]`
    
    This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones).  All reconciliation of partitions, buckets, schema from metastores or inference is done here.
    
    ### DataSourceAnalysis / DataSourceStrategy
    Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
     - pruning the files from partitions that will be read based on filters.
     - appending partition columns*
     - applying additional filters when a data source can not evaluate them internally.
     - constructing an RDD that is bucketed correctly when required*
     - sanity checking schema match-up and other analysis when writing.
    
    *In the future we should do that following:
     - Break out file handling into its own Strategy as its sufficiently complex / isolated.
     - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
     - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
    
    Author: Michael Armbrust <michael@databricks.com>
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11509 from marmbrus/fileDataSource.
    marmbrus authored and rxin committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    e720dda View commit details
    Browse the repository at this point in the history
  15. [SPARK-13648] Add Hive Cli to classes for isolated classloader

    ## What changes were proposed in this pull request?
    
    Adding the hive-cli classes to the classloader
    
    ## How was this patch tested?
    
    The hive Versionssuite tests were run
    
    This is my original work and I license the work to the project under the project's open source license.
    
    Author: Tim Preece <tim.preece.in.oz@gmail.com>
    
    Closes #11495 from preecet/master.
    preecet authored and marmbrus committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    46f25c2 View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2016

  1. [SPARK-13689][SQL] Move helper things in CatalystQl to new utils object

    ## What changes were proposed in this pull request?
    
    When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object.
    
    This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller.
    
    ## How was this patch tested?
    
    No change in functionality, so just Jenkins.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11529 from andrewor14/parser-utils.
    Andrew Or committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    da7bfac View commit details
    Browse the repository at this point in the history
  2. [SPARK-13404] [SQL] Create variables for input row when it's actually…

    … used
    
    ## What changes were proposed in this pull request?
    
    This PR change the way how we generate the code for the output variables passing from a plan to it's parent.
    
    Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted.
    
    This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`.
    
    This PR also change the `if` from
    ```
    if (cond) {
      xxx
    }
    ```
    to
    ```
    if (!cond) continue;
    xxx
    ```
    The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin.
    
    It also added some comments for operators.
    
    ## How was the this patch tested?
    
    Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s)
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11274 from davies/gen_defer.
    Davies Liu authored and davies committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    25bba58 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13711][CORE] Don't call SparkUncaughtExceptionHandler in AppCl…

    …ient as it's in driver
    
    ## What changes were proposed in this pull request?
    
    AppClient runs in the driver side. It should not call `Utils.tryOrExit` as it will send exception to SparkUncaughtExceptionHandler and call `System.exit`. This PR just removed `Utils.tryOrExit`.
    
    ## How was this patch tested?
    
    manual tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11566 from zsxwing/SPARK-13711.
    zsxwing committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    017cdf2 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13659] Refactor BlockStore put*() APIs to remove returnValues

    In preparation for larger refactoring, this patch removes the confusing `returnValues` option from the BlockStore put() APIs: returning the value is only useful in one place (caching) and in other situations, such as block replication, it's simpler to put() and then get().
    
    As part of this change, I needed to refactor `BlockManager.doPut()`'s block replication code. I also changed `doPut()` to access the memory and disk stores directly rather than calling them through the BlockStore interface; this is in anticipation of a followup patch to remove the BlockStore interface so that the disk store can expose a binary-data-oriented API which is not concerned with Java objects or serialization.
    
    These changes should be covered by the existing storage unit tests. The best way to review this patch is probably to look at the individual commits, all of which are small and have useful descriptions to guide the review.
    
    /cc davies for review.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11502 from JoshRosen/remove-returnvalues.
    JoshRosen authored and Andrew Or committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    e52e597 View commit details
    Browse the repository at this point in the history
  5. [HOT-FIX][BUILD] Use the new location of checkstyle-suppressions.xml

    ## What changes were proposed in this pull request?
    
    This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change.
    The following is the error message of current master.
    ```
    Checkstyle checks failed at following occurrences:
    [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1]
    ```
    
    ## How was this patch tested?
    
    Manual. The following command should run correctly.
    ```
    ./dev/lint-java
    mvn checkstyle:check
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11567 from dongjoon-hyun/hotfix_checkstyle_suppression.
    dongjoon-hyun authored and srowen committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    7771c73 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0

    ## What changes were proposed in this pull request?
    
    In WebUI, now Jetty Server starts with SPARK_LOCAL_IP config value if it
    is configured otherwise it starts with default value as '0.0.0.0'.
    
    It is continuation as per the closed PR #11133 for the JIRA SPARK-13117 and discussion in SPARK-13117.
    
    ## How was this patch tested?
    
    This has been verified using the command 'netstat -tnlp | grep <PID>' to check on which IP/hostname is binding with the below steps.
    
    In the below results, mentioned PID in the command is the corresponding process id.
    
    #### Without the patch changes,
    Web UI(Jetty Server) is not taking the value configured for SPARK_LOCAL_IP and it is listening to all the interfaces.
    ###### Master
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3930
    tcp6       0      0 :::8080                 :::*                    LISTEN      3930/java
    ```
    
    ###### Worker
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4090
    tcp6       0      0 :::8081                 :::*                    LISTEN      4090/java
    ```
    
    ###### History Server Process,
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2471
    tcp6       0      0 :::18080                :::*                    LISTEN      2471/java
    ```
    ###### Driver
    ```
    [devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6556
    tcp6       0      0 :::4040                 :::*                    LISTEN      6556/java
    ```
    
    #### With the patch changes
    
    ##### i. With SPARK_LOCAL_IP configured
    If the SPARK_LOCAL_IP is configured then all the processes Web UI(Jetty Server) is getting bind to the configured value.
    ###### Master
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 1561
    tcp6       0      0 x.x.x.x:8080       :::*                    LISTEN      1561/java
    ```
    ###### Worker
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2229
    tcp6       0      0 x.x.x.x:8081       :::*                    LISTEN      2229/java
    ```
    ###### History Server
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3747
    tcp6       0      0 x.x.x.x:18080      :::*                    LISTEN      3747/java
    ```
    ###### Driver
    ```
    [devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6013
    tcp6       0      0 x.x.x.x:4040       :::*                    LISTEN      6013/java
    ```
    
    ##### ii. Without SPARK_LOCAL_IP configured
    If the SPARK_LOCAL_IP is not configured then all the processes Web UI(Jetty Server) will start with the '0.0.0.0' as default value.
    ###### Master
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4573
    tcp6       0      0 :::8080                 :::*                    LISTEN      4573/java
    ```
    
    ###### Worker
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4703
    tcp6       0      0 :::8081                 :::*                    LISTEN      4703/java
    ```
    
    ###### History Server
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4846
    tcp6       0      0 :::18080                :::*                    LISTEN      4846/java
    ```
    
    ###### Driver
    ```
    [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 5437
    tcp6       0      0 :::4040                 :::*                    LISTEN      5437/java
    ```
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes #11490 from devaraj-kavali/SPARK-13117-v1.
    Devaraj K authored and srowen committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    9bf76dd View commit details
    Browse the repository at this point in the history
  7. [SPARK-13675][UI] Fix wrong historyserver url link for application ru…

    …nning in yarn cluster mode
    
    ## What changes were proposed in this pull request?
    
    Current URL for each application to access history UI is like:
    http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/
    
    Here **1** or **2** represents the number of attempts in `historypage.js`, but it will parse to attempt id in `HistoryServer`, while the correct attempt id should be like "appattempt_1457058760338_0016_000002", so it will fail to parse to a correct attempt id in HistoryServer.
    
    This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong.
    
    So here we should fix this url to parse the correct application id and attempt id. Also the suffix "jobs/" is not needed.
    
    Here is the screenshot:
    
    ![screen shot 2016-02-29 at 3 57 32 pm](https://cloud.githubusercontent.com/assets/850797/13524377/d4b44348-e235-11e5-8b3e-bc06de306e87.png)
    
    ## How was this patch tested?
    
    This patch is tested manually, with different master and deploy mode.
    
    ![image](https://cloud.githubusercontent.com/assets/850797/13524419/118be5a0-e236-11e5-8022-3ff613ccde46.png)
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #11518 from jerryshao/SPARK-13675.
    jerryshao authored and Tom Graves committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    9e86e6e View commit details
    Browse the repository at this point in the history
  8. [SPARK-13637][SQL] use more information to simplify the code in Expan…

    …d builder
    
    ## What changes were proposed in this pull request?
    
    The code in `Expand.apply` can be simplified by existing information:
    
    * the `groupByExprs` parameter are all `Attribute`s
    * the `child` parameter is a `Project` that append aliased group by expressions to its child's output
    
    ## How was this patch tested?
    
    by existing tests.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11485 from cloud-fan/expand.
    cloud-fan committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    7d05d02 View commit details
    Browse the repository at this point in the history
  9. [HOTFIX][YARN] Fix yarn cluster mode fire and forget regression

    ## What changes were proposed in this pull request?
    
    Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed.
    
    ## How was this patch tested?
    
    Manually verified this change.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #11577 from jerryshao/hot-fix-yarn-cluster.
    jerryshao authored and Marcelo Vanzin committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    ca1a7b9 View commit details
    Browse the repository at this point in the history
  10. [SPARK-13715][MLLIB] Remove last usages of jblas in tests

    ## What changes were proposed in this pull request?
    
    Remove last usage of jblas, in tests
    
    ## How was this patch tested?
    
    Jenkins tests -- the same ones that are being modified.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11560 from srowen/SPARK-13715.
    srowen committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    54040f8 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13657] [SQL] Support parsing very long AND/OR expressions

    ## What changes were proposed in this pull request?
    
    In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.
    
    ## How was this patch tested?
    
    Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.
    
    [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11501 from davies/long_or.
    Davies Liu authored and davies committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    78d3b60 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13695] Don't cache MEMORY_AND_DISK blocks as bytes in memory a…

    …fter spills
    
    When a cached block is spilled to disk and read back in serialized form (i.e. as bytes), the current BlockManager implementation will attempt to re-insert the serialized block into the MemoryStore even if the block's storage level requests deserialized caching.
    
    This behavior adds some complexity to the MemoryStore but I don't think it offers many performance benefits and I'd like to remove it in order to simplify a larger refactoring patch. Therefore, this patch changes the behavior so that disk store reads will only cache bytes in the memory store for blocks with serialized storage levels.
    
    There are two places where we request serialized bytes from the BlockStore:
    
    1. getLocalBytes(), which is only called when reading local copies of TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized storage level, so this won't lead to a mismatch in serialization forms if spilled bytes read from disk are cached as bytes in the memory store.
    2. the non-shuffle-block branch in getBlockData(), which is only called by the NettyBlockRpcServer when responding to requests to read remote blocks. Caching the serialized bytes in memory will only benefit us if those cached bytes are read before they're evicted and the likelihood of that happening seems low since the frequency of remote reads of non-broadcast cached blocks seems very low. Caching these bytes when they have a low probability of being read is bad if it risks the eviction of blocks which are cached in their expected serialized/deserialized forms, since those blocks seem more likely to be read in local computation.
    
    Given the argument above, I think this change is unlikely to cause performance regressions.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11533 from JoshRosen/remove-memorystore-level-mismatch.
    JoshRosen committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    ad3c9a9 View commit details
    Browse the repository at this point in the history
  13. [SPARK-12727][SQL] support SQL generation for aggregate with multi-di…

    …stinct
    
    ## What changes were proposed in this pull request?
    
    This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer.
    
    More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization.  However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11579 from cloud-fan/distinct.
    cloud-fan authored and rxin committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    46881b4 View commit details
    Browse the repository at this point in the history
  14. [ML] testEstimatorAndModelReadWrite should call checkModelData

    ## What changes were proposed in this pull request?
    Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it.
    BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually.
    cc jkbradley mengxr
    ## How was this patch tested?
    No new unit test, should pass the exist ones.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11513 from yanboliang/ml-check-model-data.
    yanboliang authored and jkbradley committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    9740954 View commit details
    Browse the repository at this point in the history
  15. [SPARK-13740][SQL] add null check for _verify_type in types.py

    ## What changes were proposed in this pull request?
    
    This PR adds null check in `_verify_type` according to the nullability information.
    
    ## How was this patch tested?
    
    new doc tests
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11574 from cloud-fan/py-null-check.
    cloud-fan authored and yhuai committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    d5ce617 View commit details
    Browse the repository at this point in the history
  16. [SPARK-13593] [SQL] improve the createDataFrame to accept data type…

    … string and verify the data
    
    ## What changes were proposed in this pull request?
    
    This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`.
    It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users.
    If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't.
    
    ## How was this patch tested?
    
    new tests in `test.py` and doc test in `types.py`
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11444 from cloud-fan/pyrdd.
    cloud-fan authored and davies committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    d57daf1 View commit details
    Browse the repository at this point in the history
  17. [SPARK-13400] Stop using deprecated Octal escape literals

    ## What changes were proposed in this pull request?
    
    This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines.
    ```
    LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead.
    HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead.
    ```
    
    ## How was this patch tested?
    
    Manual.
    During building, there should be no warning on `Octal escape literals`.
    ```
    mvn -DskipTests clean install
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11584 from dongjoon-hyun/SPARK-13400.
    dongjoon-hyun authored and rxin committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    076009b View commit details
    Browse the repository at this point in the history
  18. [SPARK-13738][SQL] Cleanup Data Source resolution

    Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`.
     - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`.  Actual resolution is now done by calling a function on this class.
     - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`.
     - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #11572 from marmbrus/dataSourceResolution.
    marmbrus authored and rxin committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    1e28840 View commit details
    Browse the repository at this point in the history
  19. [SPARK-13668][SQL] Reorder filter/join predicates to short-circuit is…

    …NotNull checks
    
    ## What changes were proposed in this pull request?
    
    If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates.
    
    For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.
    
    ## How was this patch tested?
    
    new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite`
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11511 from sameeragarwal/reorder-isnotnull.
    sameeragarwal authored and yhuai committed Mar 8, 2016
    Configuration menu
    Copy the full SHA
    e430614 View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2016

  1. [SPARK-13755] Escape quotes in SQL plan visualization node labels

    When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11587 from JoshRosen/graphviz-escaping.
    JoshRosen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    81f54ac View commit details
    Browse the repository at this point in the history
  2. [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a …

    …property when getting param list
    
    ## What changes were proposed in this pull request?
    
    Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance.  This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.
    
    ## How was this patch tested?
    
    Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class.  Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
    BryanCutler authored and jkbradley committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    d8813fa View commit details
    Browse the repository at this point in the history
  3. [SPARK-13750][SQL] fix sizeInBytes of HadoopFsRelation

    ## What changes were proposed in this pull request?
    
    This PR fix the sizeInBytes of HadoopFsRelation.
    
    ## How was this patch tested?
    
    Added regression test for that.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11590 from davies/fix_sizeInBytes.
    Davies Liu authored and marmbrus committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    982ef2b View commit details
    Browse the repository at this point in the history
  4. [SPARK-13754] Keep old data source name for backwards compatibility

    ## Motivation
    CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name.
    
    ## Proposed changes
    `com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala`
    
    ## Tests
    A unit test was added to `CSVSuite` to parse a csv file using the old name.
    
    Author: Hossein <hossein@databricks.com>
    
    Closes #11589 from falaki/SPARK-13754.
    falaki authored and marmbrus committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    cc4ab37 View commit details
    Browse the repository at this point in the history
  5. [SPARK-7286][SQL] Deprecate !== in favour of =!=

    This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.**
    
    ## What changes were proposed in this pull request?
    Deprecate the SparkSQL column operator !== and use =!= as an alternative.
    Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===).
    
    ## How was this patch tested?
    All currently existing tests.
    
    Author: Jakob Odersky <jodersky@gmail.com>
    
    Closes #11588 from jodersky/SPARK-7286.
    jodersky authored and rxin committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    035d3ac View commit details
    Browse the repository at this point in the history
  6. [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects

    ## What changes were proposed in this pull request?
    
    This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.
    
    - Implement both null and type checking in equals functions.
    - Fix wrong type casting logic in SimpleJavaBean2.equals.
    - Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
    - Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
    - Fix coding style: Add '{}' to single `for` statement in mllib examples.
    - Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
    - Remove unused fields in `ChunkFetchIntegrationSuite`.
    - Add `stop()` to prevent resource leak.
    
    Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).
    
    ## How was this patch tested?
    
    manual via `./dev/lint-java` and Coverity site.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11530 from dongjoon-hyun/SPARK-13692.
    dongjoon-hyun authored and srowen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    f3201ae View commit details
    Browse the repository at this point in the history
  7. [SPARK-13640][SQL] Synchronize ScalaReflection.mirror method.

    ## What changes were proposed in this pull request?
    
    `ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe.
    
    ## How was this patch tested?
    
    I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch:
    
    ```
    [info] - thread safety of mirror *** FAILED *** (49 milliseconds)
    [info]   java.lang.UnsupportedOperationException: tail of empty list
    [info]   at scala.collection.immutable.Nil$.tail(List.scala:339)
    [info]   at scala.collection.immutable.Nil$.tail(List.scala:334)
    [info]   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
    [info]   at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
    [info]   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
    [info]   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
    [info]   at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
    [info]   at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
    [info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
    [info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
    [info]   at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36)
    [info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256)
    [info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252)
    [info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    [info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    [info]   at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
    [info]   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    [info]   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    [info]   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    [info]   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    ```
    
    Notice that the test will pass when Scala version is `2.11`.
    
    Author: Takuya UESHIN <ueshin@happy-camper.st>
    
    Closes #11487 from ueshin/issues/SPARK-13640.
    ueshin authored and srowen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    2c5af7d View commit details
    Browse the repository at this point in the history
  8. [SPARK-13631][CORE] Thread-safe getLocationsWithLargestOutputs

    ## What changes were proposed in this pull request?
    
    If a job is being scheduled in one thread which has a dependency on an
    RDD currently executing a shuffle in another thread, Spark would throw a
    NullPointerException. This patch synchronizes access to `mapStatuses` and
    skips null status entries (which are in-progress shuffle tasks).
    
    ## How was this patch tested?
    
    Our client code unit test suite, which was reliably reproducing the race
    condition with 10 threads, shows that this fixes it. I have not found a minimal
    test case to add to Spark, but I will attempt to do so if desired.
    
    The same test case was tripping up on SPARK-4454, which was fixed by
    making other DAGScheduler code thread-safe.
    
    shivaram srowen
    
    Author: Andy Sloane <asloane@tetrationanalytics.com>
    
    Closes #11505 from a1k0n/SPARK-13631.
    Andy Sloane authored and srowen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    cbff280 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic inst…

    …ance creation in Java code.
    
    ## What changes were proposed in this pull request?
    
    In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
    
    ```
    -    final ArrayList<Product2<Object, Object>> dataToWrite =
    -      new ArrayList<Product2<Object, Object>>();
    +    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
    ```
    
    Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
    
    ## How was this patch tested?
    
    Manual.
    Pass the existing tests.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11541 from dongjoon-hyun/SPARK-13702.
    dongjoon-hyun authored and srowen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    c3689bc View commit details
    Browse the repository at this point in the history
  10. [SPARK-13769][CORE] Update Java Doc in Spark Submit

    JIRA : https://issues.apache.org/jira/browse/SPARK-13769
    
    The java doc here (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
    needs to be updated from "The latter two operations are currently supported only for standalone cluster mode." to "The latter two operations are currently supported only for standalone and mesos cluster modes."
    
    Author: Ahmed Kamal <ahmed.kamal@badrit.com>
    
    Closes #11600 from AhmedKamal/SPARK-13769.
    Ahmed Kamal authored and srowen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    8e8633e View commit details
    Browse the repository at this point in the history
  11. [SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Ge…

    …nerate
    
    ## What changes were proposed in this pull request?
    Analysis exception occurs while running the following query.
    ```
    SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
    ```
    ```
    Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
    'Project ['ints]
    +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
       +- SubqueryAlias nestedarray
          +- LocalRelation [a#0], [[[[1,2,3]]]]
    ```
    
    ## How was this patch tested?
    
    Added new unit tests in SQLQuerySuite and HiveQlSuite
    
    Author: Dilip Biswal <dbiswal@us.ibm.com>
    
    Closes #11538 from dilipbiswal/SPARK-13698.
    dilipbiswal authored and cloud-fan committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    53ba6d6 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13242] [SQL] codegen fallback in case-when if there many branches

    ## What changes were proposed in this pull request?
    
    If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches.
    
    This PR is based on #11243 and #11221, thanks to joehalliwell
    
    Closes #11243
    Closes #11221
    
    ## How was this patch tested?
    
    Add a test with 50 branches.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11592 from davies/fix_when.
    Davies Liu authored and davies committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    9634e17 View commit details
    Browse the repository at this point in the history
  13. Revert "[SPARK-13668][SQL] Reorder filter/join predicates to short-ci…

    …rcuit isNotNull checks"
    
    This reverts commit e430614.
    davies committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    7791d0c View commit details
    Browse the repository at this point in the history
  14. [SPARK-13595][BUILD] Move docker, extras modules into external

    ## What changes were proposed in this pull request?
    
    Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`
    
    ## How was this patch tested?
    
    This is tested with Jenkins tests.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11523 from srowen/SPARK-13595.
    srowen committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    256704c View commit details
    Browse the repository at this point in the history
  15. [SPARK-13763][SQL] Remove Project when its Child's Output is Nil

    #### What changes were proposed in this pull request?
    
    As shown in another PR: #11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,
    
    ```SQL
    SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
    ```
    Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:
    
    ```
    == Analyzed Logical Plan ==
    value: int
    Project [value#22]
    +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
       +- SubqueryAlias dummyTable
          +- Project [1 AS 1#21]
             +- OneRowRelation$
    
    == Optimized Logical Plan ==
    Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
    +- Project
       +- OneRowRelation$
    ```
    
    After the fix, the optimized plan removed the useless `Project`, as shown below:
    ```
    == Optimized Logical Plan ==
    Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
    +- OneRowRelation$
    ```
    
    This PR is to remove `Project` when its Child's output is Nil
    
    #### How was this patch tested?
    
    Added a new unit test case into the suite `ColumnPruningSuite.scala`
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #11599 from gatorsmile/projectOneRowRelation.
    gatorsmile authored and marmbrus committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    23369c3 View commit details
    Browse the repository at this point in the history
  16. [SPARK-13728][SQL] Fix ORC PPD test so that pushed filters can be che…

    …cked.
    
    ## What changes were proposed in this pull request?
    https://issues.apache.org/jira/browse/SPARK-13728
    
    #11509 makes the output only single ORC file.
    It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters.
    So, this PR simply repartitions data into 10 so that the test could pass.
    ## How was this patch tested?
    
    unittest and `./dev/run_tests` for code style test.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #11593 from HyukjinKwon/SPARK-13728.
    HyukjinKwon authored and marmbrus committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    cad29a4 View commit details
    Browse the repository at this point in the history
  17. [SPARK-13615][ML] GeneralizedLinearRegression supports save/load

    ## What changes were proposed in this pull request?
    ```GeneralizedLinearRegression``` supports ```save/load```.
    cc mengxr
    ## How was this patch tested?
    unit test.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11465 from yanboliang/spark-13615.
    yanboliang authored and jkbradley committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    0dd0648 View commit details
    Browse the repository at this point in the history
  18. [SPARK-13523] [SQL] Reuse exchanges in a query

    ## What changes were proposed in this pull request?
    
    It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache).
    
    Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query.
    
    In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan.
    
    Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning.
    
    After the rule, the plan will looks like:
    
    ```
    WholeStageCodegen
    :  +- Project [id#0L]
    :     +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
    :        :- Project [id#0L]
    :        :  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
    :        :     :- Range 0, 1, 4, 1024, [id#0L]
    :        :     +- INPUT
    :        +- INPUT
    :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
    :  +- WholeStageCodegen
    :     :  +- Range 0, 1, 4, 1024, [id#1L]
    +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
    ```
    
    ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png)
    
    For three ways SortMergeJoin,
    ```
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [id#0L]
    :     +- SortMergeJoin [id#0L], [id#4L], None
    :        :- INPUT
    :        +- INPUT
    :- WholeStageCodegen
    :  :  +- Project [id#0L]
    :  :     +- SortMergeJoin [id#0L], [id#3L], None
    :  :        :- INPUT
    :  :        +- INPUT
    :  :- WholeStageCodegen
    :  :  :  +- Sort [id#0L ASC], false, 0
    :  :  :     +- INPUT
    :  :  +- Exchange hashpartitioning(id#0L, 200), None
    :  :     +- WholeStageCodegen
    :  :        :  +- Range 0, 1, 4, 33554432, [id#0L]
    :  +- WholeStageCodegen
    :     :  +- Sort [id#3L ASC], false, 0
    :     :     +- INPUT
    :     +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None
    +- WholeStageCodegen
       :  +- Sort [id#4L ASC], false, 0
       :     +- INPUT
       +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None
    ```
    ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png)
    
    If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents.
    
    ## How was this patch tested?
    
    Added some unit tests for this.  Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11403 from davies/dedup.
    Davies Liu authored and davies committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    3dc9ae2 View commit details
    Browse the repository at this point in the history
  19. [SPARK-13527][SQL] Prune Filters based on Constraints

    #### What changes were proposed in this pull request?
    
    Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints.
    
    For example, the first query can be simplified to the second one.
    
    ```scala
        val queryWithUselessFilter = tr1
          .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
          .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
          .where(
            ("tr1.a".attr > 10 || "tr1.c".attr < 10) &&
            'd.attr < 100 &&
            "tr2.a".attr === "tr1.a".attr)
    ```
    ```scala
        val query = tr1
          .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
          .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
    ```
    #### How was this patch tested?
    
    Six test cases are added.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #11406 from gatorsmile/FilterRemoval.
    gatorsmile authored and marmbrus committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    c6aa356 View commit details
    Browse the repository at this point in the history
  20. [SPARK-11861][ML] Add feature importances for decision trees

    This patch adds an API entry point for single decision tree feature importances.
    
    Author: sethah <seth.hendrickson16@gmail.com>
    
    Closes #9912 from sethah/SPARK-11861.
    sethah authored and jkbradley committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    e1772d3 View commit details
    Browse the repository at this point in the history
  21. [SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite

    ## What changes were proposed in this pull request?
    
    This PR is a small follow up on #11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`.
    ## How was this patch tested?
    
    No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11611 from sameeragarwal/expression-set.
    sameeragarwal authored and marmbrus committed Mar 9, 2016
    Configuration menu
    Copy the full SHA
    dbf2a7c View commit details
    Browse the repository at this point in the history

Commits on Mar 10, 2016

  1. [SPARK-13747][SQL] Fix concurrent query with fork-join pool

    ## What changes were proposed in this pull request?
    
    Fix this use case, which was already fixed in SPARK-10548 in 1.6 but was broken in master due to #9264:
    
    ```
    (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() }
    ```
    
    This threw `IllegalArgumentException` consistently before this patch. For more detail, see the JIRA.
    
    ## How was this patch tested?
    
    New test in `SQLExecutionSuite`.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11586 from andrewor14/fix-concurrent-sql.
    Andrew Or authored and zsxwing committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    37fcda3 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13778][CORE] Set the executor state for a worker when removing it

    ## What changes were proposed in this pull request?
    
    When a worker is lost, the executors on this worker are also lost. But Master's ApplicationPage still displays their states as running.
    
    This patch just sets the executor state to `LOST` when a worker is lost.
    
    ## How was this patch tested?
    
    manual tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11609 from zsxwing/SPARK-13778.
    zsxwing authored and Andrew Or committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    40e0676 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13775] History page sorted by completed time desc by default.

    ## What changes were proposed in this pull request?
    Originally the page is sorted by AppID by default.
    After tests with users' feedback, we think it might be best to sort by completed time (desc).
    
    ## How was this patch tested?
    Manually test, with screenshot as follows.
    ![sorted-by-complete-time-desc](https://cloud.githubusercontent.com/assets/11683054/13647686/d6dea924-e5fa-11e5-8fc5-68e039b74b6f.png)
    
    Author: zhuol <zhuol@yahoo-inc.com>
    
    Closes #11608 from zhuoliu/13775.
    zhuol authored and Andrew Or committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    238447d View commit details
    Browse the repository at this point in the history
  4. [MINOR] Fix typo in 'hypot' docstring

    Minor typo:  docstring for pyspark.sql.functions: hypot has extra characters
    
    N/A
    
    Author: Tristan Reid <treid@netflix.com>
    
    Closes #11616 from tristanreid/master.
    tristanreid authored and Andrew Or committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    5f7dbdb View commit details
    Browse the repository at this point in the history
  5. [SPARK-13492][MESOS] Configurable Mesos framework webui URL.

    ## What changes were proposed in this pull request?
    
    Previously the Mesos framework webui URL was being derived only from the Spark UI address leaving no possibility to configure it. This commit makes it configurable. If unset it falls back to the previous behavior.
    
    Motivation:
    This change is necessary in order to be able to install Spark on DCOS and to be able to give it a custom service link. The configured `webui_url` is configured to point to a reverse proxy in the DCOS environment.
    
    ## How was this patch tested?
    
    Locally, using unit tests and on DCOS testing and stable revision.
    
    Author: Sergiusz Urbaniak <sur@mesosphere.io>
    
    Closes #11369 from s-urbaniak/sur-webui-url.
    Sergiusz Urbaniak authored and Andrew Or committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    a4a0add View commit details
    Browse the repository at this point in the history
  6. [SPARK-13760][SQL] Fix BigDecimal constructor for FloatType

    ## What changes were proposed in this pull request?
    
    A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`.
    
    ## How was this patch tested?
    
    N/A
    
    cc yhuai
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11597 from sameeragarwal/bigdecimal.
    sameeragarwal authored and yhuai committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    926e9c4 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    7906461 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13766][SQL] Consistent file extensions for files written by in…

    …ternal data sources
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-13766
    This PR makes the file extensions (written by internal datasource) consistent.
    
    **Before**
    
    - TEXT, CSV and JSON
    ```
    [.COMPRESSION_CODEC_NAME]
    ```
    
    - Parquet
    ```
    [.COMPRESSION_CODEC_NAME].parquet
    ```
    
    - ORC
    ```
    .orc
    ```
    
    **After**
    
    - TEXT, CSV and JSON
    ```
    .txt[.COMPRESSION_CODEC_NAME]
    .csv[.COMPRESSION_CODEC_NAME]
    .json[.COMPRESSION_CODEC_NAME]
    ```
    
    - Parquet
    ```
    [.COMPRESSION_CODEC_NAME].parquet
    ```
    
    - ORC
    ```
    [.COMPRESSION_CODEC_NAME].orc
    ```
    
    When the compression codec is set,
    - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end.
    
    - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension.
    
    ## How was this patch tested?
    
    Unit tests are used and `./dev/run_tests` for coding style tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #11604 from HyukjinKwon/SPARK-13766.
    HyukjinKwon authored and rxin committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    aa0eba2 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13794][SQL] Rename DataFrameWriter.stream() DataFrameWriter.st…

    …artStream()
    
    ## What changes were proposed in this pull request?
    The new name makes it more obvious with the verb "start" that we are actually starting some execution.
    
    ## How was this patch tested?
    This is just a rename. Existing unit tests should cover it.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11627 from rxin/SPARK-13794.
    rxin committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    8a3acb7 View commit details
    Browse the repository at this point in the history
  10. [SPARK-7420][STREAMING][TESTS] Enable test: o.a.s.streaming.JobGenera…

    …torSuite "Do not clear received…
    
    ## How was this patch tested?
    
    unit test
    
    Author: proflin <proflin.me@gmail.com>
    
    Closes #11626 from lw-lin/SPARK-7420.
    lw-lin authored and rxin committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    8bcad28 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13706][ML] Add Python Example for Train Validation Split

    ## What changes were proposed in this pull request?
    
    This pull request adds a python example for train validation split.
    
    ## How was this patch tested?
    
    This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve.
    
    This contribution is my original work and I license it to Spark under its open source license.
    
    Author: JeremyNixon <jnixon2@gmail.com>
    
    Closes #11547 from JeremyNixon/tvs_example.
    JeremyNixon authored and MLnick committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    3e3c3d5 View commit details
    Browse the repository at this point in the history
  12. [MINOR][SQL] Replace DataFrameWriter.stream() with startStream() in c…

    …omments.
    
    ## What changes were proposed in this pull request?
    
    According to #11627 , this PR replace `DataFrameWriter.stream()` with `startStream()` in comments of `ContinuousQueryListener.java`.
    
    ## How was this patch tested?
    
    Manual. (It changes on comments.)
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11629 from dongjoon-hyun/minor_rename.
    dongjoon-hyun authored and rxin committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    9525c56 View commit details
    Browse the repository at this point in the history
  13. [SPARK-11108][ML] OneHotEncoder should support other numeric types

    Adding support for other numeric types:
    
    * Integer
    * Short
    * Long
    * Float
    * Decimal
    
    Author: sethah <seth.hendrickson16@gmail.com>
    
    Closes #9777 from sethah/SPARK-11108.
    sethah authored and MLnick committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    9fe38ab View commit details
    Browse the repository at this point in the history
  14. [SPARK-13663][CORE] Upgrade Snappy Java to 1.1.2.1

    ## What changes were proposed in this pull request?
    
    Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around.
    Supersedes #11524
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11631 from srowen/SPARK-13663.
    srowen committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    927e22e View commit details
    Browse the repository at this point in the history
  15. [SPARK-13758][STREAMING][CORE] enhance exception message to avoid mis…

    …leading
    
    We have a recoverable Spark streaming job with checkpoint enabled, it could be executed correctly at first time, but throw following exception when restarted and recovered from checkpoint.
    ```
    org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
     	at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
     	at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
     	at org.apache.spark.rdd.RDD.union(RDD.scala:565)
     	at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
     	at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
     	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
    ```
    
    According to exception, it shows I invoked transformations and actions in other transformations, but I did not. The real reason is that I used external RDD in DStream operation. External RDD data is not stored in checkpoint, so that during recovering, the initial value of _sc in this RDD is assigned to null and hit above exception. But you can find the error message is misleading, it indicates nothing about the real issue
    Here is the code to reproduce it.
    
    ```scala
    object Repo {
    
      def createContext(ip: String, port: Int, checkpointDirectory: String):StreamingContext = {
    
        println("Creating new context")
        val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
        val ssc = new StreamingContext(sparkConf, Seconds(2))
        ssc.checkpoint(checkpointDirectory)
    
        var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
    
        val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
        words.foreachRDD((rdd: RDD[String]) => {
          val res = rdd.map(word => (word, word.length)).collect()
          println("words: " + res.mkString(", "))
    
          cached = cached.union(rdd)
          cached.checkpoint()
          println("cached words: " + cached.collect.mkString(", "))
        })
        ssc
      }
    
      def main(args: Array[String]) {
    
        val ip = "localhost"
        val port = 9999
        val dir = "/home/maowei/tmp"
    
        val ssc = StreamingContext.getOrCreate(dir,
          () => {
            createContext(ip, port, dir)
          })
        ssc.start()
        ssc.awaitTermination()
      }
    }
    ```
    
    Author: mwws <wei.mao@intel.com>
    
    Closes #11595 from mwws/SPARK-MissleadingLog.
    wei-mao-intel authored and srowen committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    74267be View commit details
    Browse the repository at this point in the history
  16. [SPARK-13636] [SQL] Directly consume UnsafeRow in wholestage codegen …

    …plans
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13636
    
    ## What changes were proposed in this pull request?
    
    As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows.
    
    ## How was this patch tested?
    
    All existing wholestage codegen tests should be passed.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #11484 from viirya/direct-consume-unsaferow.
    viirya authored and davies committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    d24801a View commit details
    Browse the repository at this point in the history
  17. [SPARK-13727][CORE] SparkConf.contains does not consider deprecated keys

    The contains() method does not return consistently with get() if the key is deprecated. For example,
    import org.apache.spark.SparkConf
    val conf = new SparkConf()
    conf.set("spark.io.compression.lz4.block.size", "12345")  # display some deprecated warning message
    conf.get("spark.io.compression.lz4.block.size") # return 12345
    conf.get("spark.io.compression.lz4.blockSize") # return 12345
    conf.contains("spark.io.compression.lz4.block.size") # return true
    conf.contains("spark.io.compression.lz4.blockSize") # return false
    
    The fix will make the contains() and get() more consistent.
    
    I've added a test case for this.
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    Unit tests should be sufficient.
    
    Author: bomeng <bmeng@us.ibm.com>
    
    Closes #11568 from bomeng/SPARK-13727.
    bomeng authored and Marcelo Vanzin committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    235f4ac View commit details
    Browse the repository at this point in the history
  18. [SPARK-13759][SQL] Add IsNotNull constraints for expressions with an …

    …inequality
    
    ## What changes were proposed in this pull request?
    
    This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null.
    
    ## How was this patch tested?
    
    1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality.
    2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input.
    
    cc nongli
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11594 from sameeragarwal/isnotequal-constraints.
    sameeragarwal authored and yhuai committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    19f4ac6 View commit details
    Browse the repository at this point in the history
  19. [SPARK-13790] Speed up ColumnVector's getDecimal

    ## What changes were proposed in this pull request?
    
    We should reuse an object similar to the other non-primitive type getters. For
    a query that computes averages over decimal columns, this shows a 10% speedup
    on overall query times.
    
    ## How was this patch tested?
    
    Existing tests and this benchmark
    
    ```
    TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
    --------------------------------------------------------------------------------
    q27-agg (master)                       10627 / 11057         10.8          92.3
    q27-agg (this patch)                     9722 / 9832         11.8          84.4
    ```
    
    Author: Nong Li <nong@databricks.com>
    
    Closes #11624 from nongli/spark-13790.
    nongli authored and davies committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    747d2f5 View commit details
    Browse the repository at this point in the history
  20. [SQL][TEST] Increased timeouts to reduce flakiness in ContinuousQuery…

    …ManagerSuite
    
    ## What changes were proposed in this pull request?
    
    ContinuousQueryManager is sometimes flaky on Jenkins. I could not reproduce it on my machine, so I guess it about the waiting times which causes problems if Jenkins is loaded. I have increased the wait time in the hope that it will be less flaky.
    
    ## How was this patch tested?
    
    I reran the unit test many times on a loop in my machine. I am going to run it a few time in Jenkins, that's the real test.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #11638 from tdas/cqm-flaky-test.
    tdas authored and yhuai committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    3d2b6f5 View commit details
    Browse the repository at this point in the history
  21. [SPARK-13696] Remove BlockStore class & simplify interfaces of mem. &…

    … disk stores
    
    Today, both the MemoryStore and DiskStore implement a common `BlockStore` API, but I feel that this API is inappropriate because it abstracts away important distinctions between the behavior of these two stores.
    
    For instance, the disk store doesn't have a notion of storing deserialized objects, so it's confusing for it to expose object-based APIs like putIterator() and getValues() instead of only exposing binary APIs and pushing the responsibilities of serialization and deserialization to the client. Similarly, the DiskStore put() methods accepted a `StorageLevel` parameter even though the disk store can only store blocks in one form.
    
    As part of a larger BlockManager interface cleanup, this patch remove the BlockStore interface and refines the MemoryStore and DiskStore interfaces to reflect more narrow sets of responsibilities for those components. Some of the benefits of this interface cleanup are reflected in simplifications to several unit tests to eliminate now-unnecessary mocking, significant simplification of the BlockManager's `getLocal()` and `doPut()` methods, and a narrower API between the MemoryStore and DiskStore.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11534 from JoshRosen/remove-blockstore-interface.
    JoshRosen authored and Andrew Or committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    81d4853 View commit details
    Browse the repository at this point in the history
  22. [SPARK-3854][BUILD] Scala style: require spaces before {.

    ## What changes were proposed in this pull request?
    
    Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
    ```
    // Correct:
    if (true) {
      println("Wow!")
    }
    
    // Incorrect:
    if (true){
       println("Wow!")
    }
    ```
    IntelliJ also shows new warnings based on this.
    
    ## How was this patch tested?
    
    Pass the Jenkins ScalaStyle test.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11637 from dongjoon-hyun/SPARK-3854.
    dongjoon-hyun authored and Andrew Or committed Mar 10, 2016
    Configuration menu
    Copy the full SHA
    91fed8e View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2016

  1. [SPARK-13751] [SQL] generate better code for Filter

    ## What changes were proposed in this pull request?
    
    This PR improve the codegen of Filter by:
    
    1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore.
    
    2. Split the condition as conjunctive predicates, then check them one by one.
    
    Here is a piece of generated code for Filter in TPCDS Q55:
    ```java
    /* 109 */       /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */
    /* 110 */       /* input[0, int] */
    /* 111 */       boolean project_isNull2 = rdd_row.isNullAt(0);
    /* 112 */       int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0));
    /* 113 */       /* input[1, int] */
    /* 114 */       boolean project_isNull3 = rdd_row.isNullAt(1);
    /* 115 */       int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1));
    /* 116 */       /* input[2, int] */
    /* 117 */       boolean project_isNull4 = rdd_row.isNullAt(2);
    /* 118 */       int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2));
    /* 119 */
    /* 120 */       if (project_isNull3) continue;
    /* 121 */       if (project_isNull4) continue;
    /* 122 */       if (project_isNull2) continue;
    /* 123 */
    /* 124 */       /* (input[1, int] = 11) */
    /* 125 */       boolean filter_value6 = false;
    /* 126 */       filter_value6 = project_value3 == 11;
    /* 127 */       if (!filter_value6) continue;
    /* 128 */
    /* 129 */       /* (input[2, int] = 1999) */
    /* 130 */       boolean filter_value9 = false;
    /* 131 */       filter_value9 = project_value4 == 1999;
    /* 132 */       if (!filter_value9) continue;
    /* 133 */
    /* 134 */       filter_metricValue1.add(1);
    /* 135 */
    /* 136 */       /*** CONSUME: Project [d_date_sk#141] */
    /* 137 */
    /* 138 */       project_rowWriter1.write(0, project_value2);
    /* 139 */       append(project_result1.copy());
    ```
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11585 from davies/gen_filter.
    Davies Liu authored and davies committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    020ff8c View commit details
    Browse the repository at this point in the history
  2. [SPARK-13604][CORE] Sync worker's state after registering with master

    ## What changes were proposed in this pull request?
    
    Here lists all cases that Master cannot talk with Worker for a while and then network is back.
    
    1. Master doesn't know the network issue (not yet timeout)
    
      a. Worker doesn't know the network issue (onDisconnected is not called)
        - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered)
    
      b. Worker knows the network issue (onDisconnected is called)
        - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602))
    
    2. Worker timeout (Master knows the network issue). In such case,  master removes Worker and its executors and drivers.
    
      a. Worker doesn't know the network issue (onDisconnected is not called)
        - Worker keeps sending Heartbeat.
        - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker
        - Worker send `RegisterWorker` to master.
        - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors)
    
      b. Worker knows the network issue (onDisconnected is called)
        - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master.
        - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors)
    
    This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers.
    
    Note:  Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed.
    
    ## How was this patch tested?
    
    test("SPARK-13604: Master should ask Worker kill unknown executors and drivers")
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11455 from zsxwing/orphan-executors.
    zsxwing authored and Andrew Or committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    27fe6ba View commit details
    Browse the repository at this point in the history
  3. [SPARK-13244][SQL] Migrates DataFrame to Dataset

    ## What changes were proposed in this pull request?
    
    This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
    
    Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
    
    There are several noticeable API changes related to those returning arrays:
    
    1.  `collect`/`take`
    
        -   Old APIs in class `DataFrame`:
    
            ```scala
            def collect(): Array[Row]
            def take(n: Int): Array[Row]
            ```
    
        -   New APIs in class `Dataset[T]`:
    
            ```scala
            def collect(): Array[T]
            def take(n: Int): Array[T]
    
            def collectRows(): Array[Row]
            def takeRows(n: Int): Array[Row]
            ```
    
        Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
    
        Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
    
    1.  `randomSplit`
    
        -   Old APIs in class `DataFrame`:
    
            ```scala
            def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
            def randomSplit(weights: Array[Double]): Array[DataFrame]
            ```
    
        -   New APIs in class `Dataset[T]`:
    
            ```scala
            def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
            def randomSplit(weights: Array[Double]): Array[Dataset[T]]
            ```
    
        Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.
    
    1.  `groupBy`
    
        Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
    
    Other noticeable changes:
    
    1.  Dataset always do eager analysis now
    
        We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.
    
    ## How was this patch tested?
    
    Existing tests do the work.
    
    ## TODO
    
    - [ ] Fix all tests
    - [ ] Re-enable MiMA check
    - [ ] Update ScalaDoc (`since`, `group`, and example code)
    
    Author: Cheng Lian <lian@databricks.com>
    Author: Yin Huai <yhuai@databricks.com>
    Author: Wenchen Fan <wenchen@databricks.com>
    Author: Cheng Lian <liancheng@users.noreply.github.com>
    
    Closes #11443 from liancheng/ds-to-df.
    liancheng authored and yhuai committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    1d54278 View commit details
    Browse the repository at this point in the history
  4. [MINOR][DOC] Fix supported hive version in doc

    ## What changes were proposed in this pull request?
    
    Today, Spark 1.6.1 and updated docs are release. Unfortunately, there is obsolete hive version information on docs: [Building Spark](http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support). This PR fixes the following two lines.
    ```
    -By default Spark will build with Hive 0.13.1 bindings.
    +By default Spark will build with Hive 1.2.1 bindings.
    -# Apache Hadoop 2.4.X with Hive 13 support
    +# Apache Hadoop 2.4.X with Hive 1.2.1 support
    ```
    `sql/README.md` file also describe
    
    ## How was this patch tested?
    
    Manual.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11639 from dongjoon-hyun/fix_doc_hive_version.
    dongjoon-hyun authored and rxin committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    88fa866 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13327][SPARKR] Added parameter validations for colnames<-

    Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
    Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
    
    Closes #11220 from olarayej/SPARK-13312-3.
    Oscar D. Lara Yejas authored and shivaram committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    416e71a View commit details
    Browse the repository at this point in the history
  6. [SPARK-13789] Infer additional constraints from attribute equality

    ## What changes were proposed in this pull request?
    
    This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5`
    
    ## How was this patch tested?
    
    Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11618 from sameeragarwal/infer-isequal-constraints.
    sameeragarwal authored and rxin committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    c3a6269 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13389][SPARKR] SparkR support first/last with ignore NAs

    ## What changes were proposed in this pull request?
    
    SparkR support first/last with ignore NAs
    
    cc sun-rui felixcheung shivaram
    
    ## How was the this patch tested?
    
    unit tests
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11267 from yanboliang/spark-13389.
    yanboliang authored and shivaram committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    4d535d1 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and El…

    …iminate useless Window
    
    #### What changes were proposed in this pull request?
    
    `projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.
    
    This PR is based on the discussion started by cloud-fan in a separate PR:
    #5604 (comment)
    
    This PR also eliminates useless `Window`.
    
    cloud-fan yhuai
    
    #### How was this patch tested?
    
    Existing test cases cover it.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    Author: xiaoli <lixiao1983@gmail.com>
    Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
    
    Closes #11565 from gatorsmile/removeProjListWindow.
    gatorsmile authored and cloud-fan committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    560489f View commit details
    Browse the repository at this point in the history
  9. [SPARK-12718][SPARK-13720][SQL] SQL generation support for window fun…

    …ctions
    
    ## What changes were proposed in this pull request?
    
    Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc.
    
    This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up.
    
    Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here.
    
    Many thanks to gatorsmile for the initial discussion and test cases!
    
    ## How was this patch tested?
    
    new tests in `LogicalPlanToSQLSuite`
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11555 from cloud-fan/window.
    cloud-fan committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    6871cc8 View commit details
    Browse the repository at this point in the history
  10. [HOT-FIX] fix compile

    Fix the compilation failure introduced by #11555 because of a merge conflict.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11648 from cloud-fan/hotbug.
    cloud-fan committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    74c4e26 View commit details
    Browse the repository at this point in the history
  11. [MINOR][CORE] Fix a duplicate "and" in a log message.

    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11642 from vanzin/spark-conf-typo.
    Marcelo Vanzin authored and rxin committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    e33bc67 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB

    JIRA: https://issues.apache.org/jira/browse/SPARK-13672
    
    ## What changes were proposed in this pull request?
    
    add two python examples of BisectingKMeans for ml and mllib
    
    ## How was this patch tested?
    
    manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #11515 from zhengruifeng/mllib_bkm_pe.
    zhengruifeng authored and MLnick committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    d18276c View commit details
    Browse the repository at this point in the history
  13. [SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class …

    …/ Spark assembly
    
    This patch removes the need to build a full Spark assembly before running the `dev/mima` script.
    
    - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies.
       - This required me to delete two classes full of dead code that we don't use anymore
    - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build.
    - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11178 from JoshRosen/remove-assembly-in-run-tests.
    JoshRosen committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    6ca990f View commit details
    Browse the repository at this point in the history
  14. [SPARK-13512][ML] add example and doc for MaxAbsScaler

    ## What changes were proposed in this pull request?
    
    jira: https://issues.apache.org/jira/browse/SPARK-13512
    Add example and doc for ml.feature.MaxAbsScaler.
    
    ## How was this patch tested?
     unit tests
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #11392 from hhbyyh/maxabsdoc.
    hhbyyh authored and MLnick committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    0b713e0 View commit details
    Browse the repository at this point in the history
  15. [SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision t…

    …ree and random forest
    
    ## What changes were proposed in this pull request?
    
    This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`.
    
    ## How was this patch tested?
    
    Python doc tests for the affected classes were updated to check feature importances.
    
    Author: sethah <seth.hendrickson16@gmail.com>
    
    Closes #11622 from sethah/SPARK-13787.
    sethah authored and MLnick committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    234f781 View commit details
    Browse the repository at this point in the history
  16. [HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java Ma…

    …xAbsScaler example
    
    ## What changes were proposed in this pull request?
    
    Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`).
    
    ## How was this patch tested?
    
    Existing build/unit tests
    
    Author: Nick Pentreath <nick.pentreath@gmail.com>
    
    Closes #11653 from MLnick/java-maxabs-example-fix.
    MLnick committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    8fff0f9 View commit details
    Browse the repository at this point in the history
  17. [SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive.

    In preparation for the demise of assemblies, this change allows the
    YARN backend to use multiple jars and globs as the "Spark jar". The
    config option has been renamed to "spark.yarn.jars" to reflect that.
    
    A second option "spark.yarn.archive" was also added; if set, this
    takes precedence and uploads an archive expected to contain the jar
    files with the Spark code and its dependencies.
    
    Existing deployments should keep working, mostly. This change drops
    support for the "SPARK_JAR" environment variable, and also does not
    fall back to using "jarOfClass" if no configuration is set, falling
    back to finding files under SPARK_HOME instead. This should be fine
    since "jarOfClass" probably wouldn't work unless you were using
    spark-submit anyway.
    
    Tested with the unit tests, and trying the different config options
    on a YARN cluster.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11500 from vanzin/SPARK-13577.
    Marcelo Vanzin authored and Tom Graves committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    07f1c54 View commit details
    Browse the repository at this point in the history
  18. [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame

    ## What changes were proposed in this pull request?
    
    PR #11443 temporarily disabled MiMA check, this PR re-enables it.
    
    One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API  changes.
    
    ## How was this patch tested?
    
    Tested by MiMA check triggered by Jenkins.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #11656 from liancheng/re-enable-mima.
    liancheng committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    6d37e1e View commit details
    Browse the repository at this point in the history
  19. [SPARK-13780][SQL] Add missing dependency to build.

    This is needed to avoid odd compiler errors when building just the
    sql package with maven, because of odd interactions between scalac
    and shaded classes.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11640 from vanzin/SPARK-13780.
    Marcelo Vanzin committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    99b7187 View commit details
    Browse the repository at this point in the history
  20. [STREAMING][MINOR] Fix a duplicate "be" in comments

    Author: Liwei Lin <proflin.me@gmail.com>
    
    Closes #11650 from lw-lin/typo.
    lw-lin authored and rxin committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    eb650a8 View commit details
    Browse the repository at this point in the history
  21. [SPARK-13328][CORE] Poor read performance for broadcast variables wit…

    …h dynamic resource allocation
    
    When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)
    
    Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
    
    Closes #11241 from nezihyigitbasi/SPARK-13328.
    nezihyigitbasi authored and Andrew Or committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    ff776b2 View commit details
    Browse the repository at this point in the history
  22. [SPARK-13807] De-duplicate Python*Helper instantiation code in PySp…

    …ark streaming
    
    This patch de-duplicates code in PySpark streaming which loads the `Python*Helper` classes. I also changed a few `raise e` statements to simply `raise` in order to preserve the full exception stacktrace when re-throwing.
    
    Here's a link to the whitespace-change-free diff: https://github.com/apache/spark/compare/master...JoshRosen:pyspark-reflection-deduplication?w=0
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11641 from JoshRosen/pyspark-reflection-deduplication.
    JoshRosen authored and zsxwing committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    073bf9d View commit details
    Browse the repository at this point in the history
  23. [SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples…

    … files
    
    JIRA:  https://issues.apache.org/jira/browse/SPARK-13814
    
    ## What changes were proposed in this pull request?
    
    delete unnecessary imports in python examples files
    
    ## How was this patch tested?
    
    manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #11651 from zhengruifeng/del_import_pe.
    zhengruifeng authored and davies committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    42afd72 View commit details
    Browse the repository at this point in the history
  24. [SPARK-13139][SQL] Parse Hive DDL commands ourselves

    ## What changes were proposed in this pull request?
    
    This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.
    
    Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.
    
    ## How was this patch tested?
    
    Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11573 from andrewor14/parser-plus-plus.
    Andrew Or authored and yhuai committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    66d9d0e View commit details
    Browse the repository at this point in the history
  25. [SPARK-13830] prefer block manager than direct result for large result

    ## What changes were proposed in this pull request?
    
    The current RPC can't handle large blocks very well, it's very slow to fetch 100M block (about 1 minute). Once switch to block manager to fetch that, it took about 10 seconds (still could be improved).
    
    ## How was this patch tested?
    
    existing unit tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11659 from davies/direct_result.
    Davies Liu authored and davies committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    2ef4c59 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2016

  1. [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RD…

    …D and data sources
    
    ## What changes were proposed in this pull request?
    
    This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.
    
    Also fix the problem for sameResult() on two DataSourceScan.
    
    Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).
    
    ## How was this patch tested?
    
    Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11514 from davies/existing_rdd.
    Davies Liu authored and davies committed Mar 12, 2016
    Configuration menu
    Copy the full SHA
    ba8c86d View commit details
    Browse the repository at this point in the history
  2. [SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown…

    … from QueryExecution.assertAnalyzed
    
    PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`.  However, the original stack trace wasn't properly inherited.  This PR fixes this issue by inheriting the stack trace.
    
    A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #11677 from liancheng/analysis-exception-stacktrace.
    liancheng authored and rxin committed Mar 12, 2016
    Configuration menu
    Copy the full SHA
    4eace4d View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2016

  1. [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()

    ## What changes were proposed in this pull request?
    
    This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.
    
    ## How was this patch tested?
    
    Existing tests should do the work.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
    liancheng committed Mar 13, 2016
    Configuration menu
    Copy the full SHA
    c079420 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Replace DataFrame with Dataset in Javadoc.

    ## What changes were proposed in this pull request?
    
    SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc.
    
    * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html
    * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.
    dongjoon-hyun authored and liancheng committed Mar 13, 2016
    Configuration menu
    Copy the full SHA
    db88d02 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13810][CORE] Add Port Configuration Suggestions on Bind Except…

    …ions
    
    ## What changes were proposed in this pull request?
    Currently, when a java.net.BindException is thrown, it displays the following message:
    
    java.net.BindException: Address already in use: Service '$serviceName' failed after 16 retries!
    
    This change adds port configuration suggestions to the BindException, for example, for the UI, it now displays
    
    java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries! Consider explicitly setting the appropriate port for 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
    
    ## How was this patch tested?
    Manual tests
    
    Author: Bjorn Jonsson <bjornjon@gmail.com>
    
    Closes #11644 from bjornjon/master.
    bjornjon authored and srowen committed Mar 13, 2016
    Configuration menu
    Copy the full SHA
    515e4af View commit details
    Browse the repository at this point in the history
  4. [SPARK-13812][SPARKR] Fix SparkR lint-r test errors.

    ## What changes were proposed in this pull request?
    
    This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.
    
    ## How was this patch tested?
    
    dev/lint-r
    SparkR unit tests
    
    Author: Sun Rui <rui.sun@intel.com>
    
    Closes #11652 from sun-rui/SPARK-13812.
    Sun Rui authored and shivaram committed Mar 13, 2016
    Configuration menu
    Copy the full SHA
    c7e68c3 View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2016

  1. [SQL] fix typo in DataSourceRegister

    ## What changes were proposed in this pull request?
    fix typo in DataSourceRegister
    
    ## How was this patch tested?
    
    found when going through latest code
    
    Author: Jacky Li <jacky.likun@huawei.com>
    
    Closes #11686 from jackylk/patch-12.
    jackylk authored and rxin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    f3daa09 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x.

    ## What changes were proposed in this pull request?
    
    For 2.0.0, we had better make **sbt** and **sbt plugins** up-to-date. This PR checks the status of each plugins and bumps the followings.
    
    * sbt: 0.13.9 --> 0.13.11
    * sbteclipse-plugin: 2.2.0 --> 4.0.0
    * sbt-dependency-graph: 0.7.4 --> 0.8.2
    * sbt-mima-plugin: 0.1.6 --> 0.1.9
    * sbt-revolver: 0.7.2 --> 0.8.0
    
    All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.)
    
    During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result.
    ```
     // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics
     ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"),
    -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"),
    +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"),
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins build.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11669 from dongjoon-hyun/update_mima.
    dongjoon-hyun authored and rxin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    473263f View commit details
    Browse the repository at this point in the history
  3. [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String …

    …<-> byte[] conversions (and remaining Coverity items)
    
    ## What changes were proposed in this pull request?
    
    - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
    - Same for `InputStreamReader` and `OutputStreamWriter` constructors
    - Standardizes on UTF-8 everywhere
    - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
    - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit srowen@1deecd8 )
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11657 from srowen/SPARK-13823.
    srowen authored and rxin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    1840852 View commit details
    Browse the repository at this point in the history
  4. Closes #11668

    rxin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    e58fa19 View commit details
    Browse the repository at this point in the history
  5. [MINOR][DOCS] Fix more typos in comments/strings.

    ## What changes were proposed in this pull request?
    
    This PR fixes 135 typos over 107 files:
    * 121 typos in comments
    * 11 typos in testcase name
    * 3 typos in log messages
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11689 from dongjoon-hyun/fix_more_typos.
    dongjoon-hyun authored and srowen committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    acdf219 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13746][TESTS] stop using deprecated SynchronizedSet

    trait SynchronizedSet in package mutable is deprecated
    
    Author: Wilson Wu <wilson888888888@gmail.com>
    
    Closes #11580 from wilson888888888/spark-synchronizedset.
    Wilson Wu authored and srowen committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    31d069d View commit details
    Browse the repository at this point in the history
  7. [SPARK-13207][SQL] Make partitioning discovery ignore _SUCCESS files.

    If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.
    
    In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.
    
    To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.
    
    https://issues.apache.org/jira/browse/SPARK-13207
    
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes #11088 from yhuai/SPARK-13207.
    yhuai committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    250832c View commit details
    Browse the repository at this point in the history
  8. [SPARK-13139][SQL] Follow-ups to #11573

    Addressing outstanding comments in #11573.
    
    Jenkins, new test case in `DDLCommandSuite`
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11667 from andrewor14/ddl-parser-followups.
    Andrew Or authored and yhuai committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    9a1680c View commit details
    Browse the repository at this point in the history
  9. [SPARK-13833] Guard against race condition when re-caching disk block…

    …s in memory
    
    When reading data from the DiskStore and attempting to cache it back into the memory store, we should guard against race conditions where multiple readers are attempting to re-cache the same block in memory.
    
    This patch accomplishes this by synchronizing on the block's `BlockInfo` object while trying to re-cache a block.
    
    (Will file JIRA as soon as ASF JIRA stops being down / laggy).
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11660 from JoshRosen/concurrent-recaching-fixes.
    JoshRosen authored and Andrew Or committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    9a87afd View commit details
    Browse the repository at this point in the history
  10. [SPARK-13578][CORE] Modify launch scripts to not use assemblies.

    Instead of looking for a specially-named assembly, the scripts now will
    blindly add all jars under the libs directory to the classpath. This
    libs directory is still currently the old assembly dir, so things should
    keep working the same way as before until we make more packaging changes.
    
    The only lost feature is the detection of multiple assemblies; I consider
    that a minor nicety that only really affects few developers, so it's probably
    ok.
    
    Tested locally by running spark-shell; also did some minor Win32 testing
    (just made sure spark-shell started).
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11591 from vanzin/SPARK-13578.
    Marcelo Vanzin authored and JoshRosen committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    45f8053 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13779][YARN] Avoid cancelling non-local container requests.

    To maximize locality, the YarnAllocator would cancel any requests with a
    stale locality preference or no locality preference. This assumed that
    the majority of tasks had locality preferences, but may not be the case
    when scanning S3. This caused container requests for S3 tasks to be
    constantly cancelled and resubmitted.
    
    This changes the allocator's logic to cancel only stale requests and
    just enough requests without locality preferences to submit requests
    with locality preferences. This avoids cancelling requests without
    locality preferences that would be resubmitted without locality
    preferences.
    
    We've deployed this patch on our clusters and verified that jobs that couldn't get executors because they kept canceling and resubmitting requests are fixed. Large jobs are running fine.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #11612 from rdblue/SPARK-13779-fix-yarn-allocator-requests.
    rdblue authored and Marcelo Vanzin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    63f642a View commit details
    Browse the repository at this point in the history
  12. [SPARK-13658][SQL] BooleanSimplification rule is slow with large bool…

    …ean expressions
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13658
    
    ## What changes were proposed in this pull request?
    
    Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree).
    
    It will great if we could speedup it.
    
    [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
    
    How to speed up it:
    
    When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time.
    
    By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions.
    
    I simply benchmark it with an expression which is part of the where clause in TPCDS Q3:
    
        val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)
    
        val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743)))
    
        val plan = testRelation.where(input).analyze
        val actual = Optimize.execute(plan)
    
    With this patch:
    
        352 milliseconds
        346 milliseconds
        340 milliseconds
    
    Without this patch:
    
        585 milliseconds
        880 milliseconds
        677 milliseconds
    
    ## How was this patch tested?
    
    Existing tests should pass.
    
    Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #11647 from viirya/improve-expr-canonicalize.
    viirya authored and marmbrus committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    6a4bfcd View commit details
    Browse the repository at this point in the history
  13. [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classl…

    …oading issue
    
    This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.
    
    In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.
    
    Py4J diff: py4j/py4j@0.9.1...0.9.2
    
    /cc zsxwing tdas davies brkyvz
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11687 from JoshRosen/py4j-0.9.2.
    JoshRosen committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    07cb323 View commit details
    Browse the repository at this point in the history
  14. [SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle file…

    …s before application has stopped
    
    ## Problem description:
    
    Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.
    
    ### Context and analysis:
    
    spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
    External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159
    
    This is a follow up on #11207 .
    
    ## What changes were proposed in this pull request?
    
    This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.
    
    ## How was the this patch tested?
    
    This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:
    ```
    16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
    16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
    16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
    16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
    16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
    16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
    16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs
    ```
    Note: there are 2 executors running on this slave.
    
    Author: Bertrand Bossy <bertrand.bossy@teralytics.net>
    
    Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.
    Bertrand Bossy authored and Andrew Or committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    310981d View commit details
    Browse the repository at this point in the history
  15. [MINOR][DOCS] Added Missing back slashes

    ## What changes were proposed in this pull request?
    
    When studying spark many users just copy examples on the documentation and paste on their terminals
    and because of that the missing backlashes lead them run into some shell errors.
    
    The added backslashes avoid that problem for spark users with that behavior.
    
    ## How was this patch tested?
    
    I generated the documentation locally using jekyll and checked the generated pages
    
    Author: Daniel Santana <mestresan@gmail.com>
    
    Closes #11699 from danielsan/master.
    danielsan authored and Andrew Or committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    9f13f0f View commit details
    Browse the repository at this point in the history
  16. [MINOR][COMMON] Fix copy-paste oversight in variable naming

    ## What changes were proposed in this pull request?
    
    JavaUtils.java has methods to convert time and byte strings for internal use, this change renames a variable used in byteStringAs(), from timeError to byteError.
    
    Author: Bjorn Jonsson <bjornjon@gmail.com>
    
    Closes #11695 from bjornjon/master.
    bjornjon authored and Andrew Or committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    e06493c View commit details
    Browse the repository at this point in the history
  17. [SPARK-13054] Always post TaskEnd event for tasks

    I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks.
    There are multiple issues with this:
    -  If the task end for tasks (in this case probably because of speculation) comes in after the stage is finished, then the DAGScheduler.handleTaskCompletion will skip the task completion event
    
    Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>
    Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
    Author: Tom Graves <tgraves@yahoo-inc.com>
    
    Closes #10951 from tgravescs/SPARK-11701.
    Thomas Graves authored and Andrew Or committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    23385e8 View commit details
    Browse the repository at this point in the history
  18. [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam…

    …` to (Streaming)LinearRegressionWithSGD
    
    ## What changes were proposed in this pull request?
    
    `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values.
    To be consistent with other algorithms, we had better add them. The same default value is used.
    
    ## How was this patch tested?
    
    Pass the existing unit test.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11527 from dongjoon-hyun/SPARK-13686.
    dongjoon-hyun authored and mengxr committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    a48296f View commit details
    Browse the repository at this point in the history
  19. [SPARK-10907][SPARK-6157] Remove pendingUnrollMemory from MemoryStore

    This patch refactors the MemoryStore to remove the concept of `pendingUnrollMemory`. It also fixes fixes SPARK-6157: "Unrolling with MEMORY_AND_DISK should always release memory".
    
    Key changes:
    
    - Inline `MemoryStore.tryToPut` at its three call sites in the `MemoryStore`.
    - Inline `Memory.unrollSafely` at its only call site (in `MemoryStore.putIterator`).
    - Inline `MemoryManager.acquireStorageMemory` at its call sites.
    - Simplify the code as a result of this inlining (some parameters have fixed values after inlining, so lots of branches can be removed).
    - Remove the `pendingUnrollMemory` map by returning the amount of unrollMemory allocated when returning an iterator after a failed `putIterator` call.
    - Change `putIterator` to return an instance of `PartiallyUnrolledIterator`, a special iterator subclass which will automatically free the unroll memory of its partially-unrolled elements when the iterator is consumed. To handle cases where the iterator is not consumed (e.g. when a MEMORY_ONLY put fails), `PartiallyUnrolledIterator` exposes a `close()` method which may be called to discard the unrolled values and free their memory.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11613 from JoshRosen/cleanup-unroll-memory.
    JoshRosen authored and Andrew Or committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    38529d8 View commit details
    Browse the repository at this point in the history
  20. [SPARK-13626][CORE] Avoid duplicate config deprecation warnings.

    Three different things were needed to get rid of spurious warnings:
    - silence deprecation warnings when cloning configuration
    - change the way SparkHadoopUtil instantiates SparkConf to silence
      warnings
    - avoid creating new SparkConf instances where it's not needed.
    
    On top of that, I changed the way that Logging.scala detects the repl;
    now it uses a method that is overridden in the repl's Main class, and
    the hack in Utils.scala is not needed anymore. This makes the 2.11 repl
    behave like the 2.10 one and set the default log level to WARN, which
    is a lot better. Previously, this wasn't working because the 2.11 repl
    triggers log initialization earlier than the 2.10 one.
    
    I also removed and simplified some other code in the 2.11 repl's Main
    to avoid replicating logic that already exists elsewhere in Spark.
    
    Tested the 2.11 repl in local and yarn modes.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11510 from vanzin/SPARK-13626.
    Marcelo Vanzin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    8301fad View commit details
    Browse the repository at this point in the history
  21. [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, stre…

    …aming-zeromq, streaming-akka, streaming-twitter to Spark packages
    
    ## What changes were proposed in this pull request?
    
    Currently there are a few sub-projects, each for integrating with different external sources for Streaming.  Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages
    
    - streaming-flume
    - streaming-akka
    - streaming-mqtt
    - streaming-zeromq
    - streaming-twitter
    
    They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.
    
    I have already copied these projects to https://github.com/spark-packages
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11672 from zsxwing/remove-external-pkg.
    zsxwing authored and rxin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    06dec37 View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2016

  1. [SPARK-11826][MLLIB] Refactor add() and subtract() methods

    srowen Could you please check this when you have time?
    
    Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>
    
    Closes #9916 from ehsanmok/JIRA-11826.
    ehsanmok authored and jkbradley committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    992142b View commit details
    Browse the repository at this point in the history
  2. [SPARK-13664][SQL] Add a strategy for planning partitioned and bucket…

    …ed scans of files
    
    This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.
    
    Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
     - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns  in the public API of `org.apache.spark.sql.sources.FileFormat`
     - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
     - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
     - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
     - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.
    
    Currently only a testing source is planned / tested using this strategy.  In follow-up PRs we will port the existing formats to this API.
    
    A stub for `FileScanRDD` is also added, but most methods remain unimplemented.
    
    Other minor cleanups:
     - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic.  This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
     - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
     - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
     - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #11646 from marmbrus/fileStrategy.
    marmbrus committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    17eec0a View commit details
    Browse the repository at this point in the history
  3. [SPARK-13882][SQL] Remove org.apache.spark.sql.execution.local

    ## What changes were proposed in this pull request?
    We introduced some local operators in org.apache.spark.sql.execution.local package but never fully wired the engine to actually use these. We still plan to implement a full local mode, but it's probably going to be fairly different from what the current iterator-based local mode would look like. Based on what we know right now, we might want a push-based columnar version of these operators.
    
    Let's just remove them for now, and we can always re-introduced them in the future by looking at branch-1.6.
    
    ## How was this patch tested?
    This is simply dead code removal.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11705 from rxin/SPARK-13882.
    rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    4bf4609 View commit details
    Browse the repository at this point in the history
  4. [SPARK-10380][SQL] Fix confusing documentation examples for astype/dr…

    …op_duplicates.
    
    ## What changes were proposed in this pull request?
    We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases.
    
    ## How was this patch tested?
    Existing PySpark unit tests.
    
    Closes #11543.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11698 from rxin/SPARK-10380.
    rxin authored and marmbrus committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    8e0b030 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13791][SQL] Add MetadataLog and HDFSMetadataLog

    ## What changes were proposed in this pull request?
    
    - Add a MetadataLog interface for  metadata reliably storage.
    - Add HDFSMetadataLog as a MetadataLog implementation based on HDFS.
    - Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11625 from zsxwing/metadata-log.
    zsxwing authored and marmbrus committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    b5e3bd8 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13880][SPARK-13881][SQL] Rename DataFrame.scala Dataset.scala,…

    … and remove LegacyFunctions
    
    ## What changes were proposed in this pull request?
    1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset.
    2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0.
    
    ## How was this patch tested?
    Should be covered by existing unit/integration tests.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11704 from rxin/SPARK-13880.
    rxin authored and cloud-fan committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    e76679a View commit details
    Browse the repository at this point in the history
  7. [SPARK-13661][SQL] avoid the copy in HashedRelation

    ## What changes were proposed in this pull request?
    
    Avoid the copy in HashedRelation, since most of the HashedRelation are built with Array[Row], added the copy() for LeftSemiJoinHash. This could help to reduce the memory consumption for Broadcast join.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11666 from davies/remove_copy.
    Davies Liu authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    9256840 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13353][SQL] fast serialization for collecting DataFrame/Dataset

    ## What changes were proposed in this pull request?
    
    When we call DataFrame/Dataset.collect(), Java serializer (or Kryo Serializer) will be used to serialize the UnsafeRows in executor, then deserialize them into UnsafeRows in driver. Java serializer (and Kyro serializer) are slow on millions rows, because they try to find out the same rows, but usually there is no same rows.
    
    This PR will serialize the UnsafeRows as byte array by packing them together, then Java serializer (or Kyro serializer) serialize the bytes very fast (there are fewer blocks and byte array are not compared by content).
    
    The UnsafeRow format is highly compressible, the serialized bytes are also compressed (configurable by spark.io.compression.codec).
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Add a benchmark for collect, before this patch:
    ```
    Intel(R) Core(TM) i7-4558U CPU  2.80GHz
    collect:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect 1 million                      3991 / 4311          0.3        3805.7       1.0X
    collect 2 millions                  10083 / 10637          0.1        9616.0       0.4X
    collect 4 millions                  29551 / 30072          0.0       28182.3       0.1X
    ```
    
    ```
    Intel(R) Core(TM) i7-4558U CPU  2.80GHz
    collect:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect 1 million                        775 / 1170          1.4         738.9       1.0X
    collect 2 millions                     1153 / 1758          0.9        1099.3       0.7X
    collect 4 millions                     4451 / 5124          0.2        4244.9       0.2X
    ```
    
    We can see about 5-7X speedup.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11664 from davies/serialize_row.
    Davies Liu authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    f72743d View commit details
    Browse the repository at this point in the history
  9. [SPARK-13884][SQL] Remove DescribeCommand's dependency on LogicalPlan

    ## What changes were proposed in this pull request?
    This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner).
    
    ## How was this patch tested?
    Should be covered by existing unit tests and Hive compatibility tests that run describe table.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11710 from rxin/SPARK-13884.
    rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    e649580 View commit details
    Browse the repository at this point in the history
  10. [SPARK-13888][DOC] Remove Akka Receiver doc and refer to the DStream …

    …Akka project
    
    ## What changes were proposed in this pull request?
    
    I have copied the docs of Streaming Akka to https://github.com/spark-packages/dstream-akka/blob/master/README.md
    
    So we can remove them from Spark now.
    
    ## How was this patch tested?
    
    Only document changes.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11711 from zsxwing/remove-akka-doc.
    zsxwing authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    43304b1 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13870][SQL] Add scalastyle escaping correctly in CVSSuite.scala

    ## What changes were proposed in this pull request?
    
    When initial creating `CVSSuite.scala` in SPARK-12833, there was a typo on `scalastyle:on`: `scalstyle:on`. So, it turns off ScalaStyle checking for the rest of the file mistakenly. So, it can not find a violation on the code of `SPARK-12668` added recently. This issue fixes the existing escaping correctly and adds a new escaping for `SPARK-12668` code like the following.
    
    ```scala
       test("test aliases sep and encoding for delimiter and charset") {
    +    // scalastyle:off
         val cars = sqlContext
    ...
           .load(testFile(carsFile8859))
    +    // scalastyle:on
    ```
    This will prevent future potential problems, too.
    
    ## How was this patch tested?
    
    Pass the Jenkins test.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11700 from dongjoon-hyun/SPARK-13870.
    dongjoon-hyun authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    a51f877 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13890][SQL] Remove some internal classes' dependency on SQLCon…

    …text
    
    ## What changes were proposed in this pull request?
    In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer.
    
    As part of this patch, I also removed the following internal methods from SQLContext:
    ```
    protected[sql] def functionRegistry: FunctionRegistry
    protected[sql] def optimizer: Optimizer
    protected[sql] def sqlParser: ParserInterface
    protected[sql] def planner: SparkPlanner
    protected[sql] def continuousQueryManager
    protected[sql] def prepareForExecution: RuleExecutor[SparkPlan]
    ```
    
    ## How was this patch tested?
    Existing unit/integration tests.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11712 from rxin/sqlContext-planner.
    rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    276c2d5 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPrunin…

    …g and EliminateOperator
    
    #### What changes were proposed in this pull request?
    
    Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations:
    
    ```
    Join Inner, Some((cast(id1#16 as bigint) = id1#18L))
    :- Project [id1#16]
    :  +- Filter isnotnull(cast(id1#16 as bigint))
    :     +- Project [id1#16]
    :        +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int>
    +- Filter isnotnull(id1#18L)
       +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint>
    ```
    
    This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators`
    
    The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue.
    
    cc sameeragarwal marmbrus
    
    In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR.
    
    cc davies cloud-fan yhuai
    
    #### How was this patch tested?
    
    Modified the existing test cases.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #11682 from gatorsmile/viewDuplicateNames.
    gatorsmile authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    99bd2f0 View commit details
    Browse the repository at this point in the history
  14. [SPARK-13660][SQL][TESTS] ContinuousQuerySuite floods the logs with g…

    …arbage
    
    ## What changes were proposed in this pull request?
    
    Use method 'testQuietly' to avoid ContinuousQuerySuite flooding the console logs with garbage
    
    Make ContinuousQuerySuite not output logs to the console. The logs will still output to unit-tests.log.
    
    ## How was this patch tested?
    
    Just check Jenkins output.
    
    Author: Xin Ren <iamshrek@126.com>
    
    Closes #11703 from keypointt/SPARK-13660.
    keypointt authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    10251a7 View commit details
    Browse the repository at this point in the history
  15. [SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml

    Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
    Performance testing should be done to ensure there were no regressions.
    
    Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing)
    
    Author: sethah <seth.hendrickson16@gmail.com>
    
    Closes #10607 from sethah/SPARK-12379.
    sethah authored and MLnick committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    dafd70f View commit details
    Browse the repository at this point in the history
  16. [SPARK-13803] restore the changes in SPARK-3411

    ## What changes were proposed in this pull request?
    
    This patch contains the functionality to balance the load of the cluster-mode drivers among workers
    
    This patch restores the changes in #1106 which was erased due to the merging of #731
    
    ## How was this patch tested?
    
    test with existing test cases
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes #11702 from CodingCat/SPARK-13803.
    CodingCat authored and srowen committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    bd5365b View commit details
    Browse the repository at this point in the history
  17. [SPARK-13576][BUILD] Don't create assembly for examples.

    As part of the goal to stop creating assemblies in Spark, this change
    modifies the mvn and sbt builds to not create an assembly for examples.
    
    Instead, dependencies are copied to the build directory (under
    target/scala-xx/jars), and in the final archive, into the "examples/jars"
    directory.
    
    To avoid having to deal too much with Windows batch files, I made examples
    run through the launcher library; the spark-submit launcher now has a
    special mode to run examples, which adds all the necessary jars to the
    spark-submit command line, and replaces the bash and batch scripts that
    were used to run examples. The scripts are now just a thin wrapper around
    spark-submit; another advantage is that now all spark-submit options are
    supported.
    
    There are a few glitches; in the mvn build, a lot of duplicated dependencies
    get copied, because they are promoted to "compile" scope due to extra
    dependencies in the examples module (such as HBase). In the sbt build,
    all dependencies are copied, because there doesn't seem to be an easy
    way to filter things.
    
    I plan to clean some of this up when the rest of the tasks are finished.
    When the main assembly is replaced with jars, we can remove duplicate jars
    from the examples directory during packaging.
    
    Tested by running SparkPi in: maven build, sbt build, dist created by
    make-distribution.sh.
    
    Finally: note that running the "assembly" target in sbt doesn't build
    the examples anymore. You need to run "package" for that.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11452 from vanzin/SPARK-13576.
    Marcelo Vanzin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    48978ab View commit details
    Browse the repository at this point in the history
  18. [SPARK-13893][SQL] Remove SQLContext.catalog/analyzer (internal method)

    ## What changes were proposed in this pull request?
    Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits:
    1. Reduces internal dependency on SQLContext.
    2. Removes 2 public methods in Java (Java does not obey package private visibility).
    
    More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field.
    
    ## How was this patch tested?
    Existing unit/integration test code.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11716 from rxin/SPARK-13893.
    rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    5e6f2f4 View commit details
    Browse the repository at this point in the history
  19. [SPARK-13642][YARN] Changed the default application exit state to fai…

    …led for yarn cluster mode
    
    ## What changes were proposed in this pull request?
    
    Changing the default exit state to `failed` for any application running on yarn cluster mode.
    
    ## How was this patch tested?
    
    Unit test is done locally.
    
    CC tgravescs and vanzin .
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #11693 from jerryshao/SPARK-13642.
    jerryshao authored and Marcelo Vanzin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    d89c714 View commit details
    Browse the repository at this point in the history
  20. [SPARK-13896][SQL][STRING] Dataset.toJSON should return Dataset

    ## What changes were proposed in this pull request?
    Change the return type of toJson in Dataset class
    ## How was this patch tested?
    No additional unit test required.
    
    Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com>
    
    Closes #11732 from skonto/fix_toJson.
    Stavros Kontopoulos authored and rxin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    50e3644 View commit details
    Browse the repository at this point in the history
  21. [MINOR] a minor fix for the comments of a method in RPC Dispatcher

    ## What changes were proposed in this pull request?
    
    a minor fix for the comments of a method in RPC Dispatcher
    
    ## How was this patch tested?
    
    existing unit tests
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes #11738 from CodingCat/minor_rpc.
    CodingCat authored and zsxwing committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    dddf2f2 View commit details
    Browse the repository at this point in the history
  22. [SPARK-13626][CORE] Revert change to SparkConf's constructor.

    It shouldn't be private.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #11734 from vanzin/SPARK-13626-api.
    Marcelo Vanzin committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    41eaabf View commit details
    Browse the repository at this point in the history
  23. [SPARK-13895][SQL] DataFrameReader.text should return Dataset[String]

    ## What changes were proposed in this pull request?
    This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String].
    
    Closes #11731.
    
    ## How was this patch tested?
    Updated existing integration tests to reflect the change.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11739 from rxin/SPARK-13895.
    rxin authored and yhuai committed Mar 15, 2016
    Configuration menu
    Copy the full SHA
    643649d View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2016

  1. [SPARK-13918][SQL] Merge SortMergeJoin and SortMergerOuterJoin

    ## What changes were proposed in this pull request?
    
    This PR just move some code from SortMergeOuterJoin into SortMergeJoin.
    
    This is for support codegen for outer join.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11743 from davies/gen_smjouter.
    Davies Liu authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    bbd887f View commit details
    Browse the repository at this point in the history
  2. [MINOR][TEST][SQL] Remove wrong "expected" parameter in checkNaNWitho…

    …utCodegen
    
    ## What changes were proposed in this pull request?
    
    Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen.
    This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also.
    
    Author: Yucai Yu <yucai.yu@intel.com>
    
    Closes #11718 from yucai/unused_expected.
    Yucai Yu authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    52b6a89 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13917] [SQL] generate broadcast semi join

    ## What changes were proposed in this pull request?
    
    This PR brings codegen support for broadcast left-semi join.
    
    ## How was this patch tested?
    
    Existing tests. Added benchmark, the result show 7X speedup.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11742 from davies/gen_semi.
    Davies Liu authored and davies committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    421f6c2 View commit details
    Browse the repository at this point in the history
  4. [SPARK-9837][ML] R-like summary statistics for GLMs via iteratively r…

    …eweighted least squares
    
    ## What changes were proposed in this pull request?
    Provide R-like summary statistics for GLMs via iteratively reweighted least squares.
    ## How was this patch tested?
    unit tests.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11694 from yanboliang/spark-9837.
    yanboliang authored and mengxr committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    3665294 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13920][BUILD] MIMA checks should apply to @experimental and @D…

    …eveloperAPI APIs
    
    ## What changes were proposed in this pull request?
    
    We are able to change `Experimental` and `DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check and adds filters for them.
    
    ## How was this patch tested?
    
    Pass the Jenkins tests (including MiMa).
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11751 from dongjoon-hyun/SPARK-13920.
    dongjoon-hyun authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    3c578c5 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13899][SQL] Produce InternalRow instead of external Row at CSV…

    … data source
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-13899
    
    This PR makes CSV data source produce `InternalRow` instead of `Row`.
    
    Basically, this resembles JSON data source. It uses the same codes for casting.
    
    ## How was this patch tested?
    
    Unit tests were used within IDE and code style was checked by `./dev/run_tests`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #11717 from HyukjinKwon/SPARK-13899.
    HyukjinKwon authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    9202479 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12653][SQL] Re-enable test "SPARK-8489: MissingRequirementErro…

    …r during reflection"
    
    ## What changes were proposed in this pull request?
    
    The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test.
    Historically, the target regression test is added by [SPARK-8498](093c348), but is temporarily disabled by [SPARK-12615](8ce645d) due to binary compatibility error.
    
    The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test.
    ```
    Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map;
    ```
    
    Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10.
    ```
    Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
    ```
    
    This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file.
    
    ## How was this patch tested?
    
    Pass the existing Jenkins test.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11744 from dongjoon-hyun/SPARK-12653.
    dongjoon-hyun authored and srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    431a3d0 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13906] Ensure that there are at least 2 dispatcher threads.

    ## What changes were proposed in this pull request?
    
    Force at least two dispatcher-event-loop threads. Since SparkDeploySchedulerBackend (in AppClient) calls askWithRetry to CoarseGrainedScheduler in the same process, there the driver needs at least two dispatcher threads to prevent the dispatcher thread from hanging.
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Yonathan Randolph <yonathangmail.com>
    
    Author: Yonathan Randolph <yonathan@liftigniter.com>
    
    Closes #11728 from yonran/SPARK-13906.
    yonran authored and srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    05ab294 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, Standard…

    …Charset follow up
    
    ## What changes were proposed in this pull request?
    
    Follow up to #11657
    
    - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
    - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
    - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11725 from srowen/SPARK-13823.2.
    srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    3b461d9 View commit details
    Browse the repository at this point in the history
  10. [SPARK-13396] Stop using our internal deprecated .metrics on Exceptio…

    JIRA: https://issues.apache.org/jira/browse/SPARK-13396
    
    Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates
    
    Author: GayathriMurali <gayathri.m.softie@gmail.com>
    
    Closes #11544 from GayathriMurali/SPARK-13396.
    GayathriMurali authored and srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    56d8824 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13793][CORE] PipedRDD doesn't propagate exceptions while readi…

    …ng parent RDD
    
    ## What changes were proposed in this pull request?
    
    PipedRDD creates a child thread to read output of the parent stage and feed it to the pipe process. Used a variable to save the exception thrown in the child thread and then propagating the exception in the main thread if the variable was set.
    
    ## How was this patch tested?
    
    - Added a unit test
    - Ran all the existing tests in PipedRDDSuite and they all pass with the change
    - Tested the patch with a real pipe() job, bounced the executor node which ran the parent stage to simulate a fetch failure and observed that the parent stage was re-ran.
    
    Author: Tejas Patil <tejasp@fb.com>
    
    Closes #11628 from tejasapatil/pipe_rdd.
    tejasapatil authored and srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    1d95fb6 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13889][YARN] Fix integer overflow when calculating the max num…

    …ber of executor failure
    
    ## What changes were proposed in this pull request?
    The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow.
    
    ## How was this patch tested?
    It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2.
    
    Author: Carson Wang <carson.wang@intel.com>
    
    Closes #11713 from carsonwang/IntOverflow.
    carsonwang authored and srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    496d2a2 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succe…

    …eds to fix failure on slow machines
    
    ## What changes were proposed in this pull request?
    
    I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example:
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console
    
    ```
    testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 0.023 sec  <<< FAILURE!
    java.lang.AssertionError: expected:<app-id> but was:<null>
    	at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)
    ```
    
    However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences.
    
    This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway.
    
    Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence:
    
    - First attempt to acquire semaphore for 10ms actually silently times out
    - The changed being waited for happens just after that, a bit too late
    - Assertion passes since condition became true just in time
    - `release()` fires from the listener
    - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails
    
    Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway.
    
    ## How was this patch tested?
    
    Jenkins tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #11763 from srowen/SPARK-13823.3.
    srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    9412547 View commit details
    Browse the repository at this point in the history
  14. [SPARK-13281][CORE] Switch broadcast of RDD to exception from warning

    ## What changes were proposed in this pull request?
    
    In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly, instead of logging the warning.
    
    ## How was this patch tested?
    
    mvn clean install
    Add UT in BroadcastSuite
    
    Author: Wesley Tang <tangmingjun@mininglamp.com>
    
    Closes #11735 from breakdawn/master.
    Wesley Tang authored and srowen committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    5f6bdf9 View commit details
    Browse the repository at this point in the history
  15. [SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON…

    … is not picked up in yarn-cluster mode
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes #11238 from zjffdu/SPARK-13360.
    zjffdu authored and Marcelo Vanzin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    eacd9d8 View commit details
    Browse the repository at this point in the history
  16. [SPARK-13924][SQL] officially support multi-insert

    ## What changes were proposed in this pull request?
    
    There is a feature of hive SQL called multi-insert. For example:
    ```
    FROM src
    INSERT OVERWRITE TABLE dest1
    SELECT key + 1
    INSERT OVERWRITE TABLE dest2
    SELECT key WHERE key > 2
    INSERT OVERWRITE TABLE dest3
    SELECT col EXPLODE(arr) exp AS col
    ...
    ```
    
    We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE.
    
    This PR removes these limitations and make us fully support multi-insert.
    
    ## How was this patch tested?
    
    new tests in `SQLQuerySuite`
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11754 from cloud-fan/lateral-view.
    cloud-fan authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    d9e8f26 View commit details
    Browse the repository at this point in the history
  17. [SPARK-13894][SQL] SqlContext.range return type from DataFrame to Dat…

    …aSet
    
    ## What changes were proposed in this pull request?
    https://issues.apache.org/jira/browse/SPARK-13894
    Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
    
    ## How was this patch tested?
    No additional unit test required.
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes #11730 from chenghao-intel/range.
    chenghao-intel authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    d9670f8 View commit details
    Browse the repository at this point in the history
  18. [SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx

    JIRA: https://issues.apache.org/jira/browse/SPARK-13816
    
    ## What changes were proposed in this pull request?
    
    Add parameter checks for algorithms in Graphx: Pregel,LabelPropagation,PageRank,SVDPlusPlus
    
    ## How was this patch tested?
    
    manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #11655 from zhengruifeng/graphx_param_check.
    zhengruifeng authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    9198497 View commit details
    Browse the repository at this point in the history
  19. [SPARK-13827][SQL] Can't add subquery to an operator with same-name o…

    …utputs while generate SQL string
    
    ## What changes were proposed in this pull request?
    
    This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.
    
    To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
    
    For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
    ```
    SELECT sq_1.key, sq_1.max
    FROM (
        SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
        FROM (
            SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
        ) AS sq_0
    ) AS sq_1
    ```
    You can see, the `key` columns become ambiguous after `sq_0`.
    
    After this PR, it will generate something like:
    ```
    SELECT attr_30 AS key, attr_37 AS max
    FROM (
        SELECT attr_30, attr_37
        FROM (
            SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
            FROM (
                SELECT attr_30, attr_35 FROM
                    (SELECT key AS attr_30 FROM t1) AS sq_0
                INNER JOIN
                    (SELECT key AS attr_35 FROM t1) AS sq_1
            ) AS sq_2
        ) AS sq_3
    ) AS sq_4
    ```
    The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
    
    ## How was this patch tested?
    
    existing tests and new tests in LogicalPlanToSQLSuite.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11658 from cloud-fan/gensql.
    cloud-fan authored and yhuai committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    1d1de28 View commit details
    Browse the repository at this point in the history
  20. [SPARK-12721][SQL] SQL Generation for Script Transformation

    #### What changes were proposed in this pull request?
    
    This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`.
    
    For example, below is the SQL containing `Transform`
    ```
    SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2
    ```
    
    Its logical plan is like
    ```
    ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
    +- SubqueryAlias parquet_t2
       +- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation
    ```
    
    The generated SQL will be like
    ```
    SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2`
    ```
    #### How was this patch tested?
    
    Seven test cases are added to `LogicalPlanToSQLSuite`.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    Author: xiaoli <lixiao1983@gmail.com>
    Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
    
    Closes #11503 from gatorsmile/transformToSQL.
    gatorsmile authored and yhuai committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    c4bd576 View commit details
    Browse the repository at this point in the history
  21. [SPARK-13038][PYSPARK] Add load/save to pipeline

    ## What changes were proposed in this pull request?
    
    JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038
    
    1. Add load/save to PySpark Pipeline and PipelineModel
    
    2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.
    
    ## How was this patch tested?
    
    Test with doctest.
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes #11683 from yinxusen/SPARK-13038-only.
    yinxusen authored and jkbradley committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    ae6c677 View commit details
    Browse the repository at this point in the history
  22. [SPARK-13613][ML] Provide ignored tests to export test dataset into C…

    …SV format
    
    ## What changes were proposed in this pull request?
    Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package.
    cc mengxr
    ## How was this patch tested?
    The test suite is ignored, but I have enabled all these cases offline and it works as expected.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11463 from yanboliang/spark-13613.
    yanboliang authored and mengxr committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    3f06eb7 View commit details
    Browse the repository at this point in the history
  23. [SPARK-11888][ML] Decision tree persistence in spark.ml

    ### What changes were proposed in this pull request?
    
    Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
    * The shared implementation is in treeModels.scala
    * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.
    
    Other changes:
    * Made CategoricalSplit.numCategories public (to use in persistence)
    * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument.  This caused an error in LDASuite, which I fixed.
    
    ### How was this patch tested?
    
    Persistence is tested via unit tests.  For each algorithm, there are 2 non-trivial trees (depth 2).  One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes #11581 from jkbradley/dt-io.
    jkbradley committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    6fc2b65 View commit details
    Browse the repository at this point in the history
  24. [SPARK-13927][MLLIB] add row/column iterator to local matrices

    ## What changes were proposed in this pull request?
    
    Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly.
    
    ## How was this patch tested?
    
    Unit tests on sparse and dense matrix.
    
    cc: dbtsai
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #11757 from mengxr/SPARK-13927.
    mengxr authored and DB Tsai committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    85c42fd View commit details
    Browse the repository at this point in the history
  25. [SPARK-13034] PySpark ml.classification support export/import

    ## What changes were proposed in this pull request?
    
    Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.
    
    ## How was this patch tested?
    
    ./python/run-tests
    ./dev/lint-python
    Unit tests added to check persistence in Logistic Regression
    
    Author: GayathriMurali <gayathri.m.softie@gmail.com>
    
    Closes #11707 from GayathriMurali/SPARK-13034.
    GayathriMurali authored and jkbradley committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    27e1f38 View commit details
    Browse the repository at this point in the history
  26. [SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x

    ## What changes were proposed in this pull request?
    
    `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs.
    
    **Migration Guide**
    ```
    - ## Migration Guide for Shark Users
    - ...
    - ### Scheduling
    - ...
    - ### Reducer number
    - ...
    - ### Caching
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins test.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11770 from dongjoon-hyun/SPARK-13942.
    dongjoon-hyun authored and rxin committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    4ce2d24 View commit details
    Browse the repository at this point in the history
  27. [SPARK-13922][SQL] Filter rows with null attributes in vectorized par…

    …quet reader
    
    # What changes were proposed in this pull request?
    
    It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls).
    
    ## How was this patch tested?
    
            Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
            String with Nulls Scan (0%):        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
            -------------------------------------------------------------------------------------------
            SQL Parquet Vectorized                   1229 / 1648          8.5         117.2       1.0X
            PR Vectorized                             833 /  846         12.6          79.4       1.5X
            PR Vectorized (Null Filtering)            732 /  782         14.3          69.8       1.7X
    
            Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
            String with Nulls Scan (50%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
            -------------------------------------------------------------------------------------------
            SQL Parquet Vectorized                    995 / 1053         10.5          94.9       1.0X
            PR Vectorized                             732 /  772         14.3          69.8       1.4X
            PR Vectorized (Null Filtering)            725 /  790         14.5          69.1       1.4X
    
            Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
            String with Nulls Scan (95%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
            -------------------------------------------------------------------------------------------
            SQL Parquet Vectorized                    326 /  333         32.2          31.1       1.0X
            PR Vectorized                             190 /  200         55.1          18.2       1.7X
            PR Vectorized (Null Filtering)            168 /  172         62.2          16.1       1.9X
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11749 from sameeragarwal/perf-testing.
    sameeragarwal authored and yhuai committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    b90c020 View commit details
    Browse the repository at this point in the history
  28. [SPARK-13871][SQL] Support for inferring filters from data constraints

    ## What changes were proposed in this pull request?
    
    This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things:
    
    (a) no redundant filters are generated, and
    (b) filters that do not contribute to any further optimizations are not generated.
    
    ## How was this patch tested?
    
    Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators.
    
    In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11665 from sameeragarwal/infer-filters.
    sameeragarwal authored and yhuai committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    f96997b View commit details
    Browse the repository at this point in the history
  29. [SPARK-13869][SQL] Remove redundant conditions while combining filters

    ## What changes were proposed in this pull request?
    
    **[I'll link it to the JIRA once ASF JIRA is back online]**
    
    This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`)
    
    ## How was this patch tested?
    
    Unit test in `FilterPushdownSuite`
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes #11670 from sameeragarwal/combine-filters.
    sameeragarwal authored and yhuai committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    77ba302 View commit details
    Browse the repository at this point in the history
  30. [SPARK-11011][SQL] Narrow type of UDT serialization

    ## What changes were proposed in this pull request?
    
    Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.
    
    ## How was this patch tested?
    
    Existing tests were successfully run on local machine.
    
    Author: Jakob Odersky <jakob@odersky.com>
    
    Closes #11379 from jodersky/SPARK-11011-udt-types.
    jodersky authored and mengxr committed Mar 16, 2016
    Configuration menu
    Copy the full SHA
    d4d8493 View commit details
    Browse the repository at this point in the history

Commits on Mar 17, 2016

  1. [SPARK-13761][ML] Deprecate validateParams

    ## What changes were proposed in this pull request?
    
    Deprecate validateParams() method here: https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553
    Move all functionality in overridden methods to transformSchema().
    Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #11620 from hhbyyh/depreValid.
    hhbyyh authored and jkbradley committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    92b7057 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13923][SQL] Implement SessionCatalog

    ## What changes were proposed in this pull request?
    
    As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.
    
    A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.
    
    ## How was this patch tested?
    
    800+ lines of tests in `SessionCatalogSuite`.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes #11750 from andrewor14/temp-catalog.
    Andrew Or authored and yhuai committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    ca9ef86 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13719][SQL] Parse JSON rows having an array type and a struct …

    …type in the same fieild
    
    ## What changes were proposed in this pull request?
    
    This #2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:
    
    ```json
    {"a": {"b": 1}}
    {"a": []}
    ```
    
    and the schema is given as below:
    
    ```scala
    val schema =
      StructType(
        StructField("a", StructType(
          StructField("b", StringType) :: Nil
        )) :: Nil)
    ```
    
    - **Before**
    
    ```scala
    sqlContext.read.schema(schema).json(path).show()
    ```
    
    ```scala
    Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
    	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
    	at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
    ...
    ```
    
    - **After**
    
    ```scala
    sqlContext.read.schema(schema).json(path).show()
    ```
    
    ```bash
    +----+
    |   a|
    +----+
    | [1]|
    |null|
    +----+
    ```
    
    For other data types, in this case it converts the given values are `null` but only this case emits an exception.
    
    This PR makes the support for wrapped rows applied only at the top level.
    
    ## How was this patch tested?
    
    Unit tests were used and `./dev/run_tests` for code style tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.
    HyukjinKwon authored and yhuai committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    917f400 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in …

    …whole stage codegen
    
    ## What changes were proposed in this pull request?
    
    We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen.
    
    Updated the benchmark for `collect`, we could see 20-30% speedup.
    
    ## How was this patch tested?
    
    existing unit tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11740 from davies/avoid_copy2.
    Davies Liu authored and davies committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    c100d31 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13118][SQL] Expression encoding for optional synthetic classes

    ## What changes were proposed in this pull request?
    
    Fix expression generation for optional types.
    Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases.
    
    This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR).
    
    ## How was this patch tested?
    A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects.
    
    Author: Jakob Odersky <jakob@odersky.com>
    
    Closes #11708 from jodersky/SPARK-13118-package-objects.
    jodersky authored and rxin committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    7eef246 View commit details
    Browse the repository at this point in the history
  6. [MINOR][SQL][BUILD] Remove duplicated lines

    ## What changes were proposed in this pull request?
    
    This PR removes three minor duplicated lines. First one is making the following unreachable code warning.
    ```
    JoinSuite.scala:52: unreachable code
    [warn]       case j: BroadcastHashJoin => j
    ```
    The other two are just consecutive repetitions in `Seq` of MiMa filters.
    
    ## How was this patch tested?
    
    Pass the existing Jenkins test.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #11773 from dongjoon-hyun/remove_duplicated_line.
    dongjoon-hyun authored and rxin committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    c890c35 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from do…

    …c and test
    
    ## What changes were proposed in this pull request?
    
    Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.
    
    ## How was this patch tested?
    
    This patch will not affect the real code path.
    
    Author: Daoyuan Wang <daoyuan.wang@intel.com>
    
    Closes #11758 from adrian-wang/spark12855.
    adrian-wang authored and rxin committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    d1c193a View commit details
    Browse the repository at this point in the history
  8. [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs w…

    …ith simple types
    
    Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.
    
    This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.
    
    In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11755 from JoshRosen/automatically-pick-best-serializer.
    JoshRosen authored and rxin committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    de1a84e View commit details
    Browse the repository at this point in the history
  9. [SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors.

    This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf.
    
    I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)).
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
    rdblue authored and rxin committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    5faba9f View commit details
    Browse the repository at this point in the history
  10. [SPARK-13948] MiMa check should catch if the visibility changes to pr…

    …ivate
    
    MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version.
    
    This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility).
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11774 from JoshRosen/SPARK-13948.
    JoshRosen authored and rxin committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    82066a1 View commit details
    Browse the repository at this point in the history
  11. Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to Colu…

    …mnPruning and EliminateOperator"
    
    This reverts commit 99bd2f0.
    davies committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    30c1884 View commit details
    Browse the repository at this point in the history
  12. [MINOR][DOC] Add JavaStreamingTestExample

    ## What changes were proposed in this pull request?
    
    Add the java example of StreamingTest
    
    ## How was this patch tested?
    
    manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #11776 from zhengruifeng/streaming_je.
    zhengruifeng authored and MLnick committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    204c9de View commit details
    Browse the repository at this point in the history
  13. [SPARK-13629][ML] Add binary toggle Param to CountVectorizer

    ## What changes were proposed in this pull request?
    
    It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    If set, then all non-zero counts will be set to 1.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes #11536 from hhbyyh/cvToggle.
    hhbyyh authored and MLnick committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    357d82d View commit details
    Browse the repository at this point in the history
  14. [SPARK-13901][CORE] correct the logDebug information when jump to the…

    … next locality level
    
    JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901
    In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it.
    
    Author: trueyao <501663994@qq.com>
    
    Closes #11719 from trueyao/logDebug-localityWait.
    trueyao authored and srowen committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    ea9ca6f View commit details
    Browse the repository at this point in the history
  15. [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.int…

    …ernal.Logging
    
    ## What changes were proposed in this pull request?
    
    Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11764 from cloud-fan/logger.
    cloud-fan committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    8ef3399 View commit details
    Browse the repository at this point in the history
  16. [SPARK-12719][SQL] SQL generation support for Generate

    ## What changes were proposed in this pull request?
    
    This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.
    
    This PR is based on #11658, please see the last commit to review the real changes.
    
    Thanks dilipbiswal for his initial work! Takes over #11596
    
    ## How was this patch tested?
    
    new tests in `LogicalPlanToSQLSuite`
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11696 from cloud-fan/generate.
    cloud-fan committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    1974d1d View commit details
    Browse the repository at this point in the history
  17. [SPARK-13776][WEBUI] Limit the max number of acceptors and selectors …

    …for Jetty
    
    ## What changes were proposed in this pull request?
    
    As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request.
    
    To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary.
    
    ## How was this patch tested?
    
    Just make sure we don't break any existing tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #11615 from zsxwing/SPARK-13776.
    zsxwing authored and srowen committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    65b75e6 View commit details
    Browse the repository at this point in the history
  18. [SPARK-13427][SQL] Support USING clause in JOIN.

    ## What changes were proposed in this pull request?
    
    Support queries that JOIN tables with USING clause.
    SELECT * from table1 JOIN table2 USING <column_list>
    
    USING clause can be used as a means to simplify the join condition
    when :
    
    1) Equijoin semantics is desired and
    2) The column names in the equijoin have the same name.
    
    We already have the support for Natural Join in Spark. This PR makes
    use of the already existing infrastructure for natural join to
    form the join condition and also the projection list.
    
    ## How was the this patch tested?
    
    Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite
    
    Author: Dilip Biswal <dbiswal@us.ibm.com>
    
    Closes #11297 from dilipbiswal/spark-13427.
    dilipbiswal authored and marmbrus committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    637a78f View commit details
    Browse the repository at this point in the history
  19. [SPARK-13838] [SQL] Clear variable code to prevent it to be re-evalua…

    …ted in BoundAttribute
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13838
    ## What changes were proposed in this pull request?
    
    We should also clear the variable code in `BoundReference.genCode` to prevent it  to be evaluated twice, as we did in `evaluateVariables`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
    
    Closes #11674 from viirya/avoid-reevaluate.
    viirya authored and davies committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    5f3bda6 View commit details
    Browse the repository at this point in the history
  20. [SPARK-12719][HOTFIX] Fix compilation against Scala 2.10

    ## What changes were proposed in this pull request?
    
    Compilation against Scala 2.10 fails with:
    ```
    [error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for         unreachability.
    [error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.)
    [error] [warn]     private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match {
    ```
    
    ## How was this patch tested?
    
    Compilation against Scala 2.10
    
    Author: tedyu <yuzhihong@gmail.com>
    
    Closes #11787 from yy2016/master.
    tedyu authored and yhuai committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    3ee7996 View commit details
    Browse the repository at this point in the history
  21. [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static t…

    …o member variable
    
    ## What changes were proposed in this pull request?
    In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.
    
    ## How was this patch tested?
    Ran python tests for ML and MLlib.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
    BryanCutler authored and jkbradley committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    828213d View commit details
    Browse the repository at this point in the history
  22. [SPARK-11891] Model export/import for RFormula and RFormulaModel

    https://issues.apache.org/jira/browse/SPARK-11891
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes #9884 from yinxusen/SPARK-11891.
    yinxusen authored and jkbradley committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    edf8b87 View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    4c08e2c View commit details
    Browse the repository at this point in the history
  24. [SPARK-13761][ML] Remove remaining uses of validateParams

    ## What changes were proposed in this pull request?
    
    Cleanups from [#11620]: remove remaining uses of validateParams, and put functionality into transformSchema
    
    ## How was this patch tested?
    
    Existing unit tests, modified to check using transformSchema instead of validateParams
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes #11790 from jkbradley/SPARK-13761-cleanup.
    jkbradley committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    b39e80d View commit details
    Browse the repository at this point in the history
  25. [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees

    Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.
    
    Say there are 3 categories A, B, C. We consider 3 splits:
    
    * A vs. B, C
    * A, B vs. C
    * A, C vs. B
    
    Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).
    
    This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.
    
    Author: sethah <seth.hendrickson16@gmail.com>
    
    Closes #9474 from sethah/SPARK-10788.
    sethah authored and jkbradley committed Mar 17, 2016
    Configuration menu
    Copy the full SHA
    1614485 View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2016

  1. [SPARK-13974][SQL] sub-query names do not need to be globally unique …

    …while generate SQL
    
    ## What changes were proposed in this pull request?
    
    We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`.
    
    also addressed 2 minor comments in #11696
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11783 from cloud-fan/tmp.
    cloud-fan committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    453455c View commit details
    Browse the repository at this point in the history
  2. [SPARK-13976][SQL] do not remove sub-queries added by user when gener…

    …ate SQL
    
    ## What changes were proposed in this pull request?
    
    We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation.
    
    an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1`
    before this PR, we will generate:
    ```
    SELECT attr_1 AS a FROM
      SELECT attr_1 FROM (
        SELECT a AS attr_1 FROM tbl
      ) AS sub_q0
      WHERE attr_1 = 1
    ```
    We missed a sub-query and this SQL string is illegal.
    
    After this PR, we will generate:
    ```
    SELECT attr_1 AS a FROM (
      SELECT attr_1 FROM (
        SELECT a AS attr_1 FROM tbl
      ) AS sub_q0
      WHERE attr_1 = 1
    ) AS t
    ```
    
    TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string.
    
    ## How was this patch tested?
    
    `LogicalPlanToSQLSuite`
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11786 from cloud-fan/bug-fix.
    cloud-fan committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    6037ed0 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore

    This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer.
    
    This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted.
    
    This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet).
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #11748 from JoshRosen/chunked-block-serialization.
    JoshRosen committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    6c2d894 View commit details
    Browse the repository at this point in the history
  4. [SPARK-12719][HOTFIX] Fix compilation against Scala 2.10

    PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure.  This PR fixes this issue by expanding this pattern match into several simpler ones.
    
    Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case.
    
    Compilation against Scala 2.10
    
    Author: tedyu <yuzhihong@gmail.com>
    
    Closes #11798 from yy2016/master.
    tedyu authored and liancheng committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    90a1d8d View commit details
    Browse the repository at this point in the history
  5. [SPARK-13826][SQL] Revises Dataset ScalaDoc

    ## What changes were proposed in this pull request?
    
    This PR revises Dataset API ScalaDoc.  All public methods are divided into the following groups
    
    * `groupname basic`: Basic Dataset functions
    * `groupname action`: Actions
    * `groupname untypedrel`: Untyped Language Integrated Relational Queries
    * `groupname typedrel`: Typed Language Integrated Relational Queries
    * `groupname func`: Functional Transformations
    * `groupname rdd`: RDD Operations
    * `groupname output`: Output Operations
    
    `since` tag and sample code are also updated.  We may want to add more sample code for typed APIs.
    
    ## How was this patch tested?
    
    Documentation change.  Checked by building unidoc locally.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #11769 from liancheng/spark-13826-ds-api-doc.
    liancheng authored and rxin committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    10ef4f3 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13930] [SQL] Apply fast serialization on collect limit operator

    ## What changes were proposed in this pull request?
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13930
    
    Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.
    
    ## How was this patch tested?
    
    Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.
    
    Without this patch:
    
        model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
        collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
        collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X
    
    With this patch:
    
        model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
        collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
        collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X
    
    Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
    
    Closes #11759 from viirya/execute-take.
    viirya authored and davies committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    750ed64 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13826][SQL] Addendum: update documentation for Datasets

    ## What changes were proposed in this pull request?
    This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.
    
    ## How was this patch tested?
    Just documentation/api stability update.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #11814 from rxin/dataset-docs.
    rxin committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    bb1fda0 View commit details
    Browse the repository at this point in the history
  8. [MINOR][ML] When trainingSummary is None, it should throw RuntimeExce…

    …ption.
    
    ## What changes were proposed in this pull request?
    When trainingSummary is None, it should throw ```RuntimeException```.
    cc mengxr
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #11784 from yanboliang/fix-summary.
    yanboliang authored and srowen committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    7783b6f View commit details
    Browse the repository at this point in the history
  9. [SPARK-14001][SQL] support multi-children Union in SQLBuilder

    ## What changes were proposed in this pull request?
    
    The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string.
    
    ## How was this patch tested?
    
    The re-enabled test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11818 from cloud-fan/bug-fix.
    cloud-fan authored and liancheng committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    0f1015f View commit details
    Browse the repository at this point in the history
  10. [MINOR][DOC] Fix nits in JavaStreamingTestExample

    ## What changes were proposed in this pull request?
    
    Fix some nits discussed in #11776 (comment)
    use !rdd.isEmpty instead of rdd.count > 0
    use static instead of AtomicInteger
    remove unneeded "throws Exception"
    
    ## How was this patch tested?
    
    manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #11821 from zhengruifeng/je_fix.
    zhengruifeng authored and srowen committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    53f32a2 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13972][SQ] hive tests should fail if SQL generation failed

    ## What changes were proposed in this pull request?
    
    Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log.
    
    We will send new PRs for spotted bugs, and merge this one after all bugs are fixed.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #11782 from cloud-fan/test.
    cloud-fan authored and liancheng committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    0acb32a View commit details
    Browse the repository at this point in the history
  12. [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only us…

    …e the first qualifier to generate SQL strings
    
    ## What changes were proposed in this pull request?
    
    Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`.
    
    This PR fixes this issue by only picking the first qualifiers.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    Existing tests should be enough.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #11820 from liancheng/spark-14004-single-qualifier.
    liancheng committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    14c7236 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13977] [SQL] Brings back Shuffled hash join

    ## What changes were proposed in this pull request?
    
    ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.
    
    ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.
    
    This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).
    
    A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.
    
    ## How was this patch tested?
    
    Added new unit tests for ShuffledHashJoin.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #11788 from davies/shuffle_join.
    Davies Liu authored and davies committed Mar 18, 2016
    Configuration menu
    Copy the full SHA
    9c23c81 View commit details
    Browse the repository at this point in the history