pull latest from apache spark #6

… 2.10 ## What changes were proposed in this pull request? Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading. ## How was this patch tested? Jenkins tests / compilation (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Sean Owen <sowen@cloudera.com> Closes #11493 from srowen/SPARK-13423.2.

… include_example Replace example code in mllib-clustering.md using include_example https://issues.apache.org/jira/browse/SPARK-13013 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11116 from keypointt/SPARK-13013.

## What changes were proposed in this pull request? Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR. This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR. ## How was this patch tested? 1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver` 1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs 1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver -Dverbose > target/dependencies.txt` 1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive` 1. Patch applied 1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set 1. Examined created spark-assembly, verified no org.codehaus packages 1. Verified that the maven dependency tree no longer references groovy Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded Author: Steve Loughran <stevel@hortonworks.com> Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.

## What changes were proposed in this pull request? The changes proposed were to add train-validation-split to pyspark.ml.tuning. ## How was the this patch tested? This patch was tested through unit tests located in pyspark/ml/test.py. This is my original work and I license it to Spark. Author: JeremyNixon <jnixon2@gmail.com> Closes #11335 from JeremyNixon/tvs_pyspark.

…et/ORC via option() ## What changes were proposed in this pull request? This PR adds the support to specify compression codecs for both ORC and Parquet. ## How was this patch tested? unittests within IDE and code style tests with `dev/run_tests`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11464 from HyukjinKwon/SPARK-13543.

…aram ## What changes were proposed in this pull request? Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367) cc mengxr srowen ## How was this patch tested? Documents change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11344 from yanboliang/shared-cleanup.

… 2.10, again ## What changes were proposed in this pull request? Fixes (another) compile problem due to inadvertent use of Option.contains, only in Scala 2.11 ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11496 from srowen/SPARK-13423.3.

## What changes were proposed in this pull request? This PR fixes typos in comments and testcase name of code. ## How was this patch tested? manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.

## What changes were proposed in this pull request? This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11482 from andrewor14/commands-package.

… logs to the console ## What changes were proposed in this pull request? Make ContinuousQueryManagerSuite not output logs to the console. The logs will still output to `unit-tests.log`. I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid changing the log level which won't output logs to `unit-tests.log`. ## How was this patch tested? Just check Jenkins output. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11439 from zsxwing/quietly-ContinuousQueryManagerSuite.

## What changes were proposed in this pull request? This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen. For example: ```python >>> sqlContext.range(100).registerTempTable("range") >>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True) == Parsed Logical Plan == 'Project [unresolvedalias(('id / subquery#9), None)] : +- 'SubqueryAlias subquery#9 : +- 'Project [unresolvedalias('sum('id), None)] : +- 'UnresolvedRelation `range`, None +- 'Filter ('id > subquery#8) : +- 'SubqueryAlias subquery#8 : +- 'GlobalLimit 1 : +- 'LocalLimit 1 : +- 'Project [unresolvedalias('id, None)] : +- 'UnresolvedRelation `range`, None +- 'UnresolvedRelation `range`, None == Analyzed Logical Plan == (id / scalarsubquery()): double Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11] : +- SubqueryAlias subquery#9 : +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L] : +- SubqueryAlias range : +- Range 0, 100, 1, 4, [id#0L] +- Filter (id#0L > subquery#8) : +- SubqueryAlias subquery#8 : +- GlobalLimit 1 : +- LocalLimit 1 : +- Project [id#0L] : +- SubqueryAlias range : +- Range 0, 100, 1, 4, [id#0L] +- SubqueryAlias range +- Range 0, 100, 1, 4, [id#0L] == Optimized Logical Plan == Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11] : +- SubqueryAlias subquery#9 : +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L] : +- Range 0, 100, 1, 4, [id#0L] +- Filter (id#0L > subquery#8) : +- SubqueryAlias subquery#8 : +- GlobalLimit 1 : +- LocalLimit 1 : +- Project [id#0L] : +- Range 0, 100, 1, 4, [id#0L] +- Range 0, 100, 1, 4, [id#0L] == Physical Plan == WholeStageCodegen : +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11] : : +- Subquery subquery#9 : : +- WholeStageCodegen : : : +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L]) : : : +- INPUT : : +- Exchange SinglePartition, None : : +- WholeStageCodegen : : : +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L]) : : : +- Range 0, 1, 4, 100, [id#0L] : +- Filter (id#0L > subquery#8) : : +- Subquery subquery#8 : : +- CollectLimit 1 : : +- WholeStageCodegen : : : +- Project [id#0L] : : : +- Range 0, 1, 4, 100, [id#0L] : +- Range 0, 1, 4, 100, [id#0L] ``` The web UI looks like: ![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png) This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 . ## How was this patch tested? Existing tests, also manual tests with the example query, check the explain and web UI. Author: Davies Liu <davies@databricks.com> Closes #11417 from davies/viz_subquery.

## What changes were proposed in this pull request? Fix race conditions when cleanup files. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #11507 from davies/flaky.

…ge in _verify_type ## What changes were proposed in this pull request? This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range. ## How was this patch tested? newly added doc test. Author: Wenchen Fan <wenchen@databricks.com> Closes #11492 from cloud-fan/py-verify.

… string datatypes to Oracle VARCHAR datatype mapping ## What changes were proposed in this pull request? A test suite added for the bug fix -SPARK 12941; for the mapping of the StringType to corresponding in Oracle ## How was this patch tested? manual tests done (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: thomastechs <thomas.sebastian@tcs.com> Author: THOMAS SEBASTIAN <thomas.sebastian@tcs.com> Closes #11489 from thomastechs/thomastechs-12941-master-new.

…cled ## What changes were proposed in this pull request? `sendRpcSync` should copy the response content because the underlying buffer will be recycled and reused. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11499 from zsxwing/SPARK-13652.

## What changes were proposed in this pull request? This is support SQL generation for subquery expressions, which will be replaced to a SubqueryHolder inside SQLBuilder recursively. ## How was this patch tested? Added unit tests. Author: Davies Liu <davies@databricks.com> Closes #11453 from davies/sql_subquery.

## What changes were proposed in this pull request? It avoids counting the dataframe twice. Author: Abou Haydar Elias <abouhaydar.elias@gmail.com> Author: Elie A <abouhaydar.elias@gmail.com> Closes #11491 from eliasah/quantile-discretizer-patch.

…orkjoin ## What changes were proposed in this pull request? Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't give us a way to specify the thread pool name (and is a wrapper of Java's in 2.12) except by providing a custom factory. Note that we can't use Java's ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which reports system parallelism. One other implicit change that happens is the old ExecutionContext would have reported a different default parallelism since it used system parallelism rather than threadpool parallelism (this was likely not intended but also likely not a huge difference). The previous version of this PR attempted to use an execution context constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class) so as to keep the ability to have human readable named threads but this reported system parallelism. ## How was this patch tested? unit tests: streaming/testOnly org.apache.spark.streaming.util.* Author: Holden Karau <holden@us.ibm.com> Closes #11423 from holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.

Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #11477 from rajeshbalamohan/SPARK-12925.2.

## What changes were proposed in this pull request? This patch fixes the problem that `bin\beeline.cmd` pollutes environment variables. The similar problem is reported and fixed in https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems to be added later. ## How was this patch tested? manual tests: I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% doesn't remain in the command prompt. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #11516 from tsudukim/feature/SPARK-13673.

…egression ## What changes were proposed in this pull request? The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value. **Scala** ``` scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam res0: Double = 0.0 ``` **Python** ``` >>> from pyspark.ml.classification import LogisticRegression >>> LogisticRegression().getRegParam() 0.1 ``` ## How was this patch tested? manual. Check the following in `pyspark`. ``` >>> from pyspark.ml.classification import LogisticRegression >>> LogisticRegression().getRegParam() 0.0 ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11519 from dongjoon-hyun/SPARK-13676.

Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`. In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in #9884. I'll add them in this PR if #9884 gets merged first. Or add a follow-up JIRA for `RFormula`. Author: Xusen Yin <yinxusen@gmail.com> Closes #11203 from yinxusen/SPARK-13036.

## What changes were proposed in this pull request? This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11506 from andrewor14/parser-package.

…otals Table ## What changes were proposed in this pull request? Now that dead executors are shown in the executors table (#10058) the totals table is updated to include the separate totals for alive and dead executors as well as the current total, as originally discussed in #10668 ## How was this patch tested? Manually verified by running the Standalone Web UI in the latest Safari and Firefox ESR Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #11381 from ajbozarth/spark13459.

…narBatch instead of InternalRows. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Currently, the parquet reader returns rows one by one which is bad for performance. This patch updates the reader to directly return ColumnarBatches. This is only enabled with whole stage codegen, which is the only operator currently that is able to consume ColumnarBatches (instead of rows). The current implementation is a bit of a hack to get this to work and we should do more refactoring of these low level interfaces to make this work better. ## How was this patch tested? ``` Results: TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns) --------------------------------------------------------------------------------- q55 (before) 8897 / 9265 12.9 77.2 q55 5486 / 5753 21.0 47.6 ``` Author: Nong Li <nong@databricks.com> Closes #11435 from nongli/spark-13255.

… preferentially from lagg… …ing partitions I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure. `maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages. This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%. Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed. We're going to try deploying this internally at Shopify to see if this resolves our issue. tdas koeninger holdenk Author: Jason White <jason.white@shopify.com> Closes #10089 from JasonMWhite/rate_controller_offsets.

…ing Sets #### What changes were proposed in this pull request? This PR is for supporting SQL generation for cube, rollup and grouping sets. For example, a query using rollup: ```SQL SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP ``` Original logical plan: ``` Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46], [(count(1),mode=Complete,isDistinct=false) AS cnt#43L, (key#17L % cast(5 as bigint))#47L AS _c1#45L, grouping__id#46 AS _c2#44] +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0), List(key#17L, value#18, null, 1)], [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46] +- Project [key#17L, value#18, (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L] +- Subquery t1 +- Relation[key#17L,value#18] ParquetRelation ``` Converted SQL: ```SQL SELECT count( 1) AS `cnt`, (`t1`.`key` % CAST(5 AS BIGINT)), grouping_id() AS `_c2` FROM `default`.`t1` GROUP BY (`t1`.`key` % CAST(5 AS BIGINT)) GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ()) ``` #### How was the this patch tested? Added eight test cases in `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11283 from gatorsmile/groupingSetsToSQL.

… checkpoint dir ## What changes were proposed in this pull request? Stop StreamingContext before deleting checkpoint dir to avoid the race condition that deleting the checkpoint dir and writing checkpoint happen at the same time. The flaky test log is here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11531 from zsxwing/SPARK-13693.

…t a project on top of it" This reverts commit f87ce05. According to discussion in #11466, let's revert PR #11466 for safe. Author: Cheng Lian <lian@databricks.com> Closes #11539 from liancheng/revert-pr-11466.

…ionSerializer.loads ## What changes were proposed in this pull request? Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`. ## How was this patch tested? Manually test in the shell. Before this patch: ``` >>> from pyspark.streaming import StreamingContext >>> from pyspark.streaming.util import TransformFunction >>> ssc = StreamingContext(sc, 1) >>> func = TransformFunction(sc, lambda x: x, sc.serializer) >>> func.rdd_wrapper(lambda x: x) TransformFunction(<function <lambda> at 0x106ac8b18>) >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers))) >>> func2 = ssc._transformerSerializer.loads(bytes) >>> print(func2.func.__module__) None >>> print(func2.rdd_wrap_func.__module__) None >>> ``` After this patch: ``` >>> from pyspark.streaming import StreamingContext >>> from pyspark.streaming.util import TransformFunction >>> ssc = StreamingContext(sc, 1) >>> func = TransformFunction(sc, lambda x: x, sc.serializer) >>> func.rdd_wrapper(lambda x: x) TransformFunction(<function <lambda> at 0x108bf1b90>) >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers))) >>> func2 = ssc._transformerSerializer.loads(bytes) >>> print(func2.func.__module__) __main__ >>> print(func2.rdd_wrap_func.__module__) __main__ >>> ``` Author: Shixiong Zhu <shixiong@databricks.com> Closes #11535 from zsxwing/loads-module.

## What changes were proposed in this pull request? Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11526 from andrewor14/rename-catalog.

…tly refers to StatefulNetworkWordCount ## What changes were proposed in this pull request? The reference to StatefulNetworkWordCount.scala from updateStatesByKey documentation should be removed, till there is a example for updateStatesByKey. ## How was this patch tested? Have tested the new documentation with jekyll build. Author: rmishra <rmishra@pivotal.io> Closes #11545 from rishitesh/SPARK-13705.

## What changes were proposed in this pull request? Added the conversion to int for the 'happiness value' read from the file. Otherwise, later on line 75 the multiplication will multiply a string by a number, yielding values like "-2-2" instead of -4. ## How was this patch tested? Tested manually. Author: Yury Liavitski <seconds.before@gmail.com> Author: Yury Liavitski <yury.liavitski@il111.ice.local> Closes #11540 from heliocentrist/fix-sentiment-value-type.

…in run time error ## What changes were proposed in this pull request? ``` Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src") sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value") ``` Results in following logical plan ``` Project [key#2,value#3] +- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3] +- SubqueryAlias src +- Project [_1#0 AS key#2,_2#1 AS value#3] +- LocalRelation [_1#0,_2#1], [[id1,value1]] ``` The above query fails with following runtime error. ``` java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) <stack-trace omitted.....> ``` In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548 ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added unit tests in hive/SQLQuerySuite and AnalysisSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11497 from dilipbiswal/spark-13651.

…pressions ## What changes were proposed in this pull request? It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it. Note that this PR doesn't fix the problem in #11497 ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11532 from cloud-fan/generate.

…ns based on their data constraints ## What changes were proposed in this pull request? This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints. Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins. ## How was this patch tested? 1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules. 2. Test generated ExpressionTrees via `OrcFilterSuite` 3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite` cc yhuai nongli Author: Sameer Agarwal <sameer@databricks.com> Closes #11372 from sameeragarwal/gen-isnotnull.

## What changes were proposed in this pull request? In the Jenkins pull request builder, PySpark tests take around [962 seconds ](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console) of end-to-end time to run, despite the fact that we run four Python test suites in parallel. According to the log, the basic reason is that the long running test starts at the end due to FIFO queue. We first try to reduce the test time by just starting some long running tests first with simple priority queue. ``` ======================================================================== Running PySpark tests ======================================================================== ... Finished test(python3.4): pyspark.streaming.tests (213s) Finished test(pypy): pyspark.sql.tests (92s) Finished test(pypy): pyspark.streaming.tests (280s) Tests passed in 962 seconds ``` ## How was this patch tested? Manual check. Check 'Running PySpark tests' part of the Jenkins log. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11551 from dongjoon-hyun/SPARK-12243.

The description of "spark.memory.offHeap.size" in the current document does not clearly state that memory is counted with bytes.... This PR contains a small fix for this tiny issue document fix Author: CodingCat <zhunansjtu@gmail.com> Closes #11561 from CodingCat/master.

…ough Generate #### What changes were proposed in this pull request? Non-deterministic predicates should not be pushed through Generate. #### How was this patch tested? Added a test case in `FilterPushdownSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes #11562 from gatorsmile/pushPredicateDownWindow.

…DSuite This patch modifies `KinesisBackedBlockRDDTests` to increase the isolation between tests in order to fix a bug which causes the tests to hang. See #11558 for more details. /cc zsxwing srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #11564 from JoshRosen/SPARK-13655.

This is, in a way, the basics to enable SPARK-529 (which was closed as won't fix but I think is still valuable). In fact, Spark SQL created something for that, and this change basically factors out that code and inserts it into SparkConf, with some extra bells and whistles. To showcase the usage of this pattern, I modified the YARN backend to use the new config keys (defined in the new `config` package object under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although logic had to be slightly modified in a handful of places. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10205 from vanzin/conf-opts.

## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13442 This PR adds the support for inferring `BooleanType` for schema. It supports to infer case-insensitive `true` / `false` as `BooleanType`. Unittests were added for `CSVInferSchemaSuite` and `CSVSuite` for end-to-end test. ## How was the this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Closes #11315 from HyukjinKwon/SPARK-13442.

… subdirs ## What changes were proposed in this pull request? Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`. I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope. ## How was this patch tested? `./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest. Author: Sean Owen <sowen@cloudera.com> Closes #11522 from srowen/SPARK-13596.

`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11509 from marmbrus/fileDataSource.

## What changes were proposed in this pull request? Adding the hive-cli classes to the classloader ## How was this patch tested? The hive Versionssuite tests were run This is my original work and I license the work to the project under the project's open source license. Author: Tim Preece <tim.preece.in.oz@gmail.com> Closes #11495 from preecet/master.

## What changes were proposed in this pull request? When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller. ## How was this patch tested? No change in functionality, so just Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11529 from andrewor14/parser-utils.

… used ## What changes were proposed in this pull request? This PR change the way how we generate the code for the output variables passing from a plan to it's parent. Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted. This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`. This PR also change the `if` from ``` if (cond) { xxx } ``` to ``` if (!cond) continue; xxx ``` The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin. It also added some comments for operators. ## How was the this patch tested? Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s) Author: Davies Liu <davies@databricks.com> Closes #11274 from davies/gen_defer.

…ient as it's in driver ## What changes were proposed in this pull request? AppClient runs in the driver side. It should not call `Utils.tryOrExit` as it will send exception to SparkUncaughtExceptionHandler and call `System.exit`. This PR just removed `Utils.tryOrExit`. ## How was this patch tested? manual tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11566 from zsxwing/SPARK-13711.

In preparation for larger refactoring, this patch removes the confusing `returnValues` option from the BlockStore put() APIs: returning the value is only useful in one place (caching) and in other situations, such as block replication, it's simpler to put() and then get(). As part of this change, I needed to refactor `BlockManager.doPut()`'s block replication code. I also changed `doPut()` to access the memory and disk stores directly rather than calling them through the BlockStore interface; this is in anticipation of a followup patch to remove the BlockStore interface so that the disk store can expose a binary-data-oriented API which is not concerned with Java objects or serialization. These changes should be covered by the existing storage unit tests. The best way to review this patch is probably to look at the individual commits, all of which are small and have useful descriptions to guide the review. /cc davies for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #11502 from JoshRosen/remove-returnvalues.

## What changes were proposed in this pull request? This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change. The following is the error message of current master. ``` Checkstyle checks failed at following occurrences: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1] ``` ## How was this patch tested? Manual. The following command should run correctly. ``` ./dev/lint-java mvn checkstyle:check ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11567 from dongjoon-hyun/hotfix_checkstyle_suppression.

## What changes were proposed in this pull request? In WebUI, now Jetty Server starts with SPARK_LOCAL_IP config value if it is configured otherwise it starts with default value as '0.0.0.0'. It is continuation as per the closed PR #11133 for the JIRA SPARK-13117 and discussion in SPARK-13117. ## How was this patch tested? This has been verified using the command 'netstat -tnlp | grep <PID>' to check on which IP/hostname is binding with the below steps. In the below results, mentioned PID in the command is the corresponding process id. #### Without the patch changes, Web UI(Jetty Server) is not taking the value configured for SPARK_LOCAL_IP and it is listening to all the interfaces. ###### Master ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3930 tcp6 0 0 :::8080 :::* LISTEN 3930/java ``` ###### Worker ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4090 tcp6 0 0 :::8081 :::* LISTEN 4090/java ``` ###### History Server Process, ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2471 tcp6 0 0 :::18080 :::* LISTEN 2471/java ``` ###### Driver ``` [devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6556 tcp6 0 0 :::4040 :::* LISTEN 6556/java ``` #### With the patch changes ##### i. With SPARK_LOCAL_IP configured If the SPARK_LOCAL_IP is configured then all the processes Web UI(Jetty Server) is getting bind to the configured value. ###### Master ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 1561 tcp6 0 0 x.x.x.x:8080 :::* LISTEN 1561/java ``` ###### Worker ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2229 tcp6 0 0 x.x.x.x:8081 :::* LISTEN 2229/java ``` ###### History Server ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3747 tcp6 0 0 x.x.x.x:18080 :::* LISTEN 3747/java ``` ###### Driver ``` [devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6013 tcp6 0 0 x.x.x.x:4040 :::* LISTEN 6013/java ``` ##### ii. Without SPARK_LOCAL_IP configured If the SPARK_LOCAL_IP is not configured then all the processes Web UI(Jetty Server) will start with the '0.0.0.0' as default value. ###### Master ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4573 tcp6 0 0 :::8080 :::* LISTEN 4573/java ``` ###### Worker ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4703 tcp6 0 0 :::8081 :::* LISTEN 4703/java ``` ###### History Server ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4846 tcp6 0 0 :::18080 :::* LISTEN 4846/java ``` ###### Driver ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 5437 tcp6 0 0 :::4040 :::* LISTEN 5437/java ``` Author: Devaraj K <devaraj@apache.org> Closes #11490 from devaraj-kavali/SPARK-13117-v1.

…nning in yarn cluster mode ## What changes were proposed in this pull request? Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here **1** or **2** represents the number of attempts in `historypage.js`, but it will parse to attempt id in `HistoryServer`, while the correct attempt id should be like "appattempt_1457058760338_0016_000002", so it will fail to parse to a correct attempt id in HistoryServer. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. Also the suffix "jobs/" is not needed. Here is the screenshot: ![screen shot 2016-02-29 at 3 57 32 pm](https://cloud.githubusercontent.com/assets/850797/13524377/d4b44348-e235-11e5-8b3e-bc06de306e87.png) ## How was this patch tested? This patch is tested manually, with different master and deploy mode. ![image](https://cloud.githubusercontent.com/assets/850797/13524419/118be5a0-e236-11e5-8022-3ff613ccde46.png) Author: jerryshao <sshao@hortonworks.com> Closes #11518 from jerryshao/SPARK-13675.

…d builder ## What changes were proposed in this pull request? The code in `Expand.apply` can be simplified by existing information: * the `groupByExprs` parameter are all `Attribute`s * the `child` parameter is a `Project` that append aliased group by expressions to its child's output ## How was this patch tested? by existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11485 from cloud-fan/expand.

## What changes were proposed in this pull request? Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed. ## How was this patch tested? Manually verified this change. Author: jerryshao <sshao@hortonworks.com> Closes #11577 from jerryshao/hot-fix-yarn-cluster.

## What changes were proposed in this pull request? Remove last usage of jblas, in tests ## How was this patch tested? Jenkins tests -- the same ones that are being modified. Author: Sean Owen <sowen@cloudera.com> Closes #11560 from srowen/SPARK-13715.

## What changes were proposed in this pull request? In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer. ## How was this patch tested? Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql Author: Davies Liu <davies@databricks.com> Closes #11501 from davies/long_or.

…fter spills When a cached block is spilled to disk and read back in serialized form (i.e. as bytes), the current BlockManager implementation will attempt to re-insert the serialized block into the MemoryStore even if the block's storage level requests deserialized caching. This behavior adds some complexity to the MemoryStore but I don't think it offers many performance benefits and I'd like to remove it in order to simplify a larger refactoring patch. Therefore, this patch changes the behavior so that disk store reads will only cache bytes in the memory store for blocks with serialized storage levels. There are two places where we request serialized bytes from the BlockStore: 1. getLocalBytes(), which is only called when reading local copies of TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized storage level, so this won't lead to a mismatch in serialization forms if spilled bytes read from disk are cached as bytes in the memory store. 2. the non-shuffle-block branch in getBlockData(), which is only called by the NettyBlockRpcServer when responding to requests to read remote blocks. Caching the serialized bytes in memory will only benefit us if those cached bytes are read before they're evicted and the likelihood of that happening seems low since the frequency of remote reads of non-broadcast cached blocks seems very low. Caching these bytes when they have a low probability of being read is bad if it risks the eviction of blocks which are cached in their expected serialized/deserialized forms, since those blocks seem more likely to be read in local computation. Given the argument above, I think this change is unlikely to cause performance regressions. Author: Josh Rosen <joshrosen@databricks.com> Closes #11533 from JoshRosen/remove-memorystore-level-mismatch.

…stinct ## What changes were proposed in this pull request? This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer. More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11579 from cloud-fan/distinct.

## What changes were proposed in this pull request? Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it. BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually. cc jkbradley mengxr ## How was this patch tested? No new unit test, should pass the exist ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11513 from yanboliang/ml-check-model-data.

## What changes were proposed in this pull request? This PR adds null check in `_verify_type` according to the nullability information. ## How was this patch tested? new doc tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11574 from cloud-fan/py-null-check.

… string and verify the data ## What changes were proposed in this pull request? This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`. It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users. If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't. ## How was this patch tested? new tests in `test.py` and doc test in `types.py` Author: Wenchen Fan <wenchen@databricks.com> Closes #11444 from cloud-fan/pyrdd.

## What changes were proposed in this pull request? This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines. ``` LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead. HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead. ``` ## How was this patch tested? Manual. During building, there should be no warning on `Octal escape literals`. ``` mvn -DskipTests clean install ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11584 from dongjoon-hyun/SPARK-13400.

Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes #11572 from marmbrus/dataSourceResolution.

…NotNull checks ## What changes were proposed in this pull request? If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates. For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation. ## How was this patch tested? new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11511 from sameeragarwal/reorder-isnotnull.

When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization. Author: Josh Rosen <joshrosen@databricks.com> Closes #11587 from JoshRosen/graphviz-escaping.

…property when getting param list ## What changes were proposed in this pull request? Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance. This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error. ## How was this patch tested? Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class. Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.

## What changes were proposed in this pull request? This PR fix the sizeInBytes of HadoopFsRelation. ## How was this patch tested? Added regression test for that. Author: Davies Liu <davies@databricks.com> Closes #11590 from davies/fix_sizeInBytes.

## Motivation CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name. ## Proposed changes `com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala` ## Tests A unit test was added to `CSVSuite` to parse a csv file using the old name. Author: Hossein <hossein@databricks.com> Closes #11589 from falaki/SPARK-13754.

This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.** ## What changes were proposed in this pull request? Deprecate the SparkSQL column operator !== and use =!= as an alternative. Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===). ## How was this patch tested? All currently existing tests. Author: Jakob Odersky <jodersky@gmail.com> Closes #11588 from jodersky/SPARK-7286.

## What changes were proposed in this pull request? This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. - Implement both null and type checking in equals functions. - Fix wrong type casting logic in SimpleJavaBean2.equals. - Add `implement Cloneable` to `UTF8String` and `SortedIterator`. - Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. - Fix coding style: Add '{}' to single `for` statement in mllib examples. - Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`. - Remove unused fields in `ChunkFetchIntegrationSuite`. - Add `stop()` to prevent resource leak. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583). ## How was this patch tested? manual via `./dev/lint-java` and Coverity site. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11530 from dongjoon-hyun/SPARK-13692.

## What changes were proposed in this pull request? `ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe. ## How was this patch tested? I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch: ``` [info] - thread safety of mirror *** FAILED *** (49 milliseconds) [info] java.lang.UnsupportedOperationException: tail of empty list [info] at scala.collection.immutable.Nil$.tail(List.scala:339) [info] at scala.collection.immutable.Nil$.tail(List.scala:334) [info] at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172) [info] at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477) [info] at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777) [info] at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235) [info] at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34) [info] at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61) [info] at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12) [info] at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12) [info] at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36) [info] at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256) [info] at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252) [info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) [info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) [info] at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) [info] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [info] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [info] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [info] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ``` Notice that the test will pass when Scala version is `2.11`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #11487 from ueshin/issues/SPARK-13640.

## What changes were proposed in this pull request? If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException. This patch synchronizes access to `mapStatuses` and skips null status entries (which are in-progress shuffle tasks). ## How was this patch tested? Our client code unit test suite, which was reliably reproducing the race condition with 10 threads, shows that this fixes it. I have not found a minimal test case to add to Spark, but I will attempt to do so if desired. The same test case was tripping up on SPARK-4454, which was fixed by making other DAGScheduler code thread-safe. shivaram srowen Author: Andy Sloane <asloane@tetrationanalytics.com> Closes #11505 from a1k0n/SPARK-13631.

…ance creation in Java code. ## What changes were proposed in this pull request? In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator. ``` - final ArrayList<Product2<Object, Object>> dataToWrite = - new ArrayList<Product2<Object, Object>>(); + final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>(); ``` Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this. ## How was this patch tested? Manual. Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11541 from dongjoon-hyun/SPARK-13702.

JIRA : https://issues.apache.org/jira/browse/SPARK-13769 The java doc here (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51) needs to be updated from "The latter two operations are currently supported only for standalone cluster mode." to "The latter two operations are currently supported only for standalone and mesos cluster modes." Author: Ahmed Kamal <ahmed.kamal@badrit.com> Closes #11600 from AhmedKamal/SPARK-13769.

…nerate ## What changes were proposed in this pull request? Analysis exception occurs while running the following query. ``` SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints` ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7 'Project ['ints] +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8] +- SubqueryAlias nestedarray +- LocalRelation [a#0], [[[[1,2,3]]]] ``` ## How was this patch tested? Added new unit tests in SQLQuerySuite and HiveQlSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11538 from dilipbiswal/SPARK-13698.

## What changes were proposed in this pull request? If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches. This PR is based on #11243 and #11221, thanks to joehalliwell Closes #11243 Closes #11221 ## How was this patch tested? Add a test with 50 branches. Author: Davies Liu <davies@databricks.com> Closes #11592 from davies/fix_when.

…rcuit isNotNull checks" This reverts commit e430614.

## What changes were proposed in this pull request? Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/` ## How was this patch tested? This is tested with Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11523 from srowen/SPARK-13595.

#### What changes were proposed in this pull request? As shown in another PR: #11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example, ```SQL SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value ``` Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below: ``` == Analyzed Logical Plan == value: int Project [value#22] +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22] +- SubqueryAlias dummyTable +- Project [1 AS 1#21] +- OneRowRelation$ == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- Project +- OneRowRelation$ ``` After the fix, the optimized plan removed the useless `Project`, as shown below: ``` == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- OneRowRelation$ ``` This PR is to remove `Project` when its Child's output is Nil #### How was this patch tested? Added a new unit test case into the suite `ColumnPruningSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes #11599 from gatorsmile/projectOneRowRelation.

…cked. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13728 #11509 makes the output only single ORC file. It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters. So, this PR simply repartitions data into 10 so that the test could pass. ## How was this patch tested? unittest and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11593 from HyukjinKwon/SPARK-13728.

## What changes were proposed in this pull request? ```GeneralizedLinearRegression``` supports ```save/load```. cc mengxr ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11465 from yanboliang/spark-13615.

## What changes were proposed in this pull request? It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache). Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query. In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan. Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning. After the rule, the plan will looks like: ``` WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None : :- Project [id#0L] : : +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None : : :- Range 0, 1, 4, 1024, [id#0L] : : +- INPUT : +- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) ``` ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png) For three ways SortMergeJoin, ``` == Physical Plan == WholeStageCodegen : +- Project [id#0L] : +- SortMergeJoin [id#0L], [id#4L], None : :- INPUT : +- INPUT :- WholeStageCodegen : : +- Project [id#0L] : : +- SortMergeJoin [id#0L], [id#3L], None : : :- INPUT : : +- INPUT : :- WholeStageCodegen : : : +- Sort [id#0L ASC], false, 0 : : : +- INPUT : : +- Exchange hashpartitioning(id#0L, 200), None : : +- WholeStageCodegen : : : +- Range 0, 1, 4, 33554432, [id#0L] : +- WholeStageCodegen : : +- Sort [id#3L ASC], false, 0 : : +- INPUT : +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None +- WholeStageCodegen : +- Sort [id#4L ASC], false, 0 : +- INPUT +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None ``` ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png) If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents. ## How was this patch tested? Added some unit tests for this. Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ). Author: Davies Liu <davies@databricks.com> Closes #11403 from davies/dedup.

#### What changes were proposed in this pull request? Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints. For example, the first query can be simplified to the second one. ```scala val queryWithUselessFilter = tr1 .where("tr1.a".attr > 10 || "tr1.c".attr < 10) .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr)) .where( ("tr1.a".attr > 10 || "tr1.c".attr < 10) && 'd.attr < 100 && "tr2.a".attr === "tr1.a".attr) ``` ```scala val query = tr1 .where("tr1.a".attr > 10 || "tr1.c".attr < 10) .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr)) ``` #### How was this patch tested? Six test cases are added. Author: gatorsmile <gatorsmile@gmail.com> Closes #11406 from gatorsmile/FilterRemoval.

This patch adds an API entry point for single decision tree feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #9912 from sethah/SPARK-11861.

## What changes were proposed in this pull request? This PR is a small follow up on #11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`. ## How was this patch tested? No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #11611 from sameeragarwal/expression-set.

## What changes were proposed in this pull request? Fix this use case, which was already fixed in SPARK-10548 in 1.6 but was broken in master due to #9264: ``` (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() } ``` This threw `IllegalArgumentException` consistently before this patch. For more detail, see the JIRA. ## How was this patch tested? New test in `SQLExecutionSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11586 from andrewor14/fix-concurrent-sql.

## What changes were proposed in this pull request? When a worker is lost, the executors on this worker are also lost. But Master's ApplicationPage still displays their states as running. This patch just sets the executor state to `LOST` when a worker is lost. ## How was this patch tested? manual tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11609 from zsxwing/SPARK-13778.

## What changes were proposed in this pull request? Originally the page is sorted by AppID by default. After tests with users' feedback, we think it might be best to sort by completed time (desc). ## How was this patch tested? Manually test, with screenshot as follows. ![sorted-by-complete-time-desc](https://cloud.githubusercontent.com/assets/11683054/13647686/d6dea924-e5fa-11e5-8fc5-68e039b74b6f.png) Author: zhuol <zhuol@yahoo-inc.com> Closes #11608 from zhuoliu/13775.

Minor typo: docstring for pyspark.sql.functions: hypot has extra characters N/A Author: Tristan Reid <treid@netflix.com> Closes #11616 from tristanreid/master.

## What changes were proposed in this pull request? Previously the Mesos framework webui URL was being derived only from the Spark UI address leaving no possibility to configure it. This commit makes it configurable. If unset it falls back to the previous behavior. Motivation: This change is necessary in order to be able to install Spark on DCOS and to be able to give it a custom service link. The configured `webui_url` is configured to point to a reverse proxy in the DCOS environment. ## How was this patch tested? Locally, using unit tests and on DCOS testing and stable revision. Author: Sergiusz Urbaniak <sur@mesosphere.io> Closes #11369 from s-urbaniak/sur-webui-url.

## What changes were proposed in this pull request? A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`. ## How was this patch tested? N/A cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #11597 from sameeragarwal/bigdecimal.

This reverts commit 926e9c4.

…ternal data sources ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13766 This PR makes the file extensions (written by internal datasource) consistent. **Before** - TEXT, CSV and JSON ``` [.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` .orc ``` **After** - TEXT, CSV and JSON ``` .txt[.COMPRESSION_CODEC_NAME] .csv[.COMPRESSION_CODEC_NAME] .json[.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` [.COMPRESSION_CODEC_NAME].orc ``` When the compression codec is set, - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end. - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension. ## How was this patch tested? Unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11604 from HyukjinKwon/SPARK-13766.

…artStream() ## What changes were proposed in this pull request? The new name makes it more obvious with the verb "start" that we are actually starting some execution. ## How was this patch tested? This is just a rename. Existing unit tests should cover it. Author: Reynold Xin <rxin@databricks.com> Closes #11627 from rxin/SPARK-13794.

…torSuite "Do not clear received… ## How was this patch tested? unit test Author: proflin <proflin.me@gmail.com> Closes #11626 from lw-lin/SPARK-7420.

## What changes were proposed in this pull request? This pull request adds a python example for train validation split. ## How was this patch tested? This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve. This contribution is my original work and I license it to Spark under its open source license. Author: JeremyNixon <jnixon2@gmail.com> Closes #11547 from JeremyNixon/tvs_example.

…omments. ## What changes were proposed in this pull request? According to #11627 , this PR replace `DataFrameWriter.stream()` with `startStream()` in comments of `ContinuousQueryListener.java`. ## How was this patch tested? Manual. (It changes on comments.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11629 from dongjoon-hyun/minor_rename.

Adding support for other numeric types: * Integer * Short * Long * Float * Decimal Author: sethah <seth.hendrickson16@gmail.com> Closes #9777 from sethah/SPARK-11108.

## What changes were proposed in this pull request? Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around. Supersedes #11524 ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11631 from srowen/SPARK-13663.

…leading We have a recoverable Spark streaming job with checkpoint enabled, it could be executed correctly at first time, but throw following exception when restarted and recovered from checkpoint. ``` org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87) at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) at org.apache.spark.rdd.RDD.union(RDD.scala:565) at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23) at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627) ``` According to exception, it shows I invoked transformations and actions in other transformations, but I did not. The real reason is that I used external RDD in DStream operation. External RDD data is not stored in checkpoint, so that during recovering, the initial value of _sc in this RDD is assigned to null and hit above exception. But you can find the error message is misleading, it indicates nothing about the real issue Here is the code to reproduce it. ```scala object Repo { def createContext(ip: String, port: Int, checkpointDirectory: String):StreamingContext = { println("Creating new context") val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint(checkpointDirectory) var cached = ssc.sparkContext.parallelize(Seq("apple, banana")) val words = ssc.socketTextStream(ip, port).flatMap(_.split(" ")) words.foreachRDD((rdd: RDD[String]) => { val res = rdd.map(word => (word, word.length)).collect() println("words: " + res.mkString(", ")) cached = cached.union(rdd) cached.checkpoint() println("cached words: " + cached.collect.mkString(", ")) }) ssc } def main(args: Array[String]) { val ip = "localhost" val port = 9999 val dir = "/home/maowei/tmp" val ssc = StreamingContext.getOrCreate(dir, () => { createContext(ip, port, dir) }) ssc.start() ssc.awaitTermination() } } ``` Author: mwws <wei.mao@intel.com> Closes #11595 from mwws/SPARK-MissleadingLog.

…plans JIRA: https://issues.apache.org/jira/browse/SPARK-13636 ## What changes were proposed in this pull request? As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows. ## How was this patch tested? All existing wholestage codegen tests should be passed. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11484 from viirya/direct-consume-unsaferow.

The contains() method does not return consistently with get() if the key is deprecated. For example, import org.apache.spark.SparkConf val conf = new SparkConf() conf.set("spark.io.compression.lz4.block.size", "12345") # display some deprecated warning message conf.get("spark.io.compression.lz4.block.size") # return 12345 conf.get("spark.io.compression.lz4.blockSize") # return 12345 conf.contains("spark.io.compression.lz4.block.size") # return true conf.contains("spark.io.compression.lz4.blockSize") # return false The fix will make the contains() and get() more consistent. I've added a test case for this. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit tests should be sufficient. Author: bomeng <bmeng@us.ibm.com> Closes #11568 from bomeng/SPARK-13727.

…inequality ## What changes were proposed in this pull request? This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null. ## How was this patch tested? 1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality. 2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input. cc nongli Author: Sameer Agarwal <sameer@databricks.com> Closes #11594 from sameeragarwal/isnotequal-constraints.

## What changes were proposed in this pull request? We should reuse an object similar to the other non-primitive type getters. For a query that computes averages over decimal columns, this shows a 10% speedup on overall query times. ## How was this patch tested? Existing tests and this benchmark ``` TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q27-agg (master) 10627 / 11057 10.8 92.3 q27-agg (this patch) 9722 / 9832 11.8 84.4 ``` Author: Nong Li <nong@databricks.com> Closes #11624 from nongli/spark-13790.

…ManagerSuite ## What changes were proposed in this pull request? ContinuousQueryManager is sometimes flaky on Jenkins. I could not reproduce it on my machine, so I guess it about the waiting times which causes problems if Jenkins is loaded. I have increased the wait time in the hope that it will be less flaky. ## How was this patch tested? I reran the unit test many times on a loop in my machine. I am going to run it a few time in Jenkins, that's the real test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11638 from tdas/cqm-flaky-test.

… disk stores Today, both the MemoryStore and DiskStore implement a common `BlockStore` API, but I feel that this API is inappropriate because it abstracts away important distinctions between the behavior of these two stores. For instance, the disk store doesn't have a notion of storing deserialized objects, so it's confusing for it to expose object-based APIs like putIterator() and getValues() instead of only exposing binary APIs and pushing the responsibilities of serialization and deserialization to the client. Similarly, the DiskStore put() methods accepted a `StorageLevel` parameter even though the disk store can only store blocks in one form. As part of a larger BlockManager interface cleanup, this patch remove the BlockStore interface and refines the MemoryStore and DiskStore interfaces to reflect more narrow sets of responsibilities for those components. Some of the benefits of this interface cleanup are reflected in simplifications to several unit tests to eliminate now-unnecessary mocking, significant simplification of the BlockManager's `getLocal()` and `doPut()` methods, and a narrower API between the MemoryStore and DiskStore. Author: Josh Rosen <joshrosen@databricks.com> Closes #11534 from JoshRosen/remove-blockstore-interface.

## What changes were proposed in this pull request? Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time. ``` // Correct: if (true) { println("Wow!") } // Incorrect: if (true){ println("Wow!") } ``` IntelliJ also shows new warnings based on this. ## How was this patch tested? Pass the Jenkins ScalaStyle test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11637 from dongjoon-hyun/SPARK-3854.

## What changes were proposed in this pull request? This PR improve the codegen of Filter by: 1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore. 2. Split the condition as conjunctive predicates, then check them one by one. Here is a piece of generated code for Filter in TPCDS Q55: ```java /* 109 */ /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */ /* 110 */ /* input[0, int] */ /* 111 */ boolean project_isNull2 = rdd_row.isNullAt(0); /* 112 */ int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0)); /* 113 */ /* input[1, int] */ /* 114 */ boolean project_isNull3 = rdd_row.isNullAt(1); /* 115 */ int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1)); /* 116 */ /* input[2, int] */ /* 117 */ boolean project_isNull4 = rdd_row.isNullAt(2); /* 118 */ int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2)); /* 119 */ /* 120 */ if (project_isNull3) continue; /* 121 */ if (project_isNull4) continue; /* 122 */ if (project_isNull2) continue; /* 123 */ /* 124 */ /* (input[1, int] = 11) */ /* 125 */ boolean filter_value6 = false; /* 126 */ filter_value6 = project_value3 == 11; /* 127 */ if (!filter_value6) continue; /* 128 */ /* 129 */ /* (input[2, int] = 1999) */ /* 130 */ boolean filter_value9 = false; /* 131 */ filter_value9 = project_value4 == 1999; /* 132 */ if (!filter_value9) continue; /* 133 */ /* 134 */ filter_metricValue1.add(1); /* 135 */ /* 136 */ /*** CONSUME: Project [d_date_sk#141] */ /* 137 */ /* 138 */ project_rowWriter1.write(0, project_value2); /* 139 */ append(project_result1.copy()); ``` ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #11585 from davies/gen_filter.

## What changes were proposed in this pull request? Here lists all cases that Master cannot talk with Worker for a while and then network is back. 1. Master doesn't know the network issue (not yet timeout) a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered) b. Worker knows the network issue (onDisconnected is called) - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602)) 2. Worker timeout (Master knows the network issue). In such case, master removes Worker and its executors and drivers. a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker - Worker send `RegisterWorker` to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) b. Worker knows the network issue (onDisconnected is called) - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers. Note: Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed. ## How was this patch tested? test("SPARK-13604: Master should ask Worker kill unknown executors and drivers") Author: Shixiong Zhu <shixiong@databricks.com> Closes #11455 from zsxwing/orphan-executors.

## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.

## What changes were proposed in this pull request? Today, Spark 1.6.1 and updated docs are release. Unfortunately, there is obsolete hive version information on docs: [Building Spark](http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support). This PR fixes the following two lines. ``` -By default Spark will build with Hive 0.13.1 bindings. +By default Spark will build with Hive 1.2.1 bindings. -# Apache Hadoop 2.4.X with Hive 13 support +# Apache Hadoop 2.4.X with Hive 1.2.1 support ``` `sql/README.md` file also describe ## How was this patch tested? Manual. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11639 from dongjoon-hyun/fix_doc_hive_version.

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Closes #11220 from olarayej/SPARK-13312-3.

## What changes were proposed in this pull request? This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5` ## How was this patch tested? Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test) Author: Sameer Agarwal <sameer@databricks.com> Closes #11618 from sameeragarwal/infer-isequal-constraints.

## What changes were proposed in this pull request? SparkR support first/last with ignore NAs cc sun-rui felixcheung shivaram ## How was the this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11267 from yanboliang/spark-13389.

…iminate useless Window #### What changes were proposed in this pull request? `projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer. This PR is based on the discussion started by cloud-fan in a separate PR: #5604 (comment) This PR also eliminates useless `Window`. cloud-fan yhuai #### How was this patch tested? Existing test cases cover it. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11565 from gatorsmile/removeProjListWindow.

…ctions ## What changes were proposed in this pull request? Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc. This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up. Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here. Many thanks to gatorsmile for the initial discussion and test cases! ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11555 from cloud-fan/window.

Fix the compilation failure introduced by #11555 because of a merge conflict. Author: Wenchen Fan <wenchen@databricks.com> Closes #11648 from cloud-fan/hotbug.

Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11642 from vanzin/spark-conf-typo.

JIRA: https://issues.apache.org/jira/browse/SPARK-13672 ## What changes were proposed in this pull request? add two python examples of BisectingKMeans for ml and mllib ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11515 from zhengruifeng/mllib_bkm_pe.

…/ Spark assembly This patch removes the need to build a full Spark assembly before running the `dev/mima` script. - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies. - This required me to delete two classes full of dead code that we don't use anymore - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build. - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly. Author: Josh Rosen <joshrosen@databricks.com> Closes #11178 from JoshRosen/remove-assembly-in-run-tests.

## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13512 Add example and doc for ml.feature.MaxAbsScaler. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11392 from hhbyyh/maxabsdoc.

…ree and random forest ## What changes were proposed in this pull request? This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`. ## How was this patch tested? Python doc tests for the affected classes were updated to check feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #11622 from sethah/SPARK-13787.

…xAbsScaler example ## What changes were proposed in this pull request? Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`). ## How was this patch tested? Existing build/unit tests Author: Nick Pentreath <nick.pentreath@gmail.com> Closes #11653 from MLnick/java-maxabs-example-fix.

In preparation for the demise of assemblies, this change allows the YARN backend to use multiple jars and globs as the "Spark jar". The config option has been renamed to "spark.yarn.jars" to reflect that. A second option "spark.yarn.archive" was also added; if set, this takes precedence and uploads an archive expected to contain the jar files with the Spark code and its dependencies. Existing deployments should keep working, mostly. This change drops support for the "SPARK_JAR" environment variable, and also does not fall back to using "jarOfClass" if no configuration is set, falling back to finding files under SPARK_HOME instead. This should be fine since "jarOfClass" probably wouldn't work unless you were using spark-submit anyway. Tested with the unit tests, and trying the different config options on a YARN cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11500 from vanzin/SPARK-13577.

## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <lian@databricks.com> Closes #11656 from liancheng/re-enable-mima.

This is needed to avoid odd compiler errors when building just the sql package with maven, because of odd interactions between scalac and shaded classes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11640 from vanzin/SPARK-13780.

Author: Liwei Lin <proflin.me@gmail.com> Closes #11650 from lw-lin/typo.

…h dynamic resource allocation When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt) Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #11241 from nezihyigitbasi/SPARK-13328.

…ark streaming This patch de-duplicates code in PySpark streaming which loads the `Python*Helper` classes. I also changed a few `raise e` statements to simply `raise` in order to preserve the full exception stacktrace when re-throwing. Here's a link to the whitespace-change-free diff: https://github.com/apache/spark/compare/master...JoshRosen:pyspark-reflection-deduplication?w=0 Author: Josh Rosen <joshrosen@databricks.com> Closes #11641 from JoshRosen/pyspark-reflection-deduplication.

… files JIRA: https://issues.apache.org/jira/browse/SPARK-13814 ## What changes were proposed in this pull request? delete unnecessary imports in python examples files ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11651 from zhengruifeng/del_import_pe.

## What changes were proposed in this pull request? This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <andrew@databricks.com> Closes #11573 from andrewor14/parser-plus-plus.

## What changes were proposed in this pull request? The current RPC can't handle large blocks very well, it's very slow to fetch 100M block (about 1 minute). Once switch to block manager to fetch that, it took about 10 seconds (still could be improved). ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11659 from davies/direct_result.

…D and data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes #11514 from davies/existing_rdd.

… from QueryExecution.assertAnalyzed PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`. However, the original stack trace wasn't properly inherited. This PR fixes this issue by inheriting the stack trace. A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`. Author: Cheng Lian <lian@databricks.com> Closes #11677 from liancheng/analysis-exception-stacktrace.

## What changes were proposed in this pull request? This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11678 from liancheng/remove-collect-rows-and-take-rows.

## What changes were proposed in this pull request? SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc. * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.

…ions ## What changes were proposed in this pull request? Currently, when a java.net.BindException is thrown, it displays the following message: java.net.BindException: Address already in use: Service '$serviceName' failed after 16 retries! This change adds port configuration suggestions to the BindException, for example, for the UI, it now displays java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries! Consider explicitly setting the appropriate port for 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries. ## How was this patch tested? Manual tests Author: Bjorn Jonsson <bjornjon@gmail.com> Closes #11644 from bjornjon/master.

## What changes were proposed in this pull request? This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #11652 from sun-rui/SPARK-13812.

## What changes were proposed in this pull request? fix typo in DataSourceRegister ## How was this patch tested? found when going through latest code Author: Jacky Li <jacky.likun@huawei.com> Closes #11686 from jackylk/patch-12.

## What changes were proposed in this pull request? For 2.0.0, we had better make **sbt** and **sbt plugins** up-to-date. This PR checks the status of each plugins and bumps the followings. * sbt: 0.13.9 --> 0.13.11 * sbteclipse-plugin: 2.2.0 --> 4.0.0 * sbt-dependency-graph: 0.7.4 --> 0.8.2 * sbt-mima-plugin: 0.1.6 --> 0.1.9 * sbt-revolver: 0.7.2 --> 0.8.0 All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.) During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result. ``` // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"), -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"), +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"), ``` ## How was this patch tested? Pass the Jenkins build. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11669 from dongjoon-hyun/update_mima.

…<-> byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit srowen@1deecd8 ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.

## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.

trait SynchronizedSet in package mutable is deprecated Author: Wilson Wu <wilson888888888@gmail.com> Closes #11580 from wilson888888888/spark-synchronizedset.

If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files. In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6. To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes. https://issues.apache.org/jira/browse/SPARK-13207 Author: Yin Huai <yhuai@databricks.com> Closes #11088 from yhuai/SPARK-13207.

Addressing outstanding comments in #11573. Jenkins, new test case in `DDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #11667 from andrewor14/ddl-parser-followups.

…s in memory When reading data from the DiskStore and attempting to cache it back into the memory store, we should guard against race conditions where multiple readers are attempting to re-cache the same block in memory. This patch accomplishes this by synchronizing on the block's `BlockInfo` object while trying to re-cache a block. (Will file JIRA as soon as ASF JIRA stops being down / laggy). Author: Josh Rosen <joshrosen@databricks.com> Closes #11660 from JoshRosen/concurrent-recaching-fixes.

Instead of looking for a specially-named assembly, the scripts now will blindly add all jars under the libs directory to the classpath. This libs directory is still currently the old assembly dir, so things should keep working the same way as before until we make more packaging changes. The only lost feature is the detection of multiple assemblies; I consider that a minor nicety that only really affects few developers, so it's probably ok. Tested locally by running spark-shell; also did some minor Win32 testing (just made sure spark-shell started). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11591 from vanzin/SPARK-13578.

To maximize locality, the YarnAllocator would cancel any requests with a stale locality preference or no locality preference. This assumed that the majority of tasks had locality preferences, but may not be the case when scanning S3. This caused container requests for S3 tasks to be constantly cancelled and resubmitted. This changes the allocator's logic to cancel only stale requests and just enough requests without locality preferences to submit requests with locality preferences. This avoids cancelling requests without locality preferences that would be resubmitted without locality preferences. We've deployed this patch on our clusters and verified that jobs that couldn't get executors because they kept canceling and resubmitting requests are fixed. Large jobs are running fine. Author: Ryan Blue <blue@apache.org> Closes #11612 from rdblue/SPARK-13779-fix-yarn-allocator-requests.

…ean expressions JIRA: https://issues.apache.org/jira/browse/SPARK-13658 ## What changes were proposed in this pull request? Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree). It will great if we could speedup it. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql How to speed up it: When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time. By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions. I simply benchmark it with an expression which is part of the where clause in TPCDS Q3: val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int) val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743))) val plan = testRelation.where(input).analyze val actual = Optimize.execute(plan) With this patch: 352 milliseconds 346 milliseconds 340 milliseconds Without this patch: 585 milliseconds 880 milliseconds 677 milliseconds ## How was this patch tested? Existing tests should pass. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11647 from viirya/improve-expr-canonicalize.

…oading issue This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark. In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches. Py4J diff: py4j/py4j@0.9.1...0.9.2 /cc zsxwing tdas davies brkyvz Author: Josh Rosen <joshrosen@databricks.com> Closes #11687 from JoshRosen/py4j-0.9.2.

…s before application has stopped ## Problem description: Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. ### Context and analysis: spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on #11207 . ## What changes were proposed in this pull request? This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. ## How was the this patch tested? This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

## What changes were proposed in this pull request? When studying spark many users just copy examples on the documentation and paste on their terminals and because of that the missing backlashes lead them run into some shell errors. The added backslashes avoid that problem for spark users with that behavior. ## How was this patch tested? I generated the documentation locally using jekyll and checked the generated pages Author: Daniel Santana <mestresan@gmail.com> Closes #11699 from danielsan/master.

## What changes were proposed in this pull request? JavaUtils.java has methods to convert time and byte strings for internal use, this change renames a variable used in byteStringAs(), from timeError to byteError. Author: Bjorn Jonsson <bjornjon@gmail.com> Closes #11695 from bjornjon/master.

I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks. There are multiple issues with this: - If the task end for tasks (in this case probably because of speculation) comes in after the stage is finished, then the DAGScheduler.handleTaskCompletion will skip the task completion event Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes #10951 from tgravescs/SPARK-11701.

…` to (Streaming)LinearRegressionWithSGD ## What changes were proposed in this pull request? `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values. To be consistent with other algorithms, we had better add them. The same default value is used. ## How was this patch tested? Pass the existing unit test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11527 from dongjoon-hyun/SPARK-13686.

This patch refactors the MemoryStore to remove the concept of `pendingUnrollMemory`. It also fixes fixes SPARK-6157: "Unrolling with MEMORY_AND_DISK should always release memory". Key changes: - Inline `MemoryStore.tryToPut` at its three call sites in the `MemoryStore`. - Inline `Memory.unrollSafely` at its only call site (in `MemoryStore.putIterator`). - Inline `MemoryManager.acquireStorageMemory` at its call sites. - Simplify the code as a result of this inlining (some parameters have fixed values after inlining, so lots of branches can be removed). - Remove the `pendingUnrollMemory` map by returning the amount of unrollMemory allocated when returning an iterator after a failed `putIterator` call. - Change `putIterator` to return an instance of `PartiallyUnrolledIterator`, a special iterator subclass which will automatically free the unroll memory of its partially-unrolled elements when the iterator is consumed. To handle cases where the iterator is not consumed (e.g. when a MEMORY_ONLY put fails), `PartiallyUnrolledIterator` exposes a `close()` method which may be called to discard the unrolled values and free their memory. Author: Josh Rosen <joshrosen@databricks.com> Closes #11613 from JoshRosen/cleanup-unroll-memory.

Three different things were needed to get rid of spurious warnings: - silence deprecation warnings when cloning configuration - change the way SparkHadoopUtil instantiates SparkConf to silence warnings - avoid creating new SparkConf instances where it's not needed. On top of that, I changed the way that Logging.scala detects the repl; now it uses a method that is overridden in the repl's Main class, and the hack in Utils.scala is not needed anymore. This makes the 2.11 repl behave like the 2.10 one and set the default log level to WARN, which is a lot better. Previously, this wasn't working because the 2.11 repl triggers log initialization earlier than the 2.10 one. I also removed and simplified some other code in the 2.11 repl's Main to avoid replicating logic that already exists elsewhere in Spark. Tested the 2.11 repl in local and yarn modes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11510 from vanzin/SPARK-13626.

…aming-zeromq, streaming-akka, streaming-twitter to Spark packages ## What changes were proposed in this pull request? Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages - streaming-flume - streaming-akka - streaming-mqtt - streaming-zeromq - streaming-twitter They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster. I have already copied these projects to https://github.com/spark-packages ## How was this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11672 from zsxwing/remove-external-pkg.

srowen Could you please check this when you have time? Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9916 from ehsanmok/JIRA-11826.

…ed scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.

## What changes were proposed in this pull request? We introduced some local operators in org.apache.spark.sql.execution.local package but never fully wired the engine to actually use these. We still plan to implement a full local mode, but it's probably going to be fairly different from what the current iterator-based local mode would look like. Based on what we know right now, we might want a push-based columnar version of these operators. Let's just remove them for now, and we can always re-introduced them in the future by looking at branch-1.6. ## How was this patch tested? This is simply dead code removal. Author: Reynold Xin <rxin@databricks.com> Closes #11705 from rxin/SPARK-13882.

…op_duplicates. ## What changes were proposed in this pull request? We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases. ## How was this patch tested? Existing PySpark unit tests. Closes #11543. Author: Reynold Xin <rxin@databricks.com> Closes #11698 from rxin/SPARK-10380.

## What changes were proposed in this pull request? - Add a MetadataLog interface for metadata reliably storage. - Add HDFSMetadataLog as a MetadataLog implementation based on HDFS. - Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself. ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11625 from zsxwing/metadata-log.

… and remove LegacyFunctions ## What changes were proposed in this pull request? 1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset. 2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0. ## How was this patch tested? Should be covered by existing unit/integration tests. Author: Reynold Xin <rxin@databricks.com> Closes #11704 from rxin/SPARK-13880.

## What changes were proposed in this pull request? Avoid the copy in HashedRelation, since most of the HashedRelation are built with Array[Row], added the copy() for LeftSemiJoinHash. This could help to reduce the memory consumption for Broadcast join. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #11666 from davies/remove_copy.

## What changes were proposed in this pull request? When we call DataFrame/Dataset.collect(), Java serializer (or Kryo Serializer) will be used to serialize the UnsafeRows in executor, then deserialize them into UnsafeRows in driver. Java serializer (and Kyro serializer) are slow on millions rows, because they try to find out the same rows, but usually there is no same rows. This PR will serialize the UnsafeRows as byte array by packing them together, then Java serializer (or Kyro serializer) serialize the bytes very fast (there are fewer blocks and byte array are not compared by content). The UnsafeRow format is highly compressible, the serialized bytes are also compressed (configurable by spark.io.compression.codec). ## How was this patch tested? Existing unit tests. Add a benchmark for collect, before this patch: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect 1 million 3991 / 4311 0.3 3805.7 1.0X collect 2 millions 10083 / 10637 0.1 9616.0 0.4X collect 4 millions 29551 / 30072 0.0 28182.3 0.1X ``` ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect 1 million 775 / 1170 1.4 738.9 1.0X collect 2 millions 1153 / 1758 0.9 1099.3 0.7X collect 4 millions 4451 / 5124 0.2 4244.9 0.2X ``` We can see about 5-7X speedup. Author: Davies Liu <davies@databricks.com> Closes #11664 from davies/serialize_row.

## What changes were proposed in this pull request? This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner). ## How was this patch tested? Should be covered by existing unit tests and Hive compatibility tests that run describe table. Author: Reynold Xin <rxin@databricks.com> Closes #11710 from rxin/SPARK-13884.

…Akka project ## What changes were proposed in this pull request? I have copied the docs of Streaming Akka to https://github.com/spark-packages/dstream-akka/blob/master/README.md So we can remove them from Spark now. ## How was this patch tested? Only document changes. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shixiong Zhu <shixiong@databricks.com> Closes #11711 from zsxwing/remove-akka-doc.

## What changes were proposed in this pull request? When initial creating `CVSSuite.scala` in SPARK-12833, there was a typo on `scalastyle:on`: `scalstyle:on`. So, it turns off ScalaStyle checking for the rest of the file mistakenly. So, it can not find a violation on the code of `SPARK-12668` added recently. This issue fixes the existing escaping correctly and adds a new escaping for `SPARK-12668` code like the following. ```scala test("test aliases sep and encoding for delimiter and charset") { + // scalastyle:off val cars = sqlContext ... .load(testFile(carsFile8859)) + // scalastyle:on ``` This will prevent future potential problems, too. ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11700 from dongjoon-hyun/SPARK-13870.

…text ## What changes were proposed in this pull request? In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer. As part of this patch, I also removed the following internal methods from SQLContext: ``` protected[sql] def functionRegistry: FunctionRegistry protected[sql] def optimizer: Optimizer protected[sql] def sqlParser: ParserInterface protected[sql] def planner: SparkPlanner protected[sql] def continuousQueryManager protected[sql] def prepareForExecution: RuleExecutor[SparkPlan] ``` ## How was this patch tested? Existing unit/integration tests. Author: Reynold Xin <rxin@databricks.com> Closes #11712 from rxin/sqlContext-planner.

…g and EliminateOperator #### What changes were proposed in this pull request? Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations: ``` Join Inner, Some((cast(id1#16 as bigint) = id1#18L)) :- Project [id1#16] : +- Filter isnotnull(cast(id1#16 as bigint)) : +- Project [id1#16] : +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int> +- Filter isnotnull(id1#18L) +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint> ``` This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators` The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue. cc sameeragarwal marmbrus In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR. cc davies cloud-fan yhuai #### How was this patch tested? Modified the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #11682 from gatorsmile/viewDuplicateNames.

…arbage ## What changes were proposed in this pull request? Use method 'testQuietly' to avoid ContinuousQuerySuite flooding the console logs with garbage Make ContinuousQuerySuite not output logs to the console. The logs will still output to unit-tests.log. ## How was this patch tested? Just check Jenkins output. Author: Xin Ren <iamshrek@126.com> Closes #11703 from keypointt/SPARK-13660.

Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation. Performance testing should be done to ensure there were no regressions. Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing) Author: sethah <seth.hendrickson16@gmail.com> Closes #10607 from sethah/SPARK-12379.

## What changes were proposed in this pull request? This patch contains the functionality to balance the load of the cluster-mode drivers among workers This patch restores the changes in #1106 which was erased due to the merging of #731 ## How was this patch tested? test with existing test cases Author: CodingCat <zhunansjtu@gmail.com> Closes #11702 from CodingCat/SPARK-13803.

As part of the goal to stop creating assemblies in Spark, this change modifies the mvn and sbt builds to not create an assembly for examples. Instead, dependencies are copied to the build directory (under target/scala-xx/jars), and in the final archive, into the "examples/jars" directory. To avoid having to deal too much with Windows batch files, I made examples run through the launcher library; the spark-submit launcher now has a special mode to run examples, which adds all the necessary jars to the spark-submit command line, and replaces the bash and batch scripts that were used to run examples. The scripts are now just a thin wrapper around spark-submit; another advantage is that now all spark-submit options are supported. There are a few glitches; in the mvn build, a lot of duplicated dependencies get copied, because they are promoted to "compile" scope due to extra dependencies in the examples module (such as HBase). In the sbt build, all dependencies are copied, because there doesn't seem to be an easy way to filter things. I plan to clean some of this up when the rest of the tasks are finished. When the main assembly is replaced with jars, we can remove duplicate jars from the examples directory during packaging. Tested by running SparkPi in: maven build, sbt build, dist created by make-distribution.sh. Finally: note that running the "assembly" target in sbt doesn't build the examples anymore. You need to run "package" for that. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11452 from vanzin/SPARK-13576.

## What changes were proposed in this pull request? Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits: 1. Reduces internal dependency on SQLContext. 2. Removes 2 public methods in Java (Java does not obey package private visibility). More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field. ## How was this patch tested? Existing unit/integration test code. Author: Reynold Xin <rxin@databricks.com> Closes #11716 from rxin/SPARK-13893.

…led for yarn cluster mode ## What changes were proposed in this pull request? Changing the default exit state to `failed` for any application running on yarn cluster mode. ## How was this patch tested? Unit test is done locally. CC tgravescs and vanzin . Author: jerryshao <sshao@hortonworks.com> Closes #11693 from jerryshao/SPARK-13642.

## What changes were proposed in this pull request? Change the return type of toJson in Dataset class ## How was this patch tested? No additional unit test required. Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com> Closes #11732 from skonto/fix_toJson.

## What changes were proposed in this pull request? a minor fix for the comments of a method in RPC Dispatcher ## How was this patch tested? existing unit tests Author: CodingCat <zhunansjtu@gmail.com> Closes #11738 from CodingCat/minor_rpc.

It shouldn't be private. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11734 from vanzin/SPARK-13626-api.

## What changes were proposed in this pull request? This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String]. Closes #11731. ## How was this patch tested? Updated existing integration tests to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #11739 from rxin/SPARK-13895.

## What changes were proposed in this pull request? This PR just move some code from SortMergeOuterJoin into SortMergeJoin. This is for support codegen for outer join. ## How was this patch tested? existing tests. Author: Davies Liu <davies@databricks.com> Closes #11743 from davies/gen_smjouter.

…utCodegen ## What changes were proposed in this pull request? Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen. This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also. Author: Yucai Yu <yucai.yu@intel.com> Closes #11718 from yucai/unused_expected.

## What changes were proposed in this pull request? This PR brings codegen support for broadcast left-semi join. ## How was this patch tested? Existing tests. Added benchmark, the result show 7X speedup. Author: Davies Liu <davies@databricks.com> Closes #11742 from davies/gen_semi.

…eweighted least squares ## What changes were proposed in this pull request? Provide R-like summary statistics for GLMs via iteratively reweighted least squares. ## How was this patch tested? unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11694 from yanboliang/spark-9837.

…eveloperAPI

…eveloperAPI APIs ## What changes were proposed in this pull request? We are able to change `Experimental` and `DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check and adds filters for them. ## How was this patch tested? Pass the Jenkins tests (including MiMa). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11751 from dongjoon-hyun/SPARK-13920.

… data source ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13899 This PR makes CSV data source produce `InternalRow` instead of `Row`. Basically, this resembles JSON data source. It uses the same codes for casting. ## How was this patch tested? Unit tests were used within IDE and code style was checked by `./dev/run_tests`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11717 from HyukjinKwon/SPARK-13899.

…r during reflection" ## What changes were proposed in this pull request? The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test. Historically, the target regression test is added by [SPARK-8498](093c348), but is temporarily disabled by [SPARK-12615](8ce645d) due to binary compatibility error. The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test. ``` Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map; ``` Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10. ``` Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; ``` This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11744 from dongjoon-hyun/SPARK-12653.

## What changes were proposed in this pull request? Force at least two dispatcher-event-loop threads. Since SparkDeploySchedulerBackend (in AppClient) calls askWithRetry to CoarseGrainedScheduler in the same process, there the driver needs at least two dispatcher threads to prevent the dispatcher thread from hanging. ## How was this patch tested? Manual. Author: Yonathan Randolph <yonathangmail.com> Author: Yonathan Randolph <yonathan@liftigniter.com> Closes #11728 from yonran/SPARK-13906.

…Charset follow up ## What changes were proposed in this pull request? Follow up to #11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.

JIRA: https://issues.apache.org/jira/browse/SPARK-13396 Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11544 from GayathriMurali/SPARK-13396.

…ng parent RDD ## What changes were proposed in this pull request? PipedRDD creates a child thread to read output of the parent stage and feed it to the pipe process. Used a variable to save the exception thrown in the child thread and then propagating the exception in the main thread if the variable was set. ## How was this patch tested? - Added a unit test - Ran all the existing tests in PipedRDDSuite and they all pass with the change - Tested the patch with a real pipe() job, bounced the executor node which ran the parent stage to simulate a fetch failure and observed that the parent stage was re-ran. Author: Tejas Patil <tejasp@fb.com> Closes #11628 from tejasapatil/pipe_rdd.

…ber of executor failure ## What changes were proposed in this pull request? The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow. ## How was this patch tested? It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2. Author: Carson Wang <carson.wang@intel.com> Closes #11713 from carsonwang/IntOverflow.

…eds to fix failure on slow machines ## What changes were proposed in this pull request? I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console ``` testCommunication(org.apache.spark.launcher.LauncherServerSuite) Time elapsed: 0.023 sec <<< FAILURE! java.lang.AssertionError: expected:<app-id> but was:<null> at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93) ``` However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences. This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway. Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence: - First attempt to acquire semaphore for 10ms actually silently times out - The changed being waited for happens just after that, a bit too late - Assertion passes since condition became true just in time - `release()` fires from the listener - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11763 from srowen/SPARK-13823.3.

## What changes were proposed in this pull request? In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly, instead of logging the warning. ## How was this patch tested? mvn clean install Add UT in BroadcastSuite Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #11735 from breakdawn/master.

… is not picked up in yarn-cluster mode Author: Jeff Zhang <zjffdu@apache.org> Closes #11238 from zjffdu/SPARK-13360.

## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1 SELECT key + 1 INSERT OVERWRITE TABLE dest2 SELECT key WHERE key > 2 INSERT OVERWRITE TABLE dest3 SELECT col EXPLODE(arr) exp AS col ... ``` We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE. This PR removes these limitations and make us fully support multi-insert. ## How was this patch tested? new tests in `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11754 from cloud-fan/lateral-view.

…aSet ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`. ## How was this patch tested? No additional unit test required. Author: Cheng Hao <hao.cheng@intel.com> Closes #11730 from chenghao-intel/range.

JIRA: https://issues.apache.org/jira/browse/SPARK-13816 ## What changes were proposed in this pull request? Add parameter checks for algorithms in Graphx: Pregel,LabelPropagation,PageRank,SVDPlusPlus ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11655 from zhengruifeng/graphx_param_check.

…utputs while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #11658 from cloud-fan/gensql.

#### What changes were proposed in this pull request? This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`. For example, below is the SQL containing `Transform` ``` SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2 ``` Its logical plan is like ``` ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true) +- SubqueryAlias parquet_t2 +- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation ``` The generated SQL will be like ``` SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2` ``` #### How was this patch tested? Seven test cases are added to `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11503 from gatorsmile/transformToSQL.

## What changes were proposed in this pull request? JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038 1. Add load/save to PySpark Pipeline and PipelineModel 2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`. ## How was this patch tested? Test with doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #11683 from yinxusen/SPARK-13038-only.

…SV format ## What changes were proposed in this pull request? Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package. cc mengxr ## How was this patch tested? The test suite is ignored, but I have enabled all these cases offline and it works as expected. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11463 from yanboliang/spark-13613.

### What changes were proposed in this pull request? Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel * The shared implementation is in treeModels.scala * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files. Other changes: * Made CategoricalSplit.numCategories public (to use in persistence) * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed. ### How was this patch tested? Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11581 from jkbradley/dt-io.

## What changes were proposed in this pull request? Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly. ## How was this patch tested? Unit tests on sparse and dense matrix. cc: dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #11757 from mengxr/SPARK-13927.

## What changes were proposed in this pull request? Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py. ## How was this patch tested? ./python/run-tests ./dev/lint-python Unit tests added to check persistence in Logistic Regression Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11707 from GayathriMurali/SPARK-13034.

## What changes were proposed in this pull request? `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs. **Migration Guide** ``` - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11770 from dongjoon-hyun/SPARK-13942.

…quet reader # What changes were proposed in this pull request? It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls). ## How was this patch tested? Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (0%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 1229 / 1648 8.5 117.2 1.0X PR Vectorized 833 / 846 12.6 79.4 1.5X PR Vectorized (Null Filtering) 732 / 782 14.3 69.8 1.7X Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (50%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 995 / 1053 10.5 94.9 1.0X PR Vectorized 732 / 772 14.3 69.8 1.4X PR Vectorized (Null Filtering) 725 / 790 14.5 69.1 1.4X Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (95%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 326 / 333 32.2 31.1 1.0X PR Vectorized 190 / 200 55.1 18.2 1.7X PR Vectorized (Null Filtering) 168 / 172 62.2 16.1 1.9X Author: Sameer Agarwal <sameer@databricks.com> Closes #11749 from sameeragarwal/perf-testing.

## What changes were proposed in this pull request? This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things: (a) no redundant filters are generated, and (b) filters that do not contribute to any further optimizations are not generated. ## How was this patch tested? Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators. In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization. Author: Sameer Agarwal <sameer@databricks.com> Closes #11665 from sameeragarwal/infer-filters.

## What changes were proposed in this pull request? **[I'll link it to the JIRA once ASF JIRA is back online]** This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`) ## How was this patch tested? Unit test in `FilterPushdownSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11670 from sameeragarwal/combine-filters.

## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.

## What changes were proposed in this pull request? Deprecate validateParams() method here: https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553 Move all functionality in overridden methods to transformSchema(). Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11620 from hhbyyh/depreValid.

## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11750 from andrewor14/temp-catalog.

…type in the same fieild ## What changes were proposed in this pull request? This #2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below: ```json {"a": {"b": 1}} {"a": []} ``` and the schema is given as below: ```scala val schema = StructType( StructField("a", StructType( StructField("b", StringType) :: Nil )) :: Nil) ``` - **Before** ```scala sqlContext.read.schema(schema).json(path).show() ``` ```scala Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) ... ``` - **After** ```scala sqlContext.read.schema(schema).json(path).show() ``` ```bash +----+ | a| +----+ | [1]| |null| +----+ ``` For other data types, in this case it converts the given values are `null` but only this case emits an exception. This PR makes the support for wrapped rows applied only at the top level. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.

…whole stage codegen ## What changes were proposed in this pull request? We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen. Updated the benchmark for `collect`, we could see 20-30% speedup. ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11740 from davies/avoid_copy2.

## What changes were proposed in this pull request? Fix expression generation for optional types. Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases. This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR). ## How was this patch tested? A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects. Author: Jakob Odersky <jakob@odersky.com> Closes #11708 from jodersky/SPARK-13118-package-objects.

## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ``` JoinSuite.scala:52: unreachable code [warn] case j: BroadcastHashJoin => j ``` The other two are just consecutive repetitions in `Seq` of MiMa filters. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11773 from dongjoon-hyun/remove_duplicated_line.

…c and test ## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.

…ith simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes #11755 from JoshRosen/automatically-pick-best-serializer.

This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf. I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)). Author: Ryan Blue <blue@apache.org> Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.

…ivate MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version. This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility). Author: Josh Rosen <joshrosen@databricks.com> Closes #11774 from JoshRosen/SPARK-13948.

…mnPruning and EliminateOperator" This reverts commit 99bd2f0.

## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.

## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html If set, then all non-zero counts will be set to 1. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11536 from hhbyyh/cvToggle.

… next locality level JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901 In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it. Author: trueyao <501663994@qq.com> Closes #11719 from trueyao/logDebug-localityWait.

…ernal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.

## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on #11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over #11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11696 from cloud-fan/generate.

…for Jetty ## What changes were proposed in this pull request? As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request. To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary. ## How was this patch tested? Just make sure we don't break any existing tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11615 from zsxwing/SPARK-13776.

## What changes were proposed in this pull request? Support queries that JOIN tables with USING clause. SELECT * from table1 JOIN table2 USING <column_list> USING clause can be used as a means to simplify the join condition when : 1) Equijoin semantics is desired and 2) The column names in the equijoin have the same name. We already have the support for Natural Join in Spark. This PR makes use of the already existing infrastructure for natural join to form the join condition and also the projection list. ## How was the this patch tested? Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11297 from dilipbiswal/spark-13427.

…ted in BoundAttribute JIRA: https://issues.apache.org/jira/browse/SPARK-13838 ## What changes were proposed in this pull request? We should also clear the variable code in `BoundReference.genCode` to prevent it to be evaluated twice, as we did in `evaluateVariables`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11674 from viirya/avoid-reevaluate.

## What changes were proposed in this pull request? Compilation against Scala 2.10 fails with: ``` [error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for unreachability. [error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.) [error] [warn] private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match { ``` ## How was this patch tested? Compilation against Scala 2.10 Author: tedyu <yuzhihong@gmail.com> Closes #11787 from yy2016/master.

…o member variable ## What changes were proposed in this pull request? In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes. ## How was this patch tested? Ran python tests for ML and MLlib. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.

https://issues.apache.org/jira/browse/SPARK-11891 Author: Xusen Yin <yinxusen@gmail.com> Closes #9884 from yinxusen/SPARK-11891.

This reverts commit 3ee7996.

## What changes were proposed in this pull request? Cleanups from [#11620]: remove remaining uses of validateParams, and put functionality into transformSchema ## How was this patch tested? Existing unit tests, modified to check using transformSchema instead of validateParams Author: Joseph K. Bradley <joseph@databricks.com> Closes #11790 from jkbradley/SPARK-13761-cleanup.

Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say there are 3 categories A, B, C. We consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A). This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features. Author: sethah <seth.hendrickson16@gmail.com> Closes #9474 from sethah/SPARK-10788.

…while generate SQL ## What changes were proposed in this pull request? We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`. also addressed 2 minor comments in #11696 ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11783 from cloud-fan/tmp.

…ate SQL ## What changes were proposed in this pull request? We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation. an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1` before this PR, we will generate: ``` SELECT attr_1 AS a FROM SELECT attr_1 FROM ( SELECT a AS attr_1 FROM tbl ) AS sub_q0 WHERE attr_1 = 1 ``` We missed a sub-query and this SQL string is illegal. After this PR, we will generate: ``` SELECT attr_1 AS a FROM ( SELECT attr_1 FROM ( SELECT a AS attr_1 FROM tbl ) AS sub_q0 WHERE attr_1 = 1 ) AS t ``` TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string. ## How was this patch tested? `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11786 from cloud-fan/bug-fix.

This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer. This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted. This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet). Author: Josh Rosen <joshrosen@databricks.com> Closes #11748 from JoshRosen/chunked-block-serialization.

PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure. This PR fixes this issue by expanding this pattern match into several simpler ones. Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case. Compilation against Scala 2.10 Author: tedyu <yuzhihong@gmail.com> Closes #11798 from yy2016/master.

## What changes were proposed in this pull request? This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups * `groupname basic`: Basic Dataset functions * `groupname action`: Actions * `groupname untypedrel`: Untyped Language Integrated Relational Queries * `groupname typedrel`: Typed Language Integrated Relational Queries * `groupname func`: Functional Transformations * `groupname rdd`: RDD Operations * `groupname output`: Output Operations `since` tag and sample code are also updated. We may want to add more sample code for typed APIs. ## How was this patch tested? Documentation change. Checked by building unidoc locally. Author: Cheng Lian <lian@databricks.com> Closes #11769 from liancheng/spark-13826-ds-api-doc.

## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13930 Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too. ## How was this patch tested? Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`. Without this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X With this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 833 / 1284 1.3 794.4 1.0X collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11759 from viirya/execute-take.

## What changes were proposed in this pull request? This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast. ## How was this patch tested? Just documentation/api stability update. Author: Reynold Xin <rxin@databricks.com> Closes #11814 from rxin/dataset-docs.

…ption. ## What changes were proposed in this pull request? When trainingSummary is None, it should throw ```RuntimeException```. cc mengxr ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11784 from yanboliang/fix-summary.

## What changes were proposed in this pull request? The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string. ## How was this patch tested? The re-enabled test Author: Wenchen Fan <wenchen@databricks.com> Closes #11818 from cloud-fan/bug-fix.

## What changes were proposed in this pull request? Fix some nits discussed in #11776 (comment) use !rdd.isEmpty instead of rdd.count > 0 use static instead of AtomicInteger remove unneeded "throws Exception" ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11821 from zhengruifeng/je_fix.

## What changes were proposed in this pull request? Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log. We will send new PRs for spotted bugs, and merge this one after all bugs are fixed. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11782 from cloud-fan/test.

…e the first qualifier to generate SQL strings ## What changes were proposed in this pull request? Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. This PR fixes this issue by only picking the first qualifiers. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests should be enough. Author: Cheng Lian <lian@databricks.com> Closes #11820 from liancheng/spark-14004-single-qualifier.

## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast. ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory. This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side). A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin. ## How was this patch tested? Added new unit tests for ShuffledHashJoin. Author: Davies Liu <davies@databricks.com> Closes #11788 from davies/shuffle_join.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull latest from apache spark #6

pull latest from apache spark #6

Commits on Mar 3, 2016

Commits on Mar 4, 2016

Commits on Mar 5, 2016

Commits on Mar 6, 2016

Commits on Mar 7, 2016

Commits on Mar 8, 2016

Commits on Mar 9, 2016

Commits on Mar 10, 2016

Commits on Mar 11, 2016

Commits on Mar 12, 2016

Commits on Mar 13, 2016

Commits on Mar 14, 2016

Commits on Mar 15, 2016

Commits on Mar 16, 2016

Commits on Mar 17, 2016

Commits on Mar 18, 2016