Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull latest from apache spark #6

Merged
merged 1,347 commits into from
Mar 18, 2016
Merged

pull latest from apache spark #6

merged 1,347 commits into from
Mar 18, 2016

Conversation

rekhajoshm
Copy link
Owner

pull latest from apache spark

srowen and others added 30 commits March 3, 2016 15:11
… 2.10

## What changes were proposed in this pull request?

Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.

## How was this patch tested?

Jenkins tests / compilation

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Sean Owen <sowen@cloudera.com>

Closes #11493 from srowen/SPARK-13423.2.
… include_example

Replace example code in mllib-clustering.md using include_example
https://issues.apache.org/jira/browse/SPARK-13013

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11116 from keypointt/SPARK-13013.
## What changes were proposed in this pull request?

Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.

This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.

## How was this patch tested?

1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver  -Dverbose > target/dependencies.txt`
1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
1. Patch applied
1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
1. Examined created spark-assembly, verified no org.codehaus packages
1. Verified that the maven dependency tree no longer references groovy

Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded

Author: Steve Loughran <stevel@hortonworks.com>

Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
## What changes were proposed in this pull request?
The changes proposed were to add train-validation-split to pyspark.ml.tuning.

## How was the this patch tested?
This patch was tested through unit tests located in pyspark/ml/test.py.

This is my original work and I license it to Spark.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11335 from JeremyNixon/tvs_pyspark.
…et/ORC via option()

## What changes were proposed in this pull request?

This PR adds the support to specify compression codecs for both ORC and Parquet.

## How was this patch tested?

unittests within IDE and code style tests with `dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11464 from HyukjinKwon/SPARK-13543.
…aram

## What changes were proposed in this pull request?
Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
cc mengxr srowen
## How was this patch tested?
Documents change, no test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11344 from yanboliang/shared-cleanup.
… 2.10, again

## What changes were proposed in this pull request?

Fixes (another) compile problem due to inadvertent use of Option.contains, only in Scala 2.11

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11496 from srowen/SPARK-13423.3.
## What changes were proposed in this pull request?

This PR fixes typos in comments and testcase name of code.

## How was this patch tested?

manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?

This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11482 from andrewor14/commands-package.
… logs to the console

## What changes were proposed in this pull request?

Make ContinuousQueryManagerSuite not output logs to the console. The logs will still output to `unit-tests.log`.

I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid changing the log level which won't output logs to `unit-tests.log`.

## How was this patch tested?

Just check Jenkins output.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11439 from zsxwing/quietly-ContinuousQueryManagerSuite.
## What changes were proposed in this pull request?

This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen.

For example:
```python
>>> sqlContext.range(100).registerTempTable("range")
>>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(('id / subquery#9), None)]
:  +- 'SubqueryAlias subquery#9
:     +- 'Project [unresolvedalias('sum('id), None)]
:        +- 'UnresolvedRelation `range`, None
+- 'Filter ('id > subquery#8)
   :  +- 'SubqueryAlias subquery#8
   :     +- 'GlobalLimit 1
   :        +- 'LocalLimit 1
   :           +- 'Project [unresolvedalias('id, None)]
   :              +- 'UnresolvedRelation `range`, None
   +- 'UnresolvedRelation `range`, None

== Analyzed Logical Plan ==
(id / scalarsubquery()): double
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
:  +- SubqueryAlias subquery#9
:     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
:        +- SubqueryAlias range
:           +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
   :  +- SubqueryAlias subquery#8
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [id#0L]
   :              +- SubqueryAlias range
   :                 +- Range 0, 100, 1, 4, [id#0L]
   +- SubqueryAlias range
      +- Range 0, 100, 1, 4, [id#0L]

== Optimized Logical Plan ==
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
:  +- SubqueryAlias subquery#9
:     +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
:        +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
   :  +- SubqueryAlias subquery#8
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [id#0L]
   :              +- Range 0, 100, 1, 4, [id#0L]
   +- Range 0, 100, 1, 4, [id#0L]

== Physical Plan ==
WholeStageCodegen
:  +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
:     :  +- Subquery subquery#9
:     :     +- WholeStageCodegen
:     :        :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
:     :        :     +- INPUT
:     :        +- Exchange SinglePartition, None
:     :           +- WholeStageCodegen
:     :              :  +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
:     :              :     +- Range 0, 1, 4, 100, [id#0L]
:     +- Filter (id#0L > subquery#8)
:        :  +- Subquery subquery#8
:        :     +- CollectLimit 1
:        :        +- WholeStageCodegen
:        :           :  +- Project [id#0L]
:        :           :     +- Range 0, 1, 4, 100, [id#0L]
:        +- Range 0, 1, 4, 100, [id#0L]
```

The web UI looks like:

![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)

This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 .

## How was this patch tested?

Existing tests, also manual tests with the example query, check the explain and web UI.

Author: Davies Liu <davies@databricks.com>

Closes #11417 from davies/viz_subquery.
## What changes were proposed in this pull request?

Fix race conditions when cleanup files.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #11507 from davies/flaky.
…ge in _verify_type

## What changes were proposed in this pull request?

This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.

## How was this patch tested?

newly added doc test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11492 from cloud-fan/py-verify.
… string datatypes to Oracle VARCHAR datatype mapping

## What changes were proposed in this pull request?
A test suite added for the bug fix -SPARK 12941; for the mapping of the StringType to corresponding in Oracle

## How was this patch tested?
manual tests done
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: thomastechs <thomas.sebastian@tcs.com>
Author: THOMAS SEBASTIAN <thomas.sebastian@tcs.com>

Closes #11489 from thomastechs/thomastechs-12941-master-new.
…cled

## What changes were proposed in this pull request?

`sendRpcSync` should copy the response content because the underlying buffer will be recycled and reused.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11499 from zsxwing/SPARK-13652.
## What changes were proposed in this pull request?

This is support SQL generation for subquery expressions, which will be replaced to a SubqueryHolder inside SQLBuilder recursively.

## How was this patch tested?

Added unit tests.

Author: Davies Liu <davies@databricks.com>

Closes #11453 from davies/sql_subquery.
## What changes were proposed in this pull request?

It avoids counting the dataframe twice.

Author: Abou Haydar Elias <abouhaydar.elias@gmail.com>
Author: Elie A <abouhaydar.elias@gmail.com>

Closes #11491 from eliasah/quantile-discretizer-patch.
…orkjoin

## What changes were proposed in this pull request?

Remove old deprecated ThreadPoolExecutor and replace with ExecutionContext using a ForkJoinPool. The downside of this is that scala's ForkJoinPool doesn't give us a way to specify the thread pool name (and is a wrapper of Java's in 2.12) except by providing a custom factory. Note that we can't use Java's ForkJoinPool directly in Scala 2.11 since it uses a ExecutionContext which reports system parallelism. One other implicit change that happens is the old ExecutionContext would have reported a different default parallelism since it used system parallelism rather than threadpool parallelism (this was likely not intended but also likely not a huge difference).

The previous version of this PR attempted to use an execution context constructed on the ThreadPool (but not the deprecated ThreadPoolExecutor class) so as to keep the ability to have human readable named threads but this reported system parallelism.

## How was this patch tested?

unit tests: streaming/testOnly org.apache.spark.streaming.util.*

Author: Holden Karau <holden@us.ibm.com>

Closes #11423 from holdenk/SPARK-13398-move-away-from-ThreadPoolTaskSupport-java-forkjoin.
Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding

Author: Rajesh Balamohan <rbalamohan@apache.org>

Closes #11477 from rajeshbalamohan/SPARK-12925.2.
## What changes were proposed in this pull request?

This patch fixes the problem that `bin\beeline.cmd` pollutes environment variables.
The similar problem is reported and fixed in https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems to be added later.

## How was this patch tested?

manual tests:
  I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% doesn't remain in the command prompt.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #11516 from tsudukim/feature/SPARK-13673.
…egression

## What changes were proposed in this pull request?

The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.

**Scala**
```
scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
```

**Python**
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
```

## How was this patch tested?
manual. Check the following in `pyspark`.
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.0
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11519 from dongjoon-hyun/SPARK-13676.
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.

In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in #9884. I'll add them in this PR if #9884 gets merged first. Or add a follow-up JIRA for `RFormula`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11203 from yinxusen/SPARK-13036.
## What changes were proposed in this pull request?

This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11506 from andrewor14/parser-package.
…otals Table

## What changes were proposed in this pull request?

Now that dead executors are shown in the executors table (#10058) the totals table is updated to include the separate totals for alive and dead executors as well as the current total, as originally discussed in #10668

## How was this patch tested?

Manually verified by running the Standalone Web UI in the latest Safari and Firefox ESR

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #11381 from ajbozarth/spark13459.
…narBatch instead of InternalRows.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Currently, the parquet reader returns rows one by one which is bad for performance. This patch
updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
of rows). The current implementation is a bit of a hack to get this to work and we should do
more refactoring of these low level interfaces to make this work better.

## How was this patch tested?

```
Results:
TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
---------------------------------------------------------------------------------
q55 (before)                             8897 / 9265         12.9          77.2
q55                                      5486 / 5753         21.0          47.6
```

Author: Nong Li <nong@databricks.com>

Closes #11435 from nongli/spark-13255.
… preferentially from lagg…

…ing partitions

I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure.

`maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages.

This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%.

Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed.

We're going to try deploying this internally at Shopify to see if this resolves our issue.

tdas koeninger holdenk

Author: Jason White <jason.white@shopify.com>

Closes #10089 from JasonMWhite/rate_controller_offsets.
…ing Sets

#### What changes were proposed in this pull request?

This PR is for supporting SQL generation for cube, rollup and grouping sets.

For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
```
Original logical plan:
```
  Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
            [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
             (key#17L % cast(5 as bigint))#47L AS _c1#45L,
             grouping__id#46 AS _c2#44]
  +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
             List(key#17L, value#18, null, 1)],
            [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
     +- Project [key#17L,
                 value#18,
                 (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
        +- Subquery t1
           +- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
  SELECT count( 1) AS `cnt`,
         (`t1`.`key` % CAST(5 AS BIGINT)),
         grouping_id() AS `_c2`
  FROM `default`.`t1`
  GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
  GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```

#### How was the this patch tested?

Added eight test cases in `LogicalPlanToSQLSuite`.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11283 from gatorsmile/groupingSetsToSQL.
… checkpoint dir

## What changes were proposed in this pull request?

Stop StreamingContext before deleting checkpoint dir to avoid the race condition that deleting the checkpoint dir and writing checkpoint happen at the same time.

The flaky test log is here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/

## How was this patch tested?

unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11531 from zsxwing/SPARK-13693.
…t a project on top of it"

This reverts commit f87ce05.

According to discussion in #11466, let's revert PR #11466 for safe.

Author: Cheng Lian <lian@databricks.com>

Closes #11539 from liancheng/revert-pr-11466.
…ionSerializer.loads

## What changes were proposed in this pull request?

Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.

## How was this patch tested?

Manually test in the shell.

Before this patch:
```
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction(<function <lambda> at 0x106ac8b18>)
>>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
None
>>> print(func2.rdd_wrap_func.__module__)
None
>>>
```
After this patch:
```
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction(<function <lambda> at 0x108bf1b90>)
>>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
__main__
>>> print(func2.rdd_wrap_func.__module__)
__main__
>>>
```

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11535 from zsxwing/loads-module.
zhengruifeng and others added 27 commits March 17, 2016 11:09
## What changes were proposed in this pull request?

Add the java example of StreamingTest

## How was this patch tested?

manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11776 from zhengruifeng/streaming_je.
## What changes were proposed in this pull request?

It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
If set, then all non-zero counts will be set to 1.

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11536 from hhbyyh/cvToggle.
… next locality level

JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901
In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it.

Author: trueyao <501663994@qq.com>

Closes #11719 from trueyao/logDebug-localityWait.
…ernal.Logging

## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
## What changes were proposed in this pull request?

This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.

This PR is based on #11658, please see the last commit to review the real changes.

Thanks dilipbiswal for his initial work! Takes over #11596

## How was this patch tested?

new tests in `LogicalPlanToSQLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11696 from cloud-fan/generate.
…for Jetty

## What changes were proposed in this pull request?

As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request.

To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary.

## How was this patch tested?

Just make sure we don't break any existing tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11615 from zsxwing/SPARK-13776.
## What changes were proposed in this pull request?

Support queries that JOIN tables with USING clause.
SELECT * from table1 JOIN table2 USING <column_list>

USING clause can be used as a means to simplify the join condition
when :

1) Equijoin semantics is desired and
2) The column names in the equijoin have the same name.

We already have the support for Natural Join in Spark. This PR makes
use of the already existing infrastructure for natural join to
form the join condition and also the projection list.

## How was the this patch tested?

Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #11297 from dilipbiswal/spark-13427.
…ted in BoundAttribute

JIRA: https://issues.apache.org/jira/browse/SPARK-13838
## What changes were proposed in this pull request?

We should also clear the variable code in `BoundReference.genCode` to prevent it  to be evaluated twice, as we did in `evaluateVariables`.

## How was this patch tested?

Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #11674 from viirya/avoid-reevaluate.
## What changes were proposed in this pull request?

Compilation against Scala 2.10 fails with:
```
[error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for         unreachability.
[error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.)
[error] [warn]     private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match {
```

## How was this patch tested?

Compilation against Scala 2.10

Author: tedyu <yuzhihong@gmail.com>

Closes #11787 from yy2016/master.
…o member variable

## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.

## How was this patch tested?
Ran python tests for ML and MLlib.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
## What changes were proposed in this pull request?

Cleanups from [#11620]: remove remaining uses of validateParams, and put functionality into transformSchema

## How was this patch tested?

Existing unit tests, modified to check using transformSchema instead of validateParams

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11790 from jkbradley/SPARK-13761-cleanup.
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.

Say there are 3 categories A, B, C. We consider 3 splits:

* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).

This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9474 from sethah/SPARK-10788.
…while generate SQL

## What changes were proposed in this pull request?

We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`.

also addressed 2 minor comments in #11696

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11783 from cloud-fan/tmp.
…ate SQL

## What changes were proposed in this pull request?

We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation.

an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1`
before this PR, we will generate:
```
SELECT attr_1 AS a FROM
  SELECT attr_1 FROM (
    SELECT a AS attr_1 FROM tbl
  ) AS sub_q0
  WHERE attr_1 = 1
```
We missed a sub-query and this SQL string is illegal.

After this PR, we will generate:
```
SELECT attr_1 AS a FROM (
  SELECT attr_1 FROM (
    SELECT a AS attr_1 FROM tbl
  ) AS sub_q0
  WHERE attr_1 = 1
) AS t
```

TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string.

## How was this patch tested?

`LogicalPlanToSQLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11786 from cloud-fan/bug-fix.
This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer.

This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted.

This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11748 from JoshRosen/chunked-block-serialization.
PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure.  This PR fixes this issue by expanding this pattern match into several simpler ones.

Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case.

Compilation against Scala 2.10

Author: tedyu <yuzhihong@gmail.com>

Closes #11798 from yy2016/master.
## What changes were proposed in this pull request?

This PR revises Dataset API ScalaDoc.  All public methods are divided into the following groups

* `groupname basic`: Basic Dataset functions
* `groupname action`: Actions
* `groupname untypedrel`: Untyped Language Integrated Relational Queries
* `groupname typedrel`: Typed Language Integrated Relational Queries
* `groupname func`: Functional Transformations
* `groupname rdd`: RDD Operations
* `groupname output`: Output Operations

`since` tag and sample code are also updated.  We may want to add more sample code for typed APIs.

## How was this patch tested?

Documentation change.  Checked by building unidoc locally.

Author: Cheng Lian <lian@databricks.com>

Closes #11769 from liancheng/spark-13826-ds-api-doc.
## What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-13930

Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.

## How was this patch tested?

Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.

Without this patch:

    model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
    collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
    collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X

With this patch:

    model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
    collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
    collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #11759 from viirya/execute-take.
## What changes were proposed in this pull request?
This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.

## How was this patch tested?
Just documentation/api stability update.

Author: Reynold Xin <rxin@databricks.com>

Closes #11814 from rxin/dataset-docs.
…ption.

## What changes were proposed in this pull request?
When trainingSummary is None, it should throw ```RuntimeException```.
cc mengxr
## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11784 from yanboliang/fix-summary.
## What changes were proposed in this pull request?

The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string.

## How was this patch tested?

The re-enabled test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11818 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?

Fix some nits discussed in #11776 (comment)
use !rdd.isEmpty instead of rdd.count > 0
use static instead of AtomicInteger
remove unneeded "throws Exception"

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11821 from zhengruifeng/je_fix.
## What changes were proposed in this pull request?

Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log.

We will send new PRs for spotted bugs, and merge this one after all bugs are fixed.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11782 from cloud-fan/test.
…e the first qualifier to generate SQL strings

## What changes were proposed in this pull request?

Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`.

This PR fixes this issue by only picking the first qualifiers.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Existing tests should be enough.

Author: Cheng Lian <lian@databricks.com>

Closes #11820 from liancheng/spark-14004-single-qualifier.
## What changes were proposed in this pull request?

ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.

ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.

This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).

A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.

## How was this patch tested?

Added new unit tests for ShuffledHashJoin.

Author: Davies Liu <davies@databricks.com>

Closes #11788 from davies/shuffle_join.
rekhajoshm added a commit that referenced this pull request Mar 18, 2016
pull latest from apache spark
@rekhajoshm rekhajoshm merged commit c73c32a into rekhajoshm:master Mar 18, 2016
rekhajoshm pushed a commit that referenced this pull request Mar 18, 2016
…ansform Clause

#### What changes were proposed in this pull request?
```SQL
FROM
(FROM test SELECT TRANSFORM(key, value) USING 'cat' AS (`thing1` int, thing2 string)) t
SELECT thing1 + 1
```
This query returns an analysis error, like:
```
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`thing1`' given input columns: [`thing1`, thing2]; line 3 pos 7
'Project [unresolvedalias(('thing1 + 1), None)]
+- SubqueryAlias t
   +- ScriptTransformation [key#2,value#3], cat, [`thing1`#6,thing2#7], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
      +- SubqueryAlias test
         +- Project [_1#0 AS key#2,_2#1 AS value#3]
            +- LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3],[4,4],[5,5]]
```

The backpacks of \`thing1\` should be cleaned before entering Parser/Analyzer. This PR fixes this issue.

#### How was this patch tested?

Added a test case and modified an existing test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#11415 from gatorsmile/scriptTransform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet