Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latest apache spark #14

Merged
merged 88 commits into from
Jun 21, 2018
Merged

latest apache spark #14

merged 88 commits into from
Jun 21, 2018

Commits on Jun 1, 2018

  1. [MINOR][YARN] Add YARN-specific credential providers in debug logging…

    … message
    
    This PR adds a debugging log for YARN-specific credential providers which is loaded by service loader mechanism.
    
    It took me a while to debug if it's actually loaded or not. I had to explicitly set the deprecated configuration and check if that's actually being loaded.
    
    The change scope is manually tested. Logs are like:
    
    ```
    Using the following builtin delegation token providers: hadoopfs, hive, hbase.
    Using the following YARN-specific credential providers: yarn-test.
    ```
    
    Author: hyukjinkwon <gurwls223@apache.org>
    
    Closes #21466 from HyukjinKwon/minor-log.
    
    Change-Id: I18e2fb8eeb3289b148f24c47bb3130a560a881cf
    HyukjinKwon authored and jerryshao committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    2c9c862 View commit details
    Browse the repository at this point in the history
  2. [SPARK-24330][SQL] Refactor ExecuteWriteTask and Use while in writi…

    …ng files
    
    ## What changes were proposed in this pull request?
    1. Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and improve readability.
    After the change, callers only need to call `commit()` or `abort` at the end of task.
    Also there is less code in `SingleDirectoryWriteTask` and `DynamicPartitionWriteTask`.
    Definitions of related classes are moved to a new file, and `ExecuteWriteTask` is renamed to `FileFormatDataWriter`.
    
    2. As per code style guide: https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex , we avoid using `for` for looping in [FileFormatWriter](https://github.com/apache/spark/pull/21381/files#diff-3b69eb0963b68c65cfe8075f8a42e850L536) , or `foreach` in [WriteToDataSourceV2Exec](https://github.com/apache/spark/pull/21381/files#diff-6fbe10db766049a395bae2e785e9d56eL119).
    In such critical code path, using `while` is good for performance.
    
    ## How was this patch tested?
    Existing unit test.
    I tried the microbenchmark in #21409
    
    | Workload | Before changes(Best/Avg Time(ms)) | After changes(Best/Avg Time(ms)) |
    | --- | --- | -- |
    |Output Single Int Column|    2018 / 2043   |    2096 / 2236 |
    |Output Single Double Column| 1978 / 2043 | 2013 / 2018 |
    |Output Int and String Column| 6332 / 6706 | 6162 / 6298 |
    |Output Partitions| 4458 / 5094 |  3792 / 4008  |
    |Output Buckets|           5695 / 6102 |    5120 / 5154 |
    
    Also a microbenchmark on my laptop for general comparison among while/foreach/for :
    ```
    class Writer {
      var sum = 0L
      def write(l: Long): Unit = sum += l
    }
    
    def testWhile(iterator: Iterator[Long]): Long = {
      val w = new Writer
      while (iterator.hasNext) {
        w.write(iterator.next())
      }
      w.sum
    }
    
    def testForeach(iterator: Iterator[Long]): Long = {
      val w = new Writer
      iterator.foreach(w.write)
      w.sum
    }
    
    def testFor(iterator: Iterator[Long]): Long = {
      val w = new Writer
      for (x <- iterator) {
        w.write(x)
      }
      w.sum
    }
    
    val data = 0L to 100000000L
    val start = System.nanoTime
    (0 to 10).foreach(_ => testWhile(data.iterator))
    println("benchmark while: " + (System.nanoTime - start)/1000000)
    
    val start2 = System.nanoTime
    (0 to 10).foreach(_ => testForeach(data.iterator))
    println("benchmark foreach: " + (System.nanoTime - start2)/1000000)
    
    val start3 = System.nanoTime
    (0 to 10).foreach(_ => testForeach(data.iterator))
    println("benchmark for: " + (System.nanoTime - start3)/1000000)
    ```
    Benchmark result:
    `while`: 15401 ms
    `foreach`: 43034 ms
    `for`: 41279 ms
    
    Author: Gengliang Wang <gengliang.wang@databricks.com>
    
    Closes #21381 from gengliangwang/refactorExecuteWriteTask.
    gengliangwang authored and cloud-fan committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    cbaa729 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column…

    … assignment
    
    ## What changes were proposed in this pull request?
    
    Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned by position.
    
    ## How was this patch tested?
    
    NA
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #21471 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-SPARK-21427.
    BryanCutler authored and HyukjinKwon committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    b2d0226 View commit details
    Browse the repository at this point in the history
  4. [SPARK-24326][MESOS] add support for local:// scheme for the app jar

    ## What changes were proposed in this pull request?
    
    * Adds support for local:// scheme like in k8s case for image based deployments where the jar is already in the image. Affects cluster mode and the mesos dispatcher.  Covers also file:// scheme. Keeps the default case where jar resolution happens on the host.
    
    ## How was this patch tested?
    
    Dispatcher image with the patch, use it to start DC/OS Spark service:
    skonto/spark-local-disp:test
    
    Test image with my application jar located at the root folder:
    skonto/spark-local:test
    
    Dockerfile for that image.
    
    From mesosphere/spark:2.3.0-2.2.1-2-hadoop-2.6
    COPY spark-examples_2.11-2.2.1.jar /
    WORKDIR /opt/spark/dist
    
    Tests:
    
    The following work as expected:
    
    * local normal example
    ```
    dcos spark run --submit-args="--conf spark.mesos.appJar.local.resolution.mode=container --conf spark.executor.memory=1g --conf spark.mesos.executor.docker.image=skonto/spark-local:test
     --conf spark.executor.cores=2 --conf spark.cores.max=8
     --class org.apache.spark.examples.SparkPi local:///spark-examples_2.11-2.2.1.jar"
    ```
    
    * make sure the flag does not affect other uris
    ```
    dcos spark run --submit-args="--conf spark.mesos.appJar.local.resolution.mode=container --conf spark.executor.memory=1g  --conf spark.executor.cores=2 --conf spark.cores.max=8
     --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar"
    ```
    
    * normal example no local
    ```
    dcos spark run --submit-args="--conf spark.executor.memory=1g  --conf spark.executor.cores=2 --conf spark.cores.max=8
     --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar"
    
    ```
    
    The following fails
    
     * uses local with no setting, default is host.
    ```
    dcos spark run --submit-args="--conf spark.executor.memory=1g --conf spark.mesos.executor.docker.image=skonto/spark-local:test
      --conf spark.executor.cores=2 --conf spark.cores.max=8
      --class org.apache.spark.examples.SparkPi local:///spark-examples_2.11-2.2.1.jar"
    ```
    ![image](https://user-images.githubusercontent.com/7945591/40283021-8d349762-5c80-11e8-9d62-2a61a4318fd5.png)
    
    Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
    
    Closes #21378 from skonto/local-upstream.
    Stavros Kontopoulos authored and Felix Cheung committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    22df953 View commit details
    Browse the repository at this point in the history
  5. [SPARK-23920][SQL] add array_remove to remove all elements that equal…

    … element from array
    
    ## What changes were proposed in this pull request?
    
    add array_remove to remove all elements that equal element from array
    
    ## How was this patch tested?
    
    add unit tests
    
    Author: Huaxin Gao <huaxing@us.ibm.com>
    
    Closes #21069 from huaxingao/spark-23920.
    huaxingao authored and ueshin committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    98909c3 View commit details
    Browse the repository at this point in the history
  6. [SPARK-24351][SS] offsetLog/commitLog purge thresholdBatchId should b…

    …e computed with current committed epoch but not currentBatchId in CP mode
    
    ## What changes were proposed in this pull request?
    Compute the thresholdBatchId to purge metadata based on current committed epoch instead of currentBatchId in CP mode to avoid cleaning all the committed metadata in some case as described in the jira [SPARK-24351](https://issues.apache.org/jira/browse/SPARK-24351).
    
    ## How was this patch tested?
    Add new unit test.
    
    Author: Huang Tengfei <tengfei.h@gmail.com>
    
    Closes #21400 from ivoson/branch-cp-meta.
    ivoson authored and zsxwing committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    6039b13 View commit details
    Browse the repository at this point in the history
  7. Revert "[SPARK-24369][SQL] Correct handling for multiple distinct agg…

    …regations having the same argument set"
    
    This reverts commit 1e46f92.
    gatorsmile committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    d2c3de7 View commit details
    Browse the repository at this point in the history
  8. [INFRA] Close stale PRs.

    Closes #21444
    Marcelo Vanzin committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    09e78c1 View commit details
    Browse the repository at this point in the history
  9. [SPARK-24340][CORE] Clean up non-shuffle disk block manager files fol…

    …lowing executor exits on a Standalone cluster
    
    ## What changes were proposed in this pull request?
    
    Currently we only clean up the local directories on application removed. However, when executors die and restart repeatedly, many temp files are left untouched in the local directories, which is undesired behavior and could cause disk space used up gradually.
    
    We can detect executor death in the Worker, and clean up the non-shuffle files (files not ended with ".index" or ".data") in the local directories, we should not touch the shuffle files since they are expected to be used by the external shuffle service.
    
    Scope of this PR is limited to only implement the cleanup logic on a Standalone cluster, we defer to experts familiar with other cluster managers(YARN/Mesos/K8s) to determine whether it's worth to add similar support.
    
    ## How was this patch tested?
    
    Add new test suite to cover.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #21390 from jiangxb1987/cleanupNonshuffleFiles.
    jiangxb1987 authored and gatorsmile committed Jun 1, 2018
    Configuration menu
    Copy the full SHA
    8ef167a View commit details
    Browse the repository at this point in the history

Commits on Jun 2, 2018

  1. [SPARK-23668][K8S] Added missing config property in running-on-kubern…

    …etes.md
    
    ## What changes were proposed in this pull request?
    PR #20811 introduced a new Spark configuration property `spark.kubernetes.container.image.pullSecrets` for specifying image pull secrets. However, the documentation wasn't updated accordingly. This PR adds the property introduced into running-on-kubernetes.md.
    
    ## How was this patch tested?
    N/A.
    
    foxish mccheah please help merge this. Thanks!
    
    Author: Yinan Li <ynli@google.com>
    
    Closes #21480 from liyinan926/master.
    liyinan926 authored and Felix Cheung committed Jun 2, 2018
    Configuration menu
    Copy the full SHA
    a36c1a6 View commit details
    Browse the repository at this point in the history

Commits on Jun 3, 2018

  1. [SPARK-24356][CORE] Duplicate strings in File.path managed by FileSeg…

    …mentManagedBuffer
    
    This patch eliminates duplicate strings that come from the 'path' field of
    java.io.File objects created by FileSegmentManagedBuffer. That is, we want
    to avoid the situation when multiple File instances for the same pathname
    "foo/bar" are created, each with a separate copy of the "foo/bar" String
    instance. In some scenarios such duplicate strings may waste a lot of memory
    (~ 10% of the heap). To avoid that, we intern the pathname with
    String.intern(), and before that we make sure that it's in a normalized
    form (contains no "//", "///" etc.) Otherwise, the code in java.io.File
    would normalize it later, creating a new "foo/bar" String copy.
    Unfortunately, the normalization code that java.io.File uses internally
    is in the package-private class java.io.FileSystem, so we cannot call it
    here directly.
    
    ## What changes were proposed in this pull request?
    
    Added code to ExternalShuffleBlockResolver.getFile(), that normalizes and then interns the pathname string before passing it to the File() constructor.
    
    ## How was this patch tested?
    
    Added unit test
    
    Author: Misha Dmitriev <misha@cloudera.com>
    
    Closes #21456 from countmdm/misha/spark-24356.
    misha-cloudera authored and srowen committed Jun 3, 2018
    Configuration menu
    Copy the full SHA
    de4feae View commit details
    Browse the repository at this point in the history

Commits on Jun 4, 2018

  1. [SPARK-24455][CORE] fix typo in TaskSchedulerImpl comment

    change runTasks to submitTasks  in the TaskSchedulerImpl.scala 's comment
    
    Author: xueyu <xueyu@yidian-inc.com>
    Author: Xue Yu <278006819@qq.com>
    
    Closes #21485 from xueyumusic/fixtypo1.
    xueyu authored and HyukjinKwon committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    a2166ec View commit details
    Browse the repository at this point in the history
  2. [SPARK-24369][SQL] Correct handling for multiple distinct aggregation…

    …s having the same argument set
    
    ## What changes were proposed in this pull request?
    
    bring back #21443
    
    This is a different approach: just change the check to count distinct columns with `toSet`
    
    ## How was this patch tested?
    
    a new test to verify the planner behavior.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    
    Closes #21487 from cloud-fan/back.
    cloud-fan authored and gatorsmile committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    416cd1f View commit details
    Browse the repository at this point in the history
  3. [SPARK-23786][SQL] Checking column names of csv headers

    ## What changes were proposed in this pull request?
    
    Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and #20894 (comment). I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown:
    
    ```
    java.lang.IllegalArgumentException: CSV file header does not contain the expected fields
     Header: depth, temperature
     Schema: temperature, depth
    CSV file: marina.csv
    ```
    
    ## How was this patch tested?
    
    The changes were tested by existing tests of CSVSuite and by 2 new tests.
    
    Author: Maxim Gekk <maxim.gekk@databricks.com>
    Author: Maxim Gekk <max.gekk@gmail.com>
    
    Closes #20894 from MaxGekk/check-column-names.
    MaxGekk authored and gatorsmile committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    1d9338b View commit details
    Browse the repository at this point in the history
  4. [SPARK-23903][SQL] Add support for date extract

    ## What changes were proposed in this pull request?
    
    Add support for date `extract` function:
    ```sql
    spark-sql> SELECT EXTRACT(YEAR FROM TIMESTAMP '2000-12-16 12:21:13');
    2000
    ```
    Supported field same as [Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316): `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Yuming Wang <yumwang@ebay.com>
    
    Closes #21479 from wangyum/SPARK-23903.
    wangyum authored and ueshin committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    0be5aa2 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21896][SQL] Fix StackOverflow caused by window functions insid…

    …e aggregate functions
    
    ## What changes were proposed in this pull request?
    
    This PR explicitly prohibits window functions inside aggregates. Currently, this will cause StackOverflow during analysis. See PR #19193 for previous discussion.
    
    ## How was this patch tested?
    
    This PR comes with a dedicated unit test.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #21473 from aokolnychyi/fix-stackoverflow-window-funcs.
    aokolnychyi authored and cloud-fan committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    7297ae0 View commit details
    Browse the repository at this point in the history
  6. [SPARK-24290][ML] add support for Array input for instrumentation.log…

    …NamedValue
    
    ## What changes were proposed in this pull request?
    
    Extend instrumentation.logNamedValue to support Array input
    change the logging for "clusterSizes" to new method
    
    ## How was this patch tested?
    
    N/A
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Lu WANG <lu.wang@databricks.com>
    
    Closes #21347 from ludatabricks/SPARK-24290.
    lu-wang-dl authored and mengxr committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    b24d3db View commit details
    Browse the repository at this point in the history
  7. [SPARK-24300][ML] change the way to set seed in ml.cluster.LDASuite.g…

    …enerateLDAData
    
    ## What changes were proposed in this pull request?
    
    Using different RNG in all different partitions.
    
    ## How was this patch tested?
    
    manually
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Lu WANG <lu.wang@databricks.com>
    
    Closes #21492 from ludatabricks/SPARK-24300.
    lu-wang-dl authored and mengxr committed Jun 4, 2018
    Configuration menu
    Copy the full SHA
    ff0501b View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2018

  1. [SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark

    ## What changes were proposed in this pull request?
    
    Implement `_repr_html_` for PySpark while in notebook and add config named "spark.sql.repl.eagerEval.enabled" to control this.
    
    The dev list thread for context: http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html
    
    ## How was this patch tested?
    
    New ut in DataFrameSuite and manual test in jupyter. Some screenshot below.
    
    **After:**
    ![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png)
    
    **Before:**
    ![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png)
    
    Author: Yuanjian Li <xyliyuanjian@gmail.com>
    
    Closes #21370 from xuanyuanking/SPARK-24215.
    xuanyuanking authored and HyukjinKwon committed Jun 5, 2018
    Configuration menu
    Copy the full SHA
    dbb4d83 View commit details
    Browse the repository at this point in the history
  2. [SPARK-16451][REPL] Fail shell if SparkSession fails to start.

    Currently, in spark-shell, if the session fails to start, the
    user sees a bunch of unrelated errors which are caused by code
    in the shell initialization that references the "spark" variable,
    which does not exist in that case. Things like:
    
    ```
    <console>:14: error: not found: value spark
           import spark.sql
    ```
    
    The user is also left with a non-working shell (unless they want
    to just write non-Spark Scala or Python code, that is).
    
    This change fails the whole shell session at the point where the
    failure occurs, so that the last error message is the one with
    the actual information about the failure.
    
    For the python error handling, I moved the session initialization code
    to session.py, so that traceback.print_exc() only shows the last error.
    Otherwise, the printed exception would contain all previous exceptions
    with a message "During handling of the above exception, another
    exception occurred", making the actual error kinda hard to parse.
    
    Tested with spark-shell, pyspark (with 2.7 and 3.5), by forcing an
    error during SparkContext initialization.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #21368 from vanzin/SPARK-16451.
    Marcelo Vanzin authored and HyukjinKwon committed Jun 5, 2018
    Configuration menu
    Copy the full SHA
    b3417b7 View commit details
    Browse the repository at this point in the history
  3. [SPARK-15784] Add Power Iteration Clustering to spark.ml

    ## What changes were proposed in this pull request?
    
    According to the discussion on JIRA. I rewrite the Power Iteration Clustering API in `spark.ml`.
    
    ## How was this patch tested?
    
    Unit test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: WeichenXu <weichen.xu@databricks.com>
    
    Closes #21493 from WeichenXu123/pic_api.
    WeichenXu123 authored and mengxr committed Jun 5, 2018
    Configuration menu
    Copy the full SHA
    e8c1a0c View commit details
    Browse the repository at this point in the history
  4. [SPARK-24453][SS] Fix error recovering from the failure in a no-data …

    …batch
    
    ## What changes were proposed in this pull request?
    
    The error occurs when we are recovering from a failure in a no-data batch (say X) that has been planned (i.e. written to offset log) but not executed (i.e. not written to commit log). Upon recovery the following sequence of events happen.
    
    1. `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. Since there was no data in the batch, the `availableOffsets` is same as `committedOffsets`, so `isNewDataAvailable` is `false`.
    2. When `MicroBatchExecution.constructNextBatch` is called, ideally it should immediately return true because the next batch has already been constructed. However, the check of whether the batch has been constructed was `if (isNewDataAvailable) return true`. Since the planned batch is a no-data batch, it escaped this check and proceeded to plan the same batch X *once again*.
    
    The solution is to have an explicit flag that signifies whether a batch has already been constructed or not. `populateStartOffsets` is going to set the flag appropriately.
    
    ## How was this patch tested?
    
    new unit test
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #21491 from tdas/SPARK-24453.
    tdas committed Jun 5, 2018
    Configuration menu
    Copy the full SHA
    2c2a86b View commit details
    Browse the repository at this point in the history
  5. [SPARK-22384][SQL] Refine partition pruning when attribute is wrapped…

    … in Cast
    
    ## What changes were proposed in this pull request?
    
    Sql below will get all partitions from metastore, which put much burden on metastore;
    ```
    CREATE TABLE `partition_test`(`col` int) PARTITIONED BY (`pt` byte)
    SELECT * FROM partition_test WHERE CAST(pt AS INT)=1
    ```
    The reason is that the the analyzed attribute `dt` is wrapped in `Cast` and `HiveShim` fails to generate a proper partition filter.
    This pr proposes to take `Cast` into consideration when generate partition filter.
    
    ## How was this patch tested?
    Test added.
    This pr proposes to use analyzed expressions in `HiveClientSuite`
    
    Author: jinxing <jinxing6042@126.com>
    
    Closes #19602 from jinxing64/SPARK-22384.
    jinxing authored and cloud-fan committed Jun 5, 2018
    Configuration menu
    Copy the full SHA
    93df3cd View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2018

  1. [SPARK-24187][R][SQL] Add array_join function to SparkR

    ## What changes were proposed in this pull request?
    
    This PR adds array_join function to SparkR
    
    ## How was this patch tested?
    
    Add unit test in test_sparkSQL.R
    
    Author: Huaxin Gao <huaxing@us.ibm.com>
    
    Closes #21313 from huaxingao/spark-24187.
    huaxingao authored and HyukjinKwon committed Jun 6, 2018
    Configuration menu
    Copy the full SHA
    e9efb62 View commit details
    Browse the repository at this point in the history
  2. [SPARK-23803][SQL] Support bucket pruning

    ## What changes were proposed in this pull request?
    support bucket pruning when filtering on a single bucketed column on the following predicates -
    EqualTo, EqualNullSafe, In, And/Or predicates
    
    ## How was this patch tested?
    refactored unit tests to test the above.
    
    based on gatorsmile work in e3c75c6
    
    Author: Asher Saban <asaban@palantir.com>
    Author: asaban <asaban@palantir.com>
    
    Closes #20915 from sabanas/filter-prune-buckets.
    Asher Saban authored and cloud-fan committed Jun 6, 2018
    Configuration menu
    Copy the full SHA
    e76b012 View commit details
    Browse the repository at this point in the history

Commits on Jun 8, 2018

  1. [SPARK-24119][SQL] Add interpreted execution to SortPrefix expression

    ## What changes were proposed in this pull request?
    
    Implemented eval in SortPrefix expression.
    
    ## How was this patch tested?
    
    - ran existing sbt SQL tests
    - added unit test
    - ran existing Python SQL tests
    - manual tests: disabling codegen -- patching code to disable beyond what spark.sql.codegen.wholeStage=false can do -- and running sbt SQL tests
    
    Author: Bruce Robbins <bersprockets@gmail.com>
    
    Closes #21231 from bersprockets/sortprefixeval.
    bersprockets authored and hvanhovell committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    1462bba View commit details
    Browse the repository at this point in the history
  2. [SPARK-24224][ML-EXAMPLES] Java example code for Power Iteration Clus…

    …tering in spark.ml
    
    ## What changes were proposed in this pull request?
    
    Java example code for Power Iteration Clustering  in spark.ml
    
    ## How was this patch tested?
    
    Locally tested
    
    Author: Shahid <shahidki31@gmail.com>
    
    Closes #21283 from shahidki31/JavaPicExample.
    shahidki31 authored and srowen committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    2c10020 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24191][ML] Scala Example code for Power Iteration Clustering

    ## What changes were proposed in this pull request?
    
    Added example code for Power Iteration Clustering in Spark ML examples
    
    Author: Shahid <shahidki31@gmail.com>
    
    Closes #21248 from shahidki31/sparkCommit.
    shahidki31 authored and srowen committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    a5d775a View commit details
    Browse the repository at this point in the history
  4. [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule in ml/__init…

    …__.py and add ImageSchema into __all__
    
    ## What changes were proposed in this pull request?
    
    This PR attaches submodules to ml's `__init__.py` module.
    
    Also, adds `ImageSchema` into `image.py` explicitly.
    
    ## How was this patch tested?
    
    Before:
    
    ```python
    >>> from pyspark import ml
    >>> ml.image
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'module' object has no attribute 'image'
    >>> ml.image.ImageSchema
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'module' object has no attribute 'image'
    ```
    
    ```python
    >>> "image" in globals()
    False
    >>> from pyspark.ml import *
    >>> "image" in globals()
    False
    >>> image
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'image' is not defined
    ```
    
    After:
    
    ```python
    >>> from pyspark import ml
    >>> ml.image
    <module 'pyspark.ml.image' from '/.../spark/python/pyspark/ml/image.pyc'>
    >>> ml.image.ImageSchema
    <pyspark.ml.image._ImageSchema object at 0x10d973b10>
    ```
    
    ```python
    >>> "image" in globals()
    False
    >>> from pyspark.ml import *
    >>> "image" in globals()
    True
    >>> image
    <module 'pyspark.ml.image' from  #'/.../spark/python/pyspark/ml/image.pyc'>
    ```
    
    Author: hyukjinkwon <gurwls223@apache.org>
    
    Closes #21483 from HyukjinKwon/SPARK-24454.
    HyukjinKwon authored and mengxr committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    173fe45 View commit details
    Browse the repository at this point in the history
  5. [SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s

    ## What changes were proposed in this pull request?
    
    Introducing Python Bindings for PySpark.
    
    - [x] Running PySpark Jobs
    - [x] Increased Default Memory Overhead value
    - [ ] Dependency Management for virtualenv/conda
    
    ## How was this patch tested?
    
    This patch was tested with
    
    - [x] Unit Tests
    - [x] Integration tests with [this addition](apache-spark-on-k8s/spark-integration#46)
    ```
    KubernetesSuite:
    - Run SparkPi with no resources
    - Run SparkPi with a very long application name.
    - Run SparkPi with a master URL without a scheme.
    - Run SparkPi with an argument.
    - Run SparkPi with custom labels, annotations, and environment variables.
    - Run SparkPi with a test secret mounted into the driver and executor pods
    - Run extraJVMOptions check on driver
    - Run SparkRemoteFileTest using a remote data file
    - Run PySpark on simple pi.py example
    - Run PySpark with Python2 to test a pyfiles example
    - Run PySpark with Python3 to test a pyfiles example
    Run completed in 4 minutes, 28 seconds.
    Total number of tests run: 11
    Suites: completed 2, aborted 0
    Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
    All tests passed.
    ```
    
    Author: Ilan Filonenko <if56@cornell.edu>
    Author: Ilan Filonenko <ifilondz@gmail.com>
    
    Closes #21092 from ifilonenko/master.
    ifilonenko authored and mccheah committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    1a644af View commit details
    Browse the repository at this point in the history
  6. [SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mism…

    …atch in PythonTransformFunction
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to wrap the transformed rdd within `TransformFunction`. `PythonTransformFunction` looks requiring to return `JavaRDD` in `_jrdd`.
    
    https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/streaming/util.py#L67
    
    https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala#L43
    
    However, this could be `JavaPairRDD` by some APIs, for example, `zip` in PySpark's RDD API.
    `_jrdd` could be checked as below:
    
    ```python
    >>> rdd.zip(rdd)._jrdd.getClass().toString()
    u'class org.apache.spark.api.java.JavaPairRDD'
    ```
    
    So, here, I wrapped it with `map` so that it ensures returning `JavaRDD`.
    
    ```python
    >>> rdd.zip(rdd).map(lambda x: x)._jrdd.getClass().toString()
    u'class org.apache.spark.api.java.JavaRDD'
    ```
    
    I tried to elaborate some failure cases as below:
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]) \
        .transform(lambda rdd: rdd.cartesian(rdd)) \
        .pprint()
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.cartesian(rdd))
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd))
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).union(rdd.zip(rdd)))
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).coalesce(1))
    ssc.start()
    ```
    
    ## How was this patch tested?
    
    Unit tests were added in `python/pyspark/streaming/tests.py` and manually tested.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19498 from HyukjinKwon/SPARK-17756.
    HyukjinKwon committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    b070ded View commit details
    Browse the repository at this point in the history
  7. [SPARK-23010][K8S] Initial checkin of k8s integration tests.

    These tests were developed in the https://github.com/apache-spark-on-k8s/spark-integration repo
    by several contributors. This is a copy of the current state into the main apache spark repo.
    The only changes from the current spark-integration repo state are:
    * Move the files from the repo root into resource-managers/kubernetes/integration-tests
    * Add a reference to these tests in the root README.md
    * Fix a path reference in dev/dev-run-integration-tests.sh
    * Add a TODO in include/util.sh
    
    ## What changes were proposed in this pull request?
    
    Incorporation of Kubernetes integration tests.
    
    ## How was this patch tested?
    
    This code has its own unit tests, but the main purpose is to provide the integration tests.
    I tested this on my laptop by running dev/dev-run-integration-tests.sh --spark-tgz ~/spark-2.4.0-SNAPSHOT-bin--.tgz
    
    The spark-integration tests have already been running for months in AMPLab, here is an example:
    https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-scheduled-spark-integration-master/
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Sean Suchter <sean-github@suchter.com>
    Author: Sean Suchter <ssuchter@pepperdata.com>
    
    Closes #20697 from ssuchter/ssuchter-k8s-integration-tests.
    ssuchter authored and mccheah committed Jun 8, 2018
    Configuration menu
    Copy the full SHA
    f433ef7 View commit details
    Browse the repository at this point in the history

Commits on Jun 9, 2018

  1. [SPARK-24412][SQL] Adding docs about automagical type casting in `isi…

    …n` and `isInCollection` APIs
    
    ## What changes were proposed in this pull request?
    Update documentation for `isInCollection` API to clealy explain the "auto-casting" of elements if their types are different.
    
    ## How was this patch tested?
    No-Op
    
    Author: Thiruvasakan Paramasivan <thiru@apple.com>
    
    Closes #21519 from trvskn/sql-doc-update.
    raptond authored and dbtsai committed Jun 9, 2018
    Configuration menu
    Copy the full SHA
    36a3409 View commit details
    Browse the repository at this point in the history
  2. [SPARK-24468][SQL] Handle negative scale when adjusting precision for…

    … decimal operations
    
    ## What changes were proposed in this pull request?
    
    In SPARK-22036 we introduced the possibility to allow precision loss in arithmetic operations (according to the SQL standard). The implementation was drawn from Hive's one, where Decimals with a negative scale are not allowed in the operations.
    
    The PR handles the case when the scale is negative, removing the assertion that it is not.
    
    ## How was this patch tested?
    
    added UTs
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #21499 from mgaido91/SPARK-24468.
    mgaido91 authored and cloud-fan committed Jun 9, 2018
    Configuration menu
    Copy the full SHA
    f07c506 View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2018

  1. [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from…

    … driver to executor
    
    ## What changes were proposed in this pull request?
    SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker
    
    ## How does this work?
    
    The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
     - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
     - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.
    
    ## How was this patch tested?
    
    Same tests, plus tests for pandas UDFs
    
    Author: edorigatti <emilio.dorigatti@gmail.com>
    
    Closes #21467 from e-dorigatti/fix_udf_hack.
    e-dorigatti authored and HyukjinKwon committed Jun 11, 2018
    Configuration menu
    Copy the full SHA
    3e5b4ae View commit details
    Browse the repository at this point in the history
  2. [SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC

    ## What changes were proposed in this pull request?
    
    add spark.ml Python API for PIC
    
    ## How was this patch tested?
    
    add doctest
    
    Author: Huaxin Gao <huaxing@us.ibm.com>
    
    Closes #21513 from huaxingao/spark--19826.
    huaxingao authored and mengxr committed Jun 11, 2018
    Configuration menu
    Copy the full SHA
    a99d284 View commit details
    Browse the repository at this point in the history
  3. [MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol

    ## What changes were proposed in this pull request?
    
    When HadoopMapRedCommitProtocol is used (e.g., when using saveAsTextFile() or
    saveAsHadoopFile() with RDDs), it's not easy to determine which output committer
    class was used, so this PR simply logs the class that was used, similarly to what
    is done in SQLHadoopMapReduceCommitProtocol.
    
    ## How was this patch tested?
    
    Built Spark then manually inspected logging when calling saveAsTextFile():
    
    ```scala
    scala> sc.setLogLevel("INFO")
    scala> sc.textFile("README.md").saveAsTextFile("/tmp/out")
    ...
    18/05/29 10:06:20 INFO HadoopMapRedCommitProtocol: Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
    ```
    
    Author: Jonathan Kelly <jonathak@amazon.com>
    
    Closes #21452 from ejono/master.
    ejono authored and srowen committed Jun 11, 2018
    Configuration menu
    Copy the full SHA
    9b6f242 View commit details
    Browse the repository at this point in the history
  4. [SPARK-24520] Double braces in documentations

    There are double braces in the markdown, which break the link.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Fokko Driesprong <fokkodriesprong@godatadriven.com>
    
    Closes #21528 from Fokko/patch-1.
    Fokko Driesprong authored and srowen committed Jun 11, 2018
    Configuration menu
    Copy the full SHA
    2dc047a View commit details
    Browse the repository at this point in the history
  5. [SPARK-24134][DOCS] A missing full-stop in doc "Tuning Spark".

    ## What changes were proposed in this pull request?
    
    In the document [Tuning Spark -> Determining Memory Consumption](https://spark.apache.org/docs/latest/tuning.html#determining-memory-consumption), a full stop was missing in the second paragraph.
    
    It's `...use SizeEstimator’s estimate method This is useful for experimenting...`, while there is supposed to be a full stop before `This`.
    
    Screenshot showing before change is attached below.
    <img width="1033" alt="screen shot 2018-05-01 at 5 22 32 pm" src="https://user-images.githubusercontent.com/11539188/39468206-778e3d8a-4d64-11e8-8a92-38464952b54b.png">
    
    ## How was this patch tested?
    
    This is a simple change in doc. Only one full stop was added in plain text.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Xiaodong <11539188+XD-DENG@users.noreply.github.com>
    
    Closes #21205 from XD-DENG/patch-1.
    XD-DENG authored and srowen committed Jun 11, 2018
    Configuration menu
    Copy the full SHA
    f5af86e View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2018

  1. [SPARK-22144][SQL] ExchangeCoordinator combine the partitions of an 0…

    … sized pre-shuffle to 0
    
    ## What changes were proposed in this pull request?
    when the length of pre-shuffle's partitions is 0, the length of post-shuffle's partitions should be 0 instead of spark.sql.shuffle.partitions.
    
    ## How was this patch tested?
    ExchangeCoordinator converted a  pre-shuffle that partitions is 0 to a post-shuffle that partitions is 0 instead of one that partitions is spark.sql.shuffle.partitions.
    
    Author: liutang123 <liutang123@yeah.net>
    
    Closes #19364 from liutang123/SPARK-22144.
    liutang123 authored and cloud-fan committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    0481977 View commit details
    Browse the repository at this point in the history
  2. [SPARK-23732][DOCS] Fix source links in generated scaladoc.

    Apply the suggestion on the bug to fix source links. Tested with
    the 2.3.1 release docs.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #21521 from vanzin/SPARK-23732.
    Marcelo Vanzin authored and HyukjinKwon committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    dc22465 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24502][SQL] flaky test: UnsafeRowSerializerSuite

    ## What changes were proposed in this pull request?
    
    `UnsafeRowSerializerSuite` calls `UnsafeProjection.create` which accesses `SQLConf.get`, while the current active SparkSession may already be stopped, and we may hit exception like this
    
    ```
    sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped.
    	at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
    	at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
    	at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93)
    	at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
    	at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
    	at scala.Option.getOrElse(Option.scala:121)
    	at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
    	at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
    	at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
    	at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
    	at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
    	at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
    	at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
    	at scala.Option.map(Option.scala:146)
    	at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
    	at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
    	at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
    	at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
    	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
    	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
    	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
    	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
    	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
    	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
    ...
    ```
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #21518 from cloud-fan/test.
    cloud-fan committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    01452ea View commit details
    Browse the repository at this point in the history
  4. docs: fix typo

    no => no[t]
    
    ## What changes were proposed in this pull request?
    
    Fixing a typo.
    
    ## How was this patch tested?
    
    Visual check of the docs.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tom Saleeba <tom.saleeba@gmail.com>
    
    Closes #21496 from tomsaleeba/patch-1.
    tomsaleeba authored and srowen committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    1d7db65 View commit details
    Browse the repository at this point in the history
  5. [SPARK-15064][ML] Locale support in StopWordsRemover

    ## What changes were proposed in this pull request?
    
    Add locale support for `StopWordsRemover`.
    
    ## How was this patch tested?
    
    [Scala|Python] unit tests.
    
    Author: Lee Dongjin <dongjin@apache.org>
    
    Closes #21501 from dongjinleekr/feature/SPARK-15064.
    dongjinleekr authored and mengxr committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    5d6a53d View commit details
    Browse the repository at this point in the history
  6. [SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in Hi…

    …veExternalCatalogVersionsSuite
    
    ## What changes were proposed in this pull request?
    
    Removing version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite as it is not present anymore in the mirrors and this is blocking all the open PRs.
    
    ## How was this patch tested?
    
    running UTs
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #21540 from mgaido91/SPARK-24531.
    mgaido91 authored and gatorsmile committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    2824f14 View commit details
    Browse the repository at this point in the history
  7. [SPARK-24416] Fix configuration specification for killBlacklisted exe…

    …cutors
    
    ## What changes were proposed in this pull request?
    
    spark.blacklist.killBlacklistedExecutors is defined as
    
    (Experimental) If set to "true", allow Spark to automatically kill, and attempt to re-create, executors when they are blacklisted. Note that, when an entire node is added to the blacklist, all of the executors on that node will be killed.
    
    I presume the killing of blacklisted executors only happens after the stage completes successfully and all tasks have completed or on fetch failures (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is confusing because the definition states that the executor will be attempted to be recreated as soon as it is blacklisted. This is not true while the stage is in progress and an executor is blacklisted, it will not attempt to cleanup until the stage finishes.
    
    Author: Sanket Chintapalli <schintap@yahoo-inc.com>
    
    Closes #21475 from redsanket/SPARK-24416.
    Sanket Chintapalli authored and squito committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    3af1d3e View commit details
    Browse the repository at this point in the history
  8. [SPARK-23931][SQL] Adds arrays_zip function to sparksql

    Signed-off-by: DylanGuedes <djmgguedesgmail.com>
    
    ## What changes were proposed in this pull request?
    
    Addition of arrays_zip function to spark sql functions.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    Unit tests that checks if the results are correct.
    
    Author: DylanGuedes <djmgguedes@gmail.com>
    
    Closes #21045 from DylanGuedes/SPARK-23931.
    DylanGuedes authored and ueshin committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    f0ef1b3 View commit details
    Browse the repository at this point in the history
  9. [SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName …

    …that is not safe in scala
    
    ## What changes were proposed in this pull request?
    
    When user create a aggregator object in scala and pass the aggregator to Spark Dataset's agg() method, Spark's will initialize TypedAggregateExpression with the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, depending on how user creates the aggregator object. For example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw java.lang.InternalError "Malformed class name". This has been reported in scalatest scalatest/scalatest#1044 and discussed in many scala upstream jiras such as SI-8110, SI-5425.
    
    To fix this issue, we follow the solution in scalatest/scalatest#1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName.
    
    ## How was this patch tested?
    added unit test
    
    Author: Fangshi Li <fli@linkedin.com>
    
    Closes #21276 from fangshil/SPARK-24216.
    Fangshi Li authored and cloud-fan committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    cc88d7f View commit details
    Browse the repository at this point in the history
  10. [SPARK-23933][SQL] Add map_from_arrays function

    ## What changes were proposed in this pull request?
    
    The PR adds the SQL function `map_from_arrays`. The behavior of the function is based on Presto's `map`. Since SparkSQL already had a `map` function, we prepared the different name for this behavior.
    
    This function returns returns a map from a pair of arrays for keys and values.
    
    ## How was this patch tested?
    
    Added UTs
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #21258 from kiszk/SPARK-23933.
    kiszk authored and ueshin committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    ada28f2 View commit details
    Browse the repository at this point in the history
  11. [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of kuberne…

    …tes-integration-tests
    
    ## What changes were proposed in this pull request?
    
    Fix java checkstyle failure of kubernetes-integration-tests
    
    ## How was this patch tested?
    
    Checked manually on my local environment.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #21545 from jiangxb1987/k8s-checkstyle.
    jiangxb1987 authored and Marcelo Vanzin committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    0d3714d View commit details
    Browse the repository at this point in the history
  12. [SPARK-24506][UI] Add UI filters to tabs added after binding

    ## What changes were proposed in this pull request?
    
    Currently, `spark.ui.filters` are not applied to the handlers added after binding the server. This means that every page which is added after starting the UI will not have the filters configured on it. This can allow unauthorized access to the pages.
    
    The PR adds the filters also to the handlers added after the UI starts.
    
    ## How was this patch tested?
    
    manual tests (without the patch, starting the thriftserver with `--conf spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter --conf spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"` you can access `http://localhost:4040/sqlserver`; with the patch, 401 is the response as for the other pages).
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #21523 from mgaido91/SPARK-24506.
    mgaido91 authored and Marcelo Vanzin committed Jun 12, 2018
    Configuration menu
    Copy the full SHA
    f53818d View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2018

  1. [SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as wi…

    …ndow functions with unbounded window frames
    
    ## What changes were proposed in this pull request?
    This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions.
    
    ```
           >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
           >>> from pyspark.sql import Window
           >>> df = spark.createDataFrame(
           ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
           ...     ("id", "v"))
           >>> pandas_udf("double", PandasUDFType.GROUPED_AGG)
           ... def mean_udf(v):
           ...     return v.mean()
           >>> w = Window.partitionBy('id')
           >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
           +---+----+------+
           | id|   v|mean_v|
           +---+----+------+
           |  1| 1.0|   1.5|
           |  1| 2.0|   1.5|
           |  2| 3.0|   6.0|
           |  2| 5.0|   6.0|
           |  2|10.0|   6.0|
           +---+----+------+
    ```
    
    The scope of this PR is somewhat limited in terms of:
    (1) Only supports unbounded window, which acts essentially as group by.
    (2) Only supports aggregation functions, not "transform" like window functions (n -> n mapping)
    
    Both of these are left as future work. Especially, (1) needs careful thinking w.r.t. how to pass rolling window data to python efficiently. (2) is a bit easier but does require more changes therefore I think it's better to leave it as a separate PR.
    
    ## How was this patch tested?
    
    WindowPandasUDFTests
    
    Author: Li Jin <ice.xelloss@gmail.com>
    
    Closes #21082 from icexelloss/SPARK-22239-window-udf.
    icexelloss authored and HyukjinKwon committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    9786ce6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-24466][SS] Fix TextSocketMicroBatchReader to be compatible wit…

    …h netcat again
    
    ## What changes were proposed in this pull request?
    
    TextSocketMicroBatchReader was no longer be compatible with netcat due to launching temporary reader for reading schema, and closing reader, and re-opening reader. While reliable socket server should be able to handle this without any issue, nc command normally can't handle multiple connections and simply exits when closing temporary reader.
    
    This patch fixes TextSocketMicroBatchReader to be compatible with netcat again, via deferring opening socket to the first call of planInputPartitions() instead of constructor.
    
    ## How was this patch tested?
    
    Added unit test which fails on current and succeeds with the patch. And also manually tested.
    
    Author: Jungtaek Lim <kabhwan@gmail.com>
    
    Closes #21497 from HeartSaVioR/SPARK-24466.
    HeartSaVioR authored and HyukjinKwon committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    3352d6f View commit details
    Browse the repository at this point in the history
  3. [SPARK-24485][SS] Measure and log elapsed time for filesystem operati…

    …ons in HDFSBackedStateStoreProvider
    
    ## What changes were proposed in this pull request?
    
    This patch measures and logs elapsed time for each operation which communicate with file system (mostly remote HDFS in production) in HDFSBackedStateStoreProvider to help investigating any latency issue.
    
    ## How was this patch tested?
    
    Manually tested.
    
    Author: Jungtaek Lim <kabhwan@gmail.com>
    
    Closes #21506 from HeartSaVioR/SPARK-24485.
    HeartSaVioR authored and HyukjinKwon committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    4c388bc View commit details
    Browse the repository at this point in the history
  4. [SPARK-24479][SS] Added config for registering streamingQueryListeners

    ## What changes were proposed in this pull request?
    
    Currently a "StreamingQueryListener" can only be registered programatically. We could have a new config "spark.sql.streamingQueryListeners" similar to  "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to register custom streaming listeners.
    
    ## How was this patch tested?
    
    New unit test and running example programs.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Arun Mahadevan <arunm@apache.org>
    
    Closes #21504 from arunmahadevan/SPARK-24480.
    arunmahadevan authored and HyukjinKwon committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    7703b46 View commit details
    Browse the repository at this point in the history
  5. [SPARK-24500][SQL] Make sure streams are materialized during Tree tra…

    …nsforms.
    
    ## What changes were proposed in this pull request?
    If you construct catalyst trees using `scala.collection.immutable.Stream` you can run into situations where valid transformations do not seem to have any effect. There are two causes for this behavior:
    - `Stream` is evaluated lazily. Note that default implementation will generally only evaluate a function for the first element (this makes testing a bit tricky).
    - `TreeNode` and `QueryPlan` use side effects to detect if a tree has changed. Mapping over a stream is lazy and does not need to trigger this side effect. If this happens the node will invalidly assume that it did not change and return itself instead if the newly created node (this is for GC reasons).
    
    This PR fixes this issue by forcing materialization on streams in `TreeNode` and `QueryPlan`.
    
    ## How was this patch tested?
    Unit tests were added to `TreeNodeSuite` and `LogicalPlanSuite`. An integration test was added to the `PlannerSuite`
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes #21539 from hvanhovell/SPARK-24500.
    hvanhovell authored and cloud-fan committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    299d297 View commit details
    Browse the repository at this point in the history
  6. [SPARK-24235][SS] Implement continuous shuffle writer for single read…

    …er partition.
    
    ## What changes were proposed in this pull request?
    
    https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit
    
    Implement continuous shuffle write RDD for a single reader partition. (I don't believe any implementation changes are actually required for multiple reader partitions, but this PR is already very large, so I want to exclude those for now to keep the size down.)
    
    ## How was this patch tested?
    
    new unit tests
    
    Author: Jose Torres <torres.joseph.f+github@gmail.com>
    
    Closes #21428 from jose-torres/writerTask.
    jose-torres authored and zsxwing committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    1b46f41 View commit details
    Browse the repository at this point in the history
  7. [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1

    ## What changes were proposed in this pull request?
    
    The PR updates the 2.3 version tested to the new release 2.3.1.
    
    ## How was this patch tested?
    
    existing UTs
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #21543 from mgaido91/patch-1.
    mgaido91 authored and gatorsmile committed Jun 13, 2018
    Configuration menu
    Copy the full SHA
    3bf7691 View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2018

  1. [MINOR][CORE][TEST] Remove unnecessary sort in UnsafeInMemorySorterSuite

    ## What changes were proposed in this pull request?
    
    We don't require specific ordering of the input data, the sort action is not necessary and misleading.
    
    ## How was this patch tested?
    
    Existing test suite.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #21536 from jiangxb1987/sorterSuite.
    jiangxb1987 authored and HyukjinKwon committed Jun 14, 2018
    Configuration menu
    Copy the full SHA
    534065e View commit details
    Browse the repository at this point in the history
  2. [SPARK-24495][SQL] EnsureRequirement returns wrong plan when reorderi…

    …ng equal keys
    
    ## What changes were proposed in this pull request?
    
    `EnsureRequirement` in its `reorder` method currently assumes that the same key appears only once in the join condition. This of course might not be the case, and when it is not satisfied, it returns a wrong plan which produces a wrong result of the query.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #21529 from mgaido91/SPARK-24495.
    mgaido91 authored and gatorsmile committed Jun 14, 2018
    Configuration menu
    Copy the full SHA
    fdadc4b View commit details
    Browse the repository at this point in the history
  3. [SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveC…

    …onf when creating pysp…
    
    …ark shell
    
    ## What changes were proposed in this pull request?
    
    This PR catches TypeError when testing existence of HiveConf when creating pyspark shell
    
    ## How was this patch tested?
    
    Manually tested. Here are the manual test cases:
    
    Build with hive:
    ```
    (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
    Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    18/06/14 14:55:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
          /_/
    
    Using Python version 3.6.5 (default, Apr  6 2018 13:44:09)
    SparkSession available as 'spark'.
    >>> spark.conf.get('spark.sql.catalogImplementation')
    'hive'
    ```
    
    Build without hive:
    ```
    (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
    Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    18/06/14 15:04:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
          /_/
    
    Using Python version 3.6.5 (default, Apr  6 2018 13:44:09)
    SparkSession available as 'spark'.
    >>> spark.conf.get('spark.sql.catalogImplementation')
    'in-memory'
    ```
    
    Failed to start shell:
    ```
    (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
    Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    18/06/14 15:07:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    /Users/icexelloss/workspace/spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
      warnings.warn("Failed to initialize Spark session.")
    Traceback (most recent call last):
      File "/Users/icexelloss/workspace/spark/python/pyspark/shell.py", line 41, in <module>
        spark = SparkSession._create_shell_session()
      File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py", line 581, in _create_shell_session
        return SparkSession.builder.getOrCreate()
      File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py", line 168, in getOrCreate
        raise py4j.protocol.Py4JError("Fake Py4JError")
    py4j.protocol.Py4JError: Fake Py4JError
    (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$
    ```
    
    Author: Li Jin <ice.xelloss@gmail.com>
    
    Closes #21569 from icexelloss/SPARK-24563-fix-pyspark-shell-without-hive.
    icexelloss authored and Marcelo Vanzin committed Jun 14, 2018
    Configuration menu
    Copy the full SHA
    d3eed8f View commit details
    Browse the repository at this point in the history
  4. [SPARK-24543][SQL] Support any type as DDL string for from_json's schema

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to support any DataType represented as DDL string for the from_json function. After the changes, it will be possible to specify `MapType` in SQL like:
    ```sql
    select from_json('{"a":1, "b":2}', 'map<string, int>')
    ```
    and in Scala (similar in other languages)
    ```scala
    val in = Seq("""{"a": {"b": 1}}""").toDS()
    val schema = "map<string, map<string, int>>"
    val out = in.select(from_json($"value", schema, Map.empty[String, String]))
    ```
    
    ## How was this patch tested?
    
    Added a couple sql tests and modified existing tests for Python and Scala. The former tests were modified because it is not imported for them in which format schema for `from_json` is provided.
    
    Author: Maxim Gekk <maxim.gekk@databricks.com>
    
    Closes #21550 from MaxGekk/from_json-ddl-schema.
    MaxGekk authored and cloud-fan committed Jun 14, 2018
    Configuration menu
    Copy the full SHA
    b8f27ae View commit details
    Browse the repository at this point in the history
  5. [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution where no main …

    …class is required.
    
    ## What changes were proposed in this pull request?
    
    With [PR 20925](#20925) now it's not possible to execute the following commands:
    * run-example
    * run-example --help
    * run-example --version
    * run-example --usage-error
    * run-example --status ...
    * run-example --kill ...
    
    In this PR the execution will be allowed for the mentioned commands.
    
    ## How was this patch tested?
    
    Existing unit tests extended + additional written.
    
    Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    
    Closes #21450 from gaborgsomogyi/SPARK-24319.
    gaborgsomogyi authored and Marcelo Vanzin committed Jun 14, 2018
    Configuration menu
    Copy the full SHA
    18cb0c0 View commit details
    Browse the repository at this point in the history
  6. [SPARK-24248][K8S] Use level triggering and state reconciliation in s…

    …cheduling and lifecycle
    
    ## What changes were proposed in this pull request?
    
    Previously, the scheduler backend was maintaining state in many places, not only for reading state but also writing to it. For example, state had to be managed in both the watch and in the executor allocator runnable. Furthermore, one had to keep track of multiple hash tables.
    
    We can do better here by:
    
    1. Consolidating the places where we manage state. Here, we take inspiration from traditional Kubernetes controllers. These controllers tend to follow a level-triggered mechanism. This means that the controller will continuously monitor the API server via watches and polling, and on periodic passes, the controller will reconcile the current state of the cluster with the desired state. We implement this by introducing the concept of a pod snapshot, which is a given state of the executors in the Kubernetes cluster. We operate periodically on snapshots. To prevent overloading the API server with polling requests to get the state of the cluster (particularly for executor allocation where we want to be checking frequently to get executors to launch without unbearably bad latency), we use watches to populate snapshots by applying observed events to a previous snapshot to get a new snapshot. Whenever we do poll the cluster, the polled state replaces any existing snapshot - this ensures eventual consistency and mirroring of the cluster, as is desired in a level triggered architecture.
    
    2. Storing less specialized in-memory state in general. Previously we were creating hash tables to represent the state of executors. Instead, it's easier to represent state solely by the snapshots.
    
    ## How was this patch tested?
    
    Integration tests should test there's no regressions end to end. Unit tests to be updated, in particular focusing on different orderings of events, particularly accounting for when events come in unexpected ordering.
    
    Author: mcheah <mcheah@palantir.com>
    
    Closes #21366 from mccheah/event-queue-driven-scheduling.
    mccheah committed Jun 14, 2018
    Configuration menu
    Copy the full SHA
    270a9a3 View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2018

  1. [SPARK-24478][SQL] Move projection and filter push down to physical c…

    …onversion
    
    ## What changes were proposed in this pull request?
    
    This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan.
    
    A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is rdblue@9d3a11e.
    
    The first commit was proposed in #21262. This PR replaces #21262.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.
    rdblue authored and cloud-fan committed Jun 15, 2018
    Configuration menu
    Copy the full SHA
    22daeba View commit details
    Browse the repository at this point in the history
  2. [PYTHON] Fix typo in serializer exception

    ## What changes were proposed in this pull request?
    
    Fix typo in exception raised in Python serializer
    
    ## How was this patch tested?
    
    No code changes
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
    
    Closes #21566 from rberenguel/fix_typo_pyspark_serializers.
    rberenguel authored and HyukjinKwon committed Jun 15, 2018
    Configuration menu
    Copy the full SHA
    6567fc4 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24490][WEBUI] Use WebUI.addStaticHandler in web UIs

    `WebUI` defines `addStaticHandler` that web UIs don't use (and simply introduce duplication). Let's clean them up and remove duplications.
    
    Local build and waiting for Jenkins
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes #21510 from jaceklaskowski/SPARK-24490-Use-WebUI.addStaticHandler.
    jaceklaskowski authored and Marcelo Vanzin committed Jun 15, 2018
    Configuration menu
    Copy the full SHA
    495d8cf View commit details
    Browse the repository at this point in the history
  4. [SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for…

    … python
    
    ## What changes were proposed in this pull request?
    
    This PR adds `foreach` for streaming queries in Python. Users will be able to specify their processing logic in two different ways.
    - As a function that takes a row as input.
    - As an object that has methods `open`, `process`, and `close` methods.
    
    See the python docs in this PR for more details.
    
    ## How was this patch tested?
    Added java and python unit tests
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #21477 from tdas/SPARK-24396.
    tdas committed Jun 15, 2018
    Configuration menu
    Copy the full SHA
    b5ccf0d View commit details
    Browse the repository at this point in the history
  5. [SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple

    ## What changes were proposed in this pull request?
    
    This PR fixes possible overflow in int add or multiply. In particular, their overflows in multiply are detected by [Spotbugs](https://spotbugs.github.io/)
    
    The following assignments may cause overflow in right hand side. As a result, the result may be negative.
    ```
    long = int * int
    long = int + int
    ```
    
    To avoid this problem, this PR performs cast from int to long in right hand side.
    
    ## How was this patch tested?
    
    Existing UTs.
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #21481 from kiszk/SPARK-24452.
    kiszk authored and cloud-fan committed Jun 15, 2018
    Configuration menu
    Copy the full SHA
    90da7dc View commit details
    Browse the repository at this point in the history
  6. [SPARK-24525][SS] Provide an option to limit number of rows in a Memo…

    …rySink
    
    ## What changes were proposed in this pull request?
    
    Provide an option to limit number of rows in a MemorySink. Currently, MemorySink and MemorySinkV2 have unbounded size, meaning that if they're used on big data, they can OOM the stream. This change adds a maxMemorySinkRows option to limit how many rows MemorySink and MemorySinkV2 can hold. By default, they are still unbounded.
    
    ## How was this patch tested?
    
    Added new unit tests.
    
    Author: Mukul Murthy <mukul.murthy@databricks.com>
    
    Closes #21559 from mukulmurthy/SPARK-24525.
    mukulmurthy authored and brkyvz committed Jun 15, 2018
    Configuration menu
    Copy the full SHA
    e4fee39 View commit details
    Browse the repository at this point in the history

Commits on Jun 16, 2018

  1. add one supported type missing from the javadoc

    ## What changes were proposed in this pull request?
    
    The supported java.math.BigInteger type is not mentioned in the javadoc of Encoders.bean()
    
    ## How was this patch tested?
    
    only Javadoc fix
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: James Yu <james@ispot.tv>
    
    Closes #21544 from yuj/master.
    James Yu authored and rxin committed Jun 16, 2018
    Configuration menu
    Copy the full SHA
    c7c0b08 View commit details
    Browse the repository at this point in the history

Commits on Jun 18, 2018

  1. [SPARK-24573][INFRA] Runs SBT checkstyle after the build to work arou…

    …nd a side-effect
    
    ## What changes were proposed in this pull request?
    
    Seems checkstyle affects the build in the PR builder in Jenkins. I can't reproduce in my local and seems it can only be reproduced in the PR builder.
    
    I was checking the places it goes through and this is just a speculation that checkstyle's compilation in SBT has a side effect to the assembly build.
    
    This PR proposes to run the SBT checkstyle after the build.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: hyukjinkwon <gurwls223@apache.org>
    
    Closes #21579 from HyukjinKwon/investigate-javastyle.
    HyukjinKwon committed Jun 18, 2018
    Configuration menu
    Copy the full SHA
    b0a9352 View commit details
    Browse the repository at this point in the history
  2. [SPARK-23772][SQL] Provide an option to ignore column of all null val…

    …ues or empty array during JSON schema inference
    
    ## What changes were proposed in this pull request?
    This pr added a new JSON option `dropFieldIfAllNull ` to ignore column of all null values or empty array/struct during JSON schema inference.
    
    ## How was this patch tested?
    Added tests in `JsonSuite`.
    
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #20929 from maropu/SPARK-23772.
    maropu authored and HyukjinKwon committed Jun 18, 2018
    Configuration menu
    Copy the full SHA
    e219e69 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24526][BUILD][TEST-MAVEN] Spaces in the build dir causes failu…

    …res in the build/mvn script
    
    ## What changes were proposed in this pull request?
    
    Fix the call to ${MVN_BIN} to be wrapped in quotes so it will handle having spaces in the path.
    
    ## How was this patch tested?
    
    Ran the following to confirm using the build/mvn tool with a space in the build dir now works without error
    
    ```
    mkdir /tmp/test\ spaces
    cd /tmp/test\ spaces
    git clone https://github.com/apache/spark.git
    cd spark
    # Remove all mvn references in PATH so the script will download mvn to the local dir
    ./build/mvn -DskipTests clean package
    ```
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: trystanleftwich <trystan@atscale.com>
    
    Closes #21534 from trystanleftwich/SPARK-24526.
    trystanleftwich authored and HyukjinKwon committed Jun 18, 2018
    Configuration menu
    Copy the full SHA
    bce1775 View commit details
    Browse the repository at this point in the history
  4. [SPARK-24548][SQL] Fix incorrect schema of Dataset with tuple encoders

    ## What changes were proposed in this pull request?
    
    When creating tuple expression encoders, we should give the serializer expressions of tuple items correct names, so we can have correct output schema when we use such tuple encoders.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #21576 from viirya/SPARK-24548.
    viirya authored and cloud-fan committed Jun 18, 2018
    Configuration menu
    Copy the full SHA
    8f225e0 View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2018

  1. [SPARK-24478][SQL][FOLLOWUP] Move projection and filter push down to …

    …physical conversion
    
    ## What changes were proposed in this pull request?
    
    This is a followup of #21503, to completely move operator pushdown to the planner rule.
    
    The code are mostly from #21319
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #21574 from cloud-fan/followup.
    cloud-fan committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    1737d45 View commit details
    Browse the repository at this point in the history
  2. [SPARK-24542][SQL] UDF series UDFXPathXXXX allow users to pass carefu…

    …lly crafted XML to access arbitrary files
    
    ## What changes were proposed in this pull request?
    
    UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files. Spark does not have built-in access control. When users use the external access control library, users might bypass them and access the file contents.
    
    This PR basically patches the Hive fix to Apache Spark. https://issues.apache.org/jira/browse/HIVE-18879
    
    ## How was this patch tested?
    
    A unit test case
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes #21549 from gatorsmile/xpathSecurity.
    gatorsmile authored and cloud-fan committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    9a75c18 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24521][SQL][TEST] Fix ineffective test in CachedTableSuite

    ## What changes were proposed in this pull request?
    
    test("withColumn doesn't invalidate cached dataframe") in CachedTableSuite doesn't not work because:
    
    The UDF is executed and test count incremented when "df.cache()" is called and the subsequent "df.collect()" has no effect on the test result.
    
    This PR fixed this test and add another test for caching UDF.
    
    ## How was this patch tested?
    
    Add new tests.
    
    Author: Li Jin <ice.xelloss@gmail.com>
    
    Closes #21531 from icexelloss/fix-cache-test.
    icexelloss authored and gatorsmile committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    a78a904 View commit details
    Browse the repository at this point in the history
  4. [SPARK-24556][SQL] Always rewrite output partitioning in ReusedExchan…

    …geExec and InMemoryTableScanExec
    
    ## What changes were proposed in this pull request?
    
    Currently, ReusedExchange and InMemoryTableScanExec only rewrite output partitioning if child's partitioning is HashPartitioning and do nothing for other partitioning, e.g., RangePartitioning. We should always rewrite it, otherwise, unnecessary shuffle could be introduced like https://issues.apache.org/jira/browse/SPARK-24556.
    
    ## How was this patch tested?
    
    Add new tests.
    
    Author: yucai <yyu1@ebay.com>
    
    Closes #21564 from yucai/SPARK-24556.
    yucai authored and cloud-fan committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    9dbe53e View commit details
    Browse the repository at this point in the history
  5. [SPARK-24534][K8S] Bypass non spark-on-k8s commands

    ## What changes were proposed in this pull request?
    This PR changes the entrypoint.sh to provide an option to run non spark-on-k8s commands (init, driver, executor) in order to let the user keep with the normal workflow without hacking the image to bypass the entrypoint
    
    ## How was this patch tested?
    This patch was built manually in my local machine and I ran some tests with a combination of ```docker run``` commands.
    
    Author: rimolive <ricardo.martinelli.oliveira@gmail.com>
    
    Closes #21572 from rimolive/rimolive-spark-24534.
    rimolive authored and erikerlandson committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    13092d7 View commit details
    Browse the repository at this point in the history
  6. [SPARK-24565][SS] Add API for in Structured Streaming for exposing ou…

    …tput rows of each microbatch as a DataFrame
    
    ## What changes were proposed in this pull request?
    
    Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful.
    - Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning).
    - Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source).
    - Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice.
    
    The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`.
    
    ## How was this patch tested?
    New unit tests.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #21571 from tdas/foreachBatch.
    tdas authored and zsxwing committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    2cb9763 View commit details
    Browse the repository at this point in the history
  7. [SPARK-24583][SQL] Wrong schema type in InsertIntoDataSourceCommand

    ## What changes were proposed in this pull request?
    
    Change insert input schema type: "insertRelationType" -> "insertRelationType.asNullable", in order to avoid nullable being overridden.
    
    ## How was this patch tested?
    
    Added one test in InsertSuite.
    
    Author: Maryann Xue <maryannxue@apache.org>
    
    Closes #21585 from maryannxue/spark-24583.
    maryannxue authored and gatorsmile committed Jun 19, 2018
    Configuration menu
    Copy the full SHA
    bc0498d View commit details
    Browse the repository at this point in the history

Commits on Jun 20, 2018

  1. [SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD

    ## What changes were proposed in this pull request?
    
    When a `union` is invoked on several RDDs of which one is an empty RDD, the result of the operation is a `UnionRDD`. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.
    
    The PR ignores incoming empty RDDs in the union method.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <marcogaido91@gmail.com>
    
    Closes #21333 from mgaido91/SPARK-23778.
    mgaido91 authored and cloud-fan committed Jun 20, 2018
    Configuration menu
    Copy the full SHA
    bc11146 View commit details
    Browse the repository at this point in the history
  2. [MINOR][SQL] Remove invalid comment from SparkStrategies

    ## What changes were proposed in this pull request?
    
    This patch is removing invalid comment from SparkStrategies, given that TODO-like comment is no longer preferred one as the comment: #21388 (comment)
    
    Removing invalid comment will prevent contributors to spend their times which is not going to be merged.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Jungtaek Lim <kabhwan@gmail.com>
    
    Closes #21595 from HeartSaVioR/MINOR-remove-invalid-comment-on-spark-strategies.
    HeartSaVioR authored and hvanhovell committed Jun 20, 2018
    Configuration menu
    Copy the full SHA
    c8ef923 View commit details
    Browse the repository at this point in the history
  3. [SPARK-24575][SQL] Prohibit window expressions inside WHERE and HAVIN…

    …G clauses
    
    ## What changes were proposed in this pull request?
    
    As discussed [before](#19193 (comment)), this PR prohibits window expressions inside WHERE and HAVING clauses.
    
    ## How was this patch tested?
    
    This PR comes with a dedicated unit test.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #21580 from aokolnychyi/spark-24575.
    aokolnychyi authored and hvanhovell committed Jun 20, 2018
    Configuration menu
    Copy the full SHA
    c5a0d11 View commit details
    Browse the repository at this point in the history
  4. [SPARK-24578][CORE] Cap sub-region's size of returned nio buffer

    ## What changes were proposed in this pull request?
    This PR tries to fix the performance regression introduced by SPARK-21517.
    
    In our production job, we performed many parallel computations, with high possibility, some task could be scheduled to a host-2 where it needs to read the cache block data from host-1. Often, this big transfer makes the cluster suffer time out issue (it will retry 3 times, each with 120s timeout, and then do recompute to put the cache block into the local MemoryStore).
    
    The root cause is that we don't do `consolidateIfNeeded` anymore as we are using
    ```
    Unpooled.wrappedBuffer(chunks.length, getChunks(): _*)
    ```
    in ChunkedByteBuffer. If we have many small chunks, it could cause the `buf.notBuffer(...)` have very bad performance in the case that we have to call `copyByteBuf(...)` many times.
    
    ## How was this patch tested?
    Existing unit tests and also test in production
    
    Author: Wenbo Zhao <wzhao@twosigma.com>
    
    Closes #21593 from WenboZhao/spark-24578.
    WenboZhao authored and zsxwing committed Jun 20, 2018
    Configuration menu
    Copy the full SHA
    3f4bda7 View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2018

  1. [SPARK-24547][K8S] Allow for building spark on k8s docker images with…

    …out cache and don't forget to push spark-py container.
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-24547
    
    TL;DR from JIRA issue:
    
    - First time I generated images for 2.4.0 Docker was using it's cache, so actually when running jobs, old jars where still in the Docker image. This produces errors like this in the executors:
    
    `java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local class incompatible: stream classdesc serialVersionUID = 6155820641931972169, local class serialVersionUID = -3720498261147521051`
    
    - The second problem was that the spark container is pushed, but the spark-py container wasn't yet. This was just forgotten in the initial PR.
    
    - A third problem I also ran into because I had an older docker was #21551 so I have not included a fix for that in this ticket.
    
    ## How was this patch tested?
    
    I've tested it on my own Spark on k8s deployment.
    
    Author: Ray Burgemeestre <ray.burgemeestre@brightcomputing.com>
    
    Closes #21555 from rayburgemeestre/SPARK-24547.
    Ray Burgemeestre authored and Anirudh Ramanathan committed Jun 21, 2018
    Configuration menu
    Copy the full SHA
    15747cf View commit details
    Browse the repository at this point in the history
  2. [SPARK-23912][SQL] add array_distinct

    ## What changes were proposed in this pull request?
    
    Add array_distinct to remove duplicate value from the array.
    
    ## How was this patch tested?
    
    Add unit tests
    
    Author: Huaxin Gao <huaxing@us.ibm.com>
    
    Closes #21050 from huaxingao/spark-23912.
    huaxingao authored and ueshin committed Jun 21, 2018
    Configuration menu
    Copy the full SHA
    9de11d3 View commit details
    Browse the repository at this point in the history